Guatemalan data example: Rodriguez B and Goldman N (2001). Improved estimation procedures for multilevel models with binary response: a case study. Journal of Royal Statistical Society Series A 339-355.
Motivation and Examples
Motivation:
Hierarchical or multilevel models are used when observations are nested (.e.g, students within schools, repeated measures within individuals). Standard regression models ignore clustering, which leads to underestimated standard errors and biased inference.
Hierarchical models can be thought of as GLMs that are reparametrized to separate with-in group and between group sources of variation with nested data structures. These include:
Fixed effects for population-level estimates
Random effects for group-specific deviations.
Can have multiple levels within one model structure.
This has a very Bayesian Flavor! So it is more natural for me to think this way.
Motivating Example: School Math Scores
We consider a three-level hierarchical model of math achievement data.
Levels:
Level 1 (Year within child):
\(math_{ijk}\): standardized math score of child \(j\) in school \(i\) during year \(k\).
\(year_{ijk}\): primary school year (centered at 3.5).
\(retained_{ijk}\): indicator for repeating the grade.
ggplot(dat, aes(x = year, y = math, group = child, color =factor(school))) +geom_line(alpha =0.4) +# line for each student's trajectorygeom_point(alpha =0.6) +# points for observationslabs(title ="Student Math Scores Over Time",x ="Year",y ="Math Score",color ="School" ) +theme_minimal() +theme(legend.position ="none")
The plot above shows individual trajectories of student scores across years. Colors denote school.
Fitting a Linear Mixed Model
Using the methods we have learned so far, we would fit a linear mixed model (LMM). We can include a random intercept model to account for student-specific differences in baseline math scores:
where:
- \(i\) indexes students within school \(j\),
- \(u_{j} \sim N(0, \tau^2)\) is a student-specific random intercept,
- \(\epsilon_{ij} \sim N(0, \sigma^2)\) is the residual error term.
This model partially pools the intercepts across students, borrowing strength from the entire dataset while still allowing each student to have its own baseline.
Problem!: Years are nested within students, and students are nested within schools. In other words, the math scores are measured repeatedly for each student (longitudinal data), and these students are clustered within schools. So even if we account for within student correlation, we have not accounted for within school correlation. A standard LLM fails to account for this hierarchical data structure.
The Better Approach: Three Level Linear Mixed Model
We extend the previous model to account for both student-level and school-level random intercepts, fully respecting the data hierarchy (years within students, and students within schools).
We can think of \(u_i\) and \(u_{ij}\) as parameters that control for unmeasured confounders at the school and student level.
“One statistician’s mean structure is another’s covariance structure.” (Donald Rubin)
In frequentist mixed models, random effects are viewed as part of the covariance structure because they induce correlation between observations within a cluster. The random effects are integrated out when estimating the model, and their variability (e.g., \(\tau^2\)) contributes to the marginal covariance matrix.
In Bayesian or hierarchical modeling, the same random effects are treated as part of the mean structure with unknown parameters that have prior distributions. We estimate the full posterior for both fixed effects and random effects.
Bayesian method:\(u_i\) is a random deviation that adjusts the mean for cluster \(i\) (part of the mean structure). In our 3-level LMM, we place priors on both population-level parameters and random effects, e.g., \(\beta_k \sim N(0, 10^2)\).
Frequentist method: Equivalently, the marginal distribution of \(y_{ij}\) can be expressed as:
where the extra variance \(\tau^2\) is part of the covariance structure. We refer to the marginal distribution because in the frequentist approach the random effects are integrated out, and the variability they induce enters the covariance structure.
In our 3-level LMM, these random effects are treated as components that induce correlation but are not explicitly estimated as parameters.
Pooling (Borrowing Information) at Multiple Levels
Our three-level linear mixed model: \[
math_{ijk} = \beta_0 + \beta_1 year_{ijk} + \cdots + u_i + u_{ij} + \epsilon_{ijk},
\] borrows strength across both schools (Level 3) and students (Level 2). This is known as multi-level partial pooling.
1. Pooling Across Schools
No pooling (separate means): Each school mean is just its sample mean: \[
\hat{\mu}_i^{\text{no-pool}} = \bar{y}_i.
\]
Complete pooling (one mean): All schools share the overall mean: \[
\hat{\mu}_i^{\text{complete-pool}} = \bar{y}.
\]
Partial pooling (hierarchical):\[
\hat{\mu}_i^{\text{partial-pool}} = w_i \, \bar{y}_i + (1 - w_i)\, \bar{y},
\] where \[
w_i = \frac{n_i}{n_i + \frac{\sigma^2}{\nu^2}},
\] with \(n_i\) = number of observations in school \(i\), \(\nu^2\) = between-school variance, and \(\sigma^2\) = residual variance.
Large schools (big \(n_i\)) get \(w_i \approx 1\) (close to no pooling).
Small schools (small \(n_i\)) get \(w_i \approx 0\) (closer to complete pooling).
2. Pooling Across Students (Within Schools)
Within each school \(i\), we also pool across students \(j\): \[
\hat{u}_{ij}^{\text{partial}} = w_{ij} \, (\bar{y}_{ij} - \bar{y}_i),
\] where \[
w_{ij} = \frac{n_{ij}}{n_{ij} + \frac{\sigma^2}{\tau^2}},
\] with \(n_{ij}\) = number of observations for student \(j\), and \(\tau^2\) = student-level variance.
Derivation of student-level partially pooled estimate
Step 1: Define the Student Mean
For student \(j\) in school \(i\), the average math score is: \[
\bar{y}_{ij} = \frac{1}{n_{ij}} \sum_{k=1}^{n_{ij}} math_{ijk}.
\]
Define the average residual error: \[
\bar{\epsilon}_{ij} = \frac{1}{n_{ij}} \sum_{k=1}^{n_{ij}} \epsilon_{ijk},
\] which satisfies \(\bar{\epsilon}_{ij} \sim N(0, \sigma^2 / n_{ij})\).
Step 2: Center by the School Mean
Subtract the school mean \(\bar{y}_i\) (which contains \(\beta_0 + u_i\)): \[
\bar{y}_{ij} - \bar{y}_i = u_{ij} + \bar{\epsilon}_{ij}.
\]
This shows that a student’s deviation from the school mean is driven by: - The student’s random effect \(u_{ij}\), - The average residual noise \(\bar{\epsilon}_{ij}\).
Step 3: Target Quantity
We want the posterior mean of \(u_{ij}\) given the observed deviation: \[
\hat{u}_{ij}^{\text{partial}} = E[ u_{ij} \mid \bar{y}_{ij} - \bar{y}_i ].
\]
where \(\mu_{ijk}\) is the linear predictor: \[
E(math_{ijk}|u_i, u_{ij}) = \mu_{ijk} = \beta_0 + \beta^{(1)'} x_{ijk} + \beta^{(2)'} x_{ij} + u_i + u_{ij},
\] and \[
Var(math_{ijk} \mid u_i, u_{ij}) = \sigma^2,
\]
Derivation of conditional variance
Conditioned on the random effects \(u_i\) and \(u_{ij}\), we can write: \[
math_{ijk} \mid u_i, u_{ij} = \mu_{ijk} + \epsilon_{ijk}.
\]
Since \(\mu_{ijk}\) is treated as fixed once we condition on \(u_i\) and \(u_{ij}\), the only remaining random component is \(\epsilon_{ijk}\). Therefore: \[
\begin{aligned}
Var(math_{ijk} \mid u_i, u_{ij})
&= Var(\mu_{ijk} + \epsilon_{ijk} \mid u_i, u_{ij}) \\
&= Var(\epsilon_{ijk}) \\
&= \sigma^2.
\end{aligned}
\]
Total Variance
This variance \(\sigma^2\) describes the Level 1 variability (within-student, across years), assuming the random effects \(u_i\) and \(u_{ij}\) are fixed.
The total variability in \(math_{ijk}\) can be decomposed into:
For the three-level linear mixed model: \[
math_{ijk} = \beta_0 + \beta_1 \, year_{ijk} + \cdots + u_i + u_{ij} + \epsilon_{ijk},
\] we assume: \[
u_i \sim N(0, \nu^2), \quad u_{ij} \sim N(0, \tau^2), \quad \epsilon_{ijk} \sim N(0, \sigma^2),
\] with all random components mutually independent.
We want to show that: \[
Var(math_{ijk}) = \nu^2 + \tau^2 + \sigma^2.
\]
Step 1: Apply the Law of Total Variance
The law of total variance states: \[
Var(Y) = E[Var(Y \mid U)] + Var(E[Y \mid U]),
\] where \(U\) represents any random variables we are conditioning on.
In our case, we condition on both \(u_i\) and \(u_{ij}\): \[
Var(math_{ijk}) = E[Var(math_{ijk} \mid u_i, u_{ij})] + Var(E[math_{ijk} \mid u_i, u_{ij}]).
\]
Step 2: Compute the Conditional Variance
From the model: \[
math_{ijk} \mid u_i, u_{ij} = \mu_{ijk} + \epsilon_{ijk},
\] where \(\mu_{ijk} = \beta_0 + \beta_1 \, year_{ijk} + \cdots + u_i + u_{ij}\).
Conditioned on \(u_i\) and \(u_{ij}\), the only random term is \(\epsilon_{ijk}\), so: \[
Var(math_{ijk} \mid u_i, u_{ij}) = Var(\epsilon_{ijk}) = \sigma^2.
\] Therefore: \[
E[Var(math_{ijk} \mid u_i, u_{ij})] = \sigma^2.
\]
Step 3: Compute the Variance of the Conditional Expectation
For two observations \(math_{ijk}\) and \(math_{ijk'}\) from the same student\(j\) in the same school\(i\): \[
Cov(math_{ijk}, math_{ijk'} \mid u_i, u_{ij}) = 0,
\] because, conditioned on \(u_i\) and \(u_{ij}\), the residual errors \(\epsilon_{ijk}\) are independent.
However, the marginal covariance (after integrating over the random effects) is: \[
Cov(math_{ijk}, math_{ijk'}) = \nu^2 + \tau^2,
\] since both observations share \(u_i\) (school effect) and \(u_{ij}\) (student effect).
For two students \(j \neq j'\) in the same school\(i\): \[
Cov(math_{ijk}, math_{ij'k'}) = \nu^2,
\] because they share only the school effect \(u_i\).
Intraclass Correlation Coefficients (ICC)
The total variance of \(math_{ijk}\) is: \[
Var(math_{ijk}) = \nu^2 + \tau^2 + \sigma^2.
\]
The school-level intraclass correlation (ICC), which measures the proportion of variance explained by differences between schools, is: \[
ICC_{\text{school}} = \frac{\nu^2}{\nu^2 + \tau^2 + \sigma^2}.
\]
The student-level ICC (within a school) is: \[
ICC_{\text{student}} = \frac{\nu^2 + \tau^2}{\nu^2 + \tau^2 + \sigma^2}.
\]
These ICCs quantify how strongly scores are correlated for students within the same school or the same student across multiple years.
Hierarchical GLMM Fitting in R
# Load required packagelibrary(lme4)# Fit the 3-level LMM# math ~ fixed effects + (1 | school/child) fits random intercepts for school and childlmm_3level <-lmer( math ~ year + female + black + hispanic + retained + (1| school/child), # child is nested within schooldata = dat)# Display model summarysummary(lmm_3level)
Linear mixed model fit by REML ['lmerMod']
Formula:
math ~ year + female + black + hispanic + retained + (1 | school/child)
Data: dat
REML criterion at convergence: 16699.2
Scaled residuals:
Min 1Q Median 3Q Max
-3.8443 -0.5854 -0.0272 0.5588 6.1663
Random effects:
Groups Name Variance Std.Dev.
child:school (Intercept) 0.6638 0.8148
school (Intercept) 0.1292 0.3594
Residual 0.3445 0.5870
Number of obs: 7230, groups: child:school, 1721; school, 60
Fixed effects:
Estimate Std. Error t value
(Intercept) -3.477e-01 8.095e-02 -4.295
year 7.483e-01 5.396e-03 138.677
female 8.850e-06 4.226e-02 0.000
black -6.235e-01 7.873e-02 -7.919
hispanic -3.636e-01 8.813e-02 -4.125
retained 1.487e-01 3.536e-02 4.207
Correlation of Fixed Effects:
(Intr) year female black hispnc
year -0.020
female -0.264 -0.005
black -0.691 -0.004 -0.003
hispanic -0.515 -0.009 0.019 0.519
retained -0.020 0.086 0.017 -0.015 -0.007
Model Interpretations
Random Effects
School-level variance (\(\nu^2 = 0.1292\)):
Differences between schools account for some variation in math scores (Std. Dev. = 0.359).
This suggests that students from different schools start at different baseline math levels.
Student-level variance (\(\tau^2 = 0.6638\)):
Differences between students (within schools) account for a larger portion of variability (Std. Dev. = 0.815), indicating substantial heterogeneity across students.
Residual variance (\(\sigma^2 = 0.3445\)):
Variation in math scores within the same student (across years) after accounting for predictors is 0.3445 (Std. Dev. = 0.587).
Number of groups:
There are 60 schools and 1,721 students across 7,230 total observations.
Fixed Effects
Intercept (-0.348):
The expected math score for a reference student (male, non-Black, non-Hispanic, not retained) in year 0 is -0.35 (on the standardized scale).
Year (0.748):
On average, math scores increase by 0.748 units per year, holding all other factors constant.
This is a strong positive trend over time (t = 138.7).
Female (0.00001):
The coefficient for being female is nearly zero (and not statistically significant).
Black (-0.623):
Black students have an average math score 0.62 units lower than non-Black students, holding other variables constant (t = -7.92).
Hispanic (-0.364):
Hispanic students have an average math score 0.36 units lower than non-Hispanic students (t = -4.13).
Retained (0.149):
Students who were retained (repeated a grade) have math scores approximately 0.15 units higher than otherwise similar students (t = 4.21). This may reflect additional time or intervention effects.
We can calculate ICCs directly from the variance components of the model:
A three-level hierarchical linear model extends linear regression to account for nested data structures—specifically, years within students, and students within schools. The model includes two random intercepts: \(y_{ijk} = X_{ijk}\beta + u_i + u_{ij} + \epsilon_{ijk}\), where \(u_i \sim N(0, \nu^2)\) and \(u_{ij} \sim N(0, \tau^2)\).
Motivation
Hierarchical or multilevel data violate the independence assumption of standard linear models. Ignoring the nesting—repeated years per student, students within schools—leads to underestimated standard errors and biased inference. Hierarchical models account for this by modeling correlated errors at each level.
Model Components
Level 1: residual error \(\epsilon_{ijk} \sim N(0, \sigma^2)\) Level 2: student-level random intercept \(u_{ij} \sim N(0, \tau^2)\) Level 3: school-level random intercept \(u_i \sim N(0, \nu^2)\) The systematic component is: \(\mu_{ijk} = \beta_0 + X_{ijk} \beta + u_i + u_{ij}\)
Hierarchical Partial Pooling
The model performs partial pooling across both schools and students. - School-level:\(\hat{\mu}_i^{\text{partial}} = w_i \bar{y}_i + (1 - w_i)\bar{y}\) with \(w_i = \frac{n_i}{n_i + \frac{\sigma^2}{\nu^2}}\) - Student-level:\(\hat{u}_{ij}^{\text{partial}} = w_{ij} (\bar{y}_{ij} - \bar{y}_i)\) with \(w_{ij} = \frac{n_{ij}}{n_{ij} + \frac{\sigma^2}{\tau^2}}\)
Likelihood
The likelihood reflects two nested levels of integration over random effects: \(L(\beta, \nu^2, \tau^2 \mid y) = \prod_i \int \prod_j \int \prod_k p(y_{ijk} \mid u_i, u_{ij}) \, p(u_{ij}) \, du_{ij} \, p(u_i) \, du_i\) Estimated using REML (frequentist) or MCMC (Bayesian) methods.
Interpretation
- Fixed effects (\(\beta\)): average effects across all observations (e.g., year, race, gender) - Random effects (\(u_i\), \(u_{ij}\)): group-specific deviations - Variance components:\(\nu^2\) (school), \(\tau^2\) (student), \(\sigma^2\) (within-student error)