Module 1B Linear Mixed Models

The Joint Density

To simplify the derivation first consider a random effects model without fixed effects:

\[ y_{ij} = \theta_i + \epsilon_{ij}, \quad \theta_i \overset{iid}{\sim} N(0, \tau^2), \quad \epsilon_{ij} \overset{iid}{\sim} N(0, \sigma^2) \] - The join density of the data and random effects is given by:

\[ \prod_{i,j} f(y_{ij}, \theta_i) = \prod_{i,j} f(y_{ij}|\theta_i) \cdot \prod_i g(\theta_i) \]

By the chain rule of probability:

\[ f(y_{ij}, \theta_i) = f(y_{ij} \mid \theta_i) \, g(\theta_i), \]

so we can write:

\[ f(\{y_{ij}\}, \{\theta_i\}) = \prod_{i=1}^m \left[ \prod_{j=1}^{n_i} f(y_{ij} \mid \theta_i) \right] g(\theta_i). \]

The conditional density of \(y_{ij}\) given \(\theta_i\) is:

\[ y_{ij} \mid \theta_i \sim N(\theta_i, \sigma^2), \]

\[ f(y_{ij} \mid \theta_i) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(y_{ij} - \theta_i)^2}{2 \sigma^2} \right). \]

The density of \(\theta_i\) is:

\[ g(\theta_i) = \frac{1}{\sqrt{2 \pi \tau^2}} \exp \left( - \frac{\theta_i^2}{2 \tau^2} \right). \]

Putting these together, the joint density of the data and random effects is given by:

\[ \prod_{ij} f(y_{ij}, \theta_i) = \prod_{i,j} f(y_{ij}|\theta_i) \cdot \prod_i g(\theta_i) \] Going through the derivation: \[ f(\{y_{ij}\}, \{\theta_i\}) = \prod_{i=1}^m \left[ \left( \prod_{j=1}^{n_i} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(y_{ij} - \theta_i)^2}{2 \sigma^2} \right) \right) \frac{1}{\sqrt{2 \pi \tau^2}} \exp \left( - \frac{\theta_i^2}{2 \tau^2} \right) \right]. \]

Best Linear Unbiased Prediction (BLUP) of \(\theta_i\)

We aim to find the predictor \(\hat{\theta}_i\) that is linear in the data, unbiased for \(\theta_i\), and minimizes mean squared error (the BLUP).

Taking logs (and dropping constants) gives the log-likelihood: \[ \ell(\theta_i) = -\frac{1}{2\sigma^2} \sum_{j=1}^{n_i} (y_{ij} - \theta_i)^2 - \frac{\theta_i^2}{2 \tau^2}. \]

Step 1: Expand the Quadratic Term Expand the sum of squares: \[ \sum_{j=1}^{n_i} (y_{ij} - \theta_i)^2 = \sum_{j=1}^{n_i} y_{ij}^2 - 2 \theta_i \sum_{j=1}^{n_i} y_{ij} + n_i \theta_i^2. \]

So: \[ \ell(\theta_i) = -\frac{1}{2\sigma^2} \left( \sum_{j=1}^{n_i} y_{ij}^2 - 2 \theta_i \sum_{j=1}^{n_i} y_{ij} + n_i \theta_i^2 \right) - \frac{\theta_i^2}{2 \tau^2}. \]

Step 2: Differentiate with Respect to \(\theta_i\) The derivative is: \[ \frac{\partial \ell}{\partial \theta_i} = \frac{\sum_{j=1}^{n_i} y_{ij} - n_i \theta_i}{\sigma^2} - \frac{\theta_i}{\tau^2}. \]

Step 3: Solve for \(\hat{\theta}_i\) Set the derivative to zero:

\[ \frac{\sum_{j=1}^{n_i} y_{ij} - n_i \hat{\theta}_i}{\sigma^2} - \frac{\hat{\theta}_i}{\tau^2} = 0. \]

Multiply by \(\sigma^2\): \[ \sum_{j=1}^{n_i} y_{ij} - n_i \hat{\theta}_i - \frac{\sigma^2}{\tau^2} \hat{\theta}_i = 0. \]

Rearrange: \[ \sum_{j=1}^{n_i} y_{ij} = \hat{\theta}_i \left( n_i + \frac{\sigma^2}{\tau^2} \right). \]

Thus: \[ \hat{\theta}_i = \frac{\sum_{j=1}^{n_i} y_{ij}}{n_i + \sigma^2 / \tau^2} = \frac{n_i}{n_i + \sigma^2 / \tau^2} \bar{y}_i \]

where \(\bar{y}_i = \frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij}\).

We can express the BLUP as: \[ \hat{\theta}_i = B_i \, \bar{y}_i, \]

where the shrinkage factor is: \[ B_i = \frac{n_i}{n_i + \sigma^2/\tau^2}. \]

Note that:

When \(\tau^2 \to 0\) (no between-group variation),
\(B_i \to 0\) and all random intercepts are shrunk to 0 (the global mean).
When \(\tau^2 \to \infty\) (very large between-group variation),
\(B_i \to 1\) and there is no shrinkage: \(\hat{\theta}_i = \bar{y}_i\).
When \(n_i\) is small, \(B_i\) is small and there is more shrinkage, while large \(n_i\) makes \(B_i\) close to 1 (less shrinkage).

Hierarchical Formulation

In a random intercept model, we assume group-specific effects \(\theta_i\) vary around a common mean \(\mu\):

\[ y_{ij} = \mu + \theta_i + \epsilon_{ij}, \quad \theta_i \sim N(0, \tau^2), \quad \epsilon_{ij} \sim N(0, \sigma^2), \]

where:

\(i = 1, \ldots, m\) indexes groups,
\(j = 1, \ldots, n_i\) indexes observations within group \(i\),
\(\tau^2\) is the between-group variance,
\(\sigma^2\) is the within-group variance.

Equivalently, this can be viewed as a two-level hierarchical model:

Level 1 (within-group model): \(y_{ij} \mid \theta_i \sim N(\mu + \theta_i, \sigma^2)\).
Level 2 (random effect model): \(\theta_i \sim N(0, \tau^2)\).

Borrowing of Information

The estimated random intercept \(\hat{\theta}_i\) can be expressed as a weighted combination of the overall mean \(\mu\) and the group-specific mean \(\bar{y}_i\):

\[ \hat{\theta}_i = \frac{(1/\tau^2) \mu + (n_i / \sigma^2) \bar{y}_i}{1/\tau^2 + n_i / \sigma^2}. \]

This shows that:

Groups with small \(n_i\) (few observations) are pulled more strongly toward the overall mean \(\mu\),
Groups with large \(n_i\) rely more on their own mean \(\bar{y}_i\).

We can also express \(\hat{\theta}_i\) in terms of the intraclass correlation: \[ \rho = \frac{\tau^2}{\tau^2 + \sigma^2}, \] leading to:

\[ \hat{\theta}_i = \frac{\rho^{-1} \mu + n_i (1-\rho)^{-1} \bar{y}_i}{\rho^{-1} + n_i (1-\rho)^{-1}}. \]

Note that:

As \(\tau^2 \to 0\), all \(\hat{\theta}_i \to \mu\) (complete shrinkage toward the common mean).
As \(\tau^2 \to \infty\), \(\hat{\theta}_i \to \bar{y}_i\) (no shrinkage, group means are used as-is).
The amount of shrinkage depends jointly on \(\tau^2\), \(\sigma^2\), and \(n_i\).

Parameter Estimation

Theorem

Let \(Y=X\beta + \epsilon\) and \(E(Y) = X\beta\) and \(cov(Y) = \sigma^2 V\) where \(X\) is full rank and \(V\) is a known positive definite matrix. The best linear unbiased estimator (BLUE) of \(\beta\) is given by:

\[ \hat{\beta}_{GLS} = (X'V^{-1}X)^{-1} X'V^{-1}y \]

The value of \(\beta\) that maximizes the likelihood function is the GLS estimate, i.e., \(\hat{\beta}_{MLE} = \hat{\beta}_{GLS}\)

To estimate the fixed effects \(\boldsymbol{\beta}\) and the variance components \((\sigma^2, \tau^2)\), we use the marginal model:

\[ \mathbf{y} \sim N(\mathbf{X}\boldsymbol{\beta}, \mathbf{V}), \] where

\[ \mathbf{V} = \mathbf{Z} G \mathbf{Z}^T + R \] captures both the between-group (\(G = \tau^2 I\)) and within-group (\(R = \sigma^2 I\)) variability.

In ordinary linear regression \(Y = X \beta + \epsilon\) with \(\epsilon \sim N(0, \sigma^2 I)\), the OLS estimator

\[ \hat{\beta}_{OLS} = (X^T X)^{-1} X^T y \]

is optimal because the errors are uncorrelated and have constant variance.

In LMMs, the data are correlated: \[ \text{Cov}(\mathbf{y}) = \mathbf{V} \neq \sigma^2 I. \] The optimal estimator becomes the generalized least squares (GLS) estimator:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{V}^{-1} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{V}^{-1} \mathbf{y}. \]

Variance Components: MLE vs. REML

The MLS estimate of the variance component \(\hat{\sigma}^2_{MLE}\) is biased!! Why because if we take the expectation (e.g., in the univariate case) \(E(\hat{\sigma}^2_{MLE}) = \frac{n-1}{n} \sigma^2 \neq \sigma^2\).
It is important to note when we calculate the MLE, we use \(\hat{y}\). If we used \(\mu\) instead of \(\hat{y}\) , it is not longer biased. The intuition behind this is we used some information on estimating the mean and we need to account for this in estimating the spread around the mean.
The idea behind REML is we want to remove information about the mean from the likelihood and maximize what remains to get an estimate of the REML variance \(\hat{V}_{REML}\) that does not depend on the mean.
The REML estimator is equivalent to the MLE estimated based on transformed outcomes \(Y^* = AY\) such that the distribution of \(Y^*\) does not depend on \(\beta\).
The MLE estimates of the variance components \((\sigma^2, \tau^2)\) are biased, much like the naive variance estimate in OLS: \[ \hat{\sigma}^2_{MLE} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2. \]
Restricted Maximum Likelihood (REML) corrects this bias by accounting for the degrees of freedom used to estimate \(\boldsymbol{\beta}\): \[ \hat{\sigma}^2_{REML} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2. \]
REML is preferred in most applications and is the default in lme4::lmer() in R.

Important: variance is now included in the estimator. This is different from \(\hat{\beta}_{MLE} = (X'X)^{-1} X'y\) where the variance does not appear. This means when outcomes are correlated the variance matters in estimating \(\beta\). –>

library(nlme)


Attaching package: 'nlme'

The following object is masked from 'package:dplyr':

    collapse

library(ggplot2)
library(dplyr)
library(gridExtra)


Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

library(lme4)

Loading required package: Matrix


Attaching package: 'lme4'

The following object is masked from 'package:nlme':

    lmList

tlcdata <- readRDS("data/tlc.RDS")


succimer_data_only <- tlcdata %>% filter(treatment == "succimer")

tlc.gls <- gls(lead ~ relevel(as.factor(week), 4),  data = succimer_data_only, #only running for treatment group 
               
correlation = corSymm(form = ~1|id), #within correlation by individual 
weights = varIdent(form = ~1|week)) #weights are determined across time 

summary(tlc.gls)

Generalized least squares fit by REML
  Model: lead ~ relevel(as.factor(week), 4) 
  Data: succimer_data_only 
       AIC      BIC    logLik
  1308.337 1354.231 -640.1687

Correlation Structure: General
 Formula: ~1 | id 
 Parameter estimate(s):
 Correlation: 
  1     2     3    
2 0.401            
3 0.384 0.731      
4 0.495 0.507 0.455
Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | week 
 Parameter estimates:
       0        1        4        6 
1.000000 1.528082 1.563878 1.841541 

Coefficients:
                              Value Std.Error   t-value p-value
(Intercept)                  20.762  1.307627 15.877618   0e+00
relevel(as.factor(week), 4)0  5.778  1.137826  5.078104   0e+00
relevel(as.factor(week), 4)1 -7.240  1.203585 -6.015361   0e+00
relevel(as.factor(week), 4)4 -5.248  1.273641 -4.120470   1e-04

 Correlation: 
                             (Intr) r(.(),4)0 r(.(),4)1
relevel(as.factor(week), 4)0 -0.840                    
relevel(as.factor(week), 4)1 -0.629  0.614             
relevel(as.factor(week), 4)4 -0.630  0.616     0.790   

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-1.8020146 -0.8258012 -0.1227684  0.5184783  4.6654246 

Residual standard error: 5.020968 
Degrees of freedom: 200 total; 196 residual

Suppose we want the covariance matrix?

# Get covariance matrix for individual i. 

getVarCov(tlc.gls, individual = 3)

Marginal variance covariance matrix
       [,1]   [,2]   [,3]   [,4]
[1,] 25.210 15.466 15.138 22.986
[2,] 15.466 58.867 44.029 35.965
[3,] 15.138 44.029 61.657 33.022
[4,] 22.986 35.965 33.022 85.494
  Standard Deviations: 5.021 7.6725 7.8522 9.2463

#Get correlation matrix for individual i. 

cov2cor(getVarCov(tlc.gls, individual = 3))

Marginal variance covariance matrix
        [,1]    [,2]    [,3]    [,4]
[1,] 1.00000 0.40146 0.38397 0.49512
[2,] 0.40146 1.00000 0.73082 0.50696
[3,] 0.38397 0.73082 1.00000 0.45482
[4,] 0.49512 0.50696 0.45482 1.00000
  Standard Deviations: 1 1 1 1

#To get the variance weights 

summary(tlc.gls$modelStruct$varStruct)

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | week 
 Parameter estimates:
       0        1        4        6 
1.000000 1.528082 1.563878 1.841541

Summary

Section	Description
What is the model?	Random intercept model with hierarchical structure: \(y_{ij} = \mu + \theta_i + \epsilon_{ij}\), where \(\theta_i \sim N(0, \tau^2)\) and \(\epsilon_{ij} \sim N(0, \sigma^2)\). The data’s covariance is \(\mathbf{V} = \mathbf{Z}G\mathbf{Z}^T + R\), with \(G = \tau^2 I\) (between-group) and \(R = \sigma^2 I\) (within-group).
BLUP of \(\theta_i\)	Best Linear Unbiased Prediction: \(\hat{\theta}_i = B_i \bar{y}_i\) with shrinkage factor \(B_i = \frac{n_i}{n_i + \sigma^2/\tau^2}\). Small \(n_i\) or small \(\tau^2\) leads to more shrinkage toward \(\mu\), while large \(n_i\) or large \(\tau^2\) reduces shrinkage.
Borrowing of Information	\(\hat{\theta}_i = \frac{(1/\tau^2)\mu + (n_i/\sigma^2)\bar{y}_i}{1/\tau^2 + n_i/\sigma^2}\). Groups with fewer observations are pulled toward the overall mean \(\mu\), while larger groups rely more on their own group mean \(\bar{y}_i\).
Parameter Estimation	Fixed effects \(\beta\) estimated using GLS: \(\hat{\beta}_{GLS} = (X^T V^{-1} X)^{-1} X^T V^{-1} y\). Variance components \((\sigma^2, \tau^2)\) estimated by MLE or REML.
Variance Components	MLE of variance components is biased (like naive OLS variance). REML adjusts for the loss of degrees of freedom when estimating \(\beta\) and provides unbiased estimates of \(\sigma^2\) and \(\tau^2\).
Interpretation	Total variance = \(\sigma^2 + \tau^2\). Intraclass correlation: \(\rho = \frac{\tau^2}{\tau^2 + \sigma^2}\). The magnitude of \(\rho\) reflects within-group similarity relative to residual variation.