A useful reference: Halekoh, I., S. Hojsgaard, and J. Yan. The R Package geepack for Generalized Estimating Equations. Journal of Statistical Software. 2006.
Foundations for Generalized Estimating Equations covered in Module 3B.
Weighted Least Squares
Motivation and Examples
Motivation:
Marginal Models: assess population-level associations in the presence of heteroscedastic and/or correlated errors without requiring assumptions about subject-specific random effects.
Longitudinal data
Clustered data
Multilevel data
For example, suppose we have a linear regression problem give below:
We are interested in learning the population-level associations between \(\mathbf{y}\) and \(\mathbf{X}\) across \(n\) subjects, i.e., How can we estimate \(\hat{\beta}\) when errors are heteroscedastic (unequal variances across units)?
Motivating Example: Revisiting TLC
The Treatment of Lead-Exposed Children (TLC) was examined using a placebo-controlled randomized study of succimer (a chelating agent) in children with blood lead levels of 20-44 micrograms/dL. These data consist of four repeated measurements of blood lead levels obtained at baseline (or week 0) , week 1, week 4, and week 6 on 100 children who were randomly assigned to chelation treament with succimer or placebo.
Assumes random effects are assumed to be normally distributed. Inference depends on correct specification of random effects distribution and covariance structure.
Estimate population level associations (fixed effects) conditional on random effects. \(E(y_{ij}|\theta_i) = X_{ij}^T \mathbf{\beta} + \theta_i\).
\(V\) is the variance–covariance matrix implied by the random effects structure. In a random intercept model, \(V\) has compound symmetry form, with diagonal elements \(\tau^2+\sigma^2\) and off-diagonal elements \(\tau^2\) within each subject.
The Better Approach: Weighted Least Squares
Using our example, we extend OLS (for normal/continuous outcomes) to account for heteroscedasticity unequal variance. Each observation is given a weight reflecting its reliability, improving efficiency when some points are noisier than others.
Let \(\mathbf{W}\) be diagonal with \(\operatorname{diag}(\mathbf{W})=(v_1^{-1/2}, v_2^{-1/2},\ldots,v_N^{-1/2})\). Note that \(\mathbf{W}^\top \mathbf{W}=\mathbf{V}^{-1}\).
Now derive the WLS estimator \(\hat{\beta}_{\text{WLS}}\):
Derivation of WLS estimator
Define the transformed variables \[
y^*=\mathbf{W}y,\quad X^*=\mathbf{W}X,\quad \epsilon^*=\mathbf{W}\epsilon.
\] Then \[
y^*=X^*\beta+\epsilon^*,\qquad \mathrm{Var}(\epsilon^*)=I \; (\text{equal variance}).
\]
Properties of the WLS estimator (when \(E[y|X]=X\beta\) and \(V\) is correctly specified):
Unbiasedness: \(E[\hat\beta_{\text{WLS}}]=\beta\) (same as OLS).
Efficiency: If errors are heteroscedastic and uncorrelated, WLS is more efficient (lower variance) than OLS. How do we know this?
Under heteroscedasticity, the true covariance of the OLS estimator is given by: \(\operatorname{Cov}(\hat\beta_{\text{OLS}}) =(X^\top X)^{-1}X^\top V X\,(X^\top X)^{-1}.\)
When \(V= \sigma^2I\) this reduces to \(\hat\sigma^2 (X^\top X)^{-1}.\)
When \(V\neq\sigma^2 I\), the usual SEs are biased. In that case we must either
use robust (sandwich) SEs, or
explicitly model \(V\) and fit WLS.
WLS: TLC example
We’ll now apply WLS to the TLC blood lead dataset to account for heteroskedasticity across weeks. The model fit is given by:
\[
\begin{aligned}
y_{ij} &= \beta_0
+ \beta_1 \,\mathbf{1}\{\text{trt}_i=\text{succimer}\}
+ \sum_{k=1}^K \gamma_k \,\mathbf{1}\{\text{week}_j=k\} \\
&\quad + \sum_{k=1}^K \delta_k \,\mathbf{1}\{\text{trt}_i=\text{succimer}\}\,\mathbf{1}\{\text{week}_j=k\}
+ \epsilon_{ij},\qquad \boldsymbol{\epsilon}\sim N(\mathbf{0},\mathbf{V}).
\end{aligned}
\] Note: \(N=\) total number of observations.
General model: allow variances to be non-constant.
Implementation: impose structure (by week, by group, by function of covariate) so the model is estimable.
Visual inspection of residuals by week (not shown here yet) suggests that variability in lead levels increases over time.
We’ll allow a non-constant variance for each week using \(nlme::gls\) with a \(varIdent\) variance function.
# Different variance per week; no correlation structurem_wls <-gls( lead ~ treatment * week_f,data = tlcdata,weights =varIdent(form =~1| week_f), # Var differs by weekd, but still no correlation among observations.method ="REML")summary(m_wls)
Intercept (24.66, p < 0.001):
Average baseline lead level (week 0) for children in the placebo group.
Succimer main effect (–5.58, p < 0.001):
At baseline, succimer-treated children had lead levels about 5.6 µg/dL lower than placebo children.
Time trend in placebo group (orthogonal polynomial coding):
Quadratic contrast = 0.59 (p = 0.53): no curvature beyond the linear decline.
Cubic contrast = –0.19 (p = 0.84): no higher-order trend.
Succimer × time interactions:
Linear interaction = –1.54 (p = 0.24): no evidence that succimer changed the linear decline relative to placebo.
Quadratic interaction = +8.54 (p < 0.001): strong evidence that succimer altered the curvature of the time trend (nonlinear divergence from placebo).
Cubic interaction = –2.44 (p = 0.072): marginal evidence of a cubic pattern.
Variance function:
Residual standard deviations were allowed to differ by week (relative to week 0 = 1.00).
Week 1: 1.33× larger
Week 4: 1.37× larger
Week 6: 1.52× larger
This means variability in lead levels increased over time, and GLS down-weighted later observations accordingly.
Generalized Least Squares
In WLS, we assumed the residual covariance matrix \(\mathbf{V}\) was known and diagonal — errors were uncorrelated but had unequal variances.
Generalized Least Squares (GLS) drops that restriction: \(\mathbf{V}\) can be any known, positive definite covariance matrix, allowing for both:
Unequal variances (heteroskedasticity), and
Correlations among observations.
GLS solves the same weighted least squares problem, but now the “weights” come from \(\mathbf{V}^{-1}\), which accounts for the full error structure (variances and covariances).
If \(\mathbf{V}\) is diagonal ⟶ GLS = WLS.
If \(\mathbf{V} = \sigma^2 \mathbf{I}\) ⟶ GLS = OLS.
If \(\mathbf{V}\) is a full positive definite matrix with nonzero off-diagonals ⟶ GLS (allows both unequal variances and correlations among errors). Estimator and covariance:
Because \(V\) is positive definite, we can always factor it as \(\mathbf{V}^{-1} = W^T W\) with \(W = V^{-1/2}\), showing GLS is just OLS on transformed data.
\(\hat{\beta}_{GLS}\) is the best (minimum variance) linear unbiased estimator (BLUE) when \(V\) is correctly specified under model assumptions.
Choosing a Covariance Structure for Grouped Data
When data are grouped or repeatedly measured (e.g., patients within clinics, time points within subjects), the error terms are usually correlated. In GLS, we model this correlation by specifying the covariance matrix:
\[
V_i = \sigma^2 R_i(\alpha),
\]
where \(\sigma^2\) is the common variance and \(R_i(\alpha)\) is the “group” correlation matrix that depends on parameters \(\alpha\) (e.g., a correlation \(-1 < \alpha < 1\), but in practice it is often positive).
\(R_i(\alpha)=\) working correlation structure where \(R_i[j, j'] = cor(y_{ij}, y_{ij'})\).
For example, for a group (or cluster) of size 3, common choices for correlation matrices \(R_i(\alpha)\) look as follows:
Independent
Assumes no correlation among observations within a group.
Coefficient estimates are typically similar across OLS, WLS, and GLS, but they need not be identical because each estimator applies different weights implied by the assumed error structure. The estimators remain unbiased for the population-level effects, while their precision differs depending on how well the covariance structure is modeled.Fixed effects \(\beta's\) remain unbiased across the 3 models, i.e., \(\beta_0 = 24.662\) across all models.
OLS ignores heteroskedasticity and correlation → larger residual SE, higher AIC.
WLS improves efficiency by modeling unequal variance across weeks.
GLS(AR1) further improves model fit (lowest AIC) by capturing within-child correlation.
Fixed-effect estimates are broadly similar, but precision improves as error structure is modeled more realistically.
Robust Inference
So far we have assumed the error covariance matrix \(V\) is known and correctly specified.
- If that is true, WLS/GLS give efficient estimates and valid standard errors.
- But if \(V\) is unknown or misspecified, the coefficient estimates remain unbiased, yet the usual standard errors can be wrong.
Solution: use robust (sandwich) standard errors, which provide valid inference without needing the exact form of \(V\).
Robust SEs for OLS
The general covariance of the OLS estimator is
\[
\operatorname{Cov}(\hat{\beta}_{OLS}) = (X^\top X)^{-1} X^\top V X (X^\top X)^{-1}.
\]
If \(V = \sigma^2 I\) (homoskedastic, independent errors), this reduces to \(\operatorname{Cov}(\hat\beta_{OLS}) = \sigma^2 (X^\top X)^{-1}\).
If \(V \neq \sigma^2 I\), the usual plug-in formula is invalid.
A robust alternative replaces \(V\) with an empirical variance estimate:
where \(\hat\Omega\) captures empirical residual variation:
Clustered data:
Let \(\hat{\boldsymbol\epsilon}_i = y_i - X_i \hat\beta_{\text{GLS}}\) be the residual vector for cluster \(i\). Then \[
\hat\Omega = \sum_{i=1}^m X_i^\top V_i^{-1}\,\hat{\boldsymbol\epsilon}_i \hat{\boldsymbol\epsilon}_i^\top\, V_i^{-1} X_i.
\]
Independent data (special case):
Each cluster has size 1, so \(\hat{\boldsymbol\epsilon}_i = \hat\epsilon_i\) is a scalar. Then \[
\hat\Omega = \operatorname{diag}(\hat\epsilon_1^2, \ldots, \hat\epsilon_N^2),
\] recovering the White heteroskedasticity-robust estimator. #### Summary
OLS + robust SEs: makes no assumptions about \(V\). Inference is valid, but estimates may be inefficient.
GLS + robust SEs: assumes some structure for \(V\). If that structure is close to the truth, estimates are more efficient; robust SEs protect inference if the structure is wrong.
GEE: TLC Example
We now fit a (GEE) model for lead level, with exchangeable correlation within each child. GEE automatically computes robust (“sandwich”) SEs. We will further discuss GEE in Module 3B.
Estimates population-averaged effects, not subject-specific trajectories.
Uses a working correlation structure (“exchangeable”) to model repeated measures within child.
Robust (sandwich) standard errors are reported, so inference remains valid even if the correlation model is wrong.
Intercept (24.66, SE = 0.71, p < 0.001)
Average baseline blood lead level in the placebo group is ~24.7 µg/dL.
Treatment main effect (−5.58, SE = 1.10, p < 0.001)
At baseline, succimer-treated children have significantly lower mean lead levels than placebo, and the robust SE confirms this effect is precise.
Week trends (placebo group)
Linear effect (−1.89, SE = 0.39, p < 0.001): mean lead decreases linearly across weeks.
Quadratic effect (+0.59, SE = 0.31, p ≈ 0.056): weak evidence of curvature.
Cubic effect (−0.19, SE = 0.27, p ≈ 0.48): no evidence of additional curvature.
Treatment × Week interactions
Linear (−1.54, SE = 0.86, p ≈ 0.07): succimer shows a steeper linear decline, but evidence is modest.
Quadratic (+8.54, SE = 0.95, p < 0.001): strong, robust evidence that succimer introduces a pronounced curved pattern of decline.
Cubic (−2.44, SE = 0.66, p < 0.001): robust evidence of additional curvature in succimer group compared to placebo.
Even if this structure is not perfectly correct, robust SEs protect our inference.
Summary
Section
Description
What is the model?
Marginal models estimate population-level associations without specifying random effects. The general form is \(y = X\beta + \epsilon\), with \(\operatorname{Var}(\epsilon) = V\).
Motivation
Standard OLS assumes independent, homoskedastic errors. Real-world data (longitudinal, clustered, multilevel) often violate these assumptions, leading to biased SEs and invalid inference.
OLS vs. WLS vs. GLS
OLS: ignores correlation/heteroskedasticity → unbiased coefficients, wrong SEs if \(V \neq \sigma^2 I\). WLS: handles heteroskedasticity with diagonal \(V\). GLS: allows general \(V\) with variances + covariances.
Estimators
OLS: \(\hat{\beta}_{OLS} = (X^\top X)^{-1}X^\top y\) WLS: \(\hat{\beta}_{WLS} = (X^\top V^{-1}X)^{-1} X^\top V^{-1} y\), \(V\) diagonal GLS: same formula with full \(V\).
Properties
All are unbiased and consistent for \(\beta\). Efficiency differs: WLS/GLS improve efficiency when \(V\) is correctly specified.
Robust Inference
OLS + robust SEs: no assumptions about \(V\), robust but inefficient. GLS + robust SEs: assumes structure in \(V\) for efficiency, robust SEs protect against misspecification.
Covariance Structures
\(V_i = \sigma^2 R_i(\alpha)\) with group correlation matrix \(R_i(\alpha)\). Independent: no correlation. Exchangeable: equal correlation \(\rho\). Unstructured: free variances/covariances. AR(1): correlation decays with lag.
Example (TLC data)
WLS with varIdent model: succimer significantly reduced baseline lead levels (–5.6 µg/dL, p<0.001) and showed a strong quadratic trajectory difference vs. placebo. Residual variability increased over time, modeled via week-specific variances.
Takeaway
Marginal models give valid population-level inference when correlation/heteroskedasticity is present. Efficiency gains come from modeling \(V\), while robust SEs ensure inference remains valid even if \(V\) is misspecified.