Chapter 3, Wood S. Generalized Additive Models, 2017. Has a nice self-contained introduction to generalized linear models.
Gelman & Hill. Data Analysis using Regression and Multilevel/Hierarchical Models.2006.
Overview
In this module, you will learn generalized linear models (GLMs), a flexible extension of linear regression that allows for modeling response variables with error distributions other than the normal distribution.
After completing this module, you will be able to:
Explain components of GLMs (link functions, lienar predictor, and error distribution)
Fit and interpret logistic and Poisson regression models.
Calculate and interpret odds ratios and rate ratios.
Perform hypothesis testing using Wald and likelihood ratio tests.
Motivation and Examples
Motivation:
If the outcome variable does not follow a normal distribution, alternative regression methods, such as generalized linear models, are required.
Bernoulli/Binomial: Binary outcomes (0/1) or counts of successes out of a fixed number of trials.
Poisson data: Count data representing the number of events occurring over a fixed time, area, or rate.
Treating these as continuous outcomes can lead to biased estimates, incorrect standard errors, and invalid inference because the underlying distributional assumptions of linear regression are violated.
Motivating Example: Preterm Births Data
The dataset: a cohort of live births from Georgia born in the year 2001 \(N = (77,340)\).
Variables:
ptb: indicator for whether a baby from pregnancy \(i\) was born preterm (<37 weeks).
age: the mother’s age at delivery (centered at age 25).
male: indicator of the baby’s sex (1=male, 0=female)
tobacco: indicator for mother’s tobacco use during pregnancy (1= yes, 0 = no).
Tobacco is treated as the exposure variables. Mothers age is a covariate.
This is equivalent to \(y_i \sim N(\beta_0 + \sum_{k=1}^K \beta_k x_{ik} , \sigma^2)\) where \(x_{ik}\) is the kth linear predictor for observation \(i\).
Reminder of the assumptions of the above model:
The \(\beta's\) are fixed unknown constants.
\(y\) is a linear function of the covariates.
The expectation of the errors \(e's\) is zero.
Only \(\epsilon's\) are random.
Therefore, \(y_i\) follows a normal distribution and the \(E(y_i|x_i) = \beta_0 + \sum_{k=1}^K \beta_k x_{ik}\). The linear regression part is used to model the mean function of \(y_i\). The assumptions do not hold for other kinds of data, i.e, binary, count data, etc.
The Better Approach: Generalized Linear Regression
A generalized linear model (GLM) extends the linear regression to other distributions, where the response variable is generated from a distribution in the exponential family.
GLMs provide this flexibility by extending linear regression to a broader set of distributions—specifically, those that belong to the exponential family. The exponential family is mathematically convenient because it allows for a :
A natural way to link the mean of the response to the linear predictors via a link function.
Simple expressions for the mean and variance, which are essential for estimation and inference.
Unified estimation procedures (e.g., MLE).
A GLM involves three ingredients:
A probability distribution in the Exponential Family.
A linear model \(x_i'\beta\).
A link function\(g()\) and its inverse \(g^{-1}()\) is the link that relates the linear model to the expectation, i.e., transforms the mean function to meet the linear model assumptions.
where \(a, b, c\) are known functions, and \(\phi\) is a known scale parameter. The ONLY parameter is \(\theta\).
In the GLM, \(\theta\) will be a function of \(x_i\beta\).
Examples of distributions in the exponential family include: normal with known variance (link = identity), Bernoulli (link = logit or probit), binomial (link = logit or log), gamma, exponential (link = negative inverse), and others.
Lets derive a general class for the exponential function. Use the expandable panels below to explore key statistical properties and their derivations.
The Binomial Distribution
The Binomial distribution with parameters \(n\) and \(p\) is:
\[f(y \mid p) = \binom{n}{y} p^y (1-p)^{n-y},\]
which can be expressed in the exponential family form:
We know that \(\mu_i = p_i = P(Y_i=1)\). So we wish to model a link function\(g()\) that enforces a linear relationship between the unscaled expectation \(\mu_i\) and the expectation \(\beta_0 + \sum_{k=1}^K \beta_k x_{ik}\), i.e, we wish to model
It is really important to understand that the change in log-odds per one unit change in \(X_i\) is not constant, i.e, that change in log-odds depends on the value of \(X_i\).
#The Logit Functionp.i <-seq(0, 1, by =0.001)g_p.i <-log(p.i/(1-p.i))plot(p.i, g_p.i, col ="red", xlab ="p", ylab ="Logit Function")
Note how the steepness changes with difference in slopes. Also note how the direction changes depending on the effect changes.
Fitting a logistic regression in R
We come back to our motivating example using data on preterm births in Georgia (2001).
Fitting the GLM model in R is very similar to fitting a linear regression model. We need to specify the distribution as binomialand the link function as logit.
### The GLM Functionload("/Users/emilypeterson/Library/CloudStorage/OneDrive-EmoryUniversity/BIOS 526/BIOS526_Book/data/PTB.Rdata")fit <-glm(formula = ptb ~ age + male + tobacco, family =binomial(link ="logit"),data = dat)summary(fit)
Call:
glm(formula = ptb ~ age + male + tobacco, family = binomial(link = "logit"),
data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.4212664 0.0631355 -38.350 < 2e-16 ***
age -0.0006295 0.0021596 -0.291 0.77068
maleM 0.0723659 0.0258672 2.798 0.00515 **
tobacco 0.4096495 0.0534627 7.662 1.83e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 44908 on 77339 degrees of freedom
Residual deviance: 44846 on 77336 degrees of freedom
AIC: 44854
Number of Fisher Scoring iterations: 5
Model Interpretation
Preterm delivery was significantly associated with male babies (p-value = 0.005) when controlling for age and mother’s smoking status. The odds ratio of a preterm birth for a male baby versus a female baby was 1.07 (95% CI: 1.02, 1.13).
\(OR = e^{0.0723} = 1.07\) with 95% CI = (1.02, 1.13).
How we get CIs: \(e^{(0.0723 \pm 1.96 * 0.0258)}\).
NOTE: TRANSFORM THE INTERVALS. DO NOT TRANSFORM THE STANDARD ERRORS (this requires the delta method).
Preterm delivery was significantly associated with whether the mother smoked during pregnancy (p<0.001) when controlling for age and the baby’s sex.
\(OR = e^{0.409} = 1.51 (1.36, 1.67)\).
The baseline proportion (female babies born to mother of age 25 who didn’t smoke) of preterm delivery was
\[
\frac{e^{-2.42}}{1 + e^{-2.42}} = 0.08.
\]
We didn’t find an effect of mother’s age.
Wald Test
The Wald Test evaluates the significance of individual regression coefficients directly by comparing the estimated coefficient to its standard error. The Wald Test relied on the fact that MLEs of the coefficients are approximately normally distributed for large samples (Central Limit Theorem).
Wald test under regularity conditions and assuming asympotitc results,
\(I(\beta)\) is called the Fisher information matrix (see Wood p.106). The standard error of \(\hat{\beta}\) is derived from \(I(\beta)^{-1}\).
The Wald test uses the standardized statistic: \[
Z = \frac{\hat{\beta}_i}{s.e.(\hat{\beta}_i)}
\]
The \(Z-\) column is calculated as \(z = \frac{\hat{\beta}_i}{SE(\hat{\beta}_i)}\).
The p-values are derived from the z-statistic \(H_0 : \beta_i = 0\). For example, for , \(z \approx 2.798\) with \(p=0.005\) indicating male babies are significantly associated with preterm birth holding other variables constant.
Likelihood Ratio Test
In logistic regression, we use the Likelihood Ratio Test to determine whether adding predictors significantly improves model fit, since standard F-tests from linear regression do not apply. The Likelihood Ratio Test (LRT) provides a natural way to compare nested models by examining how much better the full model fits the data compared to a reduced model (with some parameters set to zero): How can we test whether which model is better?
The likelihood ratio test (LRT) is used to test the null hypothesis that any subset of the \(\beta's\) is equal to 0. The number of \(\beta's\) in the full model is \(k+1\), while the number of \(\beta's\) in the reduced model is \(r+1\). (Remember the reduced model is the model that results when the \(\beta's\) in the null hypothesis are set to 0). Thus, the number of \(\beta's\) being tested in the null hypothesis is \((k+1) - (r+1) = k-r\).
The Likelihood Ratio Test is given by:
\[
LR= \frac{max_{H_0}L(\beta)}{max_{H_A}L(\beta)}
\] This gives the ratio of the likelihood of the null model evaluated at the MLEs to the likelihood of the full model evaluated at the MLEs.
In the normal case, this is the analogous to the F-test described in Chapter 1.
Maximizing the LL has no closed form solution, so a technique used to obtain MLEs \(\hat{\beta}\) is iteratively reweighted least squares (not covered here).
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
Anova(fit)
Analysis of Deviance Table (Type II tests)
Response: ptb
LR Chisq Df Pr(>Chisq)
age 0.085 1 0.770669
male 7.834 1 0.005126 **
tobacco 53.648 1 2.399e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation
Comparing the full vs reduced model with everything EXCEPT age, we see a \(\chi^2\) statistic of 0.085 with a p-value of 0.77 meaning that removing age does not significantly worsen the model fit, i.e., age is not a significant predictor when controlling for other variables.
Removing the male variable significantly worsens model fit \(p<0.01\), meaning this is a significant predictor.
Tobacco use has a \(\chi^2\) statistic of 53.65 and a p-value of <0.001, meaning it is a highly significant predictor, and removing it significantly reduces model fit.
Summary
Section
Description
What is the model?
A generalized linear model (GLM) links the expected response \(E(Y|X) = \mu\) to predictors via a link function: \(g(\mu) = \beta_0 + \sum_k \beta_k x_{ik}\). The response \(Y\) must follow a distribution in the exponential family (e.g., Normal, Bernoulli, Binomial).
Model Components
1. Random component: Distribution of \(Y\) (e.g., Bernoulli for binary data). 2. Systematic component: Linear predictor \(\eta_i = x_i' \beta\). 3. Link function:\(g(\mu_i)\) maps mean response to \(\eta_i\).
Logistic Regression
Special case of GLM for binary outcomes (\(Y \in \{0,1\}\)) using the logit link: \(g(p) = \log \frac{p}{1-p}\). Coefficients \(\beta_j\) represent changes in log-odds, while \(e^{\beta_j}\) are odds ratios.
Model Assumptions
- Correct link function (e.g., logit). - Independent observations. - Mean and variance correctly specified for Bernoulli data: \(E(Y)=p\), \(Var(Y)=p(1-p)\).
Hypothesis Testing
Wald Test: Tests \(H_0: \beta_j = 0\) using \(Z = \hat{\beta}_j / \text{SE}(\hat{\beta}_j)\). Likelihood Ratio Test (LRT): Compares nested models with \(\lambda_{LR} = -2 [l_{reduced} - l_{full}] \sim \chi^2\).
Interpretation
- \(\beta_j\): Change in log-odds of success for a one-unit increase in \(X_j\), holding other variables constant. - \(e^{\beta_j}\): Odds ratio (OR) for a one-unit change in \(X_j\). - Intercept \(\beta_0\): Baseline log-odds for reference group.