24 Day 23 (July 14)

24.1 Announcements

Workday on Tuesday
Ch. 6 in Linear Models with R is what we are covering now
Selected questions from journals
- “Will we be docked points on final projects if we do utilize cross-validation for model evaluation?”
- “You mentioned it briefly, but the idea of reparameterization still confuses me. So, it’s the idea of re-writing the same model in a different way. Is this the same as model checking? Or is it just a different way to display the data?”
- “I read this in the book: “If all statistical assumptions of an interval-producing procedure are not met, we cannot be sure of the true coverage probability of the procedure as it will depend on the extent to which these assumptions are violated and the robustness of the procedure.” Could you share your perspective on this?”

24.2 T-test

Quick review

24.3 ANOVA/F-test

F-test (pgs. 33 - 39 in Faraway (2014))

The t-test is useful for testing simple hypotheses
F-test is useful for testing more complicated hypotheses
Example: comparing more than two means

library(faraway)
boxplot(weight ~ group, data = PlantGrowth, col = "grey", ylim = c(0, 7))

aggregate(weight ~ group, FUN = mean, data = PlantGrowth)

##   group weight
## 1  ctrl  5.032
## 2  trt1  4.661
## 3  trt2  5.526

To determine if treatment 1 and 2 had an effects, we might propose the following null hypothesis \[\text{H}_{0}:\beta_1 = \beta_2 = 0\] and alternative hypothesis \[\text{H}_{\text{a}}:\beta_1 \neq 0 \;\text{or}\; \beta_2 \neq 0.\] To test the null hypothesis, we calculate the probability that we observe values of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis. To calculate this probability, note that \[\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} \sim\text{F}(p-1,n-p).\] The probability that we observe a value of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis can be written as \[\text{P}\bigg(\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} <F\bigg),\] where \(TSS = (\mathbf{y} - \bar{\mathbf{y}})^{\prime}(\mathbf{y} - \bar{\mathbf{y}})\) and \(RSS = (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\). The probability can be calculated as the area under the curve of a F-distribution. The code below shows how to do these calculations in R

m1 <- lm(weight ~ group, data = PlantGrowth)
summary(m1)

## 
## Call:
## lm(formula = weight ~ group, data = PlantGrowth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0710 -0.4180 -0.0060  0.2627  1.3690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0320     0.1971  25.527   <2e-16 ***
## grouptrt1    -0.3710     0.2788  -1.331   0.1944    
## grouptrt2     0.4940     0.2788   1.772   0.0877 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6234 on 27 degrees of freedom
## Multiple R-squared:  0.2641, Adjusted R-squared:  0.2096 
## F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591

anova(m1)

## Analysis of Variance Table
## 
## Response: weight
##           Df  Sum Sq Mean Sq F value  Pr(>F)  
## group      2  3.7663  1.8832  4.8461 0.01591 *
## Residuals 27 10.4921  0.3886                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 'By hand'
beta.hat <- coef(m1)
X <- model.matrix(m1)
y <- PlantGrowth$weight
y.bar <- mean(y)
n <- 30
p <- 3


RSS <- t(y - X %*% beta.hat) %*% (y - X %*% beta.hat)
RSS

##          [,1]
## [1,] 10.49209

TSS <- t(y - y.bar) %*% (y - y.bar)
TSS

##          [,1]
## [1,] 14.25843

TSS - RSS

##         [,1]
## [1,] 3.76634

F <- ((TSS - RSS)/(p - 1))/(RSS/(n - p))
F

##          [,1]
## [1,] 4.846088

pf(F, p - 1, n - p, lower.tail = FALSE)

##            [,1]
## [1,] 0.01590996

Live example

24.4 Model checking

Given a statistical model, estimation, prediction, and statistical inference is somewhat “automatic”
If the statistical model is misspecified (i.e., wrong) in any way, the resulting statistical inference (including predictions and prediction uncertainty) rests on a house of cards.
George Box quote: “All models are wrong but some are useful.”
- Box (1976) “Since all models are wrong the scientist cannot obtain a correct one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.”
We have assumed the linear model \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\), which allowed us to:
- Estimate \(\boldsymbol{\beta}\) and \(\sigma^2\)
- Make statistical inference about \(\hat{\boldsymbol{\beta}}\)
- Make predictions and obtain prediction intervals for future values of \(\mathbf{y}\)
All statistical inference we obtained requires that the linear model \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) gave rise to the data.
- Support
- Linear
- Constant variance
- Independence
- Outliers
Model diagnostics (Ch 6 in Faraway (2014)) is a set of tools and procedures to see if the assumptions of our model are approximately correct.
- Statistical tests (e.g., Shapiro-Wilk test for normality)
  - Specific
  - What if you reject the null?
- Graphical
  - Broad
  - Subjective
  - Widely used
- Predictive model checks
  - More common for Bayesian models (e.g., posterior predictive checks)
We will explore numerous ways to check
- Distributional assumptions
  - Normality
  - Constant variance
  - Correlation among errors
- Detection of outliers
- Deterministic model structure
  - Is \(\mathbf{X}\boldsymbol{\beta}\) a reasonable assumption?

24.5 Distributional assumptions

Why did we assume \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\)?
Is the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) ever correct? Is there a “true” model?
When would we expect the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) to be approximately correct?
- Human body weights
- Stock prices
- Temperature
- Proportion of votes for a candidate in an elections
Checking distributional assumptions
- If \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\), then \(\mathbf{y} - \mathbf{X\boldsymbol{\beta}}\sim ?\)
If the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) is approximately correct, then what should \(\hat{\boldsymbol{\varepsilon}}\) look like?

Example: checking the assumption that \(\boldsymbol{\varepsilon}\sim\text{N}(\mathbf{0},\sigma^{2}\mathbf{I})\)

Data

y <- c(63, 68, 61, 44, 103, 90, 107, 105, 76, 46, 60, 66, 58, 39, 64, 29, 37, 27,
    38, 14, 38, 52, 84, 112, 112, 97, 131, 168, 70, 91, 52, 33, 33, 27, 18, 14, 5,
    22, 31, 23, 14, 18, 23, 27, 44, 18, 19)
year <- 1965:2011
df <- data.frame(y = y, year = year)

plot(x = df$year, y = df$y, xlab = "Year", ylab = "Annual count", main = "", col = "brown",
    pch = 20)
m1 <- lm(y ~ year, data = df)
abline(m1)

Histogram of \(\hat{\boldsymbol{\varepsilon}}\)

m1 <- lm(y ~ year, data = df)
e.hat <- residuals(m1)
hist(e.hat, col = "grey", breaks = 15, main = "", xlab = expression(hat(epsilon)))

Plot covariate vs. \(\hat{\boldsymbol{\varepsilon}}\)

plot(year, e.hat, xlab = "Year", ylab = expression(hat(epsilon)), col = "darkgreen")

A formal hypothesis test (see pg. 81 in Faraway (2014))

shapiro.test(e.hat)

## 
##  Shapiro-Wilk normality test
## 
## data:  e.hat
## W = 0.86281, p-value = 5.709e-05

Example: Checking the assumption that \(\boldsymbol{\varepsilon}\sim\text{N}\left(\mathbf{0},\sigma^{2}\mathbf{I}\right)\) (What it should look like)

Simulated data

beta.truth <- c(2356, -1.15)
sigma2.truth <- 33^2
n <- 47


year <- 1965:2011
X <- model.matrix(~year)
set.seed(2930)
y <- rnorm(n, X %*% beta.truth, sigma2.truth^0.5)
df1 <- data.frame(y = y, year = year)

plot(x = df1$year, y = df1$y, xlab = "Year", ylab = "Annual count", main = "", col = "brown",
    pch = 20)

m2 <- lm(y ~ year, df1)
e.hat <- residuals(m2)

summary(m2)

## 
## Call:
## lm(formula = y ~ year, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.757 -22.237   3.767  19.353  66.634 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 1717.2121   638.5293   2.689   0.0100 *
## year          -0.8272     0.3212  -2.575   0.0134 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.87 on 45 degrees of freedom
## Multiple R-squared:  0.1285, Adjusted R-squared:  0.1091 
## F-statistic: 6.632 on 1 and 45 DF,  p-value: 0.01337

Histogram of \(\hat{\boldsymbol{\varepsilon}}\)

hist(e.hat, col = "grey", breaks = 15, main = "", xlab = expression(hat(epsilon)))

Plot covariate vs. \(\hat{\boldsymbol{\varepsilon}}\)

plot(year, e.hat, xlab = "Year", ylab = expression(hat(epsilon)))

A formal hypothesis test (see pg. 81 in Faraway (2014))

shapiro.test(e.hat)

## 
##  Shapiro-Wilk normality test
## 
## data:  e.hat
## W = 0.98556, p-value = 0.8228

Literature cited

Box, George EP. 1976. “Science and Statistics.” Journal of the American Statistical Association 71 (356): 791–99.

Faraway, J. J. 2014. Linear Models with r. CRC Press.