20 Day 19 (July 8)

20.1 Announcements

See schedule for next 5 workdays
Final week (July 28-Aug. 1)
Assignment 4 is posted
- Due date?
Selected questions from journals
- “Since bootstrapping is good for larger datasets and (ab)normal distributions, and the delta method is good for symmetrical distributions, why use the delta method? Does bootstrapping have less assumptions than the delta method? Can you accomplish similar outcomes using the bootstrap vs DM?”
- “How to handle model uncertainty? Especially when we know that the patterns in our data might not hold in the future. For example, if a species is currently recovering but hunting suddenly starts again, or in agriculture, if a new management practice is introduced or a pathogen mutation appears, the predictions from our current model could quickly become unreliable or useless. How do we communicate to non-specialists that our models have these limitations and might not work if things change unexpectedly, while still helping them understand why is important in the first place?”
- “Where bootstrapping is concerned, is it possible to use bootstrapping skills or something similar to create synthetic data? It is not for testing, just for ensuring the privacy of original data. I am trying to synthesize spatial data and still get the same model results.”

20.2 Prediction

My definition of inference and prediction
- Inference = Learning about what you can’t observe given what you did observe (and some assumptions)
- Prediction = Learning about what you didn’t observe given what you did observe (and some assumptions)

Prediction (Ch. 4 in Faraway (2014))

Derived quantity
- $\mathbf{x}^{\prime}_0$ is a $1\times p$ matrix of covariates (could be a row $\mathbf{X}$ or completely new values of the predictors)
- Use $\widehat{\text{E}(y_0)}=\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}$
Example
- Predicting the number of whooping cranes

url <- "https://www.dropbox.com/s/8grip274233dr9a/Butler%20et%20al.%20Table%201.csv?dl=1"
df1 <- read.csv(url)

plot(df1$Winter, df1$N, xlab = "Year", ylab = "Population size", xlim = c(1940, 2025),
    ylim = c(0, 300), typ = "b", lwd = 1.5, pch = "*")

What model should we use?

m1 <- lm(N ~ Winter + I(Winter^2), data = df1)
Ey.hat <- predict(m1)

plot(df1$Winter, df1$N, xlab = "Year", ylab = "Population size", xlim = c(1940, 2021),
    ylim = c(0, 300), typ = "b", lwd = 1.5, pch = "*")
points(df1$Winter, Ey.hat, typ = "l", col = "red")

20.3 Intervals for predictions

Expected value and variance of $\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}$

Confidence interval $\text{P}\left(\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}-t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0}}<\mathbf{x}^{\prime}_0\boldsymbol{\beta}<\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}+t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0}}\right)\\=1-\alpha$

The 95% CI is $\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}\pm t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0}}$
In R

Ey.hat <- predict(m1, interval = "confidence")
head(Ey.hat)

##        fit      lwr      upr
## 1 29.35980 21.43774 37.28186
## 2 28.84122 21.42928 36.25315
## 3 28.47610 21.54645 35.40575
## 4 28.26444 21.78812 34.74075
## 5 28.20624 22.15308 34.25940
## 6 28.30150 22.64000 33.96300

plot(df1$Winter, df1$N, xlab = "Year", ylab = "Population size", xlim = c(1940, 2025),
    ylim = c(0, 300), typ = "b", lwd = 1.5, pch = "*")
points(df1$Winter, Ey.hat[, 1], typ = "l", col = "red")
points(df1$Winter, Ey.hat[, 2], typ = "l", col = "red", lty = 2)
points(df1$Winter, Ey.hat[, 3], typ = "l", col = "red", lty = 2)

Why are there so many data points that fall outside of the 95% CIs?

Prediction intervals vs. Confidence intervals
- CIs for $\widehat{\text{E}(y_0)}=\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}$
- How to interpret $\widehat{\text{E}(y_0)}$
- What if I wanted to predict $y_0$?
  - $y_0\sim\text{N}(\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}},\hat{\sigma}^2)$
- Expected value and variance of $y_0$
- Prediction interval $\text{P}\left(\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}-t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}(1+\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0})}<y_0<\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}+t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}(\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0})}\right)=\\1-\alpha$
- The 95% PI is $\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}\pm t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}(1+\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0})}$
- Example in R
```
y.hat <- predict(m1, interval = "prediction")
head(y.hat)
```
```
##        fit      lwr      upr
## 1 29.35980 6.630220 52.08938
## 2 28.84122 6.284368 51.39807
## 3 28.47610 6.073089 50.87910
## 4 28.26444 5.997480 50.53139
## 5 28.20624 6.058655 50.35382
## 6 28.30150 6.257741 50.34526
```
```
plot(df1$Winter, df1$N, xlab = "Year", ylab = "Population size", xlim = c(1940, 2025),
    ylim = c(0, 300), typ = "b", lwd = 1.5, pch = "*")
points(df1$Winter, y.hat[, 1], typ = "l", col = "red")
points(df1$Winter, y.hat[, 2], typ = "l", col = "red", lty = 2)
points(df1$Winter, y.hat[, 3], typ = "l", col = "red", lty = 2)
```
Live example

20.4 T-test

Common statistical test taught in introductory statistics classes

Linear model representation

Example: comparing two means

library(faraway)

## Warning: package 'faraway' was built under R version 4.3.3

## Warning in check_dep_version(): ABI version mismatch: 
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package

df.plant <- PlantGrowth[-c(1:10), ]
df.plant$group <- factor(df.plant$group)

boxplot(weight ~ group, df.plant, col = "grey", ylim = c(0, 7))

aggregate(weight ~ group, FUN = mean, data = df.plant)

##   group weight
## 1  trt1  4.661
## 2  trt2  5.526

Use t.test() function in R.

t.test(weight ~ group - 1, data = df.plant, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  weight by group
## t = -3.0101, df = 18, p-value = 0.007518
## alternative hypothesis: true difference in means between group trt1 and group trt2 is not equal to 0
## 95 percent confidence interval:
##  -1.4687336 -0.2612664
## sample estimates:
## mean in group trt1 mean in group trt2 
##              4.661              5.526

Use lm() function in R.

# With 'intercept'
m1 <- lm(weight ~ group, data = df.plant)
summary(m1)

## 
## Call:
## lm(formula = weight ~ group, data = df.plant)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0710 -0.3573 -0.0910  0.2402  1.3690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6610     0.2032   22.94 8.93e-15 ***
## grouptrt2     0.8650     0.2874    3.01  0.00752 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6426 on 18 degrees of freedom
## Multiple R-squared:  0.3348, Adjusted R-squared:  0.2979 
## F-statistic: 9.061 on 1 and 18 DF,  p-value: 0.007518

# Without 'intercept'
m2 <- lm(weight ~ group - 1, data = df.plant)
summary(m2)

## 
## Call:
## lm(formula = weight ~ group - 1, data = df.plant)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0710 -0.3573 -0.0910  0.2402  1.3690 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## grouptrt1   4.6610     0.2032   22.94 8.93e-15 ***
## grouptrt2   5.5260     0.2032   27.20 4.52e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6426 on 18 degrees of freedom
## Multiple R-squared:  0.986,  Adjusted R-squared:  0.9844 
## F-statistic: 632.9 on 2 and 18 DF,  p-value: < 2.2e-16

Live example

Literature cited

Faraway, J. J. 2014. Linear Models with r. CRC Press.