20 Day 19 (July 8)
20.1 Announcements
See schedule for next 5 workdays
Final week (July 28-Aug. 1)
Assignment 4 is posted
- Due date?
Selected questions from journals
- “Since bootstrapping is good for larger datasets and (ab)normal distributions, and the delta method is good for symmetrical distributions, why use the delta method? Does bootstrapping have less assumptions than the delta method? Can you accomplish similar outcomes using the bootstrap vs DM?”
- “How to handle model uncertainty? Especially when we know that the patterns in our data might not hold in the future. For example, if a species is currently recovering but hunting suddenly starts again, or in agriculture, if a new management practice is introduced or a pathogen mutation appears, the predictions from our current model could quickly become unreliable or useless. How do we communicate to non-specialists that our models have these limitations and might not work if things change unexpectedly, while still helping them understand why is important in the first place?”
- “Where bootstrapping is concerned, is it possible to use bootstrapping skills or something similar to create synthetic data? It is not for testing, just for ensuring the privacy of original data. I am trying to synthesize spatial data and still get the same model results.”
20.2 Prediction
- My definition of inference and prediction
- Inference = Learning about what you can’t observe given what you did observe (and some assumptions)
- Prediction = Learning about what you didn’t observe given what you did observe (and some assumptions)
- Prediction (Ch. 4 in Faraway (2014))
- Derived quantity
- \(\mathbf{x}^{\prime}_0\) is a \(1\times p\) matrix of covariates (could be a row \(\mathbf{X}\) or completely new values of the predictors)
- Use \(\widehat{\text{E}(y_0)}=\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}\)
- Example
- Predicting the number of whooping cranes
url <- "https://www.dropbox.com/s/8grip274233dr9a/Butler%20et%20al.%20Table%201.csv?dl=1" df1 <- read.csv(url) plot(df1$Winter, df1$N, xlab = "Year", ylab = "Population size", xlim = c(1940, 2025), ylim = c(0, 300), typ = "b", lwd = 1.5, pch = "*")
- What model should we use?
m1 <- lm(N ~ Winter + I(Winter^2), data = df1) Ey.hat <- predict(m1) plot(df1$Winter, df1$N, xlab = "Year", ylab = "Population size", xlim = c(1940, 2021), ylim = c(0, 300), typ = "b", lwd = 1.5, pch = "*") points(df1$Winter, Ey.hat, typ = "l", col = "red")
- Derived quantity
20.3 Intervals for predictions
Expected value and variance of \(\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}\)
Confidence interval \(\text{P}\left(\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}-t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0}}<\mathbf{x}^{\prime}_0\boldsymbol{\beta}<\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}+t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0}}\right)\\=1-\alpha\)
- The 95% CI is \(\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}\pm t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0}}\)
- In R
## fit lwr upr ## 1 29.35980 21.43774 37.28186 ## 2 28.84122 21.42928 36.25315 ## 3 28.47610 21.54645 35.40575 ## 4 28.26444 21.78812 34.74075 ## 5 28.20624 22.15308 34.25940 ## 6 28.30150 22.64000 33.96300
plot(df1$Winter, df1$N, xlab = "Year", ylab = "Population size", xlim = c(1940, 2025), ylim = c(0, 300), typ = "b", lwd = 1.5, pch = "*") points(df1$Winter, Ey.hat[, 1], typ = "l", col = "red") points(df1$Winter, Ey.hat[, 2], typ = "l", col = "red", lty = 2) points(df1$Winter, Ey.hat[, 3], typ = "l", col = "red", lty = 2)
- Why are there so many data points that fall outside of the 95% CIs?
Prediction intervals vs. Confidence intervals
CIs for \(\widehat{\text{E}(y_0)}=\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}\)
How to interpret \(\widehat{\text{E}(y_0)}\)
What if I wanted to predict \(y_0\)?
- \(y_0\sim\text{N}(\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}},\hat{\sigma}^2)\)
Expected value and variance of \(y_0\)
Prediction interval \(\text{P}\left(\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}-t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}(1+\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0})}<y_0<\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}+t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}(\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0})}\right)=\\1-\alpha\)
The 95% PI is \(\mathbf{x}^{\prime}_0\hat{\boldsymbol{\beta}}\pm t_{\alpha/2,n-p}\sqrt{\hat{\sigma^{2}}(1+\mathbf{x}_{0}^{'}(\mathbf{X}^{'}\mathbf{X})^{-1}\mathbf{x}_{0})}\)
Example in R
## fit lwr upr ## 1 29.35980 6.630220 52.08938 ## 2 28.84122 6.284368 51.39807 ## 3 28.47610 6.073089 50.87910 ## 4 28.26444 5.997480 50.53139 ## 5 28.20624 6.058655 50.35382 ## 6 28.30150 6.257741 50.34526
plot(df1$Winter, df1$N, xlab = "Year", ylab = "Population size", xlim = c(1940, 2025), ylim = c(0, 300), typ = "b", lwd = 1.5, pch = "*") points(df1$Winter, y.hat[, 1], typ = "l", col = "red") points(df1$Winter, y.hat[, 2], typ = "l", col = "red", lty = 2) points(df1$Winter, y.hat[, 3], typ = "l", col = "red", lty = 2)
Live example
20.4 T-test
- Common statistical test taught in introductory statistics classes
Linear model representation
Example: comparing two means
## Warning: package 'faraway' was built under R version 4.3.3
## Warning in check_dep_version(): ABI version mismatch: ## lme4 was built with Matrix ABI version 1 ## Current Matrix ABI version is 0 ## Please re-install lme4 from source or restore original 'Matrix' package
df.plant <- PlantGrowth[-c(1:10), ] df.plant$group <- factor(df.plant$group) boxplot(weight ~ group, df.plant, col = "grey", ylim = c(0, 7))
## group weight ## 1 trt1 4.661 ## 2 trt2 5.526
Use
t.test()
function in R.## ## Two Sample t-test ## ## data: weight by group ## t = -3.0101, df = 18, p-value = 0.007518 ## alternative hypothesis: true difference in means between group trt1 and group trt2 is not equal to 0 ## 95 percent confidence interval: ## -1.4687336 -0.2612664 ## sample estimates: ## mean in group trt1 mean in group trt2 ## 4.661 5.526
Use
lm()
function in R.## ## Call: ## lm(formula = weight ~ group, data = df.plant) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0710 -0.3573 -0.0910 0.2402 1.3690 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.6610 0.2032 22.94 8.93e-15 *** ## grouptrt2 0.8650 0.2874 3.01 0.00752 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.6426 on 18 degrees of freedom ## Multiple R-squared: 0.3348, Adjusted R-squared: 0.2979 ## F-statistic: 9.061 on 1 and 18 DF, p-value: 0.007518
## ## Call: ## lm(formula = weight ~ group - 1, data = df.plant) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0710 -0.3573 -0.0910 0.2402 1.3690 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## grouptrt1 4.6610 0.2032 22.94 8.93e-15 *** ## grouptrt2 5.5260 0.2032 27.20 4.52e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.6426 on 18 degrees of freedom ## Multiple R-squared: 0.986, Adjusted R-squared: 0.9844 ## F-statistic: 632.9 on 2 and 18 DF, p-value: < 2.2e-16
- Live example