28 Day 27 (July 18)
28.1 Announcements
- Final presentations
- Send me an email (thefley@ksu.edu) to schedule a 20 min time interval for your final presentation. In your email give me three dates/times during the week of July 29 - July. 31 that work for you.
- Selected questions from journals
- “Since age and sex directly influence body weight, these are not independent factors. This got me thinking about my own project, and (multi)collinearity. Including these factors are important to a certain extent, but doesn’t it also make it difficult to interpret which predictor matters? Or make estimates less reliable?”
- “I’m still working on fully grasping the concept of prediction interval coverage and how it’s calculated using observed versus predicted values. While I understand that we’re checking whether observed values fall within the upper and lower prediction bounds, I’d like to better understand what it means when the coverage is too low or too high, and how to adjust models accordingly.”
28.2 ANOVA/F-test
- Live example
- ANCOVA
28.3 Model checking
- Given a statistical model, estimation, prediction, and statistical inference is somewhat “automatic”
- If the statistical model is misspecified (i.e., wrong) in any way, the resulting statistical inference (including predictions and prediction uncertainty) rests on a house of cards.
- George Box quote: “All models are wrong but some are useful.”
- Box (1976) “Since all models are wrong the scientist cannot obtain a correct one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.”
- We have assumed the linear model \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\), which allowed us to:
- Estimate \(\boldsymbol{\beta}\) and \(\sigma^2\)
- Make statistical inference about \(\hat{\boldsymbol{\beta}}\)
- Make predictions and obtain prediction intervals for future values of \(\mathbf{y}\)
- All statistical inference we obtained requires that the linear model \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) gave rise to the data.
- Support
- Linear
- Constant variance
- Independence
- Outliers
- Model diagnostics (Ch 6 in Faraway (2014)) is a set of tools and procedures to see if the assumptions of our model are approximately correct.
- Statistical tests (e.g., Shapiro-Wilk test for normality)
- Specific
- What if you reject the null?
- Graphical
- Broad
- Subjective
- Widely used
- Predictive model checks
- More common for Bayesian models (e.g., posterior predictive checks)
- Statistical tests (e.g., Shapiro-Wilk test for normality)
- We will explore numerous ways to check
- Distributional assumptions
- Normality
- Constant variance
- Correlation among errors
- Detection of outliers
- Deterministic model structure
- Is \(\mathbf{X}\boldsymbol{\beta}\) a reasonable assumption?
- Distributional assumptions
28.4 Distributional assumptions
Why did we assume \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\)?
Is the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) ever correct? Is there a “true” model?
When would we expect the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) to be approximately correct?
- Human body weights
- Stock prices
- Temperature
- Proportion of votes for a candidate in an elections
Checking distributional assumptions
- If \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\), then \(\mathbf{y} - \mathbf{X\boldsymbol{\beta}}\sim ?\)
If the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) is approximately correct, then what should \(\hat{\boldsymbol{\varepsilon}}\) look like?
Example: checking the assumption that \(\boldsymbol{\varepsilon}\sim\text{N}(\mathbf{0},\sigma^{2}\mathbf{I})\)
- Data
y <- c(63, 68, 61, 44, 103, 90, 107, 105, 76, 46, 60, 66, 58, 39, 64, 29, 37, 27, 38, 14, 38, 52, 84, 112, 112, 97, 131, 168, 70, 91, 52, 33, 33, 27, 18, 14, 5, 22, 31, 23, 14, 18, 23, 27, 44, 18, 19) year <- 1965:2011 df <- data.frame(y = y, year = year) plot(x = df$year, y = df$y, xlab = "Year", ylab = "Annual count", main = "", col = "brown", pch = 20) m1 <- lm(y ~ year, data = df) abline(m1)
- Histogram of \(\hat{\boldsymbol{\varepsilon}}\)
m1 <- lm(y ~ year, data = df) e.hat <- residuals(m1) hist(e.hat, col = "grey", breaks = 15, main = "", xlab = expression(hat(epsilon)))
- Plot covariate vs. \(\hat{\boldsymbol{\varepsilon}}\)
- A formal hypothesis test (see pg. 81 in Faraway (2014))
## ## Shapiro-Wilk normality test ## ## data: e.hat ## W = 0.86281, p-value = 5.709e-05
Example: Checking the assumption that \(\boldsymbol{\varepsilon}\sim\text{N}\left(\mathbf{0},\sigma^{2}\mathbf{I}\right)\) (What it should look like)
- Simulated data
beta.truth <- c(2356, -1.15) sigma2.truth <- 33^2 n <- 47 year <- 1965:2011 X <- model.matrix(~year) set.seed(2930) y <- rnorm(n, X %*% beta.truth, sigma2.truth^0.5) df1 <- data.frame(y = y, year = year) plot(x = df1$year, y = df1$y, xlab = "Year", ylab = "Annual count", main = "", col = "brown", pch = 20)
## ## Call: ## lm(formula = y ~ year, data = df1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -76.757 -22.237 3.767 19.353 66.634 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1717.2121 638.5293 2.689 0.0100 * ## year -0.8272 0.3212 -2.575 0.0134 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 29.87 on 45 degrees of freedom ## Multiple R-squared: 0.1285, Adjusted R-squared: 0.1091 ## F-statistic: 6.632 on 1 and 45 DF, p-value: 0.01337
- Histogram of \(\hat{\boldsymbol{\varepsilon}}\)
- Plot covariate vs. \(\hat{\boldsymbol{\varepsilon}}\)
- A formal hypothesis test (see pg. 81 in Faraway (2014))
## ## Shapiro-Wilk normality test ## ## data: e.hat ## W = 0.98556, p-value = 0.8228