26 Day 25 (July 16)

26.1 Announcements

Workday on Thursday
See due dates for final project
Peer review
Final presentations
- Send me an email (thefley@ksu.edu) to schedule a 20 min time interval for your final presentation. In your email give me three dates/times during the week of July 29 - July. 31 that work for you.
Ch. 6 in Linear Models with R is what we are covering now
Selected questions from journals
- “Something that stood out today was the idea of noisy data, when residual variation is higher than expected or not explained by the model. I wrote down that if the residual error is larger than observable variation, then we are likely dealing with noisy data. I’m wondering: if a dataset is too noisy, is it still publishable? Or is the presence of high residual error a sign that the experimental design or variable selection needs revision? Also, what strategies do we have to manage noisy data in a way that preserves validity?”
- “Since the anova test doesn’t give information on the size of the difference and the sign and even where the difference is, if there are more than two treatments are there other ways to dive further into the differences in the data. What do you think about conducting another test such as Tukey’s test to test differences at different points and how large? This is something we do in my lab and I was wondering if there are any other statistical tests to accomplish the same thing.”
- “Right now I am mostly confused about how to apply all this information to our models and what our end goal is to get to. I know we want a distribution and calibrate our model, but what do these test do to help us understand our model and what information can we read from the results to help us.”

26.2 ANOVA/F-test

Live example

26.3 Model checking

Given a statistical model, estimation, prediction, and statistical inference is somewhat “automatic”
If the statistical model is misspecified (i.e., wrong) in any way, the resulting statistical inference (including predictions and prediction uncertainty) rests on a house of cards.
George Box quote: “All models are wrong but some are useful.”
- Box (1976) “Since all models are wrong the scientist cannot obtain a correct one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.”
We have assumed the linear model \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\), which allowed us to:
- Estimate \(\boldsymbol{\beta}\) and \(\sigma^2\)
- Make statistical inference about \(\hat{\boldsymbol{\beta}}\)
- Make predictions and obtain prediction intervals for future values of \(\mathbf{y}\)
All statistical inference we obtained requires that the linear model \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) gave rise to the data.
- Support
- Linear
- Constant variance
- Independence
- Outliers
Model diagnostics (Ch 6 in Faraway (2014)) is a set of tools and procedures to see if the assumptions of our model are approximately correct.
- Statistical tests (e.g., Shapiro-Wilk test for normality)
  - Specific
  - What if you reject the null?
- Graphical
  - Broad
  - Subjective
  - Widely used
- Predictive model checks
  - More common for Bayesian models (e.g., posterior predictive checks)
We will explore numerous ways to check
- Distributional assumptions
  - Normality
  - Constant variance
  - Correlation among errors
- Detection of outliers
- Deterministic model structure
  - Is \(\mathbf{X}\boldsymbol{\beta}\) a reasonable assumption?

26.4 Distributional assumptions

Why did we assume \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\)?
Is the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) ever correct? Is there a “true” model?
When would we expect the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) to be approximately correct?
- Human body weights
- Stock prices
- Temperature
- Proportion of votes for a candidate in an elections
Checking distributional assumptions
- If \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\), then \(\mathbf{y} - \mathbf{X\boldsymbol{\beta}}\sim ?\)
If the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) is approximately correct, then what should \(\hat{\boldsymbol{\varepsilon}}\) look like?

Example: checking the assumption that \(\boldsymbol{\varepsilon}\sim\text{N}(\mathbf{0},\sigma^{2}\mathbf{I})\)

Data

y <- c(63, 68, 61, 44, 103, 90, 107, 105, 76, 46, 60, 66, 58, 39, 64, 29, 37, 27,
    38, 14, 38, 52, 84, 112, 112, 97, 131, 168, 70, 91, 52, 33, 33, 27, 18, 14, 5,
    22, 31, 23, 14, 18, 23, 27, 44, 18, 19)
year <- 1965:2011
df <- data.frame(y = y, year = year)

plot(x = df$year, y = df$y, xlab = "Year", ylab = "Annual count", main = "", col = "brown",
    pch = 20)
m1 <- lm(y ~ year, data = df)
abline(m1)

Histogram of \(\hat{\boldsymbol{\varepsilon}}\)

m1 <- lm(y ~ year, data = df)
e.hat <- residuals(m1)
hist(e.hat, col = "grey", breaks = 15, main = "", xlab = expression(hat(epsilon)))

Plot covariate vs. \(\hat{\boldsymbol{\varepsilon}}\)

plot(year, e.hat, xlab = "Year", ylab = expression(hat(epsilon)), col = "darkgreen")

A formal hypothesis test (see pg. 81 in Faraway (2014))

shapiro.test(e.hat)

## 
##  Shapiro-Wilk normality test
## 
## data:  e.hat
## W = 0.86281, p-value = 5.709e-05

Example: Checking the assumption that \(\boldsymbol{\varepsilon}\sim\text{N}\left(\mathbf{0},\sigma^{2}\mathbf{I}\right)\) (What it should look like)

Simulated data

beta.truth <- c(2356, -1.15)
sigma2.truth <- 33^2
n <- 47


year <- 1965:2011
X <- model.matrix(~year)
set.seed(2930)
y <- rnorm(n, X %*% beta.truth, sigma2.truth^0.5)
df1 <- data.frame(y = y, year = year)

plot(x = df1$year, y = df1$y, xlab = "Year", ylab = "Annual count", main = "", col = "brown",
    pch = 20)

m2 <- lm(y ~ year, df1)
e.hat <- residuals(m2)

summary(m2)

## 
## Call:
## lm(formula = y ~ year, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.757 -22.237   3.767  19.353  66.634 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 1717.2121   638.5293   2.689   0.0100 *
## year          -0.8272     0.3212  -2.575   0.0134 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.87 on 45 degrees of freedom
## Multiple R-squared:  0.1285, Adjusted R-squared:  0.1091 
## F-statistic: 6.632 on 1 and 45 DF,  p-value: 0.01337

Histogram of \(\hat{\boldsymbol{\varepsilon}}\)

hist(e.hat, col = "grey", breaks = 15, main = "", xlab = expression(hat(epsilon)))

Plot covariate vs. \(\hat{\boldsymbol{\varepsilon}}\)

plot(year, e.hat, xlab = "Year", ylab = expression(hat(epsilon)))

A formal hypothesis test (see pg. 81 in Faraway (2014))

shapiro.test(e.hat)

## 
##  Shapiro-Wilk normality test
## 
## data:  e.hat
## W = 0.98556, p-value = 0.8228

Literature cited

Box, George EP. 1976. “Science and Statistics.” Journal of the American Statistical Association 71 (356): 791–99.

Faraway, J. J. 2014. Linear Models with r. CRC Press.