21 Day 20 (July 9)

21.1 Announcements

Workday on Thursday
Ch. 14-16 in Linear Models with R
Selected questions from journals
- Calibration and reproducibility
- “What would you do to improve this model? If only 2 out of the 68(?) points are outside the interval, that would mean 97% of the data fits within this model. Would we want to improve it to 100%? Wouldn’t that be overfitting?”
- Prediction accuracy example

21.2 Prediction

Live example

21.3 T-test

Common statistical test taught in introductory statistics classes

Linear model representation

Example: comparing two means

library(faraway)
df.plant <- PlantGrowth[-c(1:10), ]
df.plant$group <- factor(df.plant$group)

boxplot(weight ~ group, df.plant, col = "grey", ylim = c(0, 7))

aggregate(weight ~ group, FUN = mean, data = df.plant)

##   group weight
## 1  trt1  4.661
## 2  trt2  5.526

Use t.test() function in R.

t.test(weight ~ group - 1, data = df.plant, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  weight by group
## t = -3.0101, df = 18, p-value = 0.007518
## alternative hypothesis: true difference in means between group trt1 and group trt2 is not equal to 0
## 95 percent confidence interval:
##  -1.4687336 -0.2612664
## sample estimates:
## mean in group trt1 mean in group trt2 
##              4.661              5.526

Use lm() function in R.

# With 'intercept'
m1 <- lm(weight ~ group, data = df.plant)
summary(m1)

## 
## Call:
## lm(formula = weight ~ group, data = df.plant)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0710 -0.3573 -0.0910  0.2402  1.3690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6610     0.2032   22.94 8.93e-15 ***
## grouptrt2     0.8650     0.2874    3.01  0.00752 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6426 on 18 degrees of freedom
## Multiple R-squared:  0.3348, Adjusted R-squared:  0.2979 
## F-statistic: 9.061 on 1 and 18 DF,  p-value: 0.007518

# Without 'intercept'
m2 <- lm(weight ~ group - 1, data = df.plant)
summary(m2)

## 
## Call:
## lm(formula = weight ~ group - 1, data = df.plant)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0710 -0.3573 -0.0910  0.2402  1.3690 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## grouptrt1   4.6610     0.2032   22.94 8.93e-15 ***
## grouptrt2   5.5260     0.2032   27.20 4.52e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6426 on 18 degrees of freedom
## Multiple R-squared:  0.986,  Adjusted R-squared:  0.9844 
## F-statistic: 632.9 on 2 and 18 DF,  p-value: < 2.2e-16

Live example

21.4 ANOVA/F-test

F-test (pgs. 33 - 39 in Faraway (2014))

The t-test is useful for testing simple hypotheses
F-test is useful for testing more complicated hypotheses
Example: comparing more than two means

library(faraway)
boxplot(weight ~ group, data = PlantGrowth, col = "grey", ylim = c(0, 7))

aggregate(weight ~ group, FUN = mean, data = PlantGrowth)

##   group weight
## 1  ctrl  5.032
## 2  trt1  4.661
## 3  trt2  5.526

To determine if treatment 1 and 2 had an effects, we might propose the following null hypothesis \[\text{H}_{0}:\beta_1 = \beta_2 = 0\] and alternative hypothesis \[\text{H}_{\text{a}}:\beta_1 \neq 0 \;\text{or}\; \beta_2 \neq 0.\] To test the null hypothesis, we calculate the probability that we observe values of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis. To calculate this probability, note that \[\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} \sim\text{F}(p-1,n-p).\] The probability that we observe a value of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis can be written as \[\text{P}\bigg(\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} <F\bigg),\] where \(TSS = (\mathbf{y} - \bar{\mathbf{y}})^{\prime}(\mathbf{y} - \bar{\mathbf{y}})\) and \(RSS = (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\). The probability can be calculated as the area under the curve of a F-distribution. The code below shows how to do these calculations in R

m1 <- lm(weight ~ group, data = PlantGrowth)
summary(m1)

## 
## Call:
## lm(formula = weight ~ group, data = PlantGrowth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0710 -0.4180 -0.0060  0.2627  1.3690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0320     0.1971  25.527   <2e-16 ***
## grouptrt1    -0.3710     0.2788  -1.331   0.1944    
## grouptrt2     0.4940     0.2788   1.772   0.0877 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6234 on 27 degrees of freedom
## Multiple R-squared:  0.2641, Adjusted R-squared:  0.2096 
## F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591

anova(m1)

## Analysis of Variance Table
## 
## Response: weight
##           Df  Sum Sq Mean Sq F value  Pr(>F)  
## group      2  3.7663  1.8832  4.8461 0.01591 *
## Residuals 27 10.4921  0.3886                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 'By hand'
beta.hat <- coef(m1)
X <- model.matrix(m1)
y <- PlantGrowth$weight
y.bar <- mean(y)
n <- 30
p <- 3


RSS <- t(y - X %*% beta.hat) %*% (y - X %*% beta.hat)
RSS

##          [,1]
## [1,] 10.49209

TSS <- t(y - y.bar) %*% (y - y.bar)
TSS

##          [,1]
## [1,] 14.25843

TSS - RSS

##         [,1]
## [1,] 3.76634

F <- ((TSS - RSS)/(p - 1))/(RSS/(n - p))
F

##          [,1]
## [1,] 4.846088

pf(F, p - 1, n - p, lower.tail = FALSE)

##            [,1]
## [1,] 0.01590996

Live example

Literature cited

Faraway, J. J. 2014. Linear Models with r. CRC Press.