21 Day 20 (July 9)
21.1 Announcements
Workday on Thursday
Ch. 14-16 in Linear Models with R
Selected questions from journals
- Calibration and reproducibility
- “What would you do to improve this model? If only 2 out of the 68(?) points are outside the interval, that would mean 97% of the data fits within this model. Would we want to improve it to 100%? Wouldn’t that be overfitting?”
- Prediction accuracy example
21.3 T-test
- Common statistical test taught in introductory statistics classes
Linear model representation
Example: comparing two means
library(faraway) df.plant <- PlantGrowth[-c(1:10), ] df.plant$group <- factor(df.plant$group) boxplot(weight ~ group, df.plant, col = "grey", ylim = c(0, 7))
## group weight ## 1 trt1 4.661 ## 2 trt2 5.526
Use
t.test()
function in R.## ## Two Sample t-test ## ## data: weight by group ## t = -3.0101, df = 18, p-value = 0.007518 ## alternative hypothesis: true difference in means between group trt1 and group trt2 is not equal to 0 ## 95 percent confidence interval: ## -1.4687336 -0.2612664 ## sample estimates: ## mean in group trt1 mean in group trt2 ## 4.661 5.526
Use
lm()
function in R.## ## Call: ## lm(formula = weight ~ group, data = df.plant) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0710 -0.3573 -0.0910 0.2402 1.3690 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.6610 0.2032 22.94 8.93e-15 *** ## grouptrt2 0.8650 0.2874 3.01 0.00752 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.6426 on 18 degrees of freedom ## Multiple R-squared: 0.3348, Adjusted R-squared: 0.2979 ## F-statistic: 9.061 on 1 and 18 DF, p-value: 0.007518
## ## Call: ## lm(formula = weight ~ group - 1, data = df.plant) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0710 -0.3573 -0.0910 0.2402 1.3690 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## grouptrt1 4.6610 0.2032 22.94 8.93e-15 *** ## grouptrt2 5.5260 0.2032 27.20 4.52e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.6426 on 18 degrees of freedom ## Multiple R-squared: 0.986, Adjusted R-squared: 0.9844 ## F-statistic: 632.9 on 2 and 18 DF, p-value: < 2.2e-16
- Live example
21.4 ANOVA/F-test
F-test (pgs. 33 - 39 in Faraway (2014))
- The t-test is useful for testing simple hypotheses
- F-test is useful for testing more complicated hypotheses
- Example: comparing more than two means
## group weight ## 1 ctrl 5.032 ## 2 trt1 4.661 ## 3 trt2 5.526
To determine if treatment 1 and 2 had an effects, we might propose the following null hypothesis \[\text{H}_{0}:\beta_1 = \beta_2 = 0\] and alternative hypothesis \[\text{H}_{\text{a}}:\beta_1 \neq 0 \;\text{or}\; \beta_2 \neq 0.\] To test the null hypothesis, we calculate the probability that we observe values of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis. To calculate this probability, note that \[\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} \sim\text{F}(p-1,n-p).\] The probability that we observe a value of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis can be written as \[\text{P}\bigg(\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} <F\bigg),\] where \(TSS = (\mathbf{y} - \bar{\mathbf{y}})^{\prime}(\mathbf{y} - \bar{\mathbf{y}})\) and \(RSS = (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\). The probability can be calculated as the area under the curve of a F-distribution. The code below shows how to do these calculations in R
## ## Call: ## lm(formula = weight ~ group, data = PlantGrowth) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0710 -0.4180 -0.0060 0.2627 1.3690 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.0320 0.1971 25.527 <2e-16 *** ## grouptrt1 -0.3710 0.2788 -1.331 0.1944 ## grouptrt2 0.4940 0.2788 1.772 0.0877 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.6234 on 27 degrees of freedom ## Multiple R-squared: 0.2641, Adjusted R-squared: 0.2096 ## F-statistic: 4.846 on 2 and 27 DF, p-value: 0.01591
## Analysis of Variance Table ## ## Response: weight ## Df Sum Sq Mean Sq F value Pr(>F) ## group 2 3.7663 1.8832 4.8461 0.01591 * ## Residuals 27 10.4921 0.3886 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 'By hand' beta.hat <- coef(m1) X <- model.matrix(m1) y <- PlantGrowth$weight y.bar <- mean(y) n <- 30 p <- 3 RSS <- t(y - X %*% beta.hat) %*% (y - X %*% beta.hat) RSS
## [,1] ## [1,] 10.49209
## [,1] ## [1,] 14.25843
## [,1] ## [1,] 3.76634
## [,1] ## [1,] 4.846088
## [,1] ## [1,] 0.01590996
Live example