21 Day 20 (July 9)

21.1 Announcements

  • Workday on Thursday

  • Ch. 14-16 in Linear Models with R

  • Selected questions from journals

    • Calibration and reproducibility
    • “What would you do to improve this model? If only 2 out of the 68(?) points are outside the interval, that would mean 97% of the data fits within this model. Would we want to improve it to 100%? Wouldn’t that be overfitting?”
    • Prediction accuracy example

21.2 Prediction

21.3 T-test

  • Common statistical test taught in introductory statistics classes
    • Linear model representation

    • Example: comparing two means

      library(faraway)
      df.plant <- PlantGrowth[-c(1:10), ]
      df.plant$group <- factor(df.plant$group)
      
      boxplot(weight ~ group, df.plant, col = "grey", ylim = c(0, 7))

      aggregate(weight ~ group, FUN = mean, data = df.plant)
      ##   group weight
      ## 1  trt1  4.661
      ## 2  trt2  5.526
    • Use t.test() function in R.

      t.test(weight ~ group - 1, data = df.plant, var.equal = TRUE)
      ## 
      ##  Two Sample t-test
      ## 
      ## data:  weight by group
      ## t = -3.0101, df = 18, p-value = 0.007518
      ## alternative hypothesis: true difference in means between group trt1 and group trt2 is not equal to 0
      ## 95 percent confidence interval:
      ##  -1.4687336 -0.2612664
      ## sample estimates:
      ## mean in group trt1 mean in group trt2 
      ##              4.661              5.526
    • Use lm() function in R.

      # With 'intercept'
      m1 <- lm(weight ~ group, data = df.plant)
      summary(m1)
      ## 
      ## Call:
      ## lm(formula = weight ~ group, data = df.plant)
      ## 
      ## Residuals:
      ##     Min      1Q  Median      3Q     Max 
      ## -1.0710 -0.3573 -0.0910  0.2402  1.3690 
      ## 
      ## Coefficients:
      ##             Estimate Std. Error t value Pr(>|t|)    
      ## (Intercept)   4.6610     0.2032   22.94 8.93e-15 ***
      ## grouptrt2     0.8650     0.2874    3.01  0.00752 ** 
      ## ---
      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      ## 
      ## Residual standard error: 0.6426 on 18 degrees of freedom
      ## Multiple R-squared:  0.3348, Adjusted R-squared:  0.2979 
      ## F-statistic: 9.061 on 1 and 18 DF,  p-value: 0.007518
      # Without 'intercept'
      m2 <- lm(weight ~ group - 1, data = df.plant)
      summary(m2)
      ## 
      ## Call:
      ## lm(formula = weight ~ group - 1, data = df.plant)
      ## 
      ## Residuals:
      ##     Min      1Q  Median      3Q     Max 
      ## -1.0710 -0.3573 -0.0910  0.2402  1.3690 
      ## 
      ## Coefficients:
      ##           Estimate Std. Error t value Pr(>|t|)    
      ## grouptrt1   4.6610     0.2032   22.94 8.93e-15 ***
      ## grouptrt2   5.5260     0.2032   27.20 4.52e-16 ***
      ## ---
      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      ## 
      ## Residual standard error: 0.6426 on 18 degrees of freedom
      ## Multiple R-squared:  0.986,  Adjusted R-squared:  0.9844 
      ## F-statistic: 632.9 on 2 and 18 DF,  p-value: < 2.2e-16
  • Live example

21.4 ANOVA/F-test

  • F-test (pgs. 33 - 39 in Faraway (2014))

    • The t-test is useful for testing simple hypotheses
    • F-test is useful for testing more complicated hypotheses
    • Example: comparing more than two means
    library(faraway)
    boxplot(weight ~ group, data = PlantGrowth, col = "grey", ylim = c(0, 7))

    aggregate(weight ~ group, FUN = mean, data = PlantGrowth)
    ##   group weight
    ## 1  ctrl  5.032
    ## 2  trt1  4.661
    ## 3  trt2  5.526

    To determine if treatment 1 and 2 had an effects, we might propose the following null hypothesis \[\text{H}_{0}:\beta_1 = \beta_2 = 0\] and alternative hypothesis \[\text{H}_{\text{a}}:\beta_1 \neq 0 \;\text{or}\; \beta_2 \neq 0.\] To test the null hypothesis, we calculate the probability that we observe values of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis. To calculate this probability, note that \[\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} \sim\text{F}(p-1,n-p).\] The probability that we observe a value of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis can be written as \[\text{P}\bigg(\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} <F\bigg),\] where \(TSS = (\mathbf{y} - \bar{\mathbf{y}})^{\prime}(\mathbf{y} - \bar{\mathbf{y}})\) and \(RSS = (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\). The probability can be calculated as the area under the curve of a F-distribution. The code below shows how to do these calculations in R

    m1 <- lm(weight ~ group, data = PlantGrowth)
    summary(m1)
    ## 
    ## Call:
    ## lm(formula = weight ~ group, data = PlantGrowth)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -1.0710 -0.4180 -0.0060  0.2627  1.3690 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   5.0320     0.1971  25.527   <2e-16 ***
    ## grouptrt1    -0.3710     0.2788  -1.331   0.1944    
    ## grouptrt2     0.4940     0.2788   1.772   0.0877 .  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.6234 on 27 degrees of freedom
    ## Multiple R-squared:  0.2641, Adjusted R-squared:  0.2096 
    ## F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591
    anova(m1)
    ## Analysis of Variance Table
    ## 
    ## Response: weight
    ##           Df  Sum Sq Mean Sq F value  Pr(>F)  
    ## group      2  3.7663  1.8832  4.8461 0.01591 *
    ## Residuals 27 10.4921  0.3886                  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    # 'By hand'
    beta.hat <- coef(m1)
    X <- model.matrix(m1)
    y <- PlantGrowth$weight
    y.bar <- mean(y)
    n <- 30
    p <- 3
    
    
    RSS <- t(y - X %*% beta.hat) %*% (y - X %*% beta.hat)
    RSS
    ##          [,1]
    ## [1,] 10.49209
    TSS <- t(y - y.bar) %*% (y - y.bar)
    TSS
    ##          [,1]
    ## [1,] 14.25843
    TSS - RSS
    ##         [,1]
    ## [1,] 3.76634
    F <- ((TSS - RSS)/(p - 1))/(RSS/(n - p))
    F
    ##          [,1]
    ## [1,] 4.846088
    pf(F, p - 1, n - p, lower.tail = FALSE)
    ##            [,1]
    ## [1,] 0.01590996
  • Live example

Literature cited

Faraway, J. J. 2014. Linear Models with r. CRC Press.