23 Day 22 (July 11)

23.1 Announcements

  • Ch. 14-16 in Linear Models with R is what we are covering now

  • Ch. 6 is what we will be moving into

  • Selected questions from journals

    • Show cool plot and talk quickly about derived quantities
    • “I am super confused about the beta values with t-tests. How did we figure out that B0 is equal to the first treatment when the model has an intercept? And that B1 is the difference between the 2 treatments? And why is B1 the first treatment and B2 the second treatment without the intercept? How can we just get rid of the intercept?? I’m sure part of my confusion comes from being completely exhausted and at most like 10% working mental capacity this morning, but still super confused.”
    • “The question I have is tailored for my project, and if it’s worth it to incorporate another parameter: I will be using a two-sampled t-test…..Could I add pigweed density (per area) in the organic model to try and account for….”
    • “Should I always test model accuracy and calibration using data that was not used for model fitting? Could you better explain the k-fold cross-validation when we have small data sets? Can I use the bootstrapping method for model validation as well?”

23.2 T-test

  • Common statistical test taught in introductory statistics classes
    • Linear model representation

    • Example: comparing two means

      library(faraway)
      df.plant <- PlantGrowth[-c(1:10), ]
      df.plant$group <- factor(df.plant$group)
      
      boxplot(weight ~ group, df.plant, col = "grey", ylim = c(0, 7))

      aggregate(weight ~ group, FUN = mean, data = df.plant)
      ##   group weight
      ## 1  trt1  4.661
      ## 2  trt2  5.526
    • Use t.test() function in R.

      t.test(weight ~ group - 1, data = df.plant, var.equal = TRUE)
      ## 
      ##  Two Sample t-test
      ## 
      ## data:  weight by group
      ## t = -3.0101, df = 18, p-value = 0.007518
      ## alternative hypothesis: true difference in means between group trt1 and group trt2 is not equal to 0
      ## 95 percent confidence interval:
      ##  -1.4687336 -0.2612664
      ## sample estimates:
      ## mean in group trt1 mean in group trt2 
      ##              4.661              5.526
    • Use lm() function in R.

      # With 'intercept'
      m1 <- lm(weight ~ group, data = df.plant)
      summary(m1)
      ## 
      ## Call:
      ## lm(formula = weight ~ group, data = df.plant)
      ## 
      ## Residuals:
      ##     Min      1Q  Median      3Q     Max 
      ## -1.0710 -0.3573 -0.0910  0.2402  1.3690 
      ## 
      ## Coefficients:
      ##             Estimate Std. Error t value Pr(>|t|)    
      ## (Intercept)   4.6610     0.2032   22.94 8.93e-15 ***
      ## grouptrt2     0.8650     0.2874    3.01  0.00752 ** 
      ## ---
      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      ## 
      ## Residual standard error: 0.6426 on 18 degrees of freedom
      ## Multiple R-squared:  0.3348, Adjusted R-squared:  0.2979 
      ## F-statistic: 9.061 on 1 and 18 DF,  p-value: 0.007518
      # Without 'intercept'
      m2 <- lm(weight ~ group - 1, data = df.plant)
      summary(m2)
      ## 
      ## Call:
      ## lm(formula = weight ~ group - 1, data = df.plant)
      ## 
      ## Residuals:
      ##     Min      1Q  Median      3Q     Max 
      ## -1.0710 -0.3573 -0.0910  0.2402  1.3690 
      ## 
      ## Coefficients:
      ##           Estimate Std. Error t value Pr(>|t|)    
      ## grouptrt1   4.6610     0.2032   22.94 8.93e-15 ***
      ## grouptrt2   5.5260     0.2032   27.20 4.52e-16 ***
      ## ---
      ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      ## 
      ## Residual standard error: 0.6426 on 18 degrees of freedom
      ## Multiple R-squared:  0.986,  Adjusted R-squared:  0.9844 
      ## F-statistic: 632.9 on 2 and 18 DF,  p-value: < 2.2e-16
  • Live example

23.3 ANOVA/F-test

  • F-test (pgs. 33 - 39 in Faraway (2014))

    • The t-test is useful for testing simple hypotheses
    • F-test is useful for testing more complicated hypotheses
    • Example: comparing more than two means
    library(faraway)
    boxplot(weight ~ group, data = PlantGrowth, col = "grey", ylim = c(0, 7))

    aggregate(weight ~ group, FUN = mean, data = PlantGrowth)
    ##   group weight
    ## 1  ctrl  5.032
    ## 2  trt1  4.661
    ## 3  trt2  5.526

    To determine if treatment 1 and 2 had an effects, we might propose the following null hypothesis \[\text{H}_{0}:\beta_1 = \beta_2 = 0\] and alternative hypothesis \[\text{H}_{\text{a}}:\beta_1 \neq 0 \;\text{or}\; \beta_2 \neq 0.\] To test the null hypothesis, we calculate the probability that we observe values of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis. To calculate this probability, note that \[\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} \sim\text{F}(p-1,n-p).\] The probability that we observe a value of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) as or more extreme than the null hypothesis can be written as \[\text{P}\bigg(\frac{(TSS - RSS)/(p-1)}{RSS/(n-1)} <F\bigg),\] where \(TSS = (\mathbf{y} - \bar{\mathbf{y}})^{\prime}(\mathbf{y} - \bar{\mathbf{y}})\) and \(RSS = (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\). The probability can be calculated as the area under the curve of a F-distribution. The code below shows how to do these calculations in R

    m1 <- lm(weight ~ group, data = PlantGrowth)
    summary(m1)
    ## 
    ## Call:
    ## lm(formula = weight ~ group, data = PlantGrowth)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -1.0710 -0.4180 -0.0060  0.2627  1.3690 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   5.0320     0.1971  25.527   <2e-16 ***
    ## grouptrt1    -0.3710     0.2788  -1.331   0.1944    
    ## grouptrt2     0.4940     0.2788   1.772   0.0877 .  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.6234 on 27 degrees of freedom
    ## Multiple R-squared:  0.2641, Adjusted R-squared:  0.2096 
    ## F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591
    anova(m1)
    ## Analysis of Variance Table
    ## 
    ## Response: weight
    ##           Df  Sum Sq Mean Sq F value  Pr(>F)  
    ## group      2  3.7663  1.8832  4.8461 0.01591 *
    ## Residuals 27 10.4921  0.3886                  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    # 'By hand'
    beta.hat <- coef(m1)
    X <- model.matrix(m1)
    y <- PlantGrowth$weight
    y.bar <- mean(y)
    n <- 30
    p <- 3
    
    
    RSS <- t(y - X %*% beta.hat) %*% (y - X %*% beta.hat)
    RSS
    ##          [,1]
    ## [1,] 10.49209
    TSS <- t(y - y.bar) %*% (y - y.bar)
    TSS
    ##          [,1]
    ## [1,] 14.25843
    TSS - RSS
    ##         [,1]
    ## [1,] 3.76634
    F <- ((TSS - RSS)/(p - 1))/(RSS/(n - p))
    F
    ##          [,1]
    ## [1,] 4.846088
    pf(F, p - 1, n - p, lower.tail = FALSE)
    ##            [,1]
    ## [1,] 0.01590996
  • Live example

23.4 Model checking

  • Given a statistical model, estimation, prediction, and statistical inference is somewhat “automatic”
  • If the statistical model is misspecified (i.e., wrong) in any way, the resulting statistical inference (including predictions and prediction uncertainty) rests on a house of cards.
  • George Box quote: “All models are wrong but some are useful.”
    • Box (1976) “Since all models are wrong the scientist cannot obtain a correct one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.”
  • We have assumed the linear model \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\), which allowed us to:
    • Estimate \(\boldsymbol{\beta}\) and \(\sigma^2\)
    • Make statistical inference about \(\hat{\boldsymbol{\beta}}\)
    • Make predictions and obtain prediction intervals for future values of \(\mathbf{y}\)
  • All statistical inference we obtained requires that the linear model \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) gave rise to the data.
    • Support
    • Linear
    • Constant variance
    • Independence
    • Outliers
  • Model diagnostics (Ch 6 in Faraway (2014)) is a set of tools and procedures to see if the assumptions of our model are approximately correct.
    • Statistical tests (e.g., Shapiro-Wilk test for normality)
      • Specific
      • What if you reject the null?
    • Graphical
      • Broad
      • Subjective
      • Widely used
    • Predictive model checks
      • More common for Bayesian models (e.g., posterior predictive checks)
  • We will explore numerous ways to check
    • Distributional assumptions
      • Normality
      • Constant variance
      • Correlation among errors
    • Detection of outliers
    • Deterministic model structure
      • Is \(\mathbf{X}\boldsymbol{\beta}\) a reasonable assumption?

Literature cited

Box, George EP. 1976. “Science and Statistics.” Journal of the American Statistical Association 71 (356): 791–99.
Faraway, J. J. 2014. Linear Models with r. CRC Press.