31 Day 30 (July 23)

31.1 Announcements

  • Plan for next week
  • Final presentations
    • Send me an email () to schedule a 20 min time interval for your final presentation. In your email give me three dates/times during the week of July 29 - July. 31 that work for you.
  • Selected questions from journals
    • “I did have another question… do we have to prepare a professional presentation for our final project now that we are only going to be talking with you?”
    • “How do calculations of mean error and calibration work for binary datasets with logit links? Do you transform the real y values to log odds to compare, or do you aim for the same proportion of presences in the training data to be predicted in the testing dataset?”
    • “The discussion about the assumptions around linear regression has a little bit confusing for me.We usually assume that 𝐲 ∼ 𝑁(𝐗β, σ²𝐈) and this is usually not met, or this basically never happen in applied sciences such as plant pathology. So, why we do this kind of assumption and why this is always the starting point?”

31.2 Distributional assumptions

  • Why did we assume \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\)?

  • Is the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) ever correct? Is there a “true” model?

  • When would we expect the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) to be approximately correct?

    • Human body weights
    • Stock prices
    • Temperature
    • Proportion of votes for a candidate in an elections
  • Checking distributional assumptions

    • If \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\), then \(\mathbf{y} - \mathbf{X\boldsymbol{\beta}}\sim ?\)
  • If the assumption \(\mathbf{y}\sim\text{N}(\mathbf{X\boldsymbol{\beta}},\sigma^{2}\mathbf{I})\) is approximately correct, then what should \(\hat{\boldsymbol{\varepsilon}}\) look like?

  • Example: checking the assumption that \(\boldsymbol{\varepsilon}\sim\text{N}(\mathbf{0},\sigma^{2}\mathbf{I})\)

    • Data
    y <- c(63, 68, 61, 44, 103, 90, 107, 105, 76, 46, 60, 66, 58, 39, 64, 29, 37, 27,
        38, 14, 38, 52, 84, 112, 112, 97, 131, 168, 70, 91, 52, 33, 33, 27, 18, 14, 5,
        22, 31, 23, 14, 18, 23, 27, 44, 18, 19)
    year <- 1965:2011
    df <- data.frame(y = y, year = year)
    
    plot(x = df$year, y = df$y, xlab = "Year", ylab = "Annual count", main = "", col = "brown",
        pch = 20)
    m1 <- lm(y ~ year, data = df)
    abline(m1)

    • Histogram of \(\hat{\boldsymbol{\varepsilon}}\)
    m1 <- lm(y ~ year, data = df)
    e.hat <- residuals(m1)
    hist(e.hat, col = "grey", breaks = 15, main = "", xlab = expression(hat(epsilon)))

    • Plot covariate vs. \(\hat{\boldsymbol{\varepsilon}}\)
    plot(year, e.hat, xlab = "Year", ylab = expression(hat(epsilon)), col = "darkgreen")

    • A formal hypothesis test (see pg. 81 in Faraway (2014))
    shapiro.test(e.hat)
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  e.hat
    ## W = 0.86281, p-value = 5.709e-05
  • Example: Checking the assumption that \(\boldsymbol{\varepsilon}\sim\text{N}\left(\mathbf{0},\sigma^{2}\mathbf{I}\right)\) (What it should look like)

    • Simulated data
    beta.truth <- c(2356, -1.15)
    sigma2.truth <- 33^2
    n <- 47
    
    
    year <- 1965:2011
    X <- model.matrix(~year)
    set.seed(2930)
    y <- rnorm(n, X %*% beta.truth, sigma2.truth^0.5)
    df1 <- data.frame(y = y, year = year)
    
    plot(x = df1$year, y = df1$y, xlab = "Year", ylab = "Annual count", main = "", col = "brown",
        pch = 20)

    m2 <- lm(y ~ year, df1)
    e.hat <- residuals(m2)
    
    summary(m2)
    ## 
    ## Call:
    ## lm(formula = y ~ year, data = df1)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -76.757 -22.237   3.767  19.353  66.634 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)  
    ## (Intercept) 1717.2121   638.5293   2.689   0.0100 *
    ## year          -0.8272     0.3212  -2.575   0.0134 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 29.87 on 45 degrees of freedom
    ## Multiple R-squared:  0.1285, Adjusted R-squared:  0.1091 
    ## F-statistic: 6.632 on 1 and 45 DF,  p-value: 0.01337
    • Histogram of \(\hat{\boldsymbol{\varepsilon}}\)
    hist(e.hat, col = "grey", breaks = 15, main = "", xlab = expression(hat(epsilon)))

    • Plot covariate vs. \(\hat{\boldsymbol{\varepsilon}}\)
    plot(year, e.hat, xlab = "Year", ylab = expression(hat(epsilon)))

    • A formal hypothesis test (see pg. 81 in Faraway (2014))
    shapiro.test(e.hat)
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  e.hat
    ## W = 0.98556, p-value = 0.8228

Literature cited

Faraway, J. J. 2014. Linear Models with r. CRC Press.