1  Simple Linear Regression

Learning objectives

By the end of this week you should be able to:

  1. Describe the different motivations for regression modelling

  2. Formulate a simple linear regression model

  3. Understand the least squares method of parameter estimation and its equivalence to maximum likelihood

  4. Interpret statistical output for a simple linear regression model

  5. Calculate and interpret confidence intervals and prediction intervals for simple linear regression

Learning activities

This week’s learning activities include:

Learning Activity Learning objectives
Video 1 1, 2, 3
Readings 1, 2, 3, 4
Video 2 4, 5
Independent exercises 2, 4, 5
Live tutorial/discussion board 2, 4, 5

In the notes below when we mention “Book Chapter”, the “Book” we are referring to is: Regression Methods in Biostatistics: Linear, Logistics, Survival, and Repeated Measures Models, by Vittinghoff et al. (Second Edition). You should be able to obtain a digital copy of the book from the library of your University.

Introduction to regression

Regression modelling is one of the key tools that statisticians use to understand and quantify the relationship between an outcome variable, \(Y\), (also known as the “dependent” or “response” variable) and one or more covariates, \(\mathbf{x}\) (also known as the “predictor”, “independent” or “explanatory” variables). It involves constructing a mathematical equation to describe the relationship between these variables, and aims to find the “best fit” to describe how the outcome variable \(Y\) changes as a covariate \(x\) changes in value. For example, we might be interested in studying how systolic blood pressure changes for every 1kg increase in body weight, or we might want to know how the average haemoglobin levels differ between males and females. Regression models are an extremely useful tool that can be used to answer a variety of research questions and for different purposes (outlined in the video below).

The aim of this unit is to lay the foundation of “regression models” to analyse data from randomised or observational studies. “Regression” is a general term for a set of methods for measuring associations between an outcome and one or multiple covariates at once, allowing for the adjustment of confounding (Module 4) and effect modification (Module 6). Regression models are commonly used in health research and being able to implement these methods appropriately and interpret their results are vital skills for you to master to be an effective practitioner of biostatistics . A suite of common regression models will be taught across this unit (Regression Modelling 1 (RM1)) and in the subsequent Regression Modelling 2 (RM2) unit. The skills taught in this unit (and in RM2) will be used for the remainder of your BCA studies (and career in biostatistics).

In RM1 we will be focussing on regressions where the outcome variables are either continuously distributed (linear regression models), or are binary (logistic regression models). RM2 will then expand the logistic regression concepts introduced in RM1 for multinomial and ordinal categorical data, and also include other regression models for count and rate data, and survival models in the framework of generalised linear models.

1.0.1 Introduction to Simple Linear Regression

This lecture introduces you to the purpose of regression models to answer three types of research questions: prediction; isolating the effect of a single predictor; and understanding multiple predictors. You will also learn what a simple linear regression looks like, how to interpret the parameters in the model, and learn about the method used to estimate its parameters.

Book (Vittinghof et al) Chapter 1. Introduction to Section 1.3.3 (pages 1-4).

This reading (pages 1-4 of the textbook) supplements Lecture 1 with a similar motivation for the need for regression models (which they refer to as “multipredictor regression models”) to answer three types of research questions:

  1. Prediction - when we want to use certain variables to predict a certain outcome

  2. Isolating the effect of a single predictor/exposure on the outcome (with or without the presence of confounders)

  3. Understanding the effect of multiple predictors/covariates on an outcome

Nothing new is introduced in this reading, but it provides some further examples and its purpose is to allow you to become familiar with the writing style of the textbook that we follow in this course.

Book Chapter 3. Section 3.3 to 3.3.3 (pages 35-38).

This reading (pages 35-38 of the textbook) introduces the simple linear regression model and describes how to interpret each parameter of the model. This will be further explored in Lecture 2. It also describes the error term between individual observations and the mean behaviour of the population – which is important as the assumptions of linear regression are all about the error term. Stata and R code corresponding to the output in this reading can be found below

Stata code and output

Show the code
use hersdata, clear
set seed 90896
sample 10
reg SBP age
## (2,487 observations deleted)
## 
## 
##       Source |       SS           df       MS      Number of obs   =       276
## -------------+----------------------------------   F(1, 274)       =     11.70
##        Model |  4595.93381         1  4595.93381   Prob > F        =    0.0007
##     Residual |  107671.294       274  392.960929   R-squared       =    0.0409
## -------------+----------------------------------   Adj R-squared   =    0.0374
##        Total |  112267.228       275  408.244466   Root MSE        =    19.823
## 
## ------------------------------------------------------------------------------
##          SBP | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##          age |   .6411057   .1874638     3.42   0.001     .2720533    1.010158
##        _cons |   93.87961   12.43407     7.55   0.000     69.40115    118.3581
## ------------------------------------------------------------------------------

R code and output

Show the code
hers_subset <- read.csv("hers_subset.csv")
lm.hers <- lm(SBP ~ age, data = hers_subset)
summary(lm.hers)
## 
## Call:
## lm(formula = SBP ~ age, data = hers_subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.193 -14.346  -1.578  13.391  57.961 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  93.8796    12.4341    7.55 6.48e-13 ***
## age           0.6411     0.1875    3.42 0.000722 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.82 on 274 degrees of freedom
## Multiple R-squared:  0.04094,    Adjusted R-squared:  0.03744 
## F-statistic:  11.7 on 1 and 274 DF,  p-value: 0.0007219
confint(lm.hers)
##                  2.5 %     97.5 %
## (Intercept) 69.4011476 118.358063
## age          0.2720533   1.010158
Note

This table does not use the complete HERS dataset; rather it takes a random sample of 10% of the data. In Stata this is achieved by using set seed 90896 and sample 10.

Here the set seed 90896 ensures that the random sample is reproducible. i.e. we draw the same random sample each time. As random sampling is hard to replicate across statistical programs, to get the same output in R we needed to take the random sample in Stata and then import this subset of the data into R. A copy of this subset is provided in the data resources titled hers_subset.csv

Notation

Before continuing further with the theory of linear regression it is helpful to see some of the variations in notation around regression formula. In general Greek letters are used for true population values, whereas the Latin (or modern) alphabet is used to denote estimated values from a sample. The hat symbol (^) can also be used to indicated estimated or fitted values. Subscripts on the \(Y\)’s and \(x\)’s indicate the observation number. Some examples of the different ways regression notation is used in this course is shown below. Don’t worry if some of these terms are not familiar to you yet, they will be introduced to you in due course.

Term True population Estimated from data/sample
Regression line \(\text{E}(Y) =\beta_0 +\beta_1 x\)\(Y_i = \beta_0 +\beta_1 x_i + \varepsilon_i\) \(\bar {Y_x} = \hat{\beta_0 } + \hat{\beta_1} x\)
\(\bar {Y_x} = b_0 + b_1 x\)
\(\hat{Y}_i = b_0 + b_1 x_i +e_i\)
Expected values / means

\(\text{E}(Y)\)

\(\text{E}(x)\)

\(\bar{Y}\)

\(\bar{x}\)

Parameters, regression coefficients \(\beta\) \(\hat{\beta}\) , \(b\)
Error terms \(\varepsilon\) - called “error” \(e\) or \(\hat{\varepsilon}\) called “residual” or “residual error”
Variance of error \(\sigma^2\), \(\text{Var}(\varepsilon)\) Mean square error, MSE, \(\hat{\sigma}^2\), \(\hat{\text{ Var}}(\varepsilon)\), \(s^2\)

Properties of ordinary least squares

There are many ways to fit a straight line to data in a scatterplot. Linear regression uses the principle of ordinary least squares, which finds the values of the parameters (\(\beta_0\) and \(\beta_1\)) of the regression line that minimise the sum of the squared vertical deviations of each point from the fitted line. That is, the line that minimises: \[ \sum(Y_i - \bar{Y}_i)^2 = \sum(Y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i))^2\]

This principle is illustrated in the diagram below, where a line is shown passing near the value of \(Y\) for six values of \(x\). Each choice of values for \(\hat{\beta}_0\) and \(\hat{\beta}_1\) would define a different line resulting in different values for the vertical deviations. There is however one pair of parameter values that produces the least possible value of the sum of the squared deviations called the least squares estimate.

Ordinary least squares minimises the square of the vertical distance between data and the line

In the scatterplot below, you can see how the line adjusts to points in the graph. Try dragging some of the points, or creating new points by clicking in an empty area of the plot, and see how the equation changes. In particular, notice how moving up and down a point at the extremes of the \(x\) scale, affects the fitted line much more than doing the same to a point in the mid-range of the \(x\) scale. We will see later that this is the reason for caution when we have outliers in the data.

Linear regression uses least squares to estimate parameters because when regression assumptions are met (more on this later) the estimator is BLUE: Best Linear Unbiased Estimator. Unbiased means that over many repetitions of sampling and fitting a model, the estimated parameter values average out to equal the true “population” parameter value (i.e. \(\text{E}[\hat{\beta}_1]=\beta_1\)). Having an unbiased estimator does not mean that all parameter estimates are close to the true value—in fact it says nothing about the sample-to-sample variability in the parameter estimates, since that is a precision issue. The “linear” means that the class of estimators are those that can be written as linear combinations of the observations \(Y\). More specifically, any linear unbiased estimator of the slope parameter \(\beta_1\) can be written as \(\sum_{i=1}^n a_iY_i\) where the values of \((a_1,...a_n)\) must be such that \(\text{E}(\sum_{i=1}^n a_iY_i)=\beta_1\). This was important when computational power was limited as linear estimators can be easily computed. The “best” component of BLUE says that least squares estimators are best in the sense of having the smallest variance of all linear unbiased estimators. That is, they have the best precision or they are the most efficient.

The mathematical theorem and proof that the least squares estimator is the best linear unbiased estimator (BLUE) is called the Gauss-Markov theorem. The least squares estimator also identical to the maximum likelihood estimator when the regression assumptions are met.

Chapter 3. Section 3.3.5 to 3.3.9 (pages 39-42).

The reading (pages 39-42 of the textbook) describes the basic properties of regression coefficients including: their standard error; hypothesis testing; confidence intervals; and their involvement in the calculation of \(R^2\).

Regression in Stata and R

The lecture below (which you can watch for either Stata or R (or both if you are keen)) shows how to carry out and interpret the results of a simple linear regression in statistical software. It then shows how to calculate and interpret confidence and prediction intervals.

Download video

R Lecture

Download video

Exercises

The following exercise will allow you to test yourself against what you have learned so far. The solutions will be released at the end of the week.

Using the dataset hers_subset.csv dataset, use simple linear regression in R or Stata to measure the association between diastolic blood pressure (DBP - the outcome) and body mass index (BMI - the exposure).

  1. Summarise the important findings by interpreting the relevant parameter values, associated P-values and confidence intervals, and \(R^2\) value. Three to four sentences is usually enough here.

  2. From your regression output, calculate by how much the mean DBP changes for a 5kgm-2 increase in BMI? Can you verify this by modifying your data and re-running your regression? Hint (for the 2nd part of the question): you will need to create a new variable for BMI (called BMI5)

  3. Manually calculate the \(\beta_1\) standard error, the t-value, p-value and \(R^2\)

  4. Based on your regression, make a prediction for the mean diastolic blood pressure of people with a BMI of 28kgm-2.

  5. Calculate and interpret a confidence interval for this prediction.

  6. Calculate and interpret a prediction interval for this prediction.

Live tutorial and discussion

The final learning activity for this week is the live tutorial and discussion. This tutorial will provide an overview of the unit and is an opportunity for you to to interact with your lecturers, meet your classmates, ask questions about the course, and learn about biostatistics in practice. You are expected to attend these tutorials when possible for you to do so. For those that cannot attend, the tutorial will be recorded and made available on Canvas. We hope to see you there!

Preparation for week 2

In Week 2 we will hold a tutorial where you will be required to collaboratively complete some exercises (in breakout rooms on zoom). Interacting, discussing, and working through problems with your peers is an important skill for biostatisticians. This is also nice activity to get to know your peers in this online course. We will communicate the timing of the tutorial in Week 1.

Summary

This week’s key concepts are:

    • Regression models have three main purposes: prediction; isolating the effect of a single exposure; and understanding multiple predictors. The purpose of a regression model will influence the procedures you follow to build a regression model, and this will be explored more in week 8.

    • Simple linear regression measures the association between a continuous outcome and a single exposure. This exposure can be continuous, or binary (in which case simple linear regression is equivalent to a two-sample students t-test).

    • The relevant output to interpret and report from a simple linear regression includes:

      • The p-value for the exposure’s regression coefficient (slope)

      • The effect size of the exposure regression coefficient and 95% confidence interval

      • The amount of variation in the outcome explained by the exposure (\(R^2\))

    • The confidence interval for a prediction represents the uncertainty associated with an estimated predicted mean. Conversely, the prediction interval represents the uncertainty associated with the spread of observations around the predicted mean.