6 Interaction and Collinearity

Learning objectives

By the end of this week you should be able to:

Understand the concept of collinearity and how it affects linear regression
Implement model building strategies for dealing with collinearity
Understand and explain the concept of interaction (effect modification)
Carry out linear regression analysis that accounts for interaction and interpret the key findings

Learning activities

This week’s learning activities include:

Learning Activity	Learning objectives
Reading	1, 2
Independent investigation	3, 4
Independent exercises	3, 4
Live tutorial/discussion	3, 4

Collinearity

Collinearity (also known as multicollinearity) may be present any regression model with more than one covariate, and occurs when there is correlation among predictor variables. This means that there is a linear association between two or more predictors in the model. For example, you might be interested in modelling the risk of a cardiovascular event based on a person’s systolic blood pressure (SBP) and diastolic blood pressure (DBP). SBP and DBP are physiologically related as they are both measures of blood pressure and are often strongly correlated, and so including both of these in the regression model may result in collinearity.

(Technically speaking, “collinearity” occurs when there is a linear relationship between two predictor variables, whilst multicollinearity refers to a linear relationship between two or more predictor variables. Thus, multicollinearity is a more general term that includes collinearity as a special case. We will use these terms interchangeably in RM1).

Implications of collinearity

The presence of collinearity can cause several problems:

Inflation of the standard errors of the estimated coefficients (\(\beta\) terms), making the estimates less precise. Covariates that appeared to be significant when evaluated independently of their collinear counterpart may no longer be significant (i.e., exhibit larger p-values, wider confidence intervals) when both are included in the model.
The model may struggle to identify the unique effect that each covariate has on the dependent variable, leading to unstable estimates - small changes in the data may cause large changes in the estimated coefficients, which reduces the reliability of the model for prediction (and interpretation).
It can become difficult to interpret the regression coefficients for an individual predictor variable/covariate. Normally we interpret a regression coefficient for a covariate by considering how a change in the covariate affects the outcome variable (e.g. a 1 unit change in the covariate), holding all other covariates/predictor variables constant. However, when collinearity is present, the covariates move together (changes in one predictor variable are associated with shifts in the collinear predictor variable(s)) and we would need to jointly consider changes in the collinear predictors.
In some scenarios, collinearity can contribute to overfitting, where the model performs well on the current data but poorly on new, unseen data.

Essentially, collinearity makes it difficult to determine which variables truly influence the outcome and by how much. However, collinearity doesn’t necessarily invalidate the model or its predictive power (discussed further below).

Detecting collinearity

There are a number of approaches that you can use to detect collinearity:

By examining scatterplots
By examining the correlation between the predictor variables (e.g., via a correlation matrix)
By fitting a series of regressions where the predictor variables are treated as the outcome variable (one at a time) and the other covariates are included as predictors in the model (this is a bit cumbersome)
By examining the variance inflation factor (VIF), which can be used to quantify the extent of the collinearity (measures how much a coefficient’s variance is inflated due to the presence of correlation between the collinear predictor variables). You can obtain this in R or Stata after fitting your models (a value will be provided for each predictor in the model). See Section 4.2.2.2 (p74) of Vittinghoff et al for more details. Note that different cut-offs have been recommended for the VIF. In general, a VIF = 1 indicates no collinearity. VIFs of 1-5 might indicate that there is potentially some collinearity and should be investigated further. VIFs 5-10 indicate that there is potentially problematic collinearity that should be investigated and VIF>10 often indicates serious collinearity that will need correction.

Handling collinearity

How we handle collinearity depends partly on our research question and the purpose of fitting our regression model (inferential goals from week 1), as well as the degree of collinearity and the variables affected. For example, if we are interested in fitting a prediction model where we are not interested in interpreting individual parameter values of the predictors, then we may include collinear variables if it improves the model’s performance (decreases prediction error). However, if we wish to interpret individual parameter values in the model, then we might want to try re-running the analysis with only one of the collinear variables included at a time, and slowly re-introduce the collinear variables into the model (this might involve fitting several models). You could potentially linearly combine some of the correlated variables to generate a new variable (e.g., use BMI instead of including height and weight in a model). Alternatively, you could also choose which variables to include based on what makes the most sense clinically (based on theory) and/or based on which measures are the easiest/most practical to collect. In some instances you might still need to keep collinear predictor variables in the model, for example, if there are higher order terms in the model such as quadratic terms (e.g., \(x^2\); Module 7) or interaction terms (introduced below), or if one variable is a predictor of main interest and the other is an important confounder variable. See Vittinghoff et al. pp421-422 for more examples.

Note that other advanced techniques such as ridge regression may also be used to handle collinearity. These are beyond the scope of RM1.

Book Chapter 10: Section 10.4.1 (pages 421-422).

This reading (pages 421-422 of the textbook) supplements the notes above.

Independent Exercise - Collinearity

In managing children with asthmas and other respiratory diseases, it is important to monitor various measures of lung function, one of the most import of which is the volume of air that can be forced out fot he lungs in one second, known as the Forced Expiratory Volume (1 second) or FEV1 for short (measured in litres). The “[lungfun.csv]” dataset contains the following variables:

idnum: patient identification number
age: age in years
wt: weight in kg
height: in cm
armsp: arm span in cm (the distance between the fingertips of left and right hands when hands outstretched)
ulna: ulna length in cm (the ulna is the main bone of the arm below the elbow)
farm: forearm length in cm (distance from elbow to the fingertip)
fev1: Forced Expiratory Volume in 1 second (L)

Carry out a regression with fev1 as the outcome, and all covariates included (excluding idnum). Describe any collinearity you observe. How would you recommend dealing with this collinearity?

Interaction (effect modification)

In previous weeks we assumed that the true underlying effect of an exposure on an outcome was constant over all individuals. This is not always the case and when the effect of the exposure is different for different groups this is called “interaction”. Another term frequently used instead of interaction is “effect modification”. You may wish to review your epidemiology notes for more information and examples of interaction.

The panel of images below is a helpful visual representation of the differences in regression models with and without interaction. The first column shows simple linear regression with a single relationship between \(x\) and \(Y\) (a continuous \(x\) is shown in the top 3 panels, and a binary \(x\) is shown in the bottom three. The second column shows multiple linear regression as we have used it so far. Here a binary “group” variable is included to adjust for the mean difference in \(Y\) across groups A and B. As the two lines are parallel along \(x\) the mean difference between groups is constant for all values of \(x\). In the third panel, we see that the relationship between \(x\) and \(Y\) is different for group A and group B. That is, there is interaction between \(x\) and the group variable. The lines are not parallel so the mean difference between group A and B depends on the value of x.

A regression model for interaction

We introduce the mathematical form of the regression model for interaction by considering a regression model with two covariates \(x_1\) and \(x_2\) which are regressed on our outcome variable \(Y\). The equation for this model without interaction is shown below:

\[\text{E}(Y) = \beta_0 +\beta_1 x_1 + \beta_2 x_2\]

The term we add to this model to account for, and test for interaction is the product of \(x_1\) and \(x_2\) as follows:

\[\text{E}(Y) = \beta_0 +\beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2\] To see why this works, consider the following factorisations of this regression equation

\[\text{E}(Y) = \beta_0 +\beta_1 x_1 + (\beta_2 + \beta_3 x_1) x_2\] Here, that the effect of \(x_2\) on \(Y\) equals \(\beta_2 + \beta_3 x_1\). That is the effect of \(x_2\) is dependent on the value of \(x_1\) - the definition of interaction. Also, you could instead factor out \(x_1\) instead of \(x_2\) to obtain the following

\[\text{E}(Y) = \beta_0 + (\beta_1 + \beta_3 x_2) x_1 + \beta_2 x_2\] where we see that the effect of \(x_1\) equals \(\beta_1 + \beta_3 x_2\). That is, the effect of \(x_1\) is dependent on the value of \(x_2\).

This factorisation is also useful when considering the interpretation of each of the regression coefficients \(\beta_0\), \(\beta_1\), \(\beta_2\) and \(\beta_3\). These are:

\(\beta_0\): the mean of \(Y\) when \(x_1 = x_2 = 0\)
\(\beta_1\): the effect of \(x_1\) when \(x_2 = 0\). i.e. The mean of \(Y\) will increase by \(\beta_1\) for every one unit increase in \(x_1\) when \(x_2 = 0\).
\(\beta_2\): the effect of \(x_2\) when \(x_1 = 0\). i.e. The mean of \(Y\) will increase by \(\beta_2\) for every one unit increase in \(x_2\) when \(x_1 = 0\).
\(\beta_3\): How the effect of \(x_1\) changes for every one unit increase in \(x_2\). Or alternatively, how the effect of \(x_2\) changes for every one unit increase in \(x_1\).

Importantly, the statistical test for trend here is the test with the null hypothesis \(\beta_3 = 0\). Therefore the P-value for \(\beta_3\) should be interpreted as the evidence for interaction between \(x_1\) and \(x_2\).

Interaction in statistical software

There are two approaches for implementing an interaction regression model in statistical software. The first is to manually create a new column of data which is the product of the two columns you wish to test for interaction for. You can then include this new variable as a covariate in the regression. The second method is to allow Stata or R to do this automatically, by specifying the interaction in the regression formula. The second method is always recommended as interaction involving categorical variables of more than two categories requires more than just creating a product of two covariates, but instead creating a product of all dummy variables involved in the categorical variable.

Stata

In Stata the hash symbol # is used to specify interaction. E.g. reg Y x1 x2 x1#x2 would be a regression model with outcome \(Y\), covariates \(x_1\) and \(x_2\) and interaction between \(x_1\) and \(x_2\). Alternatively you can use the double hash ## as a shorthand to specify the inclusion of both covariates and interaction between them. i.e. reg Y x1 x2 x1#x2 is equivalent to reg Y x1##x2

R

In R the colon symbol : is used to specify interaction. E.g. lm(Y ~ x1 + x2 + x1:x2, data=...) would be a regression model with outcome \(Y\), covariates \(x_1\) and \(x_2\) and interaction between \(x_1\) and \(x_2\). Alternatively you can use the star * as a shorthand to specify the inclusion of both covariates and interaction between them. i.e. lm(Y ~ x1 + x2 + x1*x2, data=...) is equivalent to lm(Y ~ x1:x2, data=...).

Book Chapter 4. Section 4.6.5 (pages 107-108).

This small reading provides brief descriptions of several important factors to consider when using a regression model with interaction.

Example

We will show two examples from the textbook that demonstrate interaction. These are examples 4.6.1 (interaction between two binary variables) and 4.6.2 (interaction between a binary variable and a continuous variable). You can also of course have interaction between two continuous variables.

Example 4.6.1

We return to the HERS dataset with low-density lipoprotein (LDL) as the outcome and statin use (statin) and hormone therapy (HT) as the two binary covariates. Note that the results below will differ to those in the text book as the textbook uses LDL1 as the outcome instead of LDL. The regression model with interaction is:

Stata code and output

Show the code

use hersdata, clear
reg LDL i.HT i.statins  i.HT#i.statins
##       Source |       SS           df       MS      Number of obs   =     2,752
## -------------+----------------------------------   F(3, 2748)      =     44.06
##        Model |  180425.141         3  60141.7137   Prob > F        =    0.0000
##     Residual |  3750983.32     2,748  1364.98665   R-squared       =    0.0459
## -------------+----------------------------------   Adj R-squared   =    0.0449
##        Total |  3931408.46     2,751  1429.08341   Root MSE        =    36.946
## 
## ------------------------------------------------------------------------------
##          LDL | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##           HT |
## hormone t..  |   .6656023   1.764553     0.38   0.706    -2.794382    4.125587
##              |
##      statins |
##         yes  |   -15.8714   2.055967    -7.72   0.000     -19.9028   -11.84001
##              |
##   HT#statins |
## hormone t.. #|
##         yes  |  -1.918487   2.931023    -0.65   0.513    -7.665717    3.828743
##              |
##        _cons |   150.7981   1.257647   119.90   0.000     148.3321    153.2642
## ------------------------------------------------------------------------------

R code and output

Show the code

hers <- read.csv("hersdata.csv")
## Warning in file(file, "rt"): cannot open file 'hersdata.csv': No such file or
## directory
## Error in file(file, "rt"): cannot open the connection
lm.hers <- lm(LDL ~ HT + statins + HT:statins, data = hers)
## Error in eval(mf, parent.frame()): object 'hers' not found
summary(lm.hers)
## Error in eval(expr, envir, enclos): object 'lm.hers' not found
confint(lm.hers)
## Error in eval(expr, envir, enclos): object 'lm.hers' not found

You may notice here that the R and Stata output have some differences as they choose different reference categories for the HT variable (Stata chose the placebo group as the first observation in the dataset is from the placebo group, and R chose the hormone therapy group as “hormone therapy” is before “placebo” when arranging in alphabetical order). In both instances here however the primary result is the same: “There is no evidence of interaction between HT and statin use (P = 0.513)”. As there is no evidence for interaction, you would then proceed with a non-interaction regression model.

Example 4.6.2

Using the same HERS data as example 4.6.1 we now investigate possible interaction between statin use (binary variable) and body mass index (BMI - a continuous variable). In this example, we also adjust for several other covariates including age, nonwhite, smoking and drinkany. We also use a centered version of the BMI variable. Unlike in example 4.6.1 our results should match the textbook as we both use the LDL variable.

Stata code and output

Show the code

use hersdata, clear
gen BMIc = BMI - 28.57925
regress LDL i.statins##c.BMIc age nonwhite smoking drinkany
## (5 missing values generated)
## 
## 
##       Source |       SS           df       MS      Number of obs   =     2,745
## -------------+----------------------------------   F(7, 2737)      =     22.85
##        Model |  216681.484         7  30954.4978   Prob > F        =    0.0000
##     Residual |     3707501     2,737  1354.58568   R-squared       =    0.0552
## -------------+----------------------------------   Adj R-squared   =    0.0528
##        Total |  3924182.49     2,744  1430.09566   Root MSE        =    36.805
## 
## ------------------------------------------------------------------------------
##          LDL | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##      statins |
##         yes  |  -16.25301   1.468788   -11.07   0.000    -19.13305   -13.37296
##         BMIc |   .5821275    .160095     3.64   0.000     .2682082    .8960468
##              |
##      statins#|
##       c.BMIc |
##         yes  |   -.701947   .2693752    -2.61   0.009    -1.230146   -.1737478
##              |
##          age |  -.1728526   .1105696    -1.56   0.118    -.3896608    .0439556
##     nonwhite |   4.072767   2.275126     1.79   0.074    -.3883704    8.533903
##      smoking |   3.109819    2.16704     1.44   0.151    -1.139381    7.359019
##     drinkany |  -2.075282   1.466581    -1.42   0.157    -4.950999    .8004355
##        _cons |   162.4052   7.583312    21.42   0.000     147.5356    177.2748
## ------------------------------------------------------------------------------

R code and output

Show the code

hers <- read.csv("hersdata.csv")
## Warning in file(file, "rt"): cannot open file 'hersdata.csv': No such file or
## directory
## Error in file(file, "rt"): cannot open the connection
hers$BMIc <- hers$BMI - 28.57925
## Error in eval(expr, envir, enclos): object 'hers' not found
lm.hers <- lm(LDL ~ statins*BMIc + age + nonwhite + smoking + drinkany, data = hers)
## Error in eval(mf, parent.frame()): object 'hers' not found
summary(lm.hers)
## Error in eval(expr, envir, enclos): object 'lm.hers' not found
confint(lm.hers)
## Error in eval(expr, envir, enclos): object 'lm.hers' not found

We can see from the output above that there is strong evidence for interaction between body mass index and statin use, after adjusting for age, nonwhite, smoking, and drinkany covariates (P = 0.009). Given the strong evidence for interaction, we would decide to keep interaction in this model. This makes the interpretation of the base statins and BMIc variables different as follows:

statins. When BMIc = 0, the those taking statins have a LDL on average 16.25mg/dL lower than those not taking statins. Here we can see the value of centering BMI, as BMIc = 0 corresponds to the effect of statin use for the mean BMI of this sample. Otherwise it would correspond to the effect of statins for a BMI of 0 (which is not appropriate to calculate)
BMIc. For those not taking statins (i.e. statins = 0), for every 1kg/m² increase in BMI, the mean LDL increases by 0.58mg/dL.

To calculate the mean effect of BMI on those taking statins, we would need to take linear combinations of the above regression coefficients. In this case the mean effect would be 0.58 - 0.70 = -0.12. To get a more accurate calculation of this, along with the corresponding confidence interval use the lincom BMIc + 1.statins#c.BMIc statement in stata and the summary(glht(lm.hers, "BMIc + statinsyes:BMIc = 0")) command in R (you can swap “summary” with “confint” in this statement to obtain the confidence intervals).

Summary

This week’s key concepts are:

Collinearity occurs when two or more covariates in a regression model are associated with each other, and do not have sufficient independent associations with the outcome
Collinearity increases standard errors of regression coefficients
You should check for collinearity when carrying out linear regression. If detected, the effects of collinearity can be determined by removing some of the collinear covariates and/or by examining the VIF.
Interaction between two covariates occurs when the effect size for variable 1, depends on the level of covariate 2. This is also called effect modification.
Interaction between two variables in regression can be tested by including an additional covariate in your regression model that is the multiplication of your two covariates. If one or more of these covariates is categorical (with more than 2 categories), this will be the addition of several interaction terms between all dummy variables.
The interpretation of regression models with interaction terms is more complex as effect sizes are not constant for interacting covariates.