3.4 General Linear Models with multiple predictors

3.4.1 A multiple-predictor model of behavioural inhibition

Both our models so far have had a single predictor variable (Condition for model m1; Age for model m2). Often, though, you will want to consider several predictors at the same time. In the behavioural inhibition paper, the researchers had such a situation, because they wanted to consider the impact on SSRT of their experimental manipulation, Condition, and childhood socioeconomic deprivation, (Deprivation Score): they had hypotheses about both. They also wanted to account statistically for two covariate variables, Age and GRT. They were not actually interested in from the point of view of the research questions, but they thought they might account for additional variation in SSRT scores. You can do all this with a single model.

Let’s now fit this model. First, we will center the continuous predictors:

d <- d %>% mutate(
  Deprivation_Score_centred = Deprivation_Score - 
    mean(Deprivation_Score, na.rm=T),
  Age_centred = Age - mean(Age, na.rm=T), 
  GRT_centred = GRT - mean(GRT, na.rm=T))

Now let’s run the model and get its summary. We put all the predictor variables on the right-hand side of the formula, separated by ‘+’ signs:

m3 <- lm(SSRT ~ Condition + Deprivation_Score_centred + Age_centred + GRT_centred, data = d)
summary(m3)$coefficients
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)               234.7493     7.7250  30.388 8.92e-35
## ConditionNegative           7.6383    10.7889   0.708 4.82e-01
## Deprivation_Score_centred  52.3777    19.7332   2.654 1.05e-02
## Age_centred                 1.1483     0.4163   2.759 7.99e-03
## GRT_centred                -0.0796     0.0383  -2.075 4.30e-02

Now we have a parameter estimate (and standard error) for each of the variables we thought might affect or be associated with SSRT. You interpret these parameter estimates in the same basic way as for models m1 and m2. However, you may notice some differences: the intercept of m3 does not have exactly the same value as the intercept of m1 or m2. The parameter estimate for the effect of Condition on SSRT is not quite identical in m1 and m3; and the parameter estimate for Age is not identical in m2 and m3. What is going on? Surely the association between SSRT and Age is the association between SSRT and Age; why should it be changed by what other variables we choose to look at as well?

In fact, the association between SSRT and Age that we are estimating in m3 is not the same association between SSRT and Age as the one we are estimating in m2. In a model with multiple predictors, the parameter estimates are called partial coefficients. They estimate the effect of a one-unit change in that variable on the outcome variable on the assumption that all other variables in the model remain unchanged. And the intercept represents the value of the outcome when all of the predictor variables are zero, and therefore changes according to which predictors are included.

In the model where Age was the only predictor, the coefficient for Age represented the change in SSRT that would be expected if Age increased by a year. In the model m3, the coefficient for age represents the change in SSRT that would be expected if Age increases by a year and Condition, Deprivation_Score, and GRT stayed the same.

If Age were perfectly uncorrelated to Condition, Deprivation_Score, and GRT, then the coefficients for Age in m1 and m3 would be identical. They would be identical because, when Age changes by one unit, nothing else would change. But in fact, Age is somewhat correlated with the other variables. Particularly, it is moderately positively correlated with GRT. This means that older people are also slower people to respond overall. You can verify this positive correlation:

cor(d$Age, d$GRT, use="complete.obs")
## [1] 0.495

GRT is, for its part, negatively associated with SSRT (check this on the summary(m3) output). So, in this dataset, when Age increases by one year, two things happen relevant to SSRT:

  • the direct association with Age: SSRT goes up;

  • the indirect association: GRT also goes up, because GRT is positively correlated with Age; and a higher GRT reduces SSRT.

These two things partially offset one another: the direct association causes SSRT to increase with age, and the indirect one causes SSRT to decrease a bit with age. In model m2, we were estimating the sum of the direct and indirect associations; the total impact of Age on SSRT which comes from both processes. That’s because model m2 estimates the total, raw association between Age and SSRT. In model m3, we are estimating only the direct association. The partial coefficient for Age in m3 is an estimate of the effect of a change in Age where GRT does not change, and hence the indirect association does not happen. This is why the coefficient for Age in m3 is actually a bit bigger than the coefficient for Age in m2; the model estimates exactly the impact of an increase in Age on SSRT in the case where GRT does not change at all, thus eliminating the indirect effect from consideration. It imagines people getting older, as it were, without also getting slower in GRT.

3.4.2 Which predictor variables should you include ?

The lesson from the previous section should be that which predictor variables you include in your model needs to depend on exactly what it is you are trying to find out. Unfortunately, many disciplines have got into the habit of including as many ‘control variables’ as they can get hold of, on the apparent assumption that controlling for more thing is always better than controlling for fewer things. This is not true!

On the contrary, as parameter estimate from a model with more additional predictors is not a better estimate than one from a model with fewer predictors; it is just an estimate of a different estimand, and perhaps not the one you actually care about. You can end up with a bad estimator of the quantity you actually care about (an estimator that gives a biased, or systematically wrong, estimate), through the well-intentioned inclusion of other variables that you think might help. This is known as the problem of bad control.

When variables are directly mentioned in your research questions, then it is pretty straightforward that they should be included. If your research question is ‘how does self-esteem vary with age and sex?’, then of course age and sex both go in as predictors (though, you might want to consider interactions between them, see section 5.5, and also that the association between age and self-esteem might be non-linear).

More problematic are cases where there is a covariate (like Age or GRT here) that is not mentioned in the research question, but still plausibly related to the outcome. Should this variable go in, or not? Here, we need to think about two things: what your estimand is (the thing you are trying to find out about); and what the causal relationships are between the covariate and the outcome and predictor(s) of direct interest.

There are several possible scenarios. Let’s denote the predictor of interest \(X\), the outcome \(Y\), and the candidate covariate \(Z\). We will assume that your estimand is the ACE of \(X\) on \(Y\).

  • \(Z\) is a potential confounder. This means that \(Z\) could have a causal impact on \(X\) and, separately, a causal impact on \(Y\). Confounders should generally go into the model. If you think that reading Proust (\(X\)) might increase wages (\(Y\)), then you should control for level of education (\(Z\)), because people who are more educated might more read more Proust, and also (separately) get higher-paying jobs. The association between reading Proust and wages without controlling for level of education is a biased estimate of the (probably non-existent) ACE of reading Proust on wages. This is a form of bias called omitted variable bias, and it is dealt with by controlling for the omitted confounder. Of course, better still than controlling for confounders statistically is designing experimental studies in which you manipulate \(X\) without changing \(Z\). These are de-confounded by design, and hence stronger for causal inference.

  • \(Z\) is a potential mediator. A mediator (or intervening variable) is a variable that lies on the causal pathway from \(X\) to \(Y\). (More generally, variables that can be affected by \(X\) are sometimes known as post-treatment variables). Physical activity (\(X\)) might improve depressive symptoms (\(Y\)) via the mediator of sleep quality (\(Z\)): when you exercise more, you sleep better; and when you sleep better, you feel less depressed. You should not include potential mediators in your model, as long as your estimand is the ACE of \(X\) on \(Y\). You should not include them because if you do, you are estimating something else: the part of the improvement in depressive symptoms with greater physical activity that is not due to improved sleep; or, what would happen to the average person’s depression if did more activity but their sleep quality stayed the same.

  • \(Z\) is potentially affected by the outcome variable \(Y\). In this case, you should not include it. For example, you should never include \(Z\) variables like ‘how happy the participant felt after finishing the task’ if your outcome is task performance, because how happy they felt afterwards could be affected by how well they performed. Including variables that are affected by \(Y\) in the model can undermine your ability to get an unbiased of the ACE of \(X\) on \(Y\). In the case where \(Z\) is also affected by both \(X\) and \(Y\), this is called collider bias.

  • \(Z\) is an important determinant of \(Y\), but not related to \(X\). Sometimes you have cases where a variable represents an important causal determinant of \(Y\), but cannot causally affect \(X\), or be affected by \(X\). Such a \(Z\) is neither a confounder, nor a mediator. An example would be including Age as an additional predictor in model m1 of this chapter. Age cannot affect Condition, since assignment to conditions was done at random. Age cannot be affected by Condition, or indeed by SSRT. But Age does explain a chunk of variation in SSRT. For cases like this, you don’t have to include them, but it can be a good idea to do so. Doing so does not change the estimand, but it can improve the precision of the estimate (i.e. make the standard error smaller). It does this by ‘soaking’ up variation in \(Y\) that is nothing to do with \(X\). (In the current dataset, including Age in the model barely improves the standard error of the parameter estimate for Condition, but there are cases where it helps quite a bit. )

To summarise this section, a few recommendations. First, be sparing in which covariates you include; err on the side of leaving them out, and defend this position against annoying referees who will always think it would be better if you ‘controlled for more things’ (I will shortly give you citations from really clever people to aid your defence!). Second, do some reading on causal identification, bad control and the principles of which variables to include: there are some excellent guidelines out there (Cinelli et al., 2024; Montgomery et al., 2018; Rohrer, 2018) Third, decide up front which covariates you intend to include, as part of the analysis strategy in your pre-registration (see chapter 12); and when you do this, give an explicit justification for what you include and what you exclude (so-and-so is a confounder and so we will include it, such-an-such is a potential mediator and so we do not)(Wysocki et al., 2022). Finally, explore the sensitivity of robustness of your conclusions to the inclusion or exclusion of the covariates. In other words, how different would the conclusions be if I were to exclude rather than include these variables? Sensitivity analysis is the topic of chapter 9.

References

Cinelli, C., Forney, A., & Pearl, J. (2024). A Crash Course in Good and Bad Controls. Sociological Methods and Research, 53(3), 1071–1104. https://doi.org/10.1177/00491241221099552
Montgomery, J. M., Nyhan, B., & Torres, M. (2018). How Conditioning on Posttreatment Variables Can Ruin Your Experiment and What to Do about It. American Journal of Political Science, 62(3), 760–775. https://doi.org/10.1111/ajps.12357
Rohrer, J. M. (2018). Thinking Clearly About Correlations and Causation: Graphical Causal Models for Observational Data. Advances in Methods and Practices in Psychological Science, 1(1), 27–42. https://doi.org/10.1177/2515245917745629
Wysocki, A. C., Lawson, K. M., & Rhemtulla, M. (2022). Statistical Control Requires Causal Justification. Advances in Methods and Practices in Psychological Science, 5(2), 25152459221095823. https://doi.org/10.1177/25152459221095823