3.4 General Linear Models with multiple predictors
3.4.1 A multiple-predictor model of behavioural inhibition
Both our models so far have had a single predictor variable (Condition
for model m1
; Age
for model m2
). Often, though, you will want to consider several predictors at the same time. In the behavioural inhibition paper, the researchers had such a situation, because they wanted to consider the impact on SSRT of their experimental manipulation, Condition
, and childhood socioeconomic deprivation, (Deprivation Score
): they had hypotheses about both. They also wanted to account statistically for two covariate variables, Age
and GRT
. They were not actually interested in from the point of view of the research questions, but they thought they might account for additional variation in SSRT scores. You can do all this with a single model.
Let’s now fit this model. First, we will center the continuous predictors:
d <- d %>% mutate(
Deprivation_Score_centred = Deprivation_Score -
mean(Deprivation_Score, na.rm=T),
Age_centred = Age - mean(Age, na.rm=T),
GRT_centred = GRT - mean(GRT, na.rm=T))
Now let’s run the model and get its summary. We put all the predictor variables on the right-hand side of the formula, separated by ‘+’ signs:
m3 <- lm(SSRT ~ Condition + Deprivation_Score_centred + Age_centred + GRT_centred, data = d)
summary(m3)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 234.7493 7.7250 30.388 8.92e-35
## ConditionNegative 7.6383 10.7889 0.708 4.82e-01
## Deprivation_Score_centred 52.3777 19.7332 2.654 1.05e-02
## Age_centred 1.1483 0.4163 2.759 7.99e-03
## GRT_centred -0.0796 0.0383 -2.075 4.30e-02
Now we have a parameter estimate (and standard error) for each of the variables we thought might affect or be associated with SSRT. You interpret these parameter estimates in the same basic way as for models m1
and m2
. However, you may notice some differences: the intercept of m3
does not have exactly the same value as the intercept of m1
or m2
. The parameter estimate for the effect of Condition
on SSRT is not quite identical in m1
and m3
; and the parameter estimate for Age
is not identical in m2
and m3
. What is going on? Surely the association between SSRT
and Age
is the association between SSRT
and Age
; why should it be changed by what other variables we choose to look at as well?
In fact, the association between SSRT
and Age
that we are estimating in m3
is not the same association between SSRT
and Age
as the one we are estimating in m2
. In a model with multiple predictors, the parameter estimates are called partial coefficients. They estimate the effect of a one-unit change in that variable on the outcome variable on the assumption that all other variables in the model remain unchanged. And the intercept represents the value of the outcome when all of the predictor variables are zero, and therefore changes according to which predictors are included.
In the model where Age
was the only predictor, the coefficient for Age
represented the change in SSRT that would be expected if Age
increased by a year. In the model m3
, the coefficient for age represents the change in SSRT that would be expected if Age
increases by a year and Condition
, Deprivation_Score
, and GRT
stayed the same.
If Age
were perfectly uncorrelated to Condition
, Deprivation_Score
, and GRT
, then the coefficients for Age
in m1
and m3
would be identical. They would be identical because, when Age
changes by one unit, nothing else would change. But in fact, Age
is somewhat correlated with the other variables. Particularly, it is moderately positively correlated with GRT
. This means that older people are also slower people to respond overall. You can verify this positive correlation:
## [1] 0.495
GRT
is, for its part, negatively associated with SSRT (check this on the summary(m3)
output). So, in this dataset, when Age
increases by one year, two things happen relevant to SSRT:
the direct association with
Age
: SSRT goes up;the indirect association:
GRT
also goes up, becauseGRT
is positively correlated withAge
; and a higher GRT reduces SSRT.
These two things partially offset one another: the direct association causes SSRT to increase with age, and the indirect one causes SSRT to decrease a bit with age. In model m2
, we were estimating the sum of the direct and indirect associations; the total impact of Age
on SSRT which comes from both processes. That’s because model m2
estimates the total, raw association between Age
and SSRT
. In model m3
, we are estimating only the direct association. The partial coefficient for Age
in m3
is an estimate of the effect of a change in Age
where GRT
does not change, and hence the indirect association does not happen. This is why the coefficient for Age
in m3
is actually a bit bigger than the coefficient for Age
in m2
; the model estimates exactly the impact of an increase in Age
on SSRT in the case where GRT does not change at all, thus eliminating the indirect effect from consideration. It imagines people getting older, as it were, without also getting slower in GRT.
3.4.2 Which predictor variables should you include ?
The lesson from the previous section should be that which predictor variables you include in your model needs to depend on exactly what it is you are trying to find out. Unfortunately, many disciplines have got into the habit of including as many ‘control variables’ as they can get hold of, on the apparent assumption that controlling for more thing is always better than controlling for fewer things. This is not true!
On the contrary, as parameter estimate from a model with more additional predictors is not a better estimate than one from a model with fewer predictors; it is just an estimate of a different estimand, and perhaps not the one you actually care about. You can end up with a bad estimator of the quantity you actually care about (an estimator that gives a biased, or systematically wrong, estimate), through the well-intentioned inclusion of other variables that you think might help. This is known as the problem of bad control.
When variables are directly mentioned in your research questions, then it is pretty straightforward that they should be included. If your research question is ‘how does self-esteem vary with age and sex?’, then of course age and sex both go in as predictors (though, you might want to consider interactions between them, see section 5.5, and also that the association between age and self-esteem might be non-linear).
More problematic are cases where there is a covariate (like Age
or GRT
here) that is not mentioned in the research question, but still plausibly related to the outcome. Should this variable go in, or not? Here, we need to think about two things: what your estimand is (the thing you are trying to find out about); and what the causal relationships are between the covariate and the outcome and predictor(s) of direct interest.
There are several possible scenarios. Let’s denote the predictor of interest \(X\), the outcome \(Y\), and the candidate covariate \(Z\). We will assume that your estimand is the ACE of \(X\) on \(Y\).
\(Z\) is a potential confounder. This means that \(Z\) could have a causal impact on \(X\) and, separately, a causal impact on \(Y\). Confounders should generally go into the model. If you think that reading Proust (\(X\)) might increase wages (\(Y\)), then you should control for level of education (\(Z\)), because people who are more educated might more read more Proust, and also (separately) get higher-paying jobs. The association between reading Proust and wages without controlling for level of education is a biased estimate of the (probably non-existent) ACE of reading Proust on wages. This is a form of bias called omitted variable bias, and it is dealt with by controlling for the omitted confounder. Of course, better still than controlling for confounders statistically is designing experimental studies in which you manipulate \(X\) without changing \(Z\). These are de-confounded by design, and hence stronger for causal inference.
\(Z\) is a potential mediator. A mediator (or intervening variable) is a variable that lies on the causal pathway from \(X\) to \(Y\). (More generally, variables that can be affected by \(X\) are sometimes known as post-treatment variables). Physical activity (\(X\)) might improve depressive symptoms (\(Y\)) via the mediator of sleep quality (\(Z\)): when you exercise more, you sleep better; and when you sleep better, you feel less depressed. You should not include potential mediators in your model, as long as your estimand is the ACE of \(X\) on \(Y\). You should not include them because if you do, you are estimating something else: the part of the improvement in depressive symptoms with greater physical activity that is not due to improved sleep; or, what would happen to the average person’s depression if did more activity but their sleep quality stayed the same.
\(Z\) is potentially affected by the outcome variable \(Y\). In this case, you should not include it. For example, you should never include \(Z\) variables like ‘how happy the participant felt after finishing the task’ if your outcome is task performance, because how happy they felt afterwards could be affected by how well they performed. Including variables that are affected by \(Y\) in the model can undermine your ability to get an unbiased of the ACE of \(X\) on \(Y\). In the case where \(Z\) is also affected by both \(X\) and \(Y\), this is called collider bias.
\(Z\) is an important determinant of \(Y\), but not related to \(X\). Sometimes you have cases where a variable represents an important causal determinant of \(Y\), but cannot causally affect \(X\), or be affected by \(X\). Such a \(Z\) is neither a confounder, nor a mediator. An example would be including
Age
as an additional predictor in modelm1
of this chapter.Age
cannot affectCondition
, since assignment to conditions was done at random.Age
cannot be affected byCondition
, or indeed bySSRT
. ButAge
does explain a chunk of variation inSSRT
. For cases like this, you don’t have to include them, but it can be a good idea to do so. Doing so does not change the estimand, but it can improve the precision of the estimate (i.e. make the standard error smaller). It does this by ‘soaking’ up variation in \(Y\) that is nothing to do with \(X\). (In the current dataset, includingAge
in the model barely improves the standard error of the parameter estimate forCondition
, but there are cases where it helps quite a bit. )
To summarise this section, a few recommendations. First, be sparing in which covariates you include; err on the side of leaving them out, and defend this position against annoying referees who will always think it would be better if you ‘controlled for more things’ (I will shortly give you citations from really clever people to aid your defence!). Second, do some reading on causal identification, bad control and the principles of which variables to include: there are some excellent guidelines out there (Cinelli et al., 2024; Montgomery et al., 2018; Rohrer, 2018) Third, decide up front which covariates you intend to include, as part of the analysis strategy in your pre-registration (see chapter 12); and when you do this, give an explicit justification for what you include and what you exclude (so-and-so is a confounder and so we will include it, such-an-such is a potential mediator and so we do not)(Wysocki et al., 2022). Finally, explore the sensitivity of robustness of your conclusions to the inclusion or exclusion of the covariates. In other words, how different would the conclusions be if I were to exclude rather than include these variables? Sensitivity analysis is the topic of chapter 9.