7.2 Problematic hypothesis tests using individual coefficients
7.2.1 Reload the Nettle and Saxe data
For the first part of the chapter, we will continue to analyse the data from Nettle and Saxe’s study on intuitions about social sharing (Nettle & Saxe, 2020). If you want to revisit the information about the study, it is in section 6.5.1.
Let’s load the data again (study1.data.csv
, you may well already have it saved locally from chapter 6), load it in, and convert the independent variables into factors.
7.2.2 First problem: No single test for a variable with three levels
We are going to fit our Linear Mixed Model as we did in the previous session. We need lmerTest
, remember:
Let’s look at the coefficients of the model:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 59.27 2.22 326 26.66 6.57e-84
## luckLow -16.56 2.21 497 -7.48 3.29e-13
## luckMedium -11.95 2.21 497 -5.40 1.02e-07
## hetHomogeneous 3.67 1.81 497 2.03 4.27e-02
One of the predictions of the study was that the level of the luck
variable should affect the mean of the DV (level
). If you look at the model summary, there are two significance tests that relate to this prediction: the one associated with the coefficient for luckLow
, and the one associated with coefficient for luckMedium
. Respectively, these test the prediction that the mean of level
in the Low
condition was different from the reference category High
; and that the the mean of level
in the Medium
condition was different from the reference category High
. As it happens, both are significant, so that looks like support for the prediction. But, the test of that single prediction is distributed across two coefficients. What happens if one of them is significant and the other not? Is the prediction then supported or not?
It gets worse, because which (if any) of the significance tests returns ‘significant’ depends on which level of luck
we choose as the reference level. Try this:
d1$luck <- relevel(d1$luck, ref="Medium")
m1bis <-lmer(level ~ luck + heterogeneity + (1|participant), data=d1)
summary(m1bis)$coefficients
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 47.32 2.22 326 21.29 1.22e-63
## luckHigh 11.95 2.21 497 5.40 1.02e-07
## luckLow -4.61 2.21 497 -2.08 3.79e-02
## hetHomogeneous 3.67 1.81 497 2.03 4.27e-02
In this particular case, all the tests come out significant whatever reference level you choose. But, you will see that the p-values for the variable luck
are not the same in m1
and m1bis
. This is disquieting: which is the right one?
The problem is that, in the summaries of m1
and m1bis
, the significance tests associated with luck
are actually testing a different prediction than the one we want to test. They are testing, for each level of luck
, ‘does this level differ significantly from the reference level?’ But what we actually wanted was a test of the prediction ‘the levels of luck
differ from one another more than expected under the null hypothesis’; or, if you like, ‘at least one level of luck
will be different from at least one of the others’. The prediction did not specify which one would differ from which other, only that luck
would make some difference. This problem arises whenever you have a qualitative or ordinal predictor variable with more than two levels.
7.2.3 Second problem: Main effects are hard to interpret when the model contains interactions
Model m1
only includes the additive effects of luck
and heterogeneity
. But in an experimental study with two IVs, it would be more normal to include the interaction between them in the model, in case one IV modifies the effect of the other. Let’s fit that model and view its summary:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 48.10 2.57 449 18.745 2.54e-58
## luckHigh 10.70 3.13 495 3.415 6.90e-04
## luckLow -5.71 3.13 495 -1.822 6.90e-02
## hetHomogeneous 2.10 3.13 495 0.670 5.03e-01
## luckHigh:hetHomogeneous 2.50 4.43 495 0.564 5.73e-01
## luckLow:hetHomogeneous 2.21 4.43 495 0.499 6.18e-01
The first thing you notice is that there are lot of coefficients! The test of the prediction ’the level of heterogeneity
modifies the effect of luck
is again spread across two different tests, the one involving luckLow
, and the one involving luckHigh
. But that is not our only problem.
You might be tempted to interpret the test of the heterogeneityHomogeneous
coefficient as a test of whether the heterogeneity
variable affects the DV overall, i.e. ignoring the level of the luck
variable. But it is not this. In fact, it is a test of whether heterogeneity
affects the DV specifically when luck
is at its reference category. That’s not the prediction the study authors pre-registered, and not usually the one you would be interested in.
This also has the unsettling consequence that if you change the reference category for luck
, you get a different p-value for the main effect of heterogeneity
. Try this:
d1$luck <- relevel(d1$luck, ref="High")
m2bis <- lmer(level ~ luck*heterogeneity + (1|participant), data=d1)
summary(m2bis)$coefficients
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 58.80 2.57 449 22.9148 1.65e-77
## luckMed -10.70 3.13 495 -3.4151 6.90e-04
## luckLow -16.41 3.13 495 -5.2375 2.41e-07
## hetHomogeneous 4.60 3.13 495 1.4682 1.43e-01
## luckMed:hetHomogeneous -2.50 4.43 495 -0.5642 5.73e-01
## luckLow:hetHomogeneous -0.29 4.43 495 -0.0654 9.48e-01
Depending on what reference level we set for luck
, the p-value for the main effect of heterogeneity
is either about 0.5, 0.17, or 0.14. As well as being very different from one another, these are all non-significant. Yet, in the additive model m1
, the main effect of heterogeneity
was marginally significant. This suggests that, on average across the levels of luck
, heterogeneity
does affect the DV.
So, we have identified more problems with reporting significance tests on individual coefficients. Critically, tests of the main effect of a variable in a model that also contains interactions involving that variable do not represent whether that variable has some effect overall. Rather, they represent the effect of that variable in the specific case where the other variables are all at their reference level (or zero, for continuous variables).