7.2 Problematic hypothesis tests using individual coefficients

7.2.1 Reload the Nettle and Saxe data

For the first part of the chapter, we will continue to analyse the data from Nettle and Saxe’s study on intuitions about social sharing (Nettle & Saxe, 2020). If you want to revisit the information about the study, it is in section 6.5.1.

Let’s load the data again (study1.data.csv, you may well already have it saved locally from chapter 6), load it in, and convert the independent variables into factors.

library(tidyverse)
d1 <-read_csv("study1.data.csv") %>%
  mutate(luck=as.factor(luck), 
         heterogeneity=as.factor(heterogeneity))

7.2.2 First problem: No single test for a variable with three levels

We are going to fit our Linear Mixed Model as we did in the previous session. We need lmerTest, remember:

library(lmerTest)
m1 <-lmer(level ~ luck + heterogeneity + (1|participant), data=d1)

Let’s look at the coefficients of the model:

summary(m1)$coefficients

               Estimate Std. Error  df t value Pr(>|t|)
(Intercept)       59.27       2.22 326   26.66 6.57e-84
luckLow          -16.56       2.21 497   -7.48 3.29e-13
luckMedium       -11.95       2.21 497   -5.40 1.02e-07
hetHomogeneous     3.67       1.81 497    2.03 4.27e-02

One of the predictions of the study was that the level of the luck variable should affect the mean of the DV (level). If you look at the model summary, there are two significance tests that relate to this prediction: the one associated with the coefficient for luckLow, and the one associated with coefficient for luckMedium. Respectively, these test the prediction that the mean of level in the Low condition was different from the reference category High; and that the the mean of level in the Medium condition was different from the reference category High. As it happens, both are significant, so that looks like support for the prediction. But, the test of that single prediction is distributed across two coefficients. What happens if one of them is significant and the other not? Is the prediction then supported or not?

It gets worse, because which (if any) of the significance tests returns ‘significant’ depends on which level of luck we choose as the reference level. Try this:

d1$luck <- relevel(d1$luck, ref="Medium")
m1bis <-lmer(level ~ luck + heterogeneity + (1|participant), data=d1)
summary(m1bis)$coefficients

               Estimate Std. Error  df t value Pr(>|t|)
(Intercept)       47.32       2.22 326   21.29 1.22e-63
luckHigh          11.95       2.21 497    5.40 1.02e-07
luckLow           -4.61       2.21 497   -2.08 3.79e-02
hetHomogeneous     3.67       1.81 497    2.03 4.27e-02

In this particular case, all the tests come out significant whatever reference level you choose. But, you will see that the p-values for the variable luck are not the same in m1 and m1bis. This is disquieting: which is the right one?

The problem is that, in the summaries of m1 and m1bis, the significance tests associated with luck are actually testing a different prediction than the one we want to test. They are testing, for each level of luck, ‘does this level differ significantly from the reference level?’ But what we actually wanted was a test of the prediction ‘the levels of luck differ from one another more than expected under the null hypothesis’; or, if you like, ‘at least one level of luck will be different from at least one of the others’. The prediction did not specify which one would differ from which other, only that luck would make some difference. This problem arises whenever you have a qualitative or ordinal predictor variable with more than two levels.

7.2.3 Second problem: Main effects are hard to interpret when the model contains interactions

Model m1 only includes the additive effects of luck and heterogeneity. But in an experimental study with two IVs, it would be more normal to include the interaction between them in the model, in case one IV modifies the effect of the other. Let’s fit that model and view its summary:

m2 <- lmer(level ~ luck*heterogeneity + (1|participant), data=d1)
summary(m2)$coefficients

                        Estimate Std. Error  df t value Pr(>|t|)
(Intercept)                48.10       2.57 449  18.745 2.54e-58
luckHigh                   10.70       3.13 495   3.415 6.90e-04
luckLow                    -5.71       3.13 495  -1.822 6.90e-02
hetHomogeneous              2.10       3.13 495   0.670 5.03e-01
luckHigh:hetHomogeneous     2.50       4.43 495   0.564 5.73e-01
luckLow:hetHomogeneous      2.21       4.43 495   0.499 6.18e-01

The first thing you notice is that there are lot of coefficients! The test of the prediction ’the level of heterogeneity modifies the effect of luck is again spread across two different tests, the one involving luckLow, and the one involving luckHigh. But that is not our only problem.

You might be tempted to interpret the test of the heterogeneityHomogeneous coefficient as a test of whether the heterogeneity variable affects the DV overall, i.e. ignoring the level of the luck variable. But it is not this. In fact, it is a test of whether heterogeneity affects the DV specifically when luck is at its reference category. That’s not the prediction the study authors pre-registered, and not usually the one you would be interested in.

This also has the unsettling consequence that if you change the reference category for luck, you get a different p-value for the main effect of heterogeneity. Try this:

d1$luck <- relevel(d1$luck, ref="High")
m2bis <- lmer(level ~ luck*heterogeneity + (1|participant), data=d1)
summary(m2bis)$coefficients

                       Estimate Std. Error  df t value Pr(>|t|)
(Intercept)               58.80       2.57 449 22.9148 1.65e-77
luckMed                  -10.70       3.13 495 -3.4151 6.90e-04
luckLow                  -16.41       3.13 495 -5.2375 2.41e-07
hetHomogeneous             4.60       3.13 495  1.4682 1.43e-01
luckMed:hetHomogeneous    -2.50       4.43 495 -0.5642 5.73e-01
luckLow:hetHomogeneous    -0.29       4.43 495 -0.0654 9.48e-01

Depending on what reference level we set for luck, the p-value for the main effect of heterogeneity is either about 0.5, 0.17, or 0.14. As well as being very different from one another, these are all non-significant. Yet, in the additive model m1, the main effect of heterogeneity was marginally significant. This suggests that, on average across the levels of luck, heterogeneity does affect the DV.

So, we have identified more problems with reporting significance tests on individual coefficients. Critically, tests of the main effect of a variable in a model that also contains interactions involving that variable do not represent whether that variable has some effect overall. Rather, they represent the effect of that variable in the specific case where the other variables are all at their reference level (or zero, for continuous variables).

References

Nettle, D., & Saxe, R. (2020). Preferences for redistribution are sensitive to perceived luck, social homogeneity, war and scarcity. Cognition, 198, 104234. https://doi.org/10.1016/j.cognition.2020.104234