6.5 Linear Mixed Models: Empirical example

6.5.1 The study by Nettle and Saxe (2020) and its data

Our example dataset comes from some of my own work in political cognition (Nettle & Saxe, 2020). Specially, we are looking at study 1 of this 7-study paper. In the study, we were interested in people’s intuitions about how much of what people earn should be shared with other members of society. We studied this by asking 100 participants to imagine hypothetical villages in which different conditions obtained. In these villages, people grew crops in their gardens. The DV was what percentage of their crop the villagers should be obliged to share out with other villagers, the rest being kept for themselves. The IVs were: the extent to which variation in garden productivity was due to luck rather than hard work (three levels: low importance of luck; medium importance; high importance); and the extent to which the villagers were culturally homogeneous (two levels: homogeneous and heterogeneous). This latter IV is to capture the hypothesis that people are parochial, i.e. they think one should share more with one’s own group than with outgroup members. Thus, there are 2 x 3 = 6 conditions overall (low luck homogeneous, low luck heterogeneous, medium luck homogeneous, medium luck heterogeneous, high luck homogeneous, high luck heterogeneous).

Manipulation of luck and heterogeneity was done within subjects, so every participant completed the DV six times in total. The predictions were that the greater the importance of luck, the more participants would think villagers should share; and, participants would think villagers should share more in the homogeneous than the heterogeneous villagers.

6.5.2 Getting the data

The data for the study are available in the repository at: https://osf.io/xrqae/. It is the file ‘study1.data.csv’ that we are interested in, and the file-specific URL is https://osf.io/2fxn7. Let’s read the data in. You should get a data frame d with 600 observations.

library(tidyverse)
d <-read_csv('https://osf.io/2fxn7/download')

6.5.3 Wide versus long format

You might be forgiven for imagining that you were going to get 100 rows of data, because there are 100 participants. In other words, you might have expected the first few rows of the data frame to look something like this:

  Participant Low.Hom Low.Het Med.Hom Med.Het High.Hom
1           A      23      18      40      35       56
2           B      42      40      41      34       67
3           C      31      33      54      22       45

In fact, it looks something like this:

         participant level   luck heterogeneity
1  R_0ijWqatE9iFA0Rr    80   High Heterogeneous
2  R_0ijWqatE9iFA0Rr    42    Low Heterogeneous
3  R_0ijWqatE9iFA0Rr    60 Medium Heterogeneous
4  R_0ijWqatE9iFA0Rr    70   High   Homogeneous
5  R_0ijWqatE9iFA0Rr    20    Low   Homogeneous
6  R_0ijWqatE9iFA0Rr    50 Medium   Homogeneous
7  R_0TAfv2g6dKUBf9v    80   High Heterogeneous
8  R_0TAfv2g6dKUBf9v    31    Low Heterogeneous
9  R_0TAfv2g6dKUBf9v    50 Medium Heterogeneous
10 R_0TAfv2g6dKUBf9v    50   High   Homogeneous
11 R_0TAfv2g6dKUBf9v    30    Low   Homogeneous
12 R_0TAfv2g6dKUBf9v    50 Medium   Homogeneous

The first format (1 row per participant, 6 columns corresponding to the DV in the 6 conditions) is called wide format. The format the data are actually in (1 row per trial, so 600 rows overall, one column giving all the DV values, three other columns identifying respectively: which participant it was; whether the luck IV was low, medium or high; and whether the village was homogeneous or heterogeneous) is called long format.

You will almost always want your data to be in long format. This is because in wide format, observations of the same variable (the DV) are spread over six different columns. And, there is no single column representing either of the IVs. In principle, you always want every variable in your study to be represented by exactly one column in the data frame, and every observation of that variable to be in that column. This is known as the principle of tidy data (Wickham, 2014). Here, the long format is tidy but the wide format is not. (If you want to convert between wide format and long format, there are tidyverse functions called pivot_longer() and pivot_wider(). You can find examples of how to use these on the web.)

6.5.4 Fitting the Linear Mixed Model

To recap, we now have a data frame d with 600 observations of the DV level. The IVs are luck and heterogeneity. A further variable participant further identifies which of the 100 participants each response came from.

We fit Linear Mixed Models with a contributed package called lmerTest. In fact, this is a wrapper for another package lme4 that actually fits the model; lmerTest just presents it in a handy way. Now you need to run install.packages('lmerTest'); this will also install lme4 if needed.

Let’s examine what seems to be going on in the data. Recall, the predictions were that level would be: higher when luck was more important; and lower when the village was heterogeneous than homogeneous. Let’s look at the means and standard deviations of level by condition:

d %>% group_by(luck, heterogeneity) %>%   
  summarise(M=mean(level), SD = sd(level))

# A tibble: 6 × 4
# Groups:   luck [3]
  luck   heterogeneity     M    SD
  <chr>  <chr>         <dbl> <dbl>
1 High   Heterogeneous 58.8  27.50
2 High   Homogeneous   63.4  25.51
3 Low    Heterogeneous 42.39 29.59
4 Low    Homogeneous   46.7  29.75
5 Medium Heterogeneous 48.1  20.92
6 Medium Homogeneous   50.2  18.59

The means are much higher when luck is High and lowest when luck is Low, so that seems to have worked. But, for each level of luck, the mean level is very slightly higher for Homogeneous than Heterogeneous: 63 to 59; 50 to 48; 47 to 42. So possibly there might be some support for the prediction concerning heterogeneity too. The thing is that the standard deviations within each condition are really big (between 18 and 30), which dwarfs the small differences in the means by heterogeneity.

We can start by fitting our habitual General Linear Model:

g1 <-lm(level ~ luck + heterogeneity, data=d)
summary(g1)


Call:
lm(formula = level ~ luck + het, data = d)

Residuals:
   Min     1Q Median     3Q    Max 
-52.94 -20.03  -0.98  19.44  57.29 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)       59.27       2.09   28.33  < 2e-16 ***
luckLow          -16.56       2.56   -6.46  2.2e-10 ***
luckMedium       -11.95       2.56   -4.66  3.8e-06 ***
hetHomogeneous     3.67       2.09    1.75     0.08 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.6 on 596 degrees of freedom
Multiple R-squared:  0.0739,    Adjusted R-squared:  0.0692 
F-statistic: 15.9 on 3 and 596 DF,  p-value: 6.27e-10

There are massively significant differences between the Low and Medium levels of luck and the reference category High. There is also a higher average level when the village is Homogeneous, as predicted, but the coefficient is small (3.67) compared to its standard error (2.09), and hence the p-value indicates we cannot conclude it is significantly different from zero.

However, what we really want to do is to ask: do participants increase their level rating from their own baseline when the village is Homogenous? To do this, we have to inform the model of the clustered structure of the responses by participant, so that it can fit a random intercept to account for this. Here is where we use the lmer() function from the lmerTest package rather than our habitual lm() function. The differences in the syntax are simple:

We now call lmer() instead of lm().
We add a term + (1|participant) to the formula. This is the random effect term, and it says in effect ’add a separate intercept (represented by 1) for each participant, as identified in the variable called participant .

library(lmerTest)
g2 <-lmer(level ~ luck + heterogeneity + (1|participant), data=d)
summary(g2)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: level ~ luck + het + (1 | participant)
   Data: d

REML criterion at convergence: 5514

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.4272 -0.6297  0.0056  0.6399  2.5827 

Random effects:
 Groups      Name        Variance Std.Dev.
 participant (Intercept) 168      13.0    
 Residual                489      22.1    
Number of obs: 600, groups:  participant, 100

Fixed effects:
               Estimate Std. Error     df t value Pr(>|t|)    
(Intercept)       59.27       2.22 325.95   26.66  < 2e-16 ***
luckLow          -16.56       2.21 497.00   -7.48  3.3e-13 ***
luckMedium       -11.95       2.21 497.00   -5.40  1.0e-07 ***
hetHomogeneous     3.67       1.81 497.00    2.03    0.043 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) luckLw lckMdm
luckLow     -0.498              
luckMedium  -0.498  0.500       
hetHomogens -0.406  0.000  0.000

The output looks pretty familiar too. The first thing you will notice is that the coefficients are all exactly the same. Yes, the parameter estimates come from the data in the same way as before: 3.67 for heterogeneity being Homogeneous, for example. However, their standard errors are different. This is because the standard errors in g2 take account of the clustering in the data, whereas the standard errors in g1 did not.

In this case, the standard errors are smaller when we fit the Linear Mixed Model, because we are measuring the within-subjects deviations from the participant’s own baseline as the IVs vary. This means that the parameter estimate for heterogeneity has gone from being just non-significant to being just significant. The standard errors for the parameter estimates for the luck variable have also got smaller, though those were already so significant anyway that this does not change our interpretation.

There are some other additional features in the output of g2. The output of g1 informed us that the standard error of the residuals was 25.62. Squaring this up, the variance of the residuals would therefore be \(25.62^2 = 656.38\). So this means that the residuals, the bits of discrepancy left over after we fit all the predictors, have a variance of about 656 (their mean is zero by construction).

In g2, this residual variance is broken down into two parts. There is a variance of 167.9 between individuals; this is an indicator of how much the means of different participants are different from one another. And there is a variance of 489.2 that is described as Residual. This is how much residual variance is left over after the predictors have been fitted and the differences in average response between individuals has been accounted for. In other words, it captures the inexplicable variation within individuals that we cannot account for with anything we know about. In this case, since \(168/656 \approx 0.26\), we can say that about 26% of the variation in the dataset not explained by the predictors was due to individuals having different baseline values of level, with the remaining 74% representing the unexplained noise within participants.

6.5.5 More complex random effects structures

We have dealt with a case where there was one source of clustering in the data: by participant. But often there will be more sources. For example, imagine a study where participants respond to visual images. Each participant responds to several images, so there is clustering by participant. The images are either faces, or (let’s say) flowers. But there are ten different face images, and ten different flower images. The experimenters will have included this replication in face images and in flower images to avoid over-generalizing on the basis of the response to one particular face image.

So here, responses are clustered both in terms of which participant they belonged to, and which individual image they were a response to. This would be easily captured in a mixed model. The syntax would be something like: m1 <- lmer(y ~ x + (1|participant) + (1|image), data=data). The estimation would automatically sort out the variance due to participant (people being characteristically high or low) and the variance due to image (certain images being more arresting than others), and estimate the effect of the predictor variables on top of this clustering.

In all the cases we have thought about so far, we have just wanted to add an intercept for each cluster (participant or image), so allow for the different baselines of different clusters. But sometimes you also want to estimate a cluster-specific response to some predictor variable. This case, which I will not deal with here, is called a random slopes model. Random slopes models often (though not only) occur when the predictor is time. For example, you might measure the height of children repeatedly as they grow, and want to allow for the fact that each individual child has a characteristic growth rate. Here you would fit each child with not just a random intercept (the children were different heights at the beginning), but also a random slope against time (the children were growing at characteristically different rates).

The syntax for fitting a random slopes model using lmer() is very similar to our cases so far, except that instead of + (1|participant) it will be something like + (time|participant).

Sometimes there are random effects that you are not sure whether to put in or not. For example, let’s say that as well as some responses being from the same participants, some responses were collected on a particular day, or by a particular experimenter. Should day or experimenter be included as random effects, or not?

This is not always straightforward, but there are some rules of thumb. The first is that if there are very few levels of the potential clustering variable, don’t include a random effect for it. Say there are two experimenters. You might want to include experimenter as a covariate in the non-random part of the model equation, but don’t try to include a random effect. Random effects are better suited for the case where there are many levels of the clustering variable (as is true of participant in the data we looked at, for example). The second rule of thumb is that if your potential random effect variable does have many levels, it’s probably fine (and prudent) to include it in cases where you are not sure. If there is in fact no clustering by that variable in the data, the model including this random effect will produce almost identical results to the one excluding it, and the residual variance associated with that random effect will be close to or exactly zero. In lmerTest, if you get a red warning about a boundary (singular) fit, this is probably because your model contains a random effect that accounts for almost zero variation. If you remove that random effect (as you would be justified in doing, since that level of clustering accounts for no variation in the residuals), the warning will probably go away.

References

Nettle, D., & Saxe, R. (2020). Preferences for redistribution are sensitive to perceived luck, social homogeneity, war and scarcity. Cognition, 198, 104234. https://doi.org/10.1016/j.cognition.2020.104234

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59, 1–23. https://doi.org/10.18637/jss.v059.i10