3.3 General Linear Models with a single predictor

The referendum opinion poll example was very simple: the parameter we wanted to estimate was just a proportion the population. In research, the parameters you want to estimate are, more often, the effects of a change in some IV or predictor variable or some DV or outcome variable. That’s the kind of parameter you find in the General Linear Model. You want to be able to apply a General Linear Model to real data sets, so in this section, we will work again on the behavioural inhibition dataset we met in chapter 2.5.

3.3.1 Loading in the behavioural inhibition data, again

First we are going to load the data in, as we previously did. Your script to do this should like this:

# Script to analyse behavioural inhibition data
# Load up tidyverse 
library(tidyverse)

# Read in the data 
d <- read_csv("https://bit.ly/inhibitiondata")

# Rename the first column
colnames(d)[1] <- "Participant"

# Recode the Condition variable nicer
d <- d %>% mutate(Condition = case_when(Mood_induction_condition == 1 ~ "Negative", Mood_induction_condition == 2 ~ "Neutral"))

Run this. You should now have a data frame d in your environment, with 58 observations of 14 variables.

3.3.2 A first parameter estimate

The experimental prediction in the paper (Paál et al., 2015) was about the difference in SSRT between people in the negative and neutral conditions. (We are leaving aside the other predictions, about socioeconomic deprivation and age, at this point. We will return to them.) So, let us set up a model of this situation.

Let’s say that in the population, there is some average SSRT of people who are in neutral moods. We can represent this parameter with the symbol, $\beta_0$. So, in the population, the average SSRT of people in neutral moods is as follows:

\[E(SSRT_{neutral}) = \beta_0 \] What about people in negative moods? Their average SSRT is going to differ from the average SSRT of people in neutral moods by some amount, which we can capture with the parameter $\beta_1$. We are not prejudging the question of whether SSRT is higher, lower, or the same for people in negative moods as compared with neutral. moods. $\beta_1$ might turn out to be equal to 0, in which case mood makes no difference to SSRT. But $\beta_1$ might also turn out to be different from zero, in either direction. Under our model, then, the average SSRT of people in negative moods will be given by:

\[ E(SSRT_{negative}) = \beta_0 + \beta_1 \]

Putting this together, we can say that the expected value of someone’s SSRT is going to be:

\[ E(SSRT) = \beta_0 + Condition * \beta_1 \] Here, $Condition$ represents a variable that takes the value 0 if their mood is neutral, and 1 if their mood has been made negative. What we want from our model, of course, is estimates of $\beta_0$ and $\beta_1$, plus precision for those estimates. It turns out (again I won’t go into the maths), that the best possible estimate I can make of $\beta_0$ is the mean SSRT in the neutral condition of my sample; and the best possible estimate I can make of $\beta_1$ is the difference in mean SSRTs between the neutral and negative conditions of my sample.

Let’s now see how this works by fitting a General Linear Model to the data. We do this with the R function lm().

m1 <- lm(SSRT ~ Condition, data=d)

This says, fit a General Linear Model to the data in data frame d, in which the variable SSRT is predicted by the variable Condition; then assign this model to the object m1. You could call the model something else if you like; it’s up to you. Also, if you have named your data frame something other than d, then you will need to modify the lm() call appropriately.

Now we have our model object (it should have appeared in your Environment window), let us see what it contains. We do this with the function summary().

summary(m1)


Call:
lm(formula = SSRT ~ Condition, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-112.19  -23.37   -0.17   25.86  150.21 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        242.09       8.06   30.04   <2e-16 ***
ConditionNeutral    -7.57      11.60   -0.65     0.52    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.1 on 56 degrees of freedom
Multiple R-squared:  0.00754,   Adjusted R-squared:  -0.0102 
F-statistic: 0.426 on 1 and 56 DF,  p-value: 0.517

What does this summary tell us? There is a parameter estimate called Intercept, and one that estimates the effect of Condition being Neutral rather than Negative (-7.569). This is an inconvenient way around to put it. The way I set up the example before, I was treating SSRT in Neutral as the baseline case, and SSRT in Negative the departure from this. But, because Negative comes before Neutral in alphabetical order, R has taken Negative as the baseline. Before we go any further, let’s fix this. Run the following line:

d$Condition <- factor(d$Condition, levels = c("Neutral", "Negative"))

This says, treat Condition as a factor (a qualitative variable with ordered levels) and specify the order of the levels as Neutral first and Negative second (this is also called setting Neutral as the reference category). Now let’s fit and summarise our model again. To save space, I will just show the coefficients part of the summary.

m1 <- lm(SSRT ~ Condition, data=d)
summary(m1)$coefficients

                  Estimate Std. Error t value Pr(>|t|)
(Intercept)         234.52       8.34  28.110 1.05e-34
ConditionNegative     7.57      11.60   0.652 5.17e-01

Now Intercept represents an estimate of $\beta_0$ as we originally defined it (i.e. average SSRT in neutral mood), and ConditionNegative represents $\beta_1$, the difference in average SSRT when mood is negative instead of neutral. So, interpreting the first column, $\beta_0$ is about 235 msec, and $\beta_1$ about 8 msec. SSRT is estimated as a bit higher in negative mood, but only a tiny bit (8 msec more on a baseline of 235).

How do these numbers come from the raw data? Let’s get the key descriptive statistics again.

d %>% group_by(Condition) %>% 
   summarise(M = mean(SSRT)) %>% 
   ungroup()

# A tibble: 2 × 2
  Condition     M
  <fct>     <dbl>
1 Neutral   234.5
2 Negative  242.1

We can see that the Intercept, the estimate of $\beta_0$, is just the mean SSRT in the neutral condition, about 235; and the estimate of $\beta_1$ is just the difference in mean SSRTs between the two conditions (about 8 after rounding).

Make sure you are clear on everything to this point. This is important stuff!

3.3.3 Bringing in imprecision

Now we need to get a sense from our model of how precise our parameter estimates are. The standard error of the estimates is reported in the second column of the model summary entitled Std. Error. Here, for $\beta_1$, the standard error is about 11, whereas the estimate itself is only about 8. In other words, we estimate $\beta_1$ as 8 msec (SSRTs are 8 msec longer for people in bad moods in the population), but we acknowledging that the typical error is of this estimate is 11. In other words, the true value could easily be 8 + 11 = 19, and could easily be 8 - 11 = -3. Alternatively, we can get the confidence interval for our estimate of $\beta_1$. This is how you do this:

confint(m1)

                  2.5 % 97.5 %
(Intercept)       217.8  251.2
ConditionNegative -15.7   30.8

What this tells is that we think that 95% of the time if we repeated the experiment again and again, we would get an estimate of $\beta_0$ between 218 and 251, and estimate of $\beta_1$ between -16 and plus 31. So, the bad news is that the effect of negative mood on SSRT could, in light of our data, be negative, be zero, or be positive, since all of these possible scenarios are contained within the 95% confidence interval of the parameter estimate.

3.3.4 A General Linear Model with a continuous predictor

The model m1 had one predictor, experimental condition, which was binary (i.e. Negative versus Neutral). How do we fit a General Linear Model when our predictor is a continuous variable?

Everything is pretty much the same. Let’s consider the case of whether SSRT is predicted by the variable Age. We assume that there is some value of SSRT that people have all their lives, and represent this by the parameter $\beta_0$. Then we assume that their SSRT changes by an amount $\beta_1$ with each additional year of age that passes (note, therefore, that we are assuming a linear relationship between age and SSRT at least across the range that we are studying). So, the expected value of a person’s SSRT under this model is: \[E(SSRT) = \beta_0 + \beta_1 * Age\]

If $\beta_1$ is a positive number, older people have higher SSRTs; if it is a negative number, older people have lower SSRTs; and if $\beta_1 = 0$, then SSRT does not change with age. Note the slight difference in interpretation: for model m1 with a binary predictor, $\beta_1$ represents the difference in expected SSRT when you go from Neutral to Negative; in this model where the predictor is continuous, $\beta_1$ represents the difference in expected SSRT when Age increases by one unit (i.e., one year).

We fit this model using the lm() function in exactly the same way as before:

m2 <- lm(SSRT ~ Age, data=d)
summary(m2)$coefficients

            Estimate Std. Error t value Pr(>|t|)
(Intercept)  208.145     13.750   15.14 2.90e-21
Age            0.933      0.383    2.44 1.81e-02

The estimate of $\beta_0$ is about 208, and the estimate of $\beta_1$ is about 1, suggesting that SSRT goes up by about 1 msec with every year older a person is. Though m2 is fine, it is not perhaps expressed in the most easily interpretable way. $\beta_0$ here represents the expected SSRT of someone with an age of 0. The age of 0 years is way outside the range of data (the youngest participant is 19), so extrapolating the SSRT of someone with an age of 0 isn’t statistically very defensible. More importantly, a one-day old baby could not possible do the task anyway, so this parameter does not make intuitive sense. We would better off setting our zero point for the Age variable somewhere else, such as at the average age of a member of the sample.

We do this by centring the Age variable. This means putting the zero value in the middle of the distribution, and expressing the other values as negative or positive deviations from the middle. When you centre a variable, you also have the option of scaling it. Scaling means setting the standard deviation to 1 at the same time as setting the mean to zero. (Standardizing a variable means both scaling and centring it.)

In modelling data, I recommend always centring your continuous predictor variables, for a number of reasons, including avoiding parameter estimates that make no intuitive sense. This centring will become even more useful later when there are multiple predictor variables, and especially when there are interactions between them. Whether you should scale or not depends. Where the predictor has easily interpretable units, as with age that has the units of years, I would probably keep it un-scaled, so the interpretation is intuitive. If you scale it, the parameter estimate for Age comes to represent the expected change in SSRT when Age changes by one standard devation. The standard deviation of Age in this dataset is about 14.8 years, so the parameter estimate would represent ‘the amount SSRT changes when age increases by a bit less than 15 years’. The amount by which SSRT changes with every year older is simpler to explain.

So, let’s centre Age but not scale it. We do this by subtracting the mean of the variable from every value, as shown below. You could also use the R function scale(), which can scale a variable, centre it, or both.

d$Age_centred <- d$Age - mean(d$Age, na.rm=T)

Now rerun m2 and get the summary:

m2 <- lm(SSRT ~ Age_centred, data=d)
summary(m2)$coefficients

            Estimate Std. Error t value Pr(>|t|)
(Intercept)  238.716      5.620   42.48 9.77e-44
Age_centred    0.933      0.383    2.44 1.81e-02

The Intercept, which represents the expected SSRT of a person of average age, is now about 238. This is identical to average SSRT in the sample (as you can verify using mean(d$SSRT) if you wish). And, as before, $\beta_1$ represents the change in SSRT when age increases by a year (about 1 msec). Let’s get the confidence intervals:

confint(m2)

              2.5 % 97.5 %
(Intercept) 227.453  250.0
Age_centred   0.165    1.7

The confidence interval for $\beta_1$ ranges from about 0.2 to about 1.7. If we were to run the study 100 times, we would almost always get estimates in this range. Thus, though we are not sure exactly what $\beta_1$, we are pretty sure it is a positive number. If we ran many studies, in almost all of them, SSRT would be higher on average in the older participants. The fact that zero is not in the confidence interval gives us grounds to believe that, in the big world of all humans $\beta_1 > 0$, and SSRT goes up with age. It also gives us grounds to believe that $\beta_1$ is not very big: because 10 is way outside the confidence interval, the data are incompatible with belief in the hypothesis that SSRT increases by 10 msecs on average with every year of age. We return to how to test hypotheses using the output of your General Linear Model in chapter 4.

References

Paál, T., Carpenter, T., & Nettle, D. (2015). Childhood socioeconomic deprivation, but not current mood, is associated with behavioural disinhibition in adults. PeerJ, 3, e964. https://doi.org/10.7717/peerj.964