8.4 Model selection for exploratory analysis
In the previous example, we used model selection to test between several competing hypotheses about the association between income and depressive symptoms. Here, we want to use it for a more exploratory case. Let’s take a different outcome variable in the same dataset, anxiety symptoms. Our question is going to be the rather open-ended: what factors predict how anxious someone is?
Even for an exploratory case, you must still have an a priori candidate list of predictors you might consider. Here, our list is going to include income again; as the threshold model did so well in the previous section, we will consider household income as a binary of above or below 3000 euros. In addition, we will consider age, gender, and social support. Social support is a construct concerning whether a person feels cared for and has assistance available in their social network. Having such support seems plausibly like it would reduce anxiety.
As well as the main effects of the variables, we are going to consider some possible interactions between them. First, there are potential interactions with gender: perhaps having low income, low social support, or being young affects men and women differently. We are also going to allow for an interaction between social support and income threshold, since there is some evidence that being on a low income is not so bad if you have good social support to help compensate (Davis et al., 2025). Considering these interactions means that even with just four possible predictor variables, the number of potential models is quite large. We are going to simplify the situation slightly, by removing participants whose gender is self-described, or who preferred not to give one. This is not discrimination against non-traditional genders, just a reflection of the unfortunate fact that with just 6 cases in the data, we will not be able to estimate parameters concerning them with any precision.
8.4.1 Setting up the data and model selection
We need to reaggregate the data. The outcome variable this time, anxiety symptoms, comes from a variable in the dataset called GAD
, after the Generalized Anxiety Disorder scale on which it is based. We will then remove genders other than Man
or Woman
, recalculate the income threshold variable, and remove any non-complete cases.
a <- d %>% group_by(pid) %>%
summarise(anxiety.symptoms = mean(GAD, na.rm=T),
income = mean(incomec, na.rm=T),
gender = first(gender),
age = mean(med.age, na.rm=T),
social.support = mean(social.support, na.rm=T)) %>%
subset(gender=="Man"|gender=="Woman") %>%
mutate(threshold.met = (income>3000))
And then:
Because our model includes interaction terms this time, it’s important that we centre our continuous variables:
a <- a %>% mutate(age.c = age - mean(age),
social.support.c = social.support - mean(social.support))
As there are so many models to consider, it would be tedious to write them all out as we did in the previous example. Instead, MuMIn
allows you to define the maximally complex model of your set, called the global model. There is function called dredge()
which fits all possible sub-models of the global model (i.e. all possible models that can be made by removing one or more terms from the global model). Let’s set up our global model (note how the interactions are specified):
big.g <- lm(sqrt(anxiety.symptoms) ~ age.c + gender + threshold.met + social.support.c +
gender:age.c +
gender:threshold.met +
gender:social.support.c +
social.support.c:threshold.met,
data=a)
And now, get the model selection table on the ‘dredge()’ of big.g
. (You will need to still have MuMIn
active, and the options set to na.fail
, from the previous section).
I have not shown the output, since as you will see if you run it, there are 49 models. However, many of them receive essentially no weight, and can be ignored. What we do in such cases is retain only a best models subset, those models that come within 2 AIC units of the best one.
Now we have five models in the running, with rather evenly distributed
weight
. Although this looks like a confusing picture, the best five models have a lot in common. For example, they all include age, gender and social support. The next thing to do in such a case is to get the variable importance weights. The variable importance weight for a given variable is the total AIC weight going to models that include that variable. If the variable importance weight is 1, then the variable is obviously important: all of the weight goes to models containing it. If it is 0, then it is not important at all: no weight goes to models including it. Let’s get our variable importance weights:
All five of the best models contain
age
, gender
and social.support
(variable importance weights of 1); two of them contain income threshold.met
(variable importance weight 0.42). There is some support for gender modifying the effects of age (2 models, variable importance 0.36), and a little for gender modifying the effect of social support (one model, 0.12).
So, despite there being support for several models, there are some things we can clearly conclude from this analysis of this dataset. There is resounding support for the claim that age matters for anxiety; that gender matters for anxiety; and that social support matters for anxiety. It is possible that gender modifies the effects of age and social support, but evidence for this is quite weak. What is interesting is that the evidence that having enough income is important is rather weak compared to the evidence that age, gender, and having social support matter. (This conclusion is not much different if we use log(income)
rather than threshold.met
, by the way). I would report all of the variables that get any importance weight, perhaps in a table, making clear the graded nature of the evidential strength.
The other thing you would want to do is to get model-average parameter estimates for the variables which had any weight. This is because, as well as saying that there is evidence for associations with age, gender and social support, you want to indicate the magnitudes and directions of association. I would report these alongside the variable importance weights.
## (Intercept) age.c
## 1.92402 -0.02315
## genderWoman social.support.c
## 0.34478 -0.24296
## threshold.metTRUE age.c:genderWoman
## -0.05365 -0.00356
## genderWoman:social.support.c
## 0.00294
And their confidence intervals:
## 2.5 % 97.5 %
## (Intercept) 1.7700 2.07802
## age.c -0.0337 -0.01259
## genderWoman 0.1698 0.51972
## social.support.c -0.2894 -0.19654
## threshold.metTRUE -0.3071 0.05435
## age.c:genderWoman -0.0269 0.00717
## genderWoman:social.support.c -0.0589 0.10854
There you go. I urge you to be judicious in your use of dredge()
. It’s such a powerful tool that you can find yourself going off into it like a child in a soft play area. You should still let your goals and your theoretical understanding - your understanding of the relevant possible causal relationships - delimit the variables and variable combinations that you will consider. And you should still pre-register your aims, rationale and analysis strategy. You don’t want to dredge()
unthinkingly. We return to the question of when to use model selection and when to use null hypothesis significance testing on a pre-selected model in section 12.3.