8.1 Introduction

So far we have accompanied our statistical models with statistical tests to adjudicate on hypotheses: either null hypothesis significance tests, or equivalence tests. This is the standard approach in psychology and behavioural science (especially null hypothesis significance tests). Null hypothesis significance testing has useful applications, especially for data from experimental studies. But, it does not make sense for everything. In particular, there are two kinds of cases where its utility is quite low. People often use it in these cases, but there are better alternatives.

The first kind of case is where the researcher is entertaining multiple hypotheses, none of which is null, and wishes to say something about which hypothesis is best supported by the data. For example, consider how the diameters of the trunks of trees relate to their height. Obviously, taller trees tend to have thicker trunks. Thus, testing (and rejecting) the hypothesis that there is no relationship between trunk diameter is not very interesting. However, there are multiple possible hypotheses about how trunk diameter might relate to height. For example: is there a linear relationship; does height increase with the square root of diameter; is height related to diameter to the power of 2/3, and so on. Each of these possibilities is a hypothesis; you would want to evaluate each of these hypotheses against one another given the data.

People try to hang on to null hypothesis significance testing in cases like this by looking at which terms in different models are significant against the null hypothesis of zero, and which are not. But, difference of significance is not significance of difference (see section 5.6). Besides, this does not help if all of the models have significantly non-zero terms, as would probably be true in the tree trunk case.

The second kind of case is where you have observational data, quite a few potential predictor variables, and you want to know which combination of predictors best predicts the outcome. This situation is common in more exploratory research. Because there are multiple predictors, there are many different models you could test: all the different combinations of all the predictors, their interactions, and so on. Which model should you think of as the best model? Which one should you take to represent the causal forces that really generated this dataset; and, hence, which one is likely to predict the outcome in future replication samples?

Here, people sometimes put all the possible predictors in the model, and then prune it based on which terms are significant or not (i.e. remove the non-significant terms and leave the significant ones). This is a bad idea for several reasons. Because terms in linear models are partial coefficients (section 3.4), the coefficients that remain don’t retain the same magnitude or the same interpretation when other terms in the model get removed. You end up effectively doing many statistical tests, increasing the likelihood of retaining terms through false positives; plus your estimates are biased by the way you have selected which terms to include.

Alternatively, people keep playing with the terms included in the model until they find the combination for which the coefficient of determination, \(R^2\), is the biggest. \(R^2\) is the proportion of the variance in your outcome measure is statistically explained by the combination of all your predictors; it is inversely relatd to the variance of the residuals. The problem with this is that it leads to overly complex models. A set of predictors must generate an \(R^2\) that it is at least as big as the \(R^2\) generated by all of its subsets. Often it will be a little bit bigger just by chance. So looking for a bigger \(R^2\) will always lead you to add more and more predictors, many of which might not be important determinants in the world, and whose associations will not replicate in future samples. This is known as overfitting.

This chapter introduces an alternative approach to inference about the best model: model selection or information-theoretic model selection. Model selection allows you to test between multiple non-null hypotheses; and to find the best model in exploratory analysis of datasets with many predictor variables. Model selection does not involve p-values or significance tests in the way you are used to, so we will have to learn some new terminology.