8.2 The Akaike Information Criterion

The Akaike Information Criterion (AIC) is a quantity that can be calculated for any statistical model, whether a Linear Mixed Model, General Linear Model, or Generalized Linear Model. It is named after the Japanese mathematician Hirotugu Akaike and it forms the basis of model selection. The lower the AIC, the ‘better’ the model is considered to be. You can get the AIC for any R model using the function AIC().

The AIC is in fact the sum of two terms. One term is a measure of goodness of fit. This part of the AIC is a bit like the variance of the residuals: as a model explains more of the variance in the data, this part of the AIC gets smaller. The other term is a penalty for the number of parameters in the model. As the model gets more complex, this term gets bigger and hence makes the AIC worse. Thus, model complexity is penalized. Of two models where the linear predictor predicts the variation in the outcome about equally well, the one with more predictors in it will have a worse AIC, due to the increase in the penalty term. A more complex model, to be inferred to be the better model of the underlying data-generating process has not just to account for additional variation, but enough additional variation to overcome the penalty of additional complexity. This is what defends us from overfitting.

Of a set of models, the one with the lowest AIC is the one that does the best at predicting variation in the outcome in the current sample, given its complexity. More importantly, it can be shown mathematically that it is also the one most likely to do well at predicting variation in the outcome in novel samples from the same population. That is, selecting by AIC should find us the model that does best at producing reliable prediction about future samples, not just explaining the variation in the data we already have. This is because the first and second components of the AIC are weighted in such a way as to optimally balance between capturing all the systematic forces, and avoiding overfitting.

Model selection proceeds by defining a set of models of interest. Obviously, all the models must have the same outcome variable, and all must be fitted on exactly the same dataset. This has implications for datasets in which some values are missing for some variables; you need to fit all the models to the subset of cases where all variables are non-missing. Having defined your set, you calculate AICs for all the models, and the smallest wins. Very often, though, several models have AICs that are almost the same. The software gives you an AIC weight for each of the nearly-equal models. You can interpret this as the probability that that particular model is the best one. You might for example have a situation where model m1 has an AIC weight of 0.72 and model m2 has an AIC weight of 0.28. In such a case, the data suggest a 72% probability of m1 being the best model, and a 28% probability that it is m2.

All of this will seem less abstract when we work through examples. We will have two: first, an example where we use model selection to adjudicate between multiple non-null hypotheses; then second, a case of using AIC in an exploratory manner.