3.1 Introduction
The previous chapter got us as far as looking at some data in R, and calculating descriptive statistics. Most data analysis is about something a bit more than describing the dataset we have in front of us, though. It is about inferring, from that data set, what might be true of the world more generally: for example, whether our hypotheses about the world seem likely to be true in the light of the evidence we have gathered. To do this kind of ‘going beyond’ the data, we use a set of techniques known as inferential statistics. Inferential statistics are a set of computations that allow us to say what, in the light of our data, we should best believe about the broader world of which our data are a snapshot; and what the margin of error on those beliefs should be.
If you have done a previous course in inferential statistics, you probably learned that there was a laundry list of different statistical tests, such as t-test, ANOVA, regression, ANCOVA, and so on. About the only thing these tests seemed to have in common was that they spat out a p-value and you could say they were ‘significant’ or not.
This book takes a different approach. Inferential statistics always involves assuming a statistical model of the processes in the world that lie behind your observations. This is true even in very simple cases. You use your data to infer the likely values of the parameters in this model. These inferred values are called parameter estimates. As well as the parameter estimates themselves, you can compute their imprecision, or margin of error. This margin of error represents just how wrong you could be about the processes in the world given the observations that you have made. It becomes smaller as you gather more data. Inferential statistics is about using data to infer parameter estimates and their margins of errors.
All those different tests you have heard of - t-tests, regression, ANCOVA - arise from fitting the same family of statistical model to your data: the General Linear Model. This book teaches you how to fit and interpret a General Linear Model without getting excited about whether we want to call what we are doing ‘a t-test’ or ‘multiple regression’. Instead, I want to focus on the understanding the numbers that the modelling spits out. The General Linear Model can’t do quite everything. It is suitable for outcome variables that are continuous. When our outcomes are binary (something happened or did not), or discrete (the number of times something happened), we need extensions of the GLM that belong to the family known as Generalized Linear Models (see chapter 6). This is a terminological nightmare for two reasons: Generalized Linear Model acronymizes as GLM, just like the General Linear Model does, so when people abbreviate it is unclear which one they mean; and, the General Linear Model actually belongs to the class of Generalized Linear Models. I know, I know.
Sometimes people use the abbreviation GLIM for the generalized version and GLM for the general. But sometimes people use GLM for the generalized one. R does this: its function for fitting Generalized Linear models is called glm()
, whilst the function for a GLM is called lm()
. In this book, I will avoid acronyms, and always spell out whether I mean a vanilla General Linear Model (lm()
) or a Generalized one (glm()
).