9.1 Introduction

When you analyse a dataset, you have many choices in how to specify your statistical model. These choices have been described as a garden of forking paths. Let’s think back to the ‘tomboys’ data set from chapter 6 (Atkinson et al., 2017). The authors were interested in whether someone was a tomboy or not, so the outcome variable (called yesOrNo in the data set, 1 for was called a tomboy, 0 for not) is pretty clear. But there are actually several ways they could have tested the hypothesis that they stated.

First, there is the main predictor, the 2D:4D ratio. They measured both hands. So, they could use the ratio for the right hand; or the ratio for the left hand; or, the average of the ratios of the right and left hands (these variables are all in the data set, variables rightHand, LeftHand and Averagehand respectively). So the first forking path in the garden is which hand to use. Then, there is the decision of whether to log transform whichever 2D:4D we settle on. The authors decided they should log transform, but they might not have done, and in fact in chapter 6, we did not. Also in the paper, the authors consider whether the covariates Age and Ethnicity might need to be included. In fact there are four possibilities as far as covariates are concerned: include neither Age nor Ethnicity; Age but not Ethnicity; Ethnicity but not Age; or include both Age and Ethnicity .

So, when we think about it, there are 24 ways the data could have been analysed to test the simple prediction (at least, there are probably other variants too that we have not discussed). The 24 number comes from: three different 2D:4D measures, times two for either log transforming or not, times four for the combinations of covariates included or not included. Each of these ways is called a model specification.

The existence of 24 reasonable specifications is is a problem, because every one of those 24 ways represents another ‘go’ at getting a significant p-value. Obviously, if you run enough tests, some of them will come up significant, even if the hypothesis is false (by definition, one in 20 will if the threshold is p = 0.05). So, the fact that there are many places you could go in the garden of forking paths inflates the chance of finding a significant p-value somewhere; and if researcher decisions about which places to go in the garden (i.e which analyses to report) are driven by whether they find significant results there, then claims to have found a ‘significant’ result don’t tell us much about whether their hypothesis is actually true or not.

The researchers in the tomboy study present us with only some of the 24 possible model specifications, reporting a ‘significant’ result (p just a bit less 0.05), and interpreting this as evidence in support of their hypothesis. We really want to know which of the following situations we are in. Is it: (a) any of the 24 possible specifications would lead to conclusion of a significant association between tomboy status and 2D:4D; (b) only a very few of the 24 possible specifications would lead to the conclusion of a significant association, and the researchers are presenting just those; or (c) somewhere in between, a good number but not all of the specifications would lead to the conclusion of a significant association.

If we are in world (a), then we really should believe that the result is probably not a fluke. We would have come to the same conclusion however we had specified our model, and the hypothesis looks to be supported by our dataset. If we are in world (b), only a few specifications produce the conclusion, we should be more circumspect. The association may be a false positive, and the inference to the contrary is only supported under very specific assumptions, which we would have to interrogate. And if we are in world (c), then it is interesting to know how many of the possible specifications lead to the same conclusion, and which modelling choices make the difference. This information is really going to help us evaluate the strength of the evidence presented. The investigation of sensitivity of conclusions to specification choices is called sensitivity analysis.

Researchers have known for ever that some conclusions are changed by altering the specification of the model, and that it is important to communicate the extent to which this is the case. For this reason, papers will often play about with different specifications to convince you that specification choices don’t matter for the conclusions. In the paper on the tomboy study, for example, they start by presenting a simple model just using the logged right hand variable, then they experiment with adding age and show that makes no difference, and then they experiment with adding ethnicity (without age) and show this makes no difference either. Finally, they mention that you don’t get the same conclusion by using the left hand. What they don’t do, however, is systematically try out all the logically possible combinations, or visualize the results of such a trying out.

You will often read papers that present a model 1, with no covariates, and then a model 2 that adds some more, a model 3 that adds still more, and so on. Or that specify their outcome variable one way, and then another way. But this is still a rather improvised way of exploring sensitivity of conclusions to analytical decisions. A more modern and systematic way is called specification curve analysis, the topic of this chapter.

The principle of specification curve analysis is very simple. You work out the total set of model specifications that you might have reasonably chosen. For example, here, we have the 24 possibilities made up by the combinations of which hand to use, whether or not to log transform, and which set of covariates to include. Then, you run all 24 of these models, and you visualize what the parameter estimates and their confidence intervals are for each one. In particular, this allows you to identify which features of the specification matter for the parameter estimate. For example, log transforming the outcome variable might be critical to your conclusion, but including age as a covariate might make no difference at all. In the next section, we will run a simple specification curve analysis for the tomboy data.

References

Atkinson, B. M., Smulders, T. V., & Wallenberg, J. C. (2017). An endocrine basis for tomboy identity: The second-to-fourth digit ratio (2D:4D) in “tomboys”. Psychoneuroendocrinology, 79, 9–12. https://doi.org/10.1016/j.psyneuen.2017.01.032