12.1 Introduction

This final chapter brings together everything we have covered, focusing on good practices you should adopt in approaching data analysis. It’s a high-level overview, without R examples or a dataset to analyse.

We know that over the last few decades, the literatures of psychology and behavioural science have contained too many statistically significant findings, and too few of them have actually corresponded to reliable discoveries about the world. We know this because of the rise of the discipline of meta-science, the science of how science works. Meta-scientists scour the published literature to look at things like what proportion of reported results are significant, and what kinds of analysis approaches people use. Importantly, they have also conducted mass replication studies. In these, teams of researchers take a large set of experiments that have been recently published in the leading journals of the field. They exactly repeat the experiments of the published studies. Almost all of the published experiments report statistically significant effects, and interpret those as justification for the claim that they have discovered some generalisable truth about the human mind or behaviour. If this interpretation were right, the same effects should be found in the replications.

One thing that comes out of these mass replications is how hard it is to figure out what the original researchers actually did. Often their materials, raw data and analysis code are unavailable; and the laconic description in the published papers is usually insufficiently detailed to be able to exactly repeat the procedures. When the replicators do manage to repeat the experiments, many or most do not recover the original significant effect. In the most famous mass replication, 97 of 100 target articles had reported significant effects. Statistically significant effects were re-found in 35 of those 97 replications. Effect sizes were half those of the original reports, on average (Open Science Collaboration, 2015). Other mass replications have led to similar conclusions: most papers published in the literature claim some novel finding or discovery, but many or even most of these are false positives that don’t reveal any generalisable principle. By the way, there is no indication that this is unique to psychology and behavioural science. The situation in biomedical science, for example, seems to be at least as bad.

We always knew that truth could not be definitively demonstrated by the results of a single experiment; not in sciences that deal with living organisms, anyway. The hope is that, in the long run and at the population level, science will be self-correcting. The true hypotheses will continue to show themselves fruitful through further studies (eventually there will be definitive meta-analyses), and the spurious ones will fall away. Nonetheless, it is epistemically inefficient to have such a high rate of spurious claims making it into the peer-reviewed journals of record. The unreliable claims take a long time to get weeded out, if they ever are, and researchers spend their resources building on findings that are not reliable in the first place. We need to increase the reliability of evidence within the single paper. Data analysis has a big role to play here. This chapter is about ways of doing your data analysis that increase the chance that any claim you make turns out to be an epistemically reliable one.

It’s worth spending a bit more time revisiting how such a high rate of false positives comes about. Actually, it should not surprise us that many statistically significant findings from single experiments don’t replicate (see 4.2.2). By definition, the probability of a false positive is one in twenty (with p < 0.05 as the criterion); whereas the probability of discovering a novel true thing about the mind could be much lower (it’s hard to say exactly what it is). So given that an individual result is statistically significant, it may be more likely to originate from the false-positive compartment than the true-new-discovery one (Ioannidis, 2005). Even if there is something going on, when the sample size is modest, the observed effect size is going to vary a lot from time to time, meaning that sometimes it will fall one side of a significance line, and sometimes the other (section 11.4).

In fact, the effective false positive rate is probably much higher than one in twenty. This is because there were, traditionally, many researcher degrees of freedom about how the analysis was done and how it was interpreted (Simmons et al., 2011). The researchers could try many specifications for the analysis, and report those ones that made their results look the best (see chapter 9). Trying out multiple specifications increases the false-positive rate. This came to be known as p-hacking, trying out multiple analyses until you found the one that gave you the p-value that made your paper attractive. Researchers weren’t necessarily cynical; once you know that if you control for left-handedness and time of day, your main statistical test becomes just significant instead of just non-significant, it’s hard to un-know it.

Relatedly, and again often unintentionally, researchers were prone to hypothesizing after the results were known (HARKing). If you find a significant interaction between left-handedness and your IV, it’s easy to persuade yourself that that is what you were looking for all along. Of course the effect is only going to be present in right-handers! If you can subtly adjust your prediction in the light of your results, again it increases the chance of claiming with apparent justification that your prediction is supported.

Publication bias also plays a big role here. If you only succeed in publishing the subset of your studies that produced a significant result, then by definition the set of all published results is a non-representative subset of all results, biased in the direction of claiming more significant results than there really are. Researchers also exacerbated this by selective self-presentation, leaving out measures or manipulations that did not work and only reporting in their paper the ones that seemed to have done so; or just by leaving some studies on the cutting-room floor.

There is broad consensus nowadays on a set of good practices that will reduce the rate of spurious claims, and increase the credibility of the results that are put forward. The next section sets out what these are.

References

Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLOS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632