1.8 How to do data analysis well enough

In final chapter of this book (chapter 12), after introducing all the practical skills, I will return to general advice on how to do data analysis well; by which I mean, in a way that leads to maximally sound, useful, robust evidence with minimal heartache. The final chapter is a long way off, though, and so it is worth introducing some key principles now, to bear them in mind throughout. These principles will be elaborated in the final chapter and elsewhere.

One thing you quickly learn about data analysis is that there are many different reasonable ways of analysing the same dataset. Different analysts have different opinions and styles. Whether a method is appropriate or applicable is a matter of degree, and at times a judgement call. I can’t teach you how to do an analysis that everyone will always agree is the right one. Nonetheless, if you follow the guidelines of this section and chapter 12, your analysis will always be good enough to make a productive contribution to the dialogue.

1.8.1 Analysis strategy and pre-registration

For every study, you need to define in advance an analysis strategy. Never just set out ‘trying some things’, even if your goals are exploratory. The list of things you might try is basically limitless, and your work time is finite. Instead, your strategy should state clearly what the questions are, any hypotheses and predictions, the estimand of interest, and what variables you are going to relate to that estimand, in what combinations. In writing your strategy, you should link back from the patterns you might see to your question or hypothesis (if \(Y\) is higher in the \(Treatment\) group, that would be consistent with the prediction from hypothesis 1; if \(Y\) is lower in the \(Treatment\) group, that would be consistent with hypothesis 2; if there is no difference, neither hypothesis is supported).

You should where possible write your analysis strategy before gathering any data. Thinking about analysis strategy carefully upfront can save you from wasting time and money running an experiment that cannot possibly answer your question. The design of a study and the data analysis you will do cannot be separated; they are an integrated whole developed at the same time. Sometimes you will be re-analysing existing data. In this case, you should still write down your analysis strategy prior to doing any analysis, but with knowledge of which variables are available. Parts of your strategy will take the form of a decision tree: if this variable is highly correlated with that one, I will do A; if not, I will do B.

Next, you always should pre-register your study, even if it involves re-analysis of existing data, and even if your goals are mainly exploratory. A pre-registration is a time-stamped document of record, posted in a public repository such as the Open Science Framework (OSF), https://osf.io that states what your goals were, what you knew, what you hypothesized, what methods you planned to use, and your planned analysis strategy, prior to your doing it. We will return to why pre-registration is so important in section 12.2.1. Suffice it to say for now that it makes your work more credible, because it is transparent what you really knew and predicted beforehand, and what you learned from the data; and also easier and more enjoyable for you, because it gives you a clear framework to guide you in the process of analysis and writing up, and a way of adjudicating disputes. (I really did predict this, look at what I said in the pre-reg!). You may well depart from your pre-registration and pre-planned analysis strategy, and that’s fine. Having these documents upfront makes these departures transparent.

1.8.2 Reproducibility

You need to document the full chain of evidence that leads to your final paper, and open it to inspection. This means that you need to keep copies of all of the raw files or questionnaires or field notebooks, or however your data arrives. Then you need to clearly document the workflow that leads from that raw data to your final paper: the script that merges all the raw files into an analysable dataset and sorts out weird cases and missing variables; the other script that does the analyses and produces the graphs and numbers in the paper. Never change raw data files in an untraceable and non-recoverable way (like changing an implausible value in a spreadsheet manually). Instead, write some code that takes the raw version and returns the ‘cleaned’ version for analysis, then keep both versions.

You publicly archive the whole workflow by putting all the files onto a public repository like the OSF, with clear descriptions of what the files are, and a link to the repository in your paper. Make your repository and make it public prior to submitting the paper as your reviewers need to be able to access it. Having open data (and open analysis code) in this way is a critical part of the process. You are making claims, and the community needs to be able to understand, check, scrutinize and improve the evidence supporting them. In addition, people may want to incorporate your data in a meta-analysis, learn analysis tricks from you, or answer a different question with your data or code. Science is a collective refining of our knowledge about the world, a collaborative activity. That requires that we pool all of our information very freely. People can sometimes be resistant to data openness, but this is neither acceptable nor prudent. You make your work more useful and influential by making it more open.

The materials you make openly available must allow someone else to take your raw data, apply the same procedures you did, and come up with exactly the same results. This is known as computational reproducibility. The methods of the whole project, including the data analysis methods, must be described in enough detail and clarity that someone else could run your study again but in a different sample of participants. This would be called a replication study. They may or may not get the same result (that’s an empirical question), but you need to ensure that they would have enough information to be able run the study, and to perform the data analysis: your chain of evidence needs to be sufficiently transparent that the study is at least potentially replicable. Computational reproducibility and potential replicability are necessary, though not sufficient, conditions for scientific value.