12.2 Principles of good data analysis
12.2.1 Pre-registration
The centre-piece of modern good practice is pre-registration. You should pre-register your study on a platform like the OSF, even if the study involves re-analysis or follow-up analysis of existing data that have already been analysed in relation to different questions.
A pre-registration is a concise description of what the research questions, hypotheses, predictions, and planned methods of the study were. In typical experimental cases, it is deposited before the data have been collected. However, it could also be deposited after the data exist but before the researcher has done the analysis, for example for pre-existing data or a quasi-experimental case. Your pre-registration should contain a detailed description of your planned data analysis strategy.
The epistemic benefits of pre-registration should be obvious. Pre-registration makes it transparent what the researcher really hypothesized, not what they persuaded themselves after seeing the results that they had hypothesized. Hence it reduces HARKing. Also, the p-hacking problem is mitigated, since the researchers will have been obliged to pre-specify which analysis strategy they intended to follow. I would advise you to make the data analysis strategy part of your pre-registration as detailed and specific as you can; you can even simulate a dataset (chapter 11), and use it pre-write the analysis code you will use on the real data.
Researchers are sometimes concerned that having to pre-register will inhibit their ability to discover phenomena serendipitously, or to adjust nimbly when the results are not as anticipated. This is really not a concern. You can still perform unanticipated analyses, find unexpected findings, and pivot your strategy if the data look different from you expected. Having a pre-registration merely makes it transparent when you have done this: which parts were always part of the plan, and which parts evolved. In your paper, you should include a transparent changes section, either in the Methods or as a supplementary appendix. This is a succint statement of all the ways you deviated from the pre-registration, in the methods, the sampling, and the data analysis, along with a justification of why it happened. If your data analysis strategy is substantially updated compared to your pre-registered one, you should consider presenting your originally planned analysis even if you now also have another that you consider superior.
Pre-registration is often described as beneficial for the field as a whole, but don’t underestimate its benefits to the researchers themselves. If you don’t pre-register, you can rush into running a study, only to realise as you stare at the dataset that you had not adequately thought through why the design was as it was, exactly what the predictions were, or how on earth you are going to do the analysis. At best this is time consuming, and at worse it leads you to do completely useless work. Pre-registration helps you improve your workflow because it forces you to think harder upfront, so things can go more smoothly and efficiently down the line. Also, don’t forget the limitations of your own memory. You may well have had reasons for doing something a certain way. It is by consulting your own pre-registration that you are going to remember what they were.
There are a number of different formats for writing your pre-registration. The OSF provides a helpful list (see https://help.osf.io/article/229-select-a-registration-template). The OSF standard pre-registration is the most widely used, and consists of text answers to a series of questions. There are also other templates for specific situations, such as secondary analysis of existing data. Personally, I prefer to write quite a full open-ended document which resembles the Introduction and Methods section of a paper. The advantage of doing this is that it can form the basis of the eventual paper; half of your write-up is done before you have even collected the data. If you write your own document, make sure it covers the content of all the obligatory sections of the OSF standard template.
You should aim for a clear structural correspondence between the pre-registration and the eventual paper. If you defined predictions 1, 2 and 3 in the pre-registration, then introduce them in that order and with that numbering in the paper, and use that order to structure your Results section (and also your analysis script). Assume that your reader may be reading your pre-registration and paper, and perhaps your R script, in parallel.
12.2.2 Pre-printing and results-blind publication
If the literature is to unbiasedly represent the data we have about the world, then null findings need to be as likely to enter the record as non-null findings. Journals are realizing this and are more open to the publication of null results than they used to be. You should try to play your part in keeping the record honest in this way.
An innovation that has helped with publication of null results is registered reports. These are papers that are peer-reviewed, and in principle accepted for publication, before the results are known. Registered reports helps bind journal editors and authors alike to publication even if the results are not as hoped. Although this is an excellent idea, the registered report mechanism is not always convenient. For example, I often write papers reporting series of experiments, in which each one is designed in the light of the previous one. Here, it is not clear at what point you would write and submit a registered report. Journals and authors need to make an effort to publish null findings even when they are not registered reports.
A helpful innovation for the dissemination of null results (and other results too) is the pre-print. Preprints are non-peer reviewed early versions of papers, posted publicly via services such as PsyArXiv. You should always pre-print everything, null or not. That way, even if you add more studies, journal publication takes a long time, or the paper is hard to place in a journal because of editors being sniffy about null results, the data are on record. They can be found by meta-analysts and people wanting to build on what we know.
A subtle form of publication bias that often occurs is publishing the study, but leaving out measures or manipulations that produced nothing of apparent interest. This is a shame, since it misleadingly represents what was done, and withholds information from the record. Your write-up should always cover all of the predictions, all of the variables you measured, and all of the manipulations you did. You should not cherry-pick, though of course some might get more space in the Discussion than others. Pre-registration helps here. If a prediction or variable is in the pre-registration, then it should be in the paper.
12.2.3 Sensitivity analysis
We have already covered sensitivity analysis in chapter 9, but it is worth mentioning again here. It really helps people (and you) to know whether the result you want to interpret appears in any reasonable specification of the data analysis, in only one very specific specification, or in some but not others. Even for apparently simple studies, there are often multiple reasonable analyses. As well as your primary analysis, you should also consider carefully how the conclusions are sensitive to alternative specifications, transformations and covariates, using specification curve analysis where useful. You should present sensitivity analyses either in the paper or in a supplementary appendix.
12.2.4 Internal replication
If have a finding that you wish to communicate (whether it is null or non-null), then the credibility of the conclusion is greatly enhanced by replicating it in a new sample. (Of course, you should also have a credible sample size in the original study.) What we are looking for is statements about the mind or behaviour that are true not just in your sample, but in humans more generally. The first step, then, is to see if you yourself can find it in another sample drawn from the same population, before we move on to other researchers and other populations. You don’t need to modify the study, other than small ways of making it better, or conceivably adding an extra measure that clarifies how it worked. Just run it again. If the finding is an fluke, it probably won’t look the same in the replication sample. That’s a key way that we tell false positives from true positives, and false negatives from true negatives. And, you can meta-analyse the datasets (section 10.5).
A paper is much more compelling if it shows that its main findings replicate in a new sample. You should always try to do this before deciding that you know the answer and move on. If you are studying the mating behaviour of the rare greater vasa parrot, or your study required you to be embedded in an organization for three years, then you might deserve an exemption from this. But most studies in psychology and behavioural science are quite quick and simple to do, and internal replication will add to the confidence of your beliefs.
12.2.5 Open data and code
Finally for this section, it is essential to openly and publicly archive all your raw data and code in a durable repository like the OSF. You should make this available as soon as you post a pre-print, and it should certainly be available to your peer reviewers. This archive is as much part of the scientific work as the paper is.
As for pre-registration, we tend to emphasize the benefits of open data and code for the field as a whole: people can scrutinize the way you made your inferences, explore other specifications or questions, use your evidence in meta-analysis, and see how to do follow-up studies of their own. But, again, I think that many of the main benefits are for the researchers themselves. In three years, you will have moved jobs, changed computers, and above all forgotten what final.modified.csv
or analysis.v7
were. By making a well-labelled, durable archive of the definitive versions that is easy for someone else to understand, you are leaving a gift for your future self.