12.5 Being a good data citizen

One of the great things about R is that it is an international lingua franca. You can share your R code with your collaborators, and with strangers from all over the world, and they can use it. At least in principle. In reality, I find that when collaborators send me their code, or I download code from a repository, I am almost always unable to make it work without frustration. That’s because the code is non-optimally presented. This section contains a series of tips for writing your code, tips that will make your code easy and stress-free for others to use (and for you to use when you re-open it in three months’ time).

The tip to being a great scientific writer is to overcome the curse of knowledge. This is also true for writing good code. The curse of knowledge arises because you know what a chunk of code is trying to do, what each variable name means, and which bits you need to run to make which other bits work. The curse is that you cannot appreciate what it would be like not to know this. But your reader will not know, and, more importantly, in fifteen days you will have forgotten too. The tips below are devices to write code that does not require knowledge that the user does not have in order to run.

If you can, teach yourself to write neat code from the first draft. Often when you first analyse your data, you write a really messy script, with no logical order, no comments, and full of inconsistencies. You persuade yourself that you will tidy up later, before posting it to the repository. But, try to internalise the tips below so that you employ them from the very beginning. That way you save yourself a lot of effort tidying up later on. A good chef tidies their work area as they go along.

12.5.1 Use transparent and consistent naming conventions

Give all of the variables in your data files maximally clear (full word) names. Thus, Condition, not co, Neuroticism, not pers_3, and so on. It means typing a few more characters, but the transparency saving is large. Relatedly, always code your categorical variables using transparent words, so that Condition has values Control and Intervention, not 1 or 2, or anything else. Use a consistent convention about how the names are formed. If one variable name starts with a capital letter, then all should do so. If some names have multiple words, then have a consistent rule about whether the separator is . or _. (I do not recommend variable names including spaces.) Some people prefer the convention of full stops (.) in variable names and underscores (_) in function names. It does not matter too much what system you adopt, as long as it is consistent.

12.5.2 Section and comment your script

In an RStudio script file, any line that begins and ends with #### is read as a section heading. This means it becomes available to navigate to in the little bar at the bottom of the script window. It is also visibly different in the script window itself, especially if you put a line of blank space before and after it. Use this feature to give your script a clear structure of headings. In between your headings, put in comments to tell yourself and the reader what you are doing in the following lines. Comment a lot. Too much is better than too little. Here is a hypothetical example:

#### Preparing the data frame ####

# First figure out which experimental group is which
# By seeing which one has a worse mood after than before
d %>% mutate(Difference_Mood = Final_Mood - Initial_Mood) %>% group_by(Mood_induction_condition) %>% summarise(M = mean(Difference_Mood), SD = sd(Difference_Mood))
# Looks like 1 was negative and 2 was neutral

# Now let's recode the condition variable:
d <- d %>% mutate(Condition = case_when(Mood_induction_condition == 1 ~ "Negative", Mood_induction_condition == 2 ~ "Neutral"))

#### Now the next task ####
#...

Aim for structural isomorphism between your script and your paper. That is, if your paper’s Results section has headings ‘Descriptive statistics of sample’, ‘Impulsivity measure’, ‘Experimental effects’ and ‘Sensitivity analysis’, then those should be the headings in your script too (preceded by an initial section, the ‘head’, see section 12.5). This makes it easy for the reader to map back and forth from one to the other

12.5.3 Make your scripts modular with a head

A problem you face is that the operations in your script must be written in a linear order, but you sometimes the reader wants to jump straight to a later part, say the part where you make the figures. You don’t want the user to have to run all of the earlier sections in order to make the code making figure 1, at line 320, work correctly. And, you don’t want the figure 1 code at line 320 to stop working just because you changed something at line 115 in order to fit the statistical model.

In other words, you want your script to be modular. That means it should consist of a number of parts that can be operated and changed independently: the section that calculates the descriptive statistics; the one that fits the statistical model; the one that makes the figures; and so on. If you took modularity to the extreme, then you would need to do basic operations common to the whole script, like reading in the data and caculating derived variables, separately in many different places. This would make sure that each module worked autonomously, but it would be highly repetitive.

The compromise position I use is to have one common section, the head, which must be run and contains all the things necessary for the whole script. Thereafter I try to write sections that are completely autonomous, other than depending on the head having been run. Thus, to make a late-on section work, all the user needs to do is to run the head, then run the section of interest. This is to avoid the tedium of wanting to make the figure at line 320, and finding that this does not work unless you have run the code at 285 that did the log transformation, which in turn does not work without the subset of the data defined at 113, and so on.

The head is the first section of the script, and contains a small number of general operations that are needed for all or most of the modules in the script. It should identified with ###Head#### and commented. In the head, you:

Load all contributed packages used at any point.
Read in all data files used at any point.
Do any merging or reshaping of data frames that is going to be needed at any point.
Rename and recode variables if necessary. Set the reference levels of any factors if required.
Calculate derived variables like scores from scales, or indices.
Apply any exclusions (or replacements of missing values) that are going to apply to all the analyses in the paper.

The user should know that once they have run all the code of the head, they are ready to jump to any section and find that it works correctly.

12.5.4 Consider a separate data wrangling script

Sometimes the head of your script can get very long. For example you may have to read in and merge multiple data files, rename and recode many variables, add up scores from scales, and reshape the data frame from wide to long format. It is important that these operations are done reproducibly, since they are part of the chain of evidence that goes from your experiment to your paper. But they don’t make for gripping reading, and most users would rather fast forward through them. In cases like this, I sometimes separate my work into a data wrangling script and a data analysis script. In the data wrangling script, I do all the tedious work described above, the work that takes my raw data files and makes them ready to start analysing and plotting. At the end of the data wrangling script, I save a processed version of the data (usually as an .Rdata file).

The head of the data analysis script is then very simple: it consists in loading contributed packages, loading the processed data file, and off we go. The user can choose their own adventure: either work through the data wrangling themselves, or just start at the beginning of the data analysis, where things are getting interesting.

For completeness, I usually put the raw data files and wrangling script, as well as the processed data file and analysis script, in the data repository. You might want to accompany these with a ‘readme’ file explaining what everything is. In terms of computational reproducibility, the data analysis script plus the processed data file should allow reproduction of the paper; and the data wrangling script plus the raw data files should allow reproduction of the processed data file.

12.5.5 Consider using R Markdown

You can go further with the idea of making your script and your results section correspond to one another. You can make your script and your results section be the very same document. You do this by writing your document in R Markdown. R Markdown is a way of writing formatted text documents that include sections of R code (or code in other programming languages) alongside their text. You write your R Markdown document in RStudio and save it as a file with the extension .Rmd. You can render your R Markdown into a number of formats for reading, such as Microsoft Word, PDF and HTML. When you output your file to one of the formats, the code will all be run and the output included alongside the text. This book is written in R Markdown, which I then output into HTML for the web or PDF for the print version.

You include R code within R Markdown in two ways. The first is code chunks. These are blocks of code which appears within the text in grey boxes. Underneath the code is the output that would be produced in the console and plot window from running that code. The second way is inline code. Let’s say that within a sentence you want to cite some numbers, like the mean and standard deviation of a variable. Instead of running the relevant R code, then retyping the number you see into your text document, you simply call R to do the relevant calculation within the text. As long as the necessary variable is in your R environment within the script, R does the calculation and places it into the document in the indicated place. This is obviously much less error prone than retyping. As someone who has spent months of their life typing numbers from R output into word-processor documents, this is a game changer in terms of error reduction, and potentially time-saving too. In particular, if the data change (you decide to apply a different exclusion criterion for example), all the numbers in your tables and results update automatically and are right.

I won’t go into detail here about the details of how to write R Markdown. It is very intuitive and there are good materials on the web. It’s a very good word-processor as well. I have experimented with everything from writing my entire paper in R Markdown, to using R Markdown to make a statistical document that I put into the online repository, but is not the actual paper. A good compromise solution is to write the Results section (or Methods and Results) in R Markdown, then output this, once you are happy with it, as a word-processor document that you drop into the Introduction and Discussion, which you have written in your word processor. Most journals will want a word-processor version at some point, and often your collaborators will too. A Results section written in R Markdown is fully reproducible by definition; by running your R Markdown file on your data file, a user will end up with the Results section you got.

12.5.6 Maximize local autonomy and minimize intermediate objects

I used to write scripts that made many intermediate objects. For example, if my main data frame was d but I wanted to fit a statistical model to only the data from participants with a reaction time of less than 700 msec, I would first make a data frame with the required data:

d.include <- subset(d, GRT<700)

Then later I would fit the model:

m1 <- lm(SSRT ~ Condition, data=d.include)

However, there is no need to do this in two steps. You can just write the model as:

m1 <- lm(SSRT ~ Condition, data=subset(d, GRT<700))

Why do I now prefer the second way to the first? Your environment does not get littered with intermediate objects like d.complete. The line defining the model m1 is more autonomous, in that it does not depend on having previously run the line defining d.include in order to work. And, the line defining the model m1 is more transparent. It is more obvious that m1 has been fitted only to the data from participants with GRT < 700, because the call actually says so explicitly. In the second version, more of the work is done locally to the operation itself.

There are many other examples of this kind. Let’s say you want to work out the mean of the variable SSRT by participants (for a case where participants do the same task multiple times), and then work out the mean and standard deviation of those means by Sex. You could do this via the intermediate object participant.summary, as follows:

participant.summary <- d %>% group_by(ParticipantID) %>%
  summarise(participant.mean.SSRT = mean(SSRT), 
            Sex = first(Sex))

And then later:

participant.summary %>% group_by(Sex) %>%
  summarise(mean(participant.mean.SSRT), 
            sd(participant.mean.SSRT))

However, there is no need for the intermediate object. You can do everything in one go:

d %>% group_by(ParticipantID) %>%
  summarise(participant.mean.SSRT = mean(SSRT), 
            Sex = first(Sex)) %>%
  group_by(Sex) %>%
  summarise(mean(participant.mean.SSRT), 
            sd(participant.mean.SSRT)) %>%
  ungroup()

No intermediate objects with funny names. The pipe is a bit long, but the user can work through what it is doing step by step. There are exceptions to this rule. Sometimes it is much more economical, or easier to understand, if you define an intermediate object to which you will apply further operations. But in general I would privilege transparency and local autonomony over brevity of the script.