2.2 What is R and why do we use it?

2.2.1 R and R Studio

R is a programming language for statistical computing. I recommend you interact with R through a programme called RStudio; RStudio is a helper programme through which you can interact with R. RStudio provides a convenient interface to see and edit your R code, view your data, preview graphics, and so on. There are other ways of interacting with R, and the code you write in RStudio can generally be run without using RStudio. But I highly recommend RStudio, and it is extremely widely used. If you have not yet done so, this would be a good point to install R, and then R Studio, onto your computer. Both are free and both are available for the major types of computer operating system. I will not provide detailed instructions on how to do this; a web search will easily find what you need.

2.2.2 R is a kind of anarchist utopia

R is a kind of anarchist utopia. Originally started by professors Ross Ihaka and Robert Gentleman as a programming language to teach introductory statistics at the University of Auckland, it grew organically in all directions, becoming a worldwide scientific movement. Hundreds of different people have contributed to it over the years (thousands if you include all the contributed packages, which we will discuss shortly). The fact that R is free is astonishing boon for the democratization of knowledge and scientific capacity. The fact that it is a de facto lingua franca of data analysis fosters scientific collaboration and the reproducibility of knowledge, because you can run or adapt someone else’s code and they can run and adapt yours.

The thing about anarchist utopias is that they are kind of anarchic. Central control is weak. There are dozens of different ways of doing the same thing. Updates are issued and things that worked yesterday stop working, in ways that no-one seems to have really anticipated and can be frustrating to sort out. Different parts of R do things using different words and even different symbols, for reasons to do with its decentralized historical development. If you want to tell R to ignore missing values in the calculation of a correlation coefficient, you need to say use="complete.obs". If you want to tell it the same thing when calculating a mean, you say na.rm=TRUE, and in calculating a regression, you say na.action=na.omit. Why? That’s just what you get from being an open and evolving movement.

2.2.3 Use R because you’re worth it

R can be forbidding at first. Sometimes all you want to do is read some data in and calculate the mean, and using R appears far more complex than necessary. Basic statistical operations will initially take you longer than they would in a commercial spreadsheet or statistics package. Sometimes your code will not work and you will stare at it for hours, baffled, and search the internet endlessly for solutions. R’s error messages are usually totally incomprehensible (a side effect of the whole anarchist utopia situation). However, I would recommend really getting to grips with R even if your statistics needs are currently very modest, or you also use other languages like Python.

The reasons for this recommendation are the following:

  • You can never outgrow R. R is the language of choice of many professional statisticians and data scientists. There are contributed packages to implement almost every conceivable statistical technique. You can write new functions. You can make professional quality graphics. You can make animations and run simulations. You can even write web pages and books (this one is written and typeset in R). As your skills and ambitions grow, R will be equal to the challenge.

  • Programming in R helps you think better. To do data analysis in R, you have to write a short computer programme (a script). This means writing down a sequence of exact operations, in order, understanding what each line is doing. This helps you to think clearly about what question you are asking, and what you are doing with your data.

  • Programming in R aids reproducibility. When you analyse data in R, you save your script and data files, and publish it along with your paper or dissertation. This means that someone wanting to understand your claims or findings can download your script, rerun it for themselves, and understand exactly the full pipeline that leads from your raw observations to your pretty pictures and final conclusions.

  • R is pedagogical There are incredible R resources available, many of them free online, that teach statistics, data science and graphics. By speaking R you are able to benefit from this resource.

  • It’s free. You are on the side of the good guys. It’s good for you, and it is also good for the openness of science. It also works on different operating systems.

  • It’s fun. Arguably.

2.2.4 Working with R

There are several cautionary points to make about working with R. The first is that there are many ways of doing the same thing. This is particularly true once you take all the contributed packages into account. If you are already familiar with R, you will be puzzled why I do certain operations one way, when another book, a web search, or an AI assistant suggests another way. It doesn’t matter. Try both ways and convince yourself they give the same result. Be curious and develop your own style. Above all, try to understand the logic of what each line or chunk is doing.

A second point: the R archipelago changes quite fast. There are new versions all the time and some of the changes are consequential. Contributed packages come and go. By the time you read this, there may be better ones for doing some of the things we do in this book (or existing ones will no longer be supported), and you may have to adapt the code I use here. Again, this does not matter as long as you are prepared to actively experiment, use the community and the web for answers, and try to understand the logic of what is going on.

Finally, there are bugs and errors. You will quite often find that bits of R code, even quite simple bits, just don’t work. They produce a bizarre error message or an incomprehensible results. You may briefly think you have discovered a flaw in R itself, or even a lacuna in Western mathematics. You haven’t. You have just typed something wrong. Look again at your code and if necessary type it out again from scratch. Remember than in R, the placement of every comma and line break matters, the order of things matters, and the difference between capital and lower case letters matters. Which type of bracket you use, {, ( or [, matters. You must close every bracket you open, and close it in the right place. If your code produces a problem, you have probably typed something wrong. In the rare event that you have not typed something wrong, you have thought about something wrong. In this case you need to think more carefully about what you are trying to do, maybe breaking it down into smaller logical steps.

2.2.5 Base R and contributed packages

One of the main reasons that R is anarchic and plural is that as well as R itself (often referred to as base R), there are is a whole archipelago of free sub-programs available that can be deployed within R. These contributed packages are written by a dispersed community of different data scientists and statisticians. They do not come with base R, but need to be installed. This is easy to do from within R. Often there are multiple different contributed packages available to do the same task. You will use many different ones and come to have your favourites. Each contributed package has its own lingo. We will be using contributed packages as well as base R throughout this course. In particular, every code example in this book uses a package called tidyverse. Tidyverse is in fact a bundle of packages for organizing, sorting, summarizing and plotting your data. It is very widely used and makes R more user-friendly for people who are not computer scientists.