2.3 Basics of R

2.3.1 The RStudio environment

When you launch RStudio, you will see a window with several panes. The critical parts for today are the console (the big window on the left) and the environment window (the window at the top right). In each of the areas of the RStudio screen, you can toggle between different things to view, using the tabs (for example, between the Console and the Terminal in the left-hand window, or the Environment and History in the top right-hand window).

2.3.2 Is anybody in?

Go to the console and click. The cursor after the > symbol will start to blink. You can now type input into the console. Anything you type now will be sent to R as soon as you hit return. For example, do some simple arithmetic:

2 + 3

[1] 5

Yes, you can use R to successfully add 2 and 3 to get 5. Try a few others:

2 / 3

[1] 0.667

Note that you will most likely get more decimal places in your output than I show in this book. I have set my output options to fewer digits to save space.

2 * 3

[1] 6

2 ** 3

[1] 8

Perhaps you did not know this last one: 2 ** 3 means 2 to the power of 3.

2.3.3 Objects

R would be of little interest if you could just use it as a calculator. What you usually do with R is create objects and then apply functions to those objects. Let’s start with the simplest possible object. Type the following:

x <- 3

What happens? You should see that an object appears in your environment. The environment window conveniently tells you that you have an object called x, and its current value is 3. You have assigned the value 3 to the object x (<- is R for ‘assign’; in speech, it is often glossed as ‘gets’, as in ‘x gets 3’). The object x remains in the environment for the rest of your R session, allowing you to look at it, modify it, apply operations to it or whatever. (To delete it from your environment would be rm(x), or ‘remove x’.)

By the way, you can also assign the value 3 to x with x = 3. This works fine and does the same thing. However, for reasons to do with clarity and the history of R, assigning is usually done with <-. The symbol ‘=’ is used in calling functions, as we shall see, and, as a double form ‘==’ in logical statements.

So, what can you do with your first, beautiful object? Well, you can perform numerical operations on it:

x * 3

[1] 9

x ** 5

[1] 243

Who knew that 3 to the power of 5 was 243? Note that when we performed the operations above (multiplying x by 3 or raising it to the power of 5), the value of x did not change. If you look at the value of x in your environment window, it should still be 3. You merely printed the result of a computation involving x to the console.

What do we do if we want to change the value of the object instead? To do this, we need to reassign the new value to x:

x <- x * 3

Read this as: please assign a new value to x, equal to the old value of x multiplied by 3. The x on the left hand side represents the new, post-reassignment value of x; whereas the one on the right hand side represents the current one. You can use a similar formulation to make a second object that is derived from the first:

y <- x ** 5

This should create a second object in your environment, y, which is the current value of x, raised to the power of 5. Note that if you now change the value of x, the value of y will not update unless you rerun y <- x ** 5 after changing the value of x. This means that the order you do things in always matters in programming. A computation on an object will depend on the value of that object at that point in the program, and not its value earlier or later on. For example, the two sequences of code below produce different results, as you can quickly verify by typing the lines yourself.

object1 <- 10
object2 <- 2 * object1
object1 <- object1 + 3
object1 + object2

[1] 33

object1 <- 10
object1 <- object1 + 3
object2 <- 2 * object1
object1 + object2

[1] 39

2.3.4 Vectors as objects

R objects can be all kinds of things: individual numbers, bits of text, lists of things, statistical models, data sets, even graphics. The most basic type of object you will work with in R is a vector. A vector is an ordered sequence of numbers or characters. For example, let’s say you have collected some data on people’s scores on a questionnaire. Your five respondents scored 3, 7, 15, 1, and 5 respectively. You would input these scores like this:

scores <- c(3, 7, 15, 1, 5)

Now you should have a new object in your environment, called scores (c() means combine). scores is a num object, which means it is a vector of numbers. And [1:5] means that it has five positions in it, each of which is occupied by one of the scores. If you ask R to print the object, it will print all five entries. You can also use the square brackets to address specific positions within it. Try these:

scores

[1]  3  7 15  1  5

scores[3]

[1] 15

scores[3:5]

[1] 15  1  5

You can perform numeric operations on a vector object, and these will be applied to each element of the vector in turn. So:

scores + 2

[1]  5  9 17  3  7

Note, again, that this last example does not change the value of the entries in scores. To do that would require scores <- scores + 2.

2.3.5 Applying functions to objects

All the objects in the world would be no use without being able apply functions to them. Functions are operations we perform on the information in an object, and in R they are always followed with the round brackets (). In fact, you have already encountered a couple of functions, remove: rm() and combine: c(). The objects within the round brackets of a function are called its arguments. The arguments of a function can be individual values, vectors, or other objects, depending on the function in question. Many functions take several arguments.

Try the following functions yourself and guess what they do: min(scores); max(scores); sum(scores); mean(scores); median(scores). These are the kinds of functions we are often going to use in data analysis.

2.3.6 Classes

Objects in R belong to classes. We have already encountered the class numeric, which our vector scores belongs to. There are many other classes of object in R, which we will meet as we go along. If you ever want to know what class an object belongs to, use the class() function.

class(scores)

[1] "numeric"

class(c("apple", "pear", "banana"))

[1] "character"

You can also coerce the class of an object to be something different to what it currently is. For example, there is a class called integer which as you might guess contains only integers. Let’s define a vector of non-integer numbers:

v <- c(4.1, 4.9, 5.7, 5.2)

Now let us coerce this object to have the class integer:

v.integer <- as.integer(v)
print(v.integer)

[1] 4 4 5 5

You see that the entries of v are now coerced to the class integer, which means anything after the decimal point is truncated. Note that this is not the same as rounding to the nearest integer, which you can do like this (but, rounding does not change the class of the object, which remains numeric):

v.round <- round(v, 0)
print(v.round)

[1] 4 5 6 5

2.3.7 Assigning and logical checking

Sometimes we want to establish whether a logical condition is true. For example, we might want to know whether the second element of the vector scores is equal to 7. If we say scores[2] = 7, we will assign the value 7 to the second element of scores, which is not what we wanted to do. We wanted to check whether that is the current value. We do this with double equals, ==. This means ‘check whether this condition is met’, rather than ‘make it the case that this condition is met’:

scores[2] == 7

[1] TRUE

Yes, the second entry in scores is equal to 7. You can also ask the same question for all of the vector:

scores == 7

[1] FALSE  TRUE FALSE FALSE FALSE

Only the second entry of scores is 7. For the other entries, the equality is false. A sequence of TRUEs and FALSEs is a special type of object of class logical. One of the reasons it is good to use <- for assigning in R, rather than =, is to avoid any possible confusion with logical checking ==.

2.3.8 Installing and activating a contributed package

Contributed packages define extra functions, and sometimes new types of objects too, to allow you to do specific statistical or data manipulation tasks efficiently and elegantly. Here we are going to install and start the package tidyverse. If you want to use a dishwasher, you first have to install it (bring it to your house and connect it to the water and electricity), and then start it. You install it only once, but you start it every time you want to use it. Likewise with contributed packages. Just once for each computer you use, you need to run the following command. This will install the package locally.

install.packages('tidyverse')

Do this now. You will get a lot of output messages, but hopefully the result will be success.

Having installed tidyverse, in every R session where you want to use it, you need to activate it:

library(tidyverse)

The package tidyverse and the functions it contains are now available for use in the current session.

2.3.9 Scripts

So far, we have typed commands straight into the R console. This is not how you would ever actually work with R. Instead, you will perform your data analysis by writing scripts. Scripts are text files that contain multiple lines of R, sometimes hundreds or thousands of lines, in order. When you run the script, each line is passed to R for execution, in the order in which they appear in the script.

Opening a script, or starting a new one, works exactly like opening a new document in a word-processing program. In RStudio, go to File > New > R Script for a new script, or File > Open File for an existing one. You can also use the little icons in the top of the RStudio window. Open a new script now. It should appear in the upper part of the left hand window of RStudio.

The first thing we are going to do with our new script is save it. This raises the question: which directory will it save to? Files are by default saved and looked for in the current working directory. To find out what your current directory is, use the following function:

getwd()

To set it to something else, you have two choices. You can use the setwd() function and type the path you want to be the working directory. Alternatively (and this is what I usually do), in the menu bar, choose Session > Set Working Directory > Choose Directory and navigate to the directory you wish to use.

Set a convenient working directory where you wish to have all your R work.Save your blank script (which is currently called Untitled1, probably) using the menus or icons at the top of the screen. Call it firstscript.R, or whatever you wish, respecting the .R suffix.

Now, type the following lines into your script, separating them with a return:

x <- 5
y <- 3
output <- paste("The product is:", x*y)
print(output)

paste() is a function that sticks two objects together to make a longer object, in this case, the vector consisting of the text “The product is:”, and a number. Run the script by clicking the little Source button at the top right of the script. The output of the script will now appear at the console.

[1] "The product is: 15"

You can also source your script using Ctrl + Shift + S. You can run individual lines or sections of the script by placing the cursor on the desired line, or selecting multiple lines, and then using Ctrl + Enter.

A script allows you to write down all the operations involved in your data analysis, in order, so that you can get it perfect and then run it. You can also save it and come back to it tomorrow, send it to your collaborators, and publish it with your paper so that anyone can see and reproduce the exact analysis operations that underlie your claims about your results. A useful feature of scripts is that you can comment them, using the # symbol. A comment line will not be evaluated by R. It is just there for you to leave notes for yourself and for your collaborators and readers.

# First specify two numbers
x <- 5
y <- 3
# Then make a sentence saying what their product is
output <- paste("The product is:", x*y)
# Then print this sentence to the console
print(output)

[1] "The product is: 15"

This does exactly the same thing as the version without comments above, but helps the reader see what is going on. At the end of your session, always save the latest version of your script. If there are unsaved modifications in it, its name will be in red at the top of the tab.

2.3.10 Loading in data

You could conceivably type your data directly into R, using c() to create a vector that represents each variable. But, you will almost never do that; usually you will be loading in electronic data files that have already been created in another program. You can load them in from a web link or from a locally saved file. There are two file formats that you will probably use, .csv and .Rdata. R can handle many other formats too (e.g. .xlsx), using contributed packages where necessary.

.csv is the most common format for storing and exchanging raw data, because it is neutral and economical. It just consists of the data values separated by commas (csv stands for comma separated values). Sending data in the .csv format makes no assumptions about what software the recipient might use to process or view it: any spreadsheet or statistical package can read and output .csv files. When you publish your raw data, you should probably do so as .csv to maximize accessibility. You load .csv files in with the tidyverse function read_csv() (as we will do in the next section), and you write them with the function write_csv().

There is a downside to using the ‘.csv’ format: it can’t save all kinds of R objects, and it loses a bunch of subtle information in some of those it does save, like labels or orders of levels of a factor. So, you may also sometimes need to use another format, .Rdata, particularly when you are wanting to save a version of your ongoing R work for safekeeping or to share with collaborators. .Rdata is native to R and not easily readable by other software. An .Rdata file can contain multiple objects, of different classes. You load in .Rdata files with the base R function load(), and write them with the function save().