2 Data Wrangling

Data wrangling refers to the way to view, change, and reorganize data in a data frame. Before we dive into the operations, it is important to know that there are two distinct schools of data wrangling: base R and tidyverse, where the latter has contributed substantially in enhancing the process. We will start with the base R version, followed by the tidyverse.

Learning Objectives:

This chapter covers common data wrangling techniques. After finishing this chapter, you should be able to:

  • Select rows that fulfill specified conditions.

  • Randomly sample rows from a data frame.

  • Sort rows in a data frame.

  • Select a subset of columns.

  • Add and remove columns to an existing data frame.

  • Tabulate data.

2.1 Base R

2.1.1 Select Rows by Condition(s)

The subset() function selects rows if they satisfy the condition(s). Here are the three relational operators to support multi-parts conditions:

  1. “or” is denoted by |

  2. “and” is denoted by &

  3. “not” is denoted by !

Note that the operator for checking equality is “==”; two equal signs touching each other.

# Choose all setosa iris. 
setosa_iris <- subset(iris, Species == 'setosa')

# How many are selected?
print(paste("There are", nrow(setosa_iris), "rows."))
#> [1] "There are 50 rows."
# Peek into the first few rows of the result.
head(setosa_iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
# More than one condition. The relational operator is the logical OR, i.e., a vertical bar |
iris_or_eg <- subset(iris, Sepal.Length > 5 | Sepal.Width < 3 )

# How many?
print(paste("There are", nrow(iris_or_eg), "rows."))
#> [1] "There are 124 rows."

head(iris_or_eg)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1           5.1         3.5          1.4         0.2
#> 6           5.4         3.9          1.7         0.4
#> 9           4.4         2.9          1.4         0.2
#> 11          5.4         3.7          1.5         0.2
#> 15          5.8         4.0          1.2         0.2
#> 16          5.7         4.4          1.5         0.4
#>    Species
#> 1   setosa
#> 6   setosa
#> 9   setosa
#> 11  setosa
#> 15  setosa
#> 16  setosa
# More than one condition. The relational AND operator.
iris_and_eg <- subset(iris, Petal.Length > 6 & Petal.Width < 2 )

# How many?
print(paste("There are", nrow(iris_and_eg), "rows."))
#> [1] "There are 2 rows."

iris_and_eg
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 108          7.3         2.9          6.3         1.8
#> 131          7.4         2.8          6.1         1.9
#>       Species
#> 108 virginica
#> 131 virginica
# No setosa iris
iris_no_setosa_eg <- subset(iris, Species != 'setosa')

# How many?
print(paste("There are", nrow(iris_no_setosa_eg), "rows."))
#> [1] "There are 100 rows."

head(iris_no_setosa_eg)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 51          7.0         3.2          4.7         1.4
#> 52          6.4         3.2          4.5         1.5
#> 53          6.9         3.1          4.9         1.5
#> 54          5.5         2.3          4.0         1.3
#> 55          6.5         2.8          4.6         1.5
#> 56          5.7         2.8          4.5         1.3
#>       Species
#> 51 versicolor
#> 52 versicolor
#> 53 versicolor
#> 54 versicolor
#> 55 versicolor
#> 56 versicolor

2.1.2 Select Rows Randomly

It is quite common to randomly sample a subset of rows from a data frame. It takes to two steps to achieve that:

Step 1. Create a list (vector) of random row indices by the sample() function. Specify the following parameters:

a.) The range of values. For this example, the range is from 1 to the last row of iris, which is the return of nrow(iris).

  1. How many samples, e.g., size = 5.

  2. Can a sample be drawn repeatedly? If not, replace = F.

# seq(nrow(iris)) generates 1, 2, 3, ..., 150.
rnd_idx <- sample(seq(nrow(iris)), size = 5, replace = F)
rnd_idx
#> [1]  75  24   9  79 142

Step 2: select rows from iris according to the indices in rnd_idx.

a_random_sample <- iris[rnd_idx,]

a_random_sample
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 75           6.4         2.9          4.3         1.3
#> 24           5.1         3.3          1.7         0.5
#> 9            4.4         2.9          1.4         0.2
#> 79           6.0         2.9          4.5         1.5
#> 142          6.9         3.1          5.1         2.3
#>        Species
#> 75  versicolor
#> 24      setosa
#> 9       setosa
#> 79  versicolor
#> 142  virginica

2.1.3 Sort Rows

It is a two-step process similar to the random sampling of rows from the previus subsection. Suppose you want to sort iris in Sepal.Length in ascending order (small to large).

Step 1: Obtain a list of row indices that indicates the sorting order to the data.

sorted_idx <- order(iris$Sepal.Length)
sorted_idx
#>   [1]  14   9  39  43  42   4   7  23  48   3  30  12  13
#>  [14]  25  31  46   2  10  35  38  58 107   5   8  26  27
#>  [27]  36  41  44  50  61  94   1  18  20  22  24  40  45
#>  [40]  47  99  28  29  33  60  49   6  11  17  21  32  85
#>  [53]  34  37  54  81  82  90  91  65  67  70  89  95 122
#>  [66]  16  19  56  80  96  97 100 114  15  68  83  93 102
#>  [79] 115 143  62  71 150  63  79  84  86 120 139  64  72
#>  [92]  74  92 128 135  69  98 127 149  57  73  88 101 104
#> [105] 124 134 137 147  52  75 112 116 129 133 138  55 105
#> [118] 111 117 148  59  76  66  78  87 109 125 141 145 146
#> [131]  77 113 144  53 121 140 142  51 103 110 126 130 108
#> [144] 131 106 118 119 123 136 132

Step 2: Rearrange the rows in the data frame according to the sorted row indices.

# The 5 shortest Sepal Lengths
head(iris[sorted_idx,])
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 14          4.3         3.0          1.1         0.1
#> 9           4.4         2.9          1.4         0.2
#> 39          4.4         3.0          1.3         0.2
#> 43          4.4         3.2          1.3         0.2
#> 42          4.5         2.3          1.3         0.3
#> 4           4.6         3.1          1.5         0.2
#>    Species
#> 14  setosa
#> 9   setosa
#> 39  setosa
#> 43  setosa
#> 42  setosa
#> 4   setosa

# The 5 longest Sepal Lengths
tail(iris[sorted_idx,])
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 106          7.6         3.0          6.6         2.1
#> 118          7.7         3.8          6.7         2.2
#> 119          7.7         2.6          6.9         2.3
#> 123          7.7         2.8          6.7         2.0
#> 136          7.7         3.0          6.1         2.3
#> 132          7.9         3.8          6.4         2.0
#>       Species
#> 106 virginica
#> 118 virginica
#> 119 virginica
#> 123 virginica
#> 136 virginica
#> 132 virginica

Note that the above two steps do NOT change the original order of the rows in the data frame. If you intend to change the row order permanently, use the assignment operator to update the existing data frame or create a new data frame to store the sorted rows. See below:

ascending_iris <- iris[sorted_idx,]

# The shortest Sepal Length
head(ascending_iris)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 14          4.3         3.0          1.1         0.1
#> 9           4.4         2.9          1.4         0.2
#> 39          4.4         3.0          1.3         0.2
#> 43          4.4         3.2          1.3         0.2
#> 42          4.5         2.3          1.3         0.3
#> 4           4.6         3.1          1.5         0.2
#>    Species
#> 14  setosa
#> 9   setosa
#> 39  setosa
#> 43  setosa
#> 42  setosa
#> 4   setosa

# The longest Sepal Length
tail(ascending_iris)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 106          7.6         3.0          6.6         2.1
#> 118          7.7         3.8          6.7         2.2
#> 119          7.7         2.6          6.9         2.3
#> 123          7.7         2.8          6.7         2.0
#> 136          7.7         3.0          6.1         2.3
#> 132          7.9         3.8          6.4         2.0
#>       Species
#> 106 virginica
#> 118 virginica
#> 119 virginica
#> 123 virginica
#> 136 virginica
#> 132 virginica

To sort rows in descending order (large to small), add the parameter decreasing = T to the order() function.

desc_idx <- order(iris$Sepal.Length, decreasing = T)
desc_idx
#>   [1] 132 118 119 123 136 106 131 108 110 126 130 103  51
#>  [14]  53 121 140 142  77 113 144  66  78  87 109 125 141
#>  [27] 145 146  59  76  55 105 111 117 148  52  75 112 116
#>  [40] 129 133 138  57  73  88 101 104 124 134 137 147  69
#>  [53]  98 127 149  64  72  74  92 128 135  63  79  84  86
#>  [66] 120 139  62  71 150  15  68  83  93 102 115 143  16
#>  [79]  19  56  80  96  97 100 114  65  67  70  89  95 122
#>  [92]  34  37  54  81  82  90  91   6  11  17  21  32  85
#> [105]  49  28  29  33  60   1  18  20  22  24  40  45  47
#> [118]  99   5   8  26  27  36  41  44  50  61  94   2  10
#> [131]  35  38  58 107  12  13  25  31  46   3  30   4   7
#> [144]  23  48  42   9  39  43  14

decreasing_iris <- iris[desc_idx,]
head(decreasing_iris)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 132          7.9         3.8          6.4         2.0
#> 118          7.7         3.8          6.7         2.2
#> 119          7.7         2.6          6.9         2.3
#> 123          7.7         2.8          6.7         2.0
#> 136          7.7         3.0          6.1         2.3
#> 106          7.6         3.0          6.6         2.1
#>       Species
#> 132 virginica
#> 118 virginica
#> 119 virginica
#> 123 virginica
#> 136 virginica
#> 106 virginica
tail(decreasing_iris)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 48          4.6         3.2          1.4         0.2
#> 42          4.5         2.3          1.3         0.3
#> 9           4.4         2.9          1.4         0.2
#> 39          4.4         3.0          1.3         0.2
#> 43          4.4         3.2          1.3         0.2
#> 14          4.3         3.0          1.1         0.1
#>    Species
#> 48  setosa
#> 42  setosa
#> 9   setosa
#> 39  setosa
#> 43  setosa
#> 14  setosa

2.1.4 Rearranging Columns

Suppose you want to move the Species column to the leftmost position. It can be done using two approaches.

# Reorder columns by name
new_iris <- iris[,c("Species","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
head(new_iris)
#>   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1  setosa          5.1         3.5          1.4         0.2
#> 2  setosa          4.9         3.0          1.4         0.2
#> 3  setosa          4.7         3.2          1.3         0.2
#> 4  setosa          4.6         3.1          1.5         0.2
#> 5  setosa          5.0         3.6          1.4         0.2
#> 6  setosa          5.4         3.9          1.7         0.4
# Reorder columns by index
new_iris <- iris[,c(5,1,2,3,4)]
head(new_iris)
#>   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1  setosa          5.1         3.5          1.4         0.2
#> 2  setosa          4.9         3.0          1.4         0.2
#> 3  setosa          4.7         3.2          1.3         0.2
#> 4  setosa          4.6         3.1          1.5         0.2
#> 5  setosa          5.0         3.6          1.4         0.2
#> 6  setosa          5.4         3.9          1.7         0.4

Note that if a data frame contains many columns, say >100, it is daunting to use this approach. A much better way can be done using dplyr, so stay tuned.

2.1.5 Focus on a Subset of Columns

If a data frame contains many columns, way more than you need for analysis. It is neat to focus on those columns and remove the rest. Let’s suppose you want to focus only on sepal information and species.

sepal_iris <- iris[,c("Sepal.Length","Sepal.Width","Species")]
dim(sepal_iris)
#> [1] 150   3

head(sepal_iris)
#>   Sepal.Length Sepal.Width Species
#> 1          5.1         3.5  setosa
#> 2          4.9         3.0  setosa
#> 3          4.7         3.2  setosa
#> 4          4.6         3.1  setosa
#> 5          5.0         3.6  setosa
#> 6          5.4         3.9  setosa

2.1.6 Add a New Column

Suppose we want to add a new column called Sepal.Ratio, which is the ratio Sepal.Length to Sepal.Width. Here is the code:

iris$Sepal.Ratio <- iris$Sepal.Length/iris$Sepal.Width
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
#>   Sepal.Ratio
#> 1    1.457143
#> 2    1.633333
#> 3    1.468750
#> 4    1.483871
#> 5    1.388889
#> 6    1.384615

You can see the new column Sepal.Ratio is added to the rightmost position.

2.1.7 Remove a Column

You can remove a column from a data frame, e.g., remove the Sepal.Ratio added above.

iris$Sepal.Ratio <- NULL

# the column is gone
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

2.1.8 Tabulation

Tally the number of samples per one or more categorical variables (factors) by the table() function.

table(iris$Species)
#> 
#>     setosa versicolor  virginica 
#>         50         50         50

# Alternatively, use with()
with(iris, table(Species))
#> Species
#>     setosa versicolor  virginica 
#>         50         50         50

It can tally more than one categorical variable. Let’s us another built-in dataset CO2 to illustrate the point.

with(CO2, table(Plant, Type))
#>      Type
#> Plant Quebec Mississippi
#>   Qn1      7           0
#>   Qn2      7           0
#>   Qn3      7           0
#>   Qc1      7           0
#>   Qc3      7           0
#>   Qc2      7           0
#>   Mn3      0           7
#>   Mn2      0           7
#>   Mn1      0           7
#>   Mc2      0           7
#>   Mc3      0           7
#>   Mc1      0           7

2.2 tidyverse

There is an alternate, and importantly, less cryptic way to perform data wrangling. This new set of functions are offered by the R package dplyr. One design goal of dplyr is to use verbs to name these operations, a huge difference from the unpronounceable symbols, such as $, [, etc. by base R as shown in the previous section.

Before we dive into dplyr, it is noteworthy to mention that there is a slew of packages with the aim of improving data cleaning, data visualization, and loading data in various formats. To spare people from the question of when to use what package, all these packages are bundled in a single package called tidyverse. Once you have activated tidyverse, dplyr, ggplot (for plotting data), and other packages are available for use. When you call a function, you don’t have to worry if it is from dplyr or tidyr because the tidyverse community has resolved all name conflicts. An analogy is like there is only one “Main Street” in tidyverse package; no ambiguity at all. The take home is: just activate tidyverse and you are good to go.

tidyverse does not come with the base R so you have to install it manually. But the installation is simple, click the menu option Tools, select the first option “install packages”, type “tidyverse”, and click the “Install” button.

Let’s activate tidyverse by using the library() function. Alternatively, you can use require(); they are largely doing the same thing. Note that you only need to activate a package once per session.

library(tidyverse)
#> ── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
#> ✔ purrr     1.1.0     
#> ── Conflicts ────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The library() function has churned out many messages telling you what packages (9 of them) are loaded together.

Let’s use tidyverse to repeat the data wrangling tasks by base R above.

2.2.1 Select Rows by Condition(s)

The function is filter(). It chooses rows that satisfy the specified condition(s). The syntax is filter(data_frame, conditions). E.g.,

# select all setosa iris
setosa_iris <- filter(iris, Species == 'setosa')

# How many?
print(paste("There are", nrow(setosa_iris), "rows."))
#> [1] "There are 50 rows."
# More than one condition. The symbol for the logical OR operator is a vertical bar |
iris_or_eg <- filter(iris, Sepal.Length > 5 | Sepal.Width < 3 )

# How many?
print(paste("There are", nrow(iris_or_eg), "rows."))
#> [1] "There are 124 rows."

head(iris_or_eg)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          5.4         3.9          1.7         0.4  setosa
#> 3          4.4         2.9          1.4         0.2  setosa
#> 4          5.4         3.7          1.5         0.2  setosa
#> 5          5.8         4.0          1.2         0.2  setosa
#> 6          5.7         4.4          1.5         0.4  setosa
# More than one condition. The symbol for the logical AND operator is &
iris_and_eg <- filter(iris, Petal.Length > 6 & Petal.Width < 2 )

# How many?
print(paste("There are", nrow(iris_and_eg), "rows."))
#> [1] "There are 2 rows."

iris_and_eg
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          7.3         2.9          6.3         1.8
#> 2          7.4         2.8          6.1         1.9
#>     Species
#> 1 virginica
#> 2 virginica
# No setosa iris
iris_not_eg <- filter(iris, Species != 'setosa')

# How many?
print(paste("There are", nrow(iris_not_eg), "rows."))
#> [1] "There are 100 rows."

head(iris_not_eg)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          7.0         3.2          4.7         1.4
#> 2          6.4         3.2          4.5         1.5
#> 3          6.9         3.1          4.9         1.5
#> 4          5.5         2.3          4.0         1.3
#> 5          6.5         2.8          4.6         1.5
#> 6          5.7         2.8          4.5         1.3
#>      Species
#> 1 versicolor
#> 2 versicolor
#> 3 versicolor
#> 4 versicolor
#> 5 versicolor
#> 6 versicolor

2.2.2 Select Rows Randomly

With tidyverse, you can do it in one step by using the sample_n() function.

a_random_sample <- sample_n(iris, size = 5)

a_random_sample
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          5.6         3.0          4.1         1.3
#> 2          6.5         3.2          5.1         2.0
#> 3          6.7         3.0          5.2         2.3
#> 4          5.9         3.0          4.2         1.5
#> 5          5.7         2.8          4.5         1.3
#>      Species
#> 1 versicolor
#> 2  virginica
#> 3  virginica
#> 4 versicolor
#> 5 versicolor

2.2.3 Sort Rows in a Data Frame

Use arrange(). Note that the function will not change the original data frame. Use the assignment operator (<-) to change the original data frame or store the changes in a new data frame. For example, sort iris by Sepal.Length, and store the sorted result in a new data frame.

sorted_iris <- arrange(iris, Sepal.Length)

# The shortest Sepal Length
head(sorted_iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          4.3         3.0          1.1         0.1  setosa
#> 2          4.4         2.9          1.4         0.2  setosa
#> 3          4.4         3.0          1.3         0.2  setosa
#> 4          4.4         3.2          1.3         0.2  setosa
#> 5          4.5         2.3          1.3         0.3  setosa
#> 6          4.6         3.1          1.5         0.2  setosa

# The longest Sepal Length
tail(sorted_iris)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 145          7.6         3.0          6.6         2.1
#> 146          7.7         3.8          6.7         2.2
#> 147          7.7         2.6          6.9         2.3
#> 148          7.7         2.8          6.7         2.0
#> 149          7.7         3.0          6.1         2.3
#> 150          7.9         3.8          6.4         2.0
#>       Species
#> 145 virginica
#> 146 virginica
#> 147 virginica
#> 148 virginica
#> 149 virginica
#> 150 virginica

To sort rows according to the descending order of the sorting column, simply mark the sorting column with a minus -. E.g.,

decreasing_iris <- arrange(iris, -Sepal.Length)
head(decreasing_iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          7.9         3.8          6.4         2.0
#> 2          7.7         3.8          6.7         2.2
#> 3          7.7         2.6          6.9         2.3
#> 4          7.7         2.8          6.7         2.0
#> 5          7.7         3.0          6.1         2.3
#> 6          7.6         3.0          6.6         2.1
#>     Species
#> 1 virginica
#> 2 virginica
#> 3 virginica
#> 4 virginica
#> 5 virginica
#> 6 virginica
tail(decreasing_iris)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 145          4.6         3.2          1.4         0.2
#> 146          4.5         2.3          1.3         0.3
#> 147          4.4         2.9          1.4         0.2
#> 148          4.4         3.0          1.3         0.2
#> 149          4.4         3.2          1.3         0.2
#> 150          4.3         3.0          1.1         0.1
#>     Species
#> 145  setosa
#> 146  setosa
#> 147  setosa
#> 148  setosa
#> 149  setosa
#> 150  setosa

Specify more than one sorting column is much easier in dplyr than base R. E.g. sort iris in descending order Sepal.Length and ascending order of Petal.Length.

double_sorted_iris <- arrange(iris, -Sepal.Length, Petal.Length)
head(double_sorted_iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          7.9         3.8          6.4         2.0
#> 2          7.7         3.0          6.1         2.3
#> 3          7.7         3.8          6.7         2.2
#> 4          7.7         2.8          6.7         2.0
#> 5          7.7         2.6          6.9         2.3
#> 6          7.6         3.0          6.6         2.1
#>     Species
#> 1 virginica
#> 2 virginica
#> 3 virginica
#> 4 virginica
#> 5 virginica
#> 6 virginica
tail(double_sorted_iris)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 145          4.6         3.1          1.5         0.2
#> 146          4.5         2.3          1.3         0.3
#> 147          4.4         3.0          1.3         0.2
#> 148          4.4         3.2          1.3         0.2
#> 149          4.4         2.9          1.4         0.2
#> 150          4.3         3.0          1.1         0.1
#>     Species
#> 145  setosa
#> 146  setosa
#> 147  setosa
#> 148  setosa
#> 149  setosa
#> 150  setosa

For rows with equal Sepal.Length (rows 2-5, 147-149), you can see their Petal.Length are in ascending order.

2.2.4 Rearranging columns of a data frame

Suppose you want to move the Species column to the leftmost position. You’re going to love dplyr, as it can be done easily by using two functions: select() and everything().

# Reorder columns by name
new_iris <- select(iris, Species, everything())
head(new_iris)
#>   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1  setosa          5.1         3.5          1.4         0.2
#> 2  setosa          4.9         3.0          1.4         0.2
#> 3  setosa          4.7         3.2          1.3         0.2
#> 4  setosa          4.6         3.1          1.5         0.2
#> 5  setosa          5.0         3.6          1.4         0.2
#> 6  setosa          5.4         3.9          1.7         0.4

2.2.5 Focus on a Subset of Columns

If a data frame contains many columns, way more than you need for analysis. It is neat to focus on those columns and remove the rest. E.g., use the select() function to keep only sepal information and species.

sepal_iris <- select(iris, Sepal.Length, Sepal.Width, Species)
dim(sepal_iris)
#> [1] 150   3

head(sepal_iris)
#>   Sepal.Length Sepal.Width Species
#> 1          5.1         3.5  setosa
#> 2          4.9         3.0  setosa
#> 3          4.7         3.2  setosa
#> 4          4.6         3.1  setosa
#> 5          5.0         3.6  setosa
#> 6          5.4         3.9  setosa

2.2.6 Add a New Column

Use the mutate() function add columns to a data frame.

iris <- mutate(iris, Sepal.Ratio = Sepal.Length/Sepal.Width)
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
#>   Sepal.Ratio
#> 1    1.457143
#> 2    1.633333
#> 3    1.468750
#> 4    1.483871
#> 5    1.388889
#> 6    1.384615

2.2.7 Remove a Column

Use the select() function as before but put a minus '-' symbol before the column name to be removed.

iris <- select(iris, -Sepal.Ratio)
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

2.2.8 Rename a Column

So far, you have seen all the data wrangling tasks performed by base R using tidyverse. Starting from here, you are going to demonstrate tasks that are harder to do with base R. The first task is to rename a column. Suppose you want to rename Species to Iris_Species. You can do it by the rename() function:

# new_name = existing_name
iris <- rename(iris, Iris_Species = Species)
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          5.1         3.5          1.4         0.2
#> 2          4.9         3.0          1.4         0.2
#> 3          4.7         3.2          1.3         0.2
#> 4          4.6         3.1          1.5         0.2
#> 5          5.0         3.6          1.4         0.2
#> 6          5.4         3.9          1.7         0.4
#>   Iris_Species
#> 1       setosa
#> 2       setosa
#> 3       setosa
#> 4       setosa
#> 5       setosa
#> 6       setosa

Let’s revert the renaming so it will not affect the following examples.

iris <- rename(iris, Species = Iris_Species)
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

2.2.9 Pipe

Often, data wrangling is a multi-step process in which many intermediate steps are involved before producing the final result. E.g., select iris where its Petal.Length is between 0.5 and 2, and sort the results in descending order of Petal.Length. As you can see, it involves filtering, followed by sorting. It can done into two steps via an intermediate data frame as below:

tmp_iris <- filter(iris, Petal.Length >= 0.5 & Petal.Length <=2)
final_iris <- arrange(tmp_iris, -Petal.Length)
dim(final_iris)
#> [1] 50  5
head(final_iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          4.8         3.4          1.9         0.2  setosa
#> 2          5.1         3.8          1.9         0.4  setosa
#> 3          5.4         3.9          1.7         0.4  setosa
#> 4          5.7         3.8          1.7         0.3  setosa
#> 5          5.4         3.4          1.7         0.2  setosa
#> 6          5.1         3.3          1.7         0.5  setosa

The final result contains 50 rows.

tidyverse provides a neater way - no intermediate data frames - to do it by the concept of pip. You can imagine it like the assembly line of a factory, where an upstream work station passes (pipe) intermediate results to the next work station for further processing.

The pip symbol is represented by %>%, no space between them. Let’s see how to apply pip to the above example without creating the intermediate data frame tmp_iris.

final_results <- iris %>% 
                   filter(Petal.Length >= 0.5 & Petal.Length <=2) %>%
                   arrange(-Petal.Length)
dim(final_iris)
#> [1] 50  5
head(final_iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          4.8         3.4          1.9         0.2  setosa
#> 2          5.1         3.8          1.9         0.4  setosa
#> 3          5.4         3.9          1.7         0.4  setosa
#> 4          5.7         3.8          1.7         0.3  setosa
#> 5          5.4         3.4          1.7         0.2  setosa
#> 6          5.1         3.3          1.7         0.5  setosa

Let’s walk through the code above line by line.

  • iris %>% sends all rows, 150 in total, to the downstream filter() function.

  • filter(Petal.Length >= 0.5 & Petal.Length <=2) %>% receives the data, and check each row if it satisfies the specified conditions. If so, send it to the downstream process; otherwise do not pass down.

  • arrange(-Petal.Length) received rows that satisfy the previous filtering process, and sort them in descending order of Petal.Length.

  • final_results <- the sorted result is assigned to the variable final_results.

By using pip (%>%), you can avoid intermediate variables, making the code cleaner and easily to read.

2.2.10 Tabulation and Summarize

Another task which is useful but difficult to get it done in base R is to produce summary information by group. E.g., you want to produce a table that shows the average sepal length for each species. Here’s how to do it using two new functions group_by() and summarize(), and pipe them together.

E.g., tally the number of samples by Plant and ‘Type’ in CO2. The function that does the tallying is n().

CO2 %>%
  group_by(Plant, Type) %>%
  summarize(n=n())
#> `summarise()` has grouped output by 'Plant'. You can
#> override using the `.groups` argument.
#> # A tibble: 12 × 3
#> # Groups:   Plant [12]
#>    Plant Type            n
#>    <ord> <fct>       <int>
#>  1 Qn1   Quebec          7
#>  2 Qn2   Quebec          7
#>  3 Qn3   Quebec          7
#>  4 Qc1   Quebec          7
#>  5 Qc3   Quebec          7
#>  6 Qc2   Quebec          7
#>  7 Mn3   Mississippi     7
#>  8 Mn2   Mississippi     7
#>  9 Mn1   Mississippi     7
#> 10 Mc2   Mississippi     7
#> 11 Mc3   Mississippi     7
#> 12 Mc1   Mississippi     7

You do not see much difference between base R and tidyverse in the example above. But if you want to do more than tallying, summarize() is a treat, as summarize() supports the descriptive statistics functions mentioned before, such as mean(), sd(), median(), quantile(), etc. E.g,

iris %>%
  group_by(Species) %>%
  summarize(u=mean(Sepal.Length))
#> # A tibble: 3 × 2
#>   Species        u
#>   <fct>      <dbl>
#> 1 setosa      5.01
#> 2 versicolor  5.94
#> 3 virginica   6.59

You can apply multiple functions in summarize(). E.g., shows the number of samples and standard deviation.

iris %>%
  group_by(Species) %>%
  summarize(n=n(),
            u=mean(Sepal.Length),
            s=sd(Sepal.Length))
#> # A tibble: 3 × 4
#>   Species        n     u     s
#>   <fct>      <int> <dbl> <dbl>
#> 1 setosa        50  5.01 0.352
#> 2 versicolor    50  5.94 0.516
#> 3 virginica     50  6.59 0.636

You can group by more than one column. Let’s switch to CO2 dataset to illustrate this point.

CO2 %>%
  group_by(Plant, Type, Treatment) %>%
  summarize(n=n(),
            u=mean(conc),
            s=sd(conc))
#> `summarise()` has grouped output by 'Plant', 'Type'. You
#> can override using the `.groups` argument.
#> # A tibble: 12 × 6
#> # Groups:   Plant, Type [12]
#>    Plant Type        Treatment      n     u     s
#>    <ord> <fct>       <fct>      <int> <dbl> <dbl>
#>  1 Qn1   Quebec      nonchilled     7   435  318.
#>  2 Qn2   Quebec      nonchilled     7   435  318.
#>  3 Qn3   Quebec      nonchilled     7   435  318.
#>  4 Qc1   Quebec      chilled        7   435  318.
#>  5 Qc3   Quebec      chilled        7   435  318.
#>  6 Qc2   Quebec      chilled        7   435  318.
#>  7 Mn3   Mississippi nonchilled     7   435  318.
#>  8 Mn2   Mississippi nonchilled     7   435  318.
#>  9 Mn1   Mississippi nonchilled     7   435  318.
#> 10 Mc2   Mississippi chilled        7   435  318.
#> 11 Mc3   Mississippi chilled        7   435  318.
#> 12 Mc1   Mississippi chilled        7   435  318.

Two remarks:

  1. group_by() is never used alone. It is often followed by summarize().

  2. The variables feed to group_by() must be factor, i.e., categorical.

2.2.11 A Summary of tidyverse functions

  • filter(): select rows that satistify the specified conditions.

  • select(): choose or drop certain columns.

  • arrange(): sort rows by the specified columns.

  • rename(): change the name of a column.

  • mutate(): add a column or modify the values of an existing column.

  • group_by(), followed by summarize()

  • pip %>%: funnel intermediate result from an upstream process to the input of an immediate downstream process.

2.3 Summary

In this chapter, you have seen how to perform data wrangling with basic R and tidyverse. What you have seen is only a fraction of what tidyverse can do. Readers who want to explore the full capability of tidyverse should consult the most authorative source: R for Data Science by Hadley Wickham & Garrett Grolemund (https://r4ds.had.co.nz/). In the next chapter, you will see how to plot data. A picture is worth a thousand words. Data visualization is an indispensible step for data analysis. But before the next topic, let’s recall all the new R commands discussed in this chapter.

|, &, ==, !, !=, subset(), sample(), order(), NONE (a keyword), table(), library(), require(), filter(), select(), arrange(), rename(), mutate(), group_by(), summarize(), n(), and pip %>%.

Table 2.1: A Summary of R Operators and Functions
x
&#124;
&
==
!
!=
subset()