2 Data Wrangling
Data wrangling refers to the way to view, change, and reorganize data in a data frame. Before we dive into the operations, it is important to know that there are two distinct schools of data wrangling: base R and tidyverse
, where the latter has contributed substantially in enhancing the process. We will start with the base R version, followed by the tidyverse
.
Learning Objectives:
This chapter covers common data wrangling techniques. After finishing this chapter, you should be able to:
Select rows that fulfill specified conditions.
Randomly sample rows from a data frame.
Sort rows in a data frame.
Select a subset of columns.
Add and remove columns to an existing data frame.
Tabulate data.
2.1 Base R
2.1.1 Select Rows by Condition(s)
The subset()
function selects rows if they satisfy the condition(s). Here are the three relational operators to support multi-parts conditions:
“or” is denoted by
|
“and” is denoted by
&
“not” is denoted by
!
Note that the operator for checking equality is “==”; two equal signs touching each other.
# Choose all setosa iris.
setosa_iris <- subset(iris, Species == 'setosa')
# How many are selected?
print(paste("There are", nrow(setosa_iris), "rows."))
#> [1] "There are 50 rows."
# Peek into the first few rows of the result.
head(setosa_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# More than one condition. The relational operator is the logical OR, i.e., a vertical bar |
iris_or_eg <- subset(iris, Sepal.Length > 5 | Sepal.Width < 3 )
# How many?
print(paste("There are", nrow(iris_or_eg), "rows."))
#> [1] "There are 124 rows."
head(iris_or_eg)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
#> 9 4.4 2.9 1.4 0.2
#> 11 5.4 3.7 1.5 0.2
#> 15 5.8 4.0 1.2 0.2
#> 16 5.7 4.4 1.5 0.4
#> Species
#> 1 setosa
#> 6 setosa
#> 9 setosa
#> 11 setosa
#> 15 setosa
#> 16 setosa
# More than one condition. The relational AND operator.
iris_and_eg <- subset(iris, Petal.Length > 6 & Petal.Width < 2 )
# How many?
print(paste("There are", nrow(iris_and_eg), "rows."))
#> [1] "There are 2 rows."
iris_and_eg
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 108 7.3 2.9 6.3 1.8
#> 131 7.4 2.8 6.1 1.9
#> Species
#> 108 virginica
#> 131 virginica
# No setosa iris
iris_no_setosa_eg <- subset(iris, Species != 'setosa')
# How many?
print(paste("There are", nrow(iris_no_setosa_eg), "rows."))
#> [1] "There are 100 rows."
head(iris_no_setosa_eg)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 51 7.0 3.2 4.7 1.4
#> 52 6.4 3.2 4.5 1.5
#> 53 6.9 3.1 4.9 1.5
#> 54 5.5 2.3 4.0 1.3
#> 55 6.5 2.8 4.6 1.5
#> 56 5.7 2.8 4.5 1.3
#> Species
#> 51 versicolor
#> 52 versicolor
#> 53 versicolor
#> 54 versicolor
#> 55 versicolor
#> 56 versicolor
2.1.2 Select Rows Randomly
It is quite common to randomly sample a subset of rows from a data frame. It takes to two steps to achieve that:
Step 1. Create a list (vector) of random row indices by the sample()
function. Specify the following parameters:
a.) The range of values. For this example, the range is from 1 to the last row of iris
, which is the return of nrow(iris)
.
How many samples, e.g.,
size = 5
.Can a sample be drawn repeatedly? If not,
replace = F
.
# seq(nrow(iris)) generates 1, 2, 3, ..., 150.
rnd_idx <- sample(seq(nrow(iris)), size = 5, replace = F)
rnd_idx
#> [1] 75 24 9 79 142
Step 2: select rows from iris according to the indices in rnd_idx
.
a_random_sample <- iris[rnd_idx,]
a_random_sample
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 75 6.4 2.9 4.3 1.3
#> 24 5.1 3.3 1.7 0.5
#> 9 4.4 2.9 1.4 0.2
#> 79 6.0 2.9 4.5 1.5
#> 142 6.9 3.1 5.1 2.3
#> Species
#> 75 versicolor
#> 24 setosa
#> 9 setosa
#> 79 versicolor
#> 142 virginica
2.1.3 Sort Rows
It is a two-step process similar to the random sampling of rows from the previus subsection. Suppose you want to sort iris
in Sepal.Length
in ascending order (small to large).
Step 1: Obtain a list of row indices that indicates the sorting order to the data.
sorted_idx <- order(iris$Sepal.Length)
sorted_idx
#> [1] 14 9 39 43 42 4 7 23 48 3 30 12 13
#> [14] 25 31 46 2 10 35 38 58 107 5 8 26 27
#> [27] 36 41 44 50 61 94 1 18 20 22 24 40 45
#> [40] 47 99 28 29 33 60 49 6 11 17 21 32 85
#> [53] 34 37 54 81 82 90 91 65 67 70 89 95 122
#> [66] 16 19 56 80 96 97 100 114 15 68 83 93 102
#> [79] 115 143 62 71 150 63 79 84 86 120 139 64 72
#> [92] 74 92 128 135 69 98 127 149 57 73 88 101 104
#> [105] 124 134 137 147 52 75 112 116 129 133 138 55 105
#> [118] 111 117 148 59 76 66 78 87 109 125 141 145 146
#> [131] 77 113 144 53 121 140 142 51 103 110 126 130 108
#> [144] 131 106 118 119 123 136 132
Step 2: Rearrange the rows in the data frame according to the sorted row indices.
# The 5 shortest Sepal Lengths
head(iris[sorted_idx,])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 14 4.3 3.0 1.1 0.1
#> 9 4.4 2.9 1.4 0.2
#> 39 4.4 3.0 1.3 0.2
#> 43 4.4 3.2 1.3 0.2
#> 42 4.5 2.3 1.3 0.3
#> 4 4.6 3.1 1.5 0.2
#> Species
#> 14 setosa
#> 9 setosa
#> 39 setosa
#> 43 setosa
#> 42 setosa
#> 4 setosa
# The 5 longest Sepal Lengths
tail(iris[sorted_idx,])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 106 7.6 3.0 6.6 2.1
#> 118 7.7 3.8 6.7 2.2
#> 119 7.7 2.6 6.9 2.3
#> 123 7.7 2.8 6.7 2.0
#> 136 7.7 3.0 6.1 2.3
#> 132 7.9 3.8 6.4 2.0
#> Species
#> 106 virginica
#> 118 virginica
#> 119 virginica
#> 123 virginica
#> 136 virginica
#> 132 virginica
Note that the above two steps do NOT change the original order of the rows in the data frame. If you intend to change the row order permanently, use the assignment operator to update the existing data frame or create a new data frame to store the sorted rows. See below:
ascending_iris <- iris[sorted_idx,]
# The shortest Sepal Length
head(ascending_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 14 4.3 3.0 1.1 0.1
#> 9 4.4 2.9 1.4 0.2
#> 39 4.4 3.0 1.3 0.2
#> 43 4.4 3.2 1.3 0.2
#> 42 4.5 2.3 1.3 0.3
#> 4 4.6 3.1 1.5 0.2
#> Species
#> 14 setosa
#> 9 setosa
#> 39 setosa
#> 43 setosa
#> 42 setosa
#> 4 setosa
# The longest Sepal Length
tail(ascending_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 106 7.6 3.0 6.6 2.1
#> 118 7.7 3.8 6.7 2.2
#> 119 7.7 2.6 6.9 2.3
#> 123 7.7 2.8 6.7 2.0
#> 136 7.7 3.0 6.1 2.3
#> 132 7.9 3.8 6.4 2.0
#> Species
#> 106 virginica
#> 118 virginica
#> 119 virginica
#> 123 virginica
#> 136 virginica
#> 132 virginica
To sort rows in descending order (large to small), add the parameter decreasing = T
to the order()
function.
desc_idx <- order(iris$Sepal.Length, decreasing = T)
desc_idx
#> [1] 132 118 119 123 136 106 131 108 110 126 130 103 51
#> [14] 53 121 140 142 77 113 144 66 78 87 109 125 141
#> [27] 145 146 59 76 55 105 111 117 148 52 75 112 116
#> [40] 129 133 138 57 73 88 101 104 124 134 137 147 69
#> [53] 98 127 149 64 72 74 92 128 135 63 79 84 86
#> [66] 120 139 62 71 150 15 68 83 93 102 115 143 16
#> [79] 19 56 80 96 97 100 114 65 67 70 89 95 122
#> [92] 34 37 54 81 82 90 91 6 11 17 21 32 85
#> [105] 49 28 29 33 60 1 18 20 22 24 40 45 47
#> [118] 99 5 8 26 27 36 41 44 50 61 94 2 10
#> [131] 35 38 58 107 12 13 25 31 46 3 30 4 7
#> [144] 23 48 42 9 39 43 14
decreasing_iris <- iris[desc_idx,]
head(decreasing_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 132 7.9 3.8 6.4 2.0
#> 118 7.7 3.8 6.7 2.2
#> 119 7.7 2.6 6.9 2.3
#> 123 7.7 2.8 6.7 2.0
#> 136 7.7 3.0 6.1 2.3
#> 106 7.6 3.0 6.6 2.1
#> Species
#> 132 virginica
#> 118 virginica
#> 119 virginica
#> 123 virginica
#> 136 virginica
#> 106 virginica
tail(decreasing_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 48 4.6 3.2 1.4 0.2
#> 42 4.5 2.3 1.3 0.3
#> 9 4.4 2.9 1.4 0.2
#> 39 4.4 3.0 1.3 0.2
#> 43 4.4 3.2 1.3 0.2
#> 14 4.3 3.0 1.1 0.1
#> Species
#> 48 setosa
#> 42 setosa
#> 9 setosa
#> 39 setosa
#> 43 setosa
#> 14 setosa
2.1.4 Rearranging Columns
Suppose you want to move the Species
column to the leftmost position. It can be done using two approaches.
# Reorder columns by name
new_iris <- iris[,c("Species","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
head(new_iris)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 setosa 5.1 3.5 1.4 0.2
#> 2 setosa 4.9 3.0 1.4 0.2
#> 3 setosa 4.7 3.2 1.3 0.2
#> 4 setosa 4.6 3.1 1.5 0.2
#> 5 setosa 5.0 3.6 1.4 0.2
#> 6 setosa 5.4 3.9 1.7 0.4
# Reorder columns by index
new_iris <- iris[,c(5,1,2,3,4)]
head(new_iris)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 setosa 5.1 3.5 1.4 0.2
#> 2 setosa 4.9 3.0 1.4 0.2
#> 3 setosa 4.7 3.2 1.3 0.2
#> 4 setosa 4.6 3.1 1.5 0.2
#> 5 setosa 5.0 3.6 1.4 0.2
#> 6 setosa 5.4 3.9 1.7 0.4
Note that if a data frame contains many columns, say >100, it is daunting to use this approach. A much better way can be done using dplyr
, so stay tuned.
2.1.5 Focus on a Subset of Columns
If a data frame contains many columns, way more than you need for analysis. It is neat to focus on those columns and remove the rest. Let’s suppose you want to focus only on sepal information and species.
2.1.6 Add a New Column
Suppose we want to add a new column called Sepal.Ratio
, which is the ratio Sepal.Length
to Sepal.Width
. Here is the code:
iris$Sepal.Ratio <- iris$Sepal.Length/iris$Sepal.Width
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> Sepal.Ratio
#> 1 1.457143
#> 2 1.633333
#> 3 1.468750
#> 4 1.483871
#> 5 1.388889
#> 6 1.384615
You can see the new column Sepal.Ratio
is added to the rightmost position.
2.1.7 Remove a Column
You can remove a column from a data frame, e.g., remove the Sepal.Ratio
added above.
iris$Sepal.Ratio <- NULL
# the column is gone
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
2.1.8 Tabulation
Tally the number of samples per one or more categorical variables (factors) by the table()
function.
table(iris$Species)
#>
#> setosa versicolor virginica
#> 50 50 50
# Alternatively, use with()
with(iris, table(Species))
#> Species
#> setosa versicolor virginica
#> 50 50 50
It can tally more than one categorical variable. Let’s us another built-in dataset CO2
to illustrate the point.
2.2 tidyverse
There is an alternate, and importantly, less cryptic way to perform data wrangling. This new set of functions are offered by the R package dplyr
. One design goal of dplyr
is to use verbs to name these operations, a huge difference from the unpronounceable symbols, such as $
, [
, etc. by base R as shown in the previous section.
Before we dive into dplyr
, it is noteworthy to mention that there is a slew of packages with the aim of improving data cleaning, data visualization, and loading data in various formats. To spare people from the question of when to use what package, all these packages are bundled in a single package called tidyverse
. Once you have activated tidyverse
, dplyr
, ggplot
(for plotting data), and other packages are available for use. When you call a function, you don’t have to worry if it is from dplyr
or tidyr
because the tidyverse
community has resolved all name conflicts. An analogy is like there is only one “Main Street” in tidyverse
package; no ambiguity at all. The take home is: just activate tidyverse
and you are good to go.
tidyverse
does not come with the base R so you have to install it manually. But the installation is simple, click the menu option Tools, select the first option “install packages”, type “tidyverse”, and click the “Install” button.
Let’s activate tidyverse
by using the library()
function. Alternatively, you can use require()
; they are largely doing the same thing. Note that you only need to activate a package once per session.
library(tidyverse)
#> ── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
#> ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
#> ✔ purrr 1.1.0
#> ── Conflicts ────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The library()
function has churned out many messages telling you what packages (9 of them) are loaded together.
Let’s use tidyverse
to repeat the data wrangling tasks by base R above.
2.2.1 Select Rows by Condition(s)
The function is filter()
. It chooses rows that satisfy the specified condition(s). The syntax is filter(data_frame, conditions)
. E.g.,
# select all setosa iris
setosa_iris <- filter(iris, Species == 'setosa')
# How many?
print(paste("There are", nrow(setosa_iris), "rows."))
#> [1] "There are 50 rows."
# More than one condition. The symbol for the logical OR operator is a vertical bar |
iris_or_eg <- filter(iris, Sepal.Length > 5 | Sepal.Width < 3 )
# How many?
print(paste("There are", nrow(iris_or_eg), "rows."))
#> [1] "There are 124 rows."
head(iris_or_eg)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 5.4 3.9 1.7 0.4 setosa
#> 3 4.4 2.9 1.4 0.2 setosa
#> 4 5.4 3.7 1.5 0.2 setosa
#> 5 5.8 4.0 1.2 0.2 setosa
#> 6 5.7 4.4 1.5 0.4 setosa
# More than one condition. The symbol for the logical AND operator is &
iris_and_eg <- filter(iris, Petal.Length > 6 & Petal.Width < 2 )
# How many?
print(paste("There are", nrow(iris_and_eg), "rows."))
#> [1] "There are 2 rows."
iris_and_eg
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 7.3 2.9 6.3 1.8
#> 2 7.4 2.8 6.1 1.9
#> Species
#> 1 virginica
#> 2 virginica
# No setosa iris
iris_not_eg <- filter(iris, Species != 'setosa')
# How many?
print(paste("There are", nrow(iris_not_eg), "rows."))
#> [1] "There are 100 rows."
head(iris_not_eg)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 7.0 3.2 4.7 1.4
#> 2 6.4 3.2 4.5 1.5
#> 3 6.9 3.1 4.9 1.5
#> 4 5.5 2.3 4.0 1.3
#> 5 6.5 2.8 4.6 1.5
#> 6 5.7 2.8 4.5 1.3
#> Species
#> 1 versicolor
#> 2 versicolor
#> 3 versicolor
#> 4 versicolor
#> 5 versicolor
#> 6 versicolor
2.2.2 Select Rows Randomly
With tidyverse
, you can do it in one step by using the sample_n()
function.
a_random_sample <- sample_n(iris, size = 5)
a_random_sample
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.6 3.0 4.1 1.3
#> 2 6.5 3.2 5.1 2.0
#> 3 6.7 3.0 5.2 2.3
#> 4 5.9 3.0 4.2 1.5
#> 5 5.7 2.8 4.5 1.3
#> Species
#> 1 versicolor
#> 2 virginica
#> 3 virginica
#> 4 versicolor
#> 5 versicolor
2.2.3 Sort Rows in a Data Frame
Use arrange()
. Note that the function will not change the original data frame. Use the assignment operator (<-
) to change the original data frame or store the changes in a new data frame. For example, sort iris
by Sepal.Length
, and store the sorted result in a new data frame.
sorted_iris <- arrange(iris, Sepal.Length)
# The shortest Sepal Length
head(sorted_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 4.3 3.0 1.1 0.1 setosa
#> 2 4.4 2.9 1.4 0.2 setosa
#> 3 4.4 3.0 1.3 0.2 setosa
#> 4 4.4 3.2 1.3 0.2 setosa
#> 5 4.5 2.3 1.3 0.3 setosa
#> 6 4.6 3.1 1.5 0.2 setosa
# The longest Sepal Length
tail(sorted_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 145 7.6 3.0 6.6 2.1
#> 146 7.7 3.8 6.7 2.2
#> 147 7.7 2.6 6.9 2.3
#> 148 7.7 2.8 6.7 2.0
#> 149 7.7 3.0 6.1 2.3
#> 150 7.9 3.8 6.4 2.0
#> Species
#> 145 virginica
#> 146 virginica
#> 147 virginica
#> 148 virginica
#> 149 virginica
#> 150 virginica
To sort rows according to the descending order of the sorting column, simply mark the sorting column with a minus -
. E.g.,
decreasing_iris <- arrange(iris, -Sepal.Length)
head(decreasing_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 7.9 3.8 6.4 2.0
#> 2 7.7 3.8 6.7 2.2
#> 3 7.7 2.6 6.9 2.3
#> 4 7.7 2.8 6.7 2.0
#> 5 7.7 3.0 6.1 2.3
#> 6 7.6 3.0 6.6 2.1
#> Species
#> 1 virginica
#> 2 virginica
#> 3 virginica
#> 4 virginica
#> 5 virginica
#> 6 virginica
tail(decreasing_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 145 4.6 3.2 1.4 0.2
#> 146 4.5 2.3 1.3 0.3
#> 147 4.4 2.9 1.4 0.2
#> 148 4.4 3.0 1.3 0.2
#> 149 4.4 3.2 1.3 0.2
#> 150 4.3 3.0 1.1 0.1
#> Species
#> 145 setosa
#> 146 setosa
#> 147 setosa
#> 148 setosa
#> 149 setosa
#> 150 setosa
Specify more than one sorting column is much easier in dplyr
than base R. E.g. sort iris
in descending order Sepal.Length
and ascending order of Petal.Length
.
double_sorted_iris <- arrange(iris, -Sepal.Length, Petal.Length)
head(double_sorted_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 7.9 3.8 6.4 2.0
#> 2 7.7 3.0 6.1 2.3
#> 3 7.7 3.8 6.7 2.2
#> 4 7.7 2.8 6.7 2.0
#> 5 7.7 2.6 6.9 2.3
#> 6 7.6 3.0 6.6 2.1
#> Species
#> 1 virginica
#> 2 virginica
#> 3 virginica
#> 4 virginica
#> 5 virginica
#> 6 virginica
tail(double_sorted_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 145 4.6 3.1 1.5 0.2
#> 146 4.5 2.3 1.3 0.3
#> 147 4.4 3.0 1.3 0.2
#> 148 4.4 3.2 1.3 0.2
#> 149 4.4 2.9 1.4 0.2
#> 150 4.3 3.0 1.1 0.1
#> Species
#> 145 setosa
#> 146 setosa
#> 147 setosa
#> 148 setosa
#> 149 setosa
#> 150 setosa
For rows with equal Sepal.Length
(rows 2-5, 147-149), you can see their Petal.Length
are in ascending order.
2.2.4 Rearranging columns of a data frame
Suppose you want to move the Species
column to the leftmost position. You’re going to love dplyr
, as it can be done easily by using two functions: select()
and everything()
.
# Reorder columns by name
new_iris <- select(iris, Species, everything())
head(new_iris)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 setosa 5.1 3.5 1.4 0.2
#> 2 setosa 4.9 3.0 1.4 0.2
#> 3 setosa 4.7 3.2 1.3 0.2
#> 4 setosa 4.6 3.1 1.5 0.2
#> 5 setosa 5.0 3.6 1.4 0.2
#> 6 setosa 5.4 3.9 1.7 0.4
2.2.5 Focus on a Subset of Columns
If a data frame contains many columns, way more than you need for analysis. It is neat to focus on those columns and remove the rest. E.g., use the select()
function to keep only sepal information and species.
2.2.6 Add a New Column
Use the mutate()
function add columns to a data frame.
iris <- mutate(iris, Sepal.Ratio = Sepal.Length/Sepal.Width)
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> Sepal.Ratio
#> 1 1.457143
#> 2 1.633333
#> 3 1.468750
#> 4 1.483871
#> 5 1.388889
#> 6 1.384615
2.2.7 Remove a Column
Use the select()
function as before but put a minus '-'
symbol before the column name to be removed.
2.2.8 Rename a Column
So far, you have seen all the data wrangling tasks performed by base R using tidyverse
. Starting from here, you are going to demonstrate tasks that are harder to do with base R. The first task is to rename a column. Suppose you want to rename Species
to Iris_Species
. You can do it by the rename()
function:
# new_name = existing_name
iris <- rename(iris, Iris_Species = Species)
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
#> Iris_Species
#> 1 setosa
#> 2 setosa
#> 3 setosa
#> 4 setosa
#> 5 setosa
#> 6 setosa
Let’s revert the renaming so it will not affect the following examples.
2.2.9 Pipe
Often, data wrangling is a multi-step process in which many intermediate steps are involved before producing the final result. E.g., select iris where its Petal.Length
is between 0.5 and 2, and sort the results in descending order of Petal.Length
. As you can see, it involves filtering, followed by sorting. It can done into two steps via an intermediate data frame as below:
tmp_iris <- filter(iris, Petal.Length >= 0.5 & Petal.Length <=2)
final_iris <- arrange(tmp_iris, -Petal.Length)
dim(final_iris)
#> [1] 50 5
head(final_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 4.8 3.4 1.9 0.2 setosa
#> 2 5.1 3.8 1.9 0.4 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5.7 3.8 1.7 0.3 setosa
#> 5 5.4 3.4 1.7 0.2 setosa
#> 6 5.1 3.3 1.7 0.5 setosa
The final result contains 50 rows.
tidyverse
provides a neater way - no intermediate data frames - to do it by the concept of pip. You can imagine it like the assembly line of a factory, where an upstream work station passes (pipe) intermediate results to the next work station for further processing.
The pip symbol is represented by %>%
, no space between them. Let’s see how to apply pip to the above example without creating the intermediate data frame tmp_iris
.
final_results <- iris %>%
filter(Petal.Length >= 0.5 & Petal.Length <=2) %>%
arrange(-Petal.Length)
dim(final_iris)
#> [1] 50 5
head(final_iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 4.8 3.4 1.9 0.2 setosa
#> 2 5.1 3.8 1.9 0.4 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5.7 3.8 1.7 0.3 setosa
#> 5 5.4 3.4 1.7 0.2 setosa
#> 6 5.1 3.3 1.7 0.5 setosa
Let’s walk through the code above line by line.
iris %>%
sends all rows, 150 in total, to the downstreamfilter()
function.filter(Petal.Length >= 0.5 & Petal.Length <=2) %>%
receives the data, and check each row if it satisfies the specified conditions. If so, send it to the downstream process; otherwise do not pass down.arrange(-Petal.Length)
received rows that satisfy the previous filtering process, and sort them in descending order ofPetal.Length
.final_results <-
the sorted result is assigned to the variablefinal_results
.
By using pip (%>%
), you can avoid intermediate variables, making the code cleaner and easily to read.
2.2.10 Tabulation and Summarize
Another task which is useful but difficult to get it done in base R is to produce summary information by group. E.g., you want to produce a table that shows the average sepal length for each species. Here’s how to do it using two new functions group_by()
and summarize()
, and pipe them together.
E.g., tally the number of samples by Plant
and ‘Type’ in CO2
. The function that does the tallying is n()
.
CO2 %>%
group_by(Plant, Type) %>%
summarize(n=n())
#> `summarise()` has grouped output by 'Plant'. You can
#> override using the `.groups` argument.
#> # A tibble: 12 × 3
#> # Groups: Plant [12]
#> Plant Type n
#> <ord> <fct> <int>
#> 1 Qn1 Quebec 7
#> 2 Qn2 Quebec 7
#> 3 Qn3 Quebec 7
#> 4 Qc1 Quebec 7
#> 5 Qc3 Quebec 7
#> 6 Qc2 Quebec 7
#> 7 Mn3 Mississippi 7
#> 8 Mn2 Mississippi 7
#> 9 Mn1 Mississippi 7
#> 10 Mc2 Mississippi 7
#> 11 Mc3 Mississippi 7
#> 12 Mc1 Mississippi 7
You do not see much difference between base R and tidyverse
in the example above. But if you want to do more than tallying, summarize()
is a treat, as summarize()
supports the descriptive statistics functions mentioned before, such as mean(), sd(), median(), quantile(), etc. E.g,
iris %>%
group_by(Species) %>%
summarize(u=mean(Sepal.Length))
#> # A tibble: 3 × 2
#> Species u
#> <fct> <dbl>
#> 1 setosa 5.01
#> 2 versicolor 5.94
#> 3 virginica 6.59
You can apply multiple functions in summarize()
. E.g., shows the number of samples and standard deviation.
iris %>%
group_by(Species) %>%
summarize(n=n(),
u=mean(Sepal.Length),
s=sd(Sepal.Length))
#> # A tibble: 3 × 4
#> Species n u s
#> <fct> <int> <dbl> <dbl>
#> 1 setosa 50 5.01 0.352
#> 2 versicolor 50 5.94 0.516
#> 3 virginica 50 6.59 0.636
You can group by more than one column. Let’s switch to CO2
dataset to illustrate this point.
CO2 %>%
group_by(Plant, Type, Treatment) %>%
summarize(n=n(),
u=mean(conc),
s=sd(conc))
#> `summarise()` has grouped output by 'Plant', 'Type'. You
#> can override using the `.groups` argument.
#> # A tibble: 12 × 6
#> # Groups: Plant, Type [12]
#> Plant Type Treatment n u s
#> <ord> <fct> <fct> <int> <dbl> <dbl>
#> 1 Qn1 Quebec nonchilled 7 435 318.
#> 2 Qn2 Quebec nonchilled 7 435 318.
#> 3 Qn3 Quebec nonchilled 7 435 318.
#> 4 Qc1 Quebec chilled 7 435 318.
#> 5 Qc3 Quebec chilled 7 435 318.
#> 6 Qc2 Quebec chilled 7 435 318.
#> 7 Mn3 Mississippi nonchilled 7 435 318.
#> 8 Mn2 Mississippi nonchilled 7 435 318.
#> 9 Mn1 Mississippi nonchilled 7 435 318.
#> 10 Mc2 Mississippi chilled 7 435 318.
#> 11 Mc3 Mississippi chilled 7 435 318.
#> 12 Mc1 Mississippi chilled 7 435 318.
Two remarks:
group_by()
is never used alone. It is often followed bysummarize()
.The variables feed to
group_by()
must be factor, i.e., categorical.
2.2.11 A Summary of tidyverse
functions
filter()
: select rows that satistify the specified conditions.select()
: choose or drop certain columns.arrange()
: sort rows by the specified columns.rename()
: change the name of a column.mutate()
: add a column or modify the values of an existing column.group_by()
, followed bysummarize()
pip
%>%
: funnel intermediate result from an upstream process to the input of an immediate downstream process.
2.3 Summary
In this chapter, you have seen how to perform data wrangling with basic R and tidyverse
. What you have seen is only a fraction of what tidyverse
can do. Readers who want to explore the full capability of tidyverse
should consult the most authorative source: R for Data Science by Hadley Wickham & Garrett Grolemund (https://r4ds.had.co.nz/). In the next chapter, you will see how to plot data. A picture is worth a thousand words. Data visualization is an indispensible step for data analysis. But before the next topic, let’s recall all the new R commands discussed in this chapter.
|
, &
, ==
, !
, !=
, subset()
, sample()
, order()
, NONE
(a keyword), table()
, library()
, require()
, filter()
, select()
, arrange()
, rename()
, mutate()
, group_by()
, summarize()
, n()
, and pip %>%
.
x |
---|
| |
& |
== |
! |
!= |
subset() |