4 数据操作

目前， R 语言在数据操作方面陆续出现三套工具，最早的是 Base R（1997 年 4月），之后是 data.table（2006年4月）和 dplyr（2014 年1月）。下面以 R 软件内置的 iris 数据集为例，介绍三者的常用数据操作。

4.1 操作工具

Base R 指的是 R 软件自带的一套操作数据的工具，只要安装了 R 软件就可使用，各个操作函数分布在基础 R 包中，主要且常见的操作对象是数据框（data.frame）。
data.table 是一个 R 包，它强化和扩展了 Base R 操作数据的功能，提供一致的链式法则，更高效的功能实现，兼容 Base R 的函数和数据。
dplyr 是另一个扩展包，它基于一套新的数据操作语法，各个主要函数语义上比较接近 SQL 查询语言，代码易读。

表 4.1: Base R、data.table 和 dplyr

操作工具	依赖	稳定	效率
Base R	零依赖	很稳定	一般
data.table	少依赖	稳定	很高
dplyr	依赖多	一般	高

4.1.1 Base R

在 data.frame 的基础上，提供一系列辅助函数实现各类数据操作。

aggregate(iris, Sepal.Length ~ Species, FUN = length)

     Species Sepal.Length
1     setosa           50
2 versicolor           50
3  virginica           50

4.1.2 data.table

data.table 包在 Base R 的基础上，扩展和加强了原有函数的功能，提供一套完整的链式操作语法。

library(data.table)
iris_dt <- as.data.table(iris)
iris_dt[ ,.(cnt = length(Sepal.Length)) , by = "Species"]

      Species   cnt
       <fctr> <int>
1:     setosa    50
2: versicolor    50
3:  virginica    50

4.1.3 dplyr

dplyr 包提供一套全新的数据操作语法，与 purrr 包和 tidyr 包一起形成完备的数据操作功能。在 R 环境下，dplyr 包提供一套等价的表示，代码如下：

iris |> 
  dplyr::group_by(Species) |> 
  dplyr::count()

# A tibble: 3 × 2
# Groups:   Species [3]
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

4.1.4 SQL

实际工作中，SQL （结构化查询语言）是必不可少的基础性工具，比如 SQLite、 Hive 和 Spark 等都提供基于 SQL 的数据查询引擎，没有重点介绍 SQL 操作是因为本书以 R 语言为数据分析的主要工具，而不是它不重要。以 dplyr 来说吧，它的诸多语义动词就是对标 SQL 的。

library(DBI)
conn <- DBI::dbConnect(RSQLite::SQLite(),
  dbname = system.file("db", "datasets.sqlite", package = "RSQLite")
)

按 Species 分组统计数据条数， SQL 查询语句如下：

SELECT COUNT(1) AS cnt, Species
FROM iris
GROUP BY Species;

SQL 代码执行的结果如下：

iris_preview

  cnt    Species
1  50     setosa
2  50 versicolor
3  50  virginica

dplyr 包能连接数据库，以上 SQL 代码也可以翻译成等价的 dplyr 语句。

dplyr::tbl(conn, "iris") |> 
  dplyr::group_by(Species) |> 
  dplyr::count()

# Source:   SQL [?? x 2]
# Database: sqlite 3.50.4 [/Users/runner/work/_temp/Library/RSQLite/db/datasets.sqlite]
# Groups:   Species
  Species        n
  <chr>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

dplyr 包的函数 show_query() 可以将 dplyr 语句转化为查询语句，这有助于排错。

dplyr::tbl(conn, "iris") |> 
  dplyr::group_by(Species) |> 
  dplyr::count() |> 
  dplyr::show_query()

<SQL>
SELECT `Species`, COUNT(*) AS `n`
FROM `iris`
GROUP BY `Species`

glue 包可以使用 R 环境中的变量，相比于 sprintf() 函数，可以组合更大型的 SQL 语句，这在生产环境中广泛使用。

# R 环境中的变量
group <- "Species"
# 组合 SQL
query <- glue::glue("
  SELECT COUNT(1) AS cnt, Species
  FROM iris
  GROUP BY ({group})
")
# 将 SQL 语句传递给数据库，执行 SQL 语句
DBI::dbGetQuery(conn, query)

  cnt    Species
1  50     setosa
2  50 versicolor
3  50  virginica

用完后，关闭连接通道。

dbDisconnect(conn = conn)

更多关于 SQL 语句的使用介绍见书籍《Become a SELECT star》。

4.2 Base R 操作

介绍最核心的 Base R 数据操作，如筛选、排序、变换、聚合、重塑等

4.2.1 筛选

筛选操作可以用函数 subset() 或 [ 实现

subset(iris, subset = Species == "setosa" & Sepal.Length > 5.5, select = c("Sepal.Length", "Sepal.Width"))

   Sepal.Length Sepal.Width
15          5.8         4.0
16          5.7         4.4
19          5.7         3.8

iris[iris$Species == "setosa" & iris$Sepal.Length > 5.5, c("Sepal.Length", "Sepal.Width")]

   Sepal.Length Sepal.Width
15          5.8         4.0
16          5.7         4.4
19          5.7         3.8

4.2.2 变换

变换操作可以用函数 within()/transform() 实现。最常见的变换操作是类型转化，比如从字符串型转为因子型、整型或日期型等。

# iris2 <- transform(iris, Species_N = as.integer(Species))[1:3, ]
iris2 <- within(iris, {
  Species_N <- as.integer(Species)
})
str(iris2)

'data.frame':   150 obs. of  6 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Species_N   : int  1 1 1 1 1 1 1 1 1 1 ...

4.2.3 排序

排序操作可以用函数 order() 实现。

iris[order(iris$Sepal.Length, decreasing = FALSE)[1:3], ]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14          4.3         3.0          1.1         0.1  setosa
9           4.4         2.9          1.4         0.2  setosa
39          4.4         3.0          1.3         0.2  setosa

4.2.4 聚合

聚合操作可以用函数 aggregate() / by() / ave() 实现。

# 按 Species 分组计算 Sepal.Length 的均值
aggregate(iris, Sepal.Length ~ Species, mean)

     Species Sepal.Length
1     setosa        5.006
2 versicolor        5.936
3  virginica        6.588

by(iris, ~ Species, FUN = dim) # 函数应用于整个 data.frame

Species: setosa
[1] 50  5
------------------------------------------------------------ 
Species: versicolor
[1] 50  5
------------------------------------------------------------ 
Species: virginica
[1] 50  5

# 分组线性回归
by(iris, ~ Species, FUN = function(x) lm(Sepal.Length ~ Sepal.Width, data = x))

Species: setosa

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = x)

Coefficients:
(Intercept)  Sepal.Width  
     2.6390       0.6905  

------------------------------------------------------------ 
Species: versicolor

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = x)

Coefficients:
(Intercept)  Sepal.Width  
     3.5397       0.8651  

------------------------------------------------------------ 
Species: virginica

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = x)

Coefficients:
(Intercept)  Sepal.Width  
     3.9068       0.9015

# 按 Species 分组计算 Sepal.Length 的均值
with(iris, ave(Sepal.Length, Species, FUN = mean))

  [1] 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006
 [13] 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006
 [25] 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006
 [37] 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006 5.006
 [49] 5.006 5.006 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936
 [61] 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936
 [73] 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936
 [85] 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936 5.936
 [97] 5.936 5.936 5.936 5.936 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588
[109] 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588
[121] 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588
[133] 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588 6.588
[145] 6.588 6.588 6.588 6.588 6.588 6.588

4.2.5 合并

两个数据框的合并操作可以用函数 merge() 实现

df1 <- data.frame(a1 = c(1, 2, 3), a2 = c("A", "B", "C"))
df2 <- data.frame(b1 = c(2, 3, 4), b2 = c("A", "B", "D"))
# LEFT JOIN
merge(x = df1, y = df2, by.x = "a2", by.y = "b2", all.x = TRUE)

# RIGHT JOIN
merge(x = df1, y = df2, by.x = "a2", by.y = "b2", all.y = TRUE)

# INNER JOIN
merge(x = df1, y = df2, by.x = "a2", by.y = "b2", all = FALSE)

  a2 a1 b1
1  A  1  2
2  B  2  3

# FULL JOIN
merge(x = df1, y = df2, by.x = "a2", by.y = "b2", all = TRUE)

  a2 a1 b1
1  A  1  2
2  B  2  3
3  C  3 NA
4  D NA  4

4.2.6 重塑

将数据集从宽格式转为长格式，可以用函数 reshape() 实现，反之，亦然。

# 长格式
df3 <- data.frame(
  extra = c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4),
  group = c("A", "A", "A", "B", "B", "B"),
  id = c(1, 2, 3, 1, 2, 3)
)
# 长转宽
reshape(df3, direction = "wide", timevar = "group", idvar = "id")

  id extra.A extra.B
1  1     0.7    -1.2
2  2    -1.6    -0.1
3  3    -0.2     3.4

# 也可以指定组合变量的列名
reshape(df3, direction = "wide", timevar = "group", idvar = "id",
        v.names = "extra", sep = "_")

  id extra_A extra_B
1  1     0.7    -1.2
2  2    -1.6    -0.1
3  3    -0.2     3.4

提取并整理分组线性回归系数。函数 split() 将数据集 iris 按分类变量 Species 拆分成列表，函数 lapply() 将线性回归操作 lm() 应用于列表的每一个元素上，再次用函数 lapply() 将函数 coef() 应用于线性回归后的列表上，提取回归系数，用函数 do.call() 将系数合并成矩阵，最后，用函数as.data.frame() 转化成数据框。

s1 <- split(iris, ~Species)
s2 <- lapply(s1, lm, formula = Sepal.Length ~ Sepal.Width)
s3 <- lapply(s2, coef)
s4 <- do.call("rbind", s3)
s5 <- as.data.frame(s4)
s5

           (Intercept) Sepal.Width
setosa        2.639001   0.6904897
versicolor    3.539735   0.8650777
virginica     3.906836   0.9015345

do.call(
  "rbind",
  lapply(
    lapply(
      split(iris, ~Species), lm,
      formula = Sepal.Length ~ Sepal.Width
    ),
    coef
  )
)

           (Intercept) Sepal.Width
setosa        2.639001   0.6904897
versicolor    3.539735   0.8650777
virginica     3.906836   0.9015345

类似 Base R，也用 data.table 来实现 iris 分组线性回归

iris_dt[, as.list(coef(lm(Sepal.Length ~ Sepal.Width))), by = "Species"]

      Species (Intercept) Sepal.Width
       <fctr>       <num>       <num>
1:     setosa    2.639001   0.6904897
2: versicolor    3.539735   0.8650777
3:  virginica    3.906836   0.9015345

4.3 data.table 操作

掌握此等基础性的工具，再去了解新工具也不难，更重要的是，只要将一种工具掌握的足够好，也就足以应付绝大多数的情况。

介绍 data.table 基础语法，对标 Base R，介绍基础操作，同时给出等价的 dplyr 实现，但不运行代码。
data.table 扩展 Base R 数据操作，介绍常用的操作 8 个，讲清楚出现的具体场景，同时给出等价的 dplyr 实现，但不运行代码。
data.table 特有的高级数据操作 on、.SD 、.I 、.J 等。

4.3.1 筛选

data.table 扩展了函数 [ 功能，简化 iris$Species == "setosa" 代码 Species == "setosa"

iris_dt[Species == "setosa" & Sepal.Length > 5.5, c("Sepal.Length", "Sepal.Width")]

   Sepal.Length Sepal.Width
          <num>       <num>
1:          5.8         4.0
2:          5.7         4.4
3:          5.7         3.8

4.3.2 变换

变换操作可以用函数 := 作用类似于 Base R 函数 transform()

iris_dt[, Species_N := as.integer(Species)]
str(iris_dt)

Classes 'data.table' and 'data.frame':  150 obs. of  6 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Species_N   : int  1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

4.3.3 排序

排序操作可以用函数 order() 。

iris_dt[order(Sepal.Length, decreasing = FALSE), ] |> head()

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_N
          <num>       <num>        <num>       <num>  <fctr>     <int>
1:          4.3         3.0          1.1         0.1  setosa         1
2:          4.4         2.9          1.4         0.2  setosa         1
3:          4.4         3.0          1.3         0.2  setosa         1
4:          4.4         3.2          1.3         0.2  setosa         1
5:          4.5         2.3          1.3         0.3  setosa         1
6:          4.6         3.1          1.5         0.2  setosa         1

4.3.4 聚合

聚合操作用函数 .() 和 by 组合。

iris_dt[, .(mean = mean(Sepal.Length)), by = "Species"]

      Species  mean
       <fctr> <num>
1:     setosa 5.006
2: versicolor 5.936
3:  virginica 6.588

4.3.5 合并

合并操作也是用函数 merge() 来实现。

dt1 <- data.table(a1 = c(1, 2, 3), a2 = c("A", "B", "C"))
dt2 <- data.table(b1 = c(2, 3, 4), b2 = c("A", "B", "D"))
# LEFT JOIN
merge(x = dt1, y = dt2, by.x = "a2", by.y = "b2", all.x = TRUE)

Key: <a2>
       a2    a1    b1
   <char> <num> <num>
1:      A     1     2
2:      B     2     3
3:      C     3    NA

# RIGHT JOIN
merge(x = dt1, y = dt2, by.x = "a2", by.y = "b2", all.y = TRUE)

Key: <a2>
       a2    a1    b1
   <char> <num> <num>
1:      A     1     2
2:      B     2     3
3:      D    NA     4

# INNER JOIN
merge(x = dt1, y = dt2, by.x = "a2", by.y = "b2", all = FALSE)

Key: <a2>
       a2    a1    b1
   <char> <num> <num>
1:      A     1     2
2:      B     2     3

# FULL JOIN
merge(x = dt1, y = dt2, by.x = "a2", by.y = "b2", all = TRUE)

Key: <a2>
       a2    a1    b1
   <char> <num> <num>
1:      A     1     2
2:      B     2     3
3:      C     3    NA
4:      D    NA     4

4.3.6 重塑

将数据集从宽格式转为长格式，可以用函数 dcast() 实现，反之，可以用函数 melt() 实现。

# 长格式
dt3 <- data.table(
  extra = c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4),
  group = c("A", "A", "A", "B", "B", "B"),
  id = c(1, 2, 3, 1, 2, 3)
)
# 长转宽
dcast(dt3, id ~ group, value.var = "extra")

Key: <id>
      id     A     B
   <num> <num> <num>
1:     1   0.7  -1.2
2:     2  -1.6  -0.1
3:     3  -0.2   3.4

4.4 分组操作

按 Species 分组筛选出前两行数据

do.call("rbind", by(iris, ~Species, head, 2))

              Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
setosa.1               5.1         3.5          1.4         0.2     setosa
setosa.2               4.9         3.0          1.4         0.2     setosa
versicolor.51          7.0         3.2          4.7         1.4 versicolor
versicolor.52          6.4         3.2          4.5         1.5 versicolor
virginica.101          6.3         3.3          6.0         2.5  virginica
virginica.102          5.8         2.7          5.1         1.9  virginica

iris_dt[, head(.SD, 2), by = .(Species)]

      Species Sepal.Length Sepal.Width Petal.Length Petal.Width Species_N
       <fctr>        <num>       <num>        <num>       <num>     <int>
1:     setosa          5.1         3.5          1.4         0.2         1
2:     setosa          4.9         3.0          1.4         0.2         1
3: versicolor          7.0         3.2          4.7         1.4         2
4: versicolor          6.4         3.2          4.5         1.5         2
5:  virginica          6.3         3.3          6.0         2.5         3
6:  virginica          5.8         2.7          5.1         1.9         3