6.4 Rectangling
Rectangling is the art and craft of taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns. There are three functions from tidyr that are particularly useful for rectangling:
unnest_longer()takes each element of a list-column and makes a new row.unnest_wider()takes each element of a list-column and makes a new column.unnest_auto()guesses whether you wantunnest_longer()orunnest_wider().hoist()is similar to unnest_wider() but only plucks out selected components, and can reach down multiple levels.
A very large number of data rectangling problems can be solved by combining these functions with a splash of dplyr (largely eliminating prior approaches that combined mutate() with multiple purrr::map()s).
To illustrate these techniques, we’ll use the repurrrsive package, which provides a number deeply nested lists originally mostly captured from web APIs.
6.4.1 Github users
We’ll start with gh_users, a list which contains information about six GitHub users.
To begin, we put the gh_users list into a data frame:
(users <- tibble(user = gh_users))
#> # A tibble: 6 x 1
#> user
#> <list>
#> 1 <named list [30]>
#> 2 <named list [30]>
#> 3 <named list [30]>
#> 4 <named list [30]>
#> 5 <named list [30]>
#> 6 <named list [30]>Each element of column user is yet another list, where each element represents a column.
names(users$user[[1]])
#> [1] "login" "id" "avatar_url"
#> [4] "gravatar_id" "url" "html_url"
#> [7] "followers_url" "following_url" "gists_url"
#> [10] "starred_url" "subscriptions_url" "organizations_url"
#> [13] "repos_url" "events_url" "received_events_url"
#> [16] "type" "site_admin" "name"
#> [19] "company" "blog" "location"
#> [22] "email" "hireable" "bio"
#> [25] "public_repos" "public_gists" "followers"
#> [28] "following" "created_at" "updated_at"Obviously we could use unnest_wider() to turn the list components into columns:
users %>% unnest_wider(user)
#> # A tibble: 6 x 30
#> login id avatar_url gravatar_id url html_url followers_url following_url
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 gabo~ 6.60e5 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 2 jenn~ 5.99e5 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 3 jtle~ 1.57e6 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 4 juli~ 1.25e7 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 5 leep~ 3.51e6 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 6 masa~ 8.36e6 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> # ... with 22 more variables: gists_url <chr>, starred_url <chr>,
#> # subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#> # events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#> # name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
#> # public_repos <int>, public_gists <int>, followers <int>, following <int>,
#> # created_at <chr>, updated_at <chr>, bio <chr>, hireable <lgl>But in this case, there are many components and we don’t need most of them so we can instead use hoist(). hoist() allows us to pull out selected components using the same syntax as purrr::pluck():
users %>% hoist(user,
followers = "followers",
login = "login",
url = "html_url")
#> # A tibble: 6 x 4
#> followers login url user
#> <int> <chr> <chr> <list>
#> 1 303 gaborcsardi https://github.com/gaborcsardi <named list [27]>
#> 2 780 jennybc https://github.com/jennybc <named list [27]>
#> 3 3958 jtleek https://github.com/jtleek <named list [27]>
#> 4 115 juliasilge https://github.com/juliasilge <named list [27]>
#> 5 213 leeper https://github.com/leeper <named list [27]>
#> 6 34 masalmon https://github.com/masalmon <named list [27]>hoist() 从列表列中提取出指明的元素作为新变量,保留余下的元素
hoist() removes the named components from the user list-column, so you can think of it as moving components out of the inner list into the top-level data frame
6.4.2 Github repos
We start off gh_repos similarly, by putting it in a tibble:
repos <- tibble(repo = gh_repos)
repos
#> # A tibble: 6 x 1
#> repo
#> <list>
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> 6 <list [30]>By comparison, gh_repos is more nested than gh_users, with elements in the 2nd hierarchy being repositorys that gh_users own, and thus requires one more level of information to record each repo.
This time the elements of user are a list of repositories that belong to that user. These are observations, so should become new rows, so we use unnest_longer() rather than unnest_wider():
repos <- repos %>% unnest_longer(repo)
repos
#> # A tibble: 176 x 1
#> repo
#> <list>
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> 6 <named list [68]>
#> # ... with 170 more rowsNow each rwo representes a repository, then we can use unnest_wider() or hoist():
repos %>% hoist(repo,
login = list("owner", "login"),
name = "name",
homepage = "homepage",
watchers = "watchers_count"
)
#> # A tibble: 176 x 5
#> login name homepage watchers repo
#> <chr> <chr> <chr> <int> <list>
#> 1 gaborcsardi after <NA> 5 <named list [65]>
#> 2 gaborcsardi argufy <NA> 19 <named list [65]>
#> 3 gaborcsardi ask <NA> 5 <named list [65]>
#> 4 gaborcsardi baseimports <NA> 0 <named list [65]>
#> 5 gaborcsardi citest <NA> 0 <named list [65]>
#> 6 gaborcsardi clisymbols "" 18 <named list [65]>
#> # ... with 170 more rowsNote the use of list("owner", "login"): this allows us to reach two levels deep inside of a list using the same syntax as purrr::pluck(). An alternative approach would be to pull out just owner and then put each element of it in a column:
repos %>%
hoist(repo, owner = "owner") %>%
unnest_wider(owner)
#> # A tibble: 176 x 18
#> login id avatar_url gravatar_id url html_url followers_url following_url
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 2 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 3 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 4 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 5 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> 6 gabo~ 660288 https://a~ "" http~ https:/~ https://api.~ https://api.~
#> # ... with 170 more rows, and 10 more variables: gists_url <chr>,
#> # starred_url <chr>, subscriptions_url <chr>, organizations_url <chr>,
#> # repos_url <chr>, events_url <chr>, received_events_url <chr>, type <chr>,
#> # site_admin <lgl>, repo <list>Instead of looking at the list and carefully thinking about whether it needs to become rows or columns, you can use unnest_auto(). It uses a handful of heuristics to figure out whether unnest_longer() or unnest_wider() is appropriate, and tells you about its reasoning.
tibble(repo = gh_repos) %>%
unnest_auto(repo) %>%
unnest_auto(repo)
#> Using `unnest_longer(repo)`; no element has names
#> Using `unnest_wider(repo)`; elements have 68 names in common
#> # A tibble: 176 x 67
#> id name full_name owner private html_url description fork url
#> <int> <chr> <chr> <lis> <lgl> <chr> <chr> <lgl> <chr>
#> 1 6.12e7 after gaborcsa~ <nam~ FALSE https:/~ Run Code i~ FALSE http~
#> 2 4.05e7 argu~ gaborcsa~ <nam~ FALSE https:/~ Declarativ~ FALSE http~
#> 3 3.64e7 ask gaborcsa~ <nam~ FALSE https:/~ Friendly C~ FALSE http~
#> 4 3.49e7 base~ gaborcsa~ <nam~ FALSE https:/~ Do we get ~ FALSE http~
#> 5 6.16e7 cite~ gaborcsa~ <nam~ FALSE https:/~ Test R pac~ TRUE http~
#> 6 3.39e7 clis~ gaborcsa~ <nam~ FALSE https:/~ Unicode sy~ FALSE http~
#> # ... with 170 more rows, and 58 more variables: forks_url <chr>,
#> # keys_url <chr>, collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
#> # issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
#> # branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
#> # git_refs_url <chr>, trees_url <chr>, statuses_url <chr>,
#> # languages_url <chr>, stargazers_url <chr>, contributors_url <chr>,
#> # subscribers_url <chr>, subscription_url <chr>, commits_url <chr>,
#> # git_commits_url <chr>, comments_url <chr>, issue_comment_url <chr>,
#> # contents_url <chr>, compare_url <chr>, merges_url <chr>, archive_url <chr>,
#> # downloads_url <chr>, issues_url <chr>, pulls_url <chr>,
#> # milestones_url <chr>, notifications_url <chr>, labels_url <chr>,
#> # releases_url <chr>, deployments_url <chr>, created_at <chr>,
#> # updated_at <chr>, pushed_at <chr>, git_url <chr>, ssh_url <chr>,
#> # clone_url <chr>, svn_url <chr>, size <int>, stargazers_count <int>,
#> # watchers_count <int>, language <chr>, has_issues <lgl>,
#> # has_downloads <lgl>, has_wiki <lgl>, has_pages <lgl>, forks_count <int>,
#> # open_issues_count <int>, forks <int>, open_issues <int>, watchers <int>,
#> # default_branch <chr>, homepage <chr>6.4.3 Game of Throne characters
got_chars has a similar structure to gh_users: it’s a list of named lists, where each element of the inner list describes some attribute of a GoT character.
We start in the same way, first by creating a data frame and then by unnesting each component into a column:
chars <- tibble(char = got_chars)
chars
#> # A tibble: 30 x 1
#> char
#> <list>
#> 1 <named list [18]>
#> 2 <named list [18]>
#> 3 <named list [18]>
#> 4 <named list [18]>
#> 5 <named list [18]>
#> 6 <named list [18]>
#> # ... with 24 more rowschars2 <- chars %>% unnest_wider(char)
chars2
#> # A tibble: 30 x 18
#> url id name gender culture born died alive titles aliases father
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr>
#> 1 http~ 1022 Theo~ Male "Ironb~ "In ~ "" TRUE <chr ~ <chr [~ ""
#> 2 http~ 1052 Tyri~ Male "" "In ~ "" TRUE <chr ~ <chr [~ ""
#> 3 http~ 1074 Vict~ Male "Ironb~ "In ~ "" TRUE <chr ~ <chr [~ ""
#> 4 http~ 1109 Will Male "" "" "In ~ FALSE <chr ~ <chr [~ ""
#> 5 http~ 1166 Areo~ Male "Norvo~ "In ~ "" TRUE <chr ~ <chr [~ ""
#> 6 http~ 1267 Chett Male "" "At ~ "In ~ FALSE <chr ~ <chr [~ ""
#> # ... with 24 more rows, and 7 more variables: mother <chr>, spouse <chr>,
#> # allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
#> # playedBy <list>This is more complex than gh_users because some component of char are themselves a list, giving us a collection of list-columns:
chars2 %>% select_if(is.list)
#> select_if: dropped 11 variables (url, id, name, gender, culture, …)
#> # A tibble: 30 x 7
#> titles aliases allegiances books povBooks tvSeries playedBy
#> <list> <list> <list> <list> <list> <list> <list>
#> 1 <chr [3]> <chr [4]> <chr [1]> <chr [3]> <chr [2]> <chr [6]> <chr [1]>
#> 2 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr [4]> <chr [6]> <chr [1]>
#> 3 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [1]> <chr [1]>
#> 4 <chr [1]> <chr [1]> <??? [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
#> 5 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [2]> <chr [1]>
#> 6 <chr [1]> <chr [1]> <??? [1]> <chr [2]> <chr [1]> <chr [1]> <chr [1]>
#> # ... with 24 more rowsWhat you do next will depend on the purposes of the analysis. Maybe you want a row for every book and TV series that the character appears in:
chars2 %>%
select(name, books, tvSeries) %>%
pivot_longer(c(books, tvSeries), names_to = "media", values_to = "value") %>%
unnest_longer(value)
#> select: dropped 15 variables (url, id, gender, culture, born, …)
#> pivot_longer: reorganized (books, tvSeries) into (media, value) [was 30x3, now 60x3]
#> # A tibble: 180 x 3
#> name media value
#> <chr> <chr> <chr>
#> 1 Theon Greyjoy books A Game of Thrones
#> 2 Theon Greyjoy books A Storm of Swords
#> 3 Theon Greyjoy books A Feast for Crows
#> 4 Theon Greyjoy tvSeries Season 1
#> 5 Theon Greyjoy tvSeries Season 2
#> 6 Theon Greyjoy tvSeries Season 3
#> # ... with 174 more rowsOr maybe you want to build a table that lets you match title to name:
chars2 %>%
select(name, title = titles) %>%
unnest_longer(title)
#> select: renamed one variable (title) and dropped 16 variables
#> # A tibble: 60 x 2
#> name title
#> <chr> <chr>
#> 1 Theon Greyjoy Prince of Winterfell
#> 2 Theon Greyjoy Captain of Sea Bitch
#> 3 Theon Greyjoy Lord of the Iron Islands (by law of the green lands)
#> 4 Tyrion Lannister Acting Hand of the King (former)
#> 5 Tyrion Lannister Master of Coin (former)
#> 6 Victarion Greyjoy Lord Captain of the Iron Fleet
#> # ... with 54 more rows