Chapter 7 Data Wrangling Part 2

In this chapter we present some more advanced methods that are required by data scientists.

7.1 Advanced Piping

We have previously experienced the “%>%” pipe, meaning “and then” that stitched together a set of operations.

Here we create a set of observations for Australia and save the subset in a dataframe called ozdata.

ozdata <- 
gapminder %>% 
  dplyr::filter(country=="Australia") 
#
head(ozdata)

## # A tibble: 6 x 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Australia Oceania    1952    69.1  8691212    10040.
## 2 Australia Oceania    1957    70.3  9712569    10950.
## 3 Australia Oceania    1962    70.9 10794968    12217.
## 4 Australia Oceania    1967    71.1 11872264    14526.
## 5 Australia Oceania    1972    71.9 13177000    16789.
## 6 Australia Oceania    1977    73.5 14074100    18334.

This same action could be accomplished by using a “forward assignment” operation

gapminder %>% 
  dplyr::filter(country=="Australia") -> ozdata 
#
head(ozdata)

## # A tibble: 6 x 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Australia Oceania    1952    69.1  8691212    10040.
## 2 Australia Oceania    1957    70.3  9712569    10950.
## 3 Australia Oceania    1962    70.9 10794968    12217.
## 4 Australia Oceania    1967    71.1 11872264    14526.
## 5 Australia Oceania    1972    71.9 13177000    16789.
## 6 Australia Oceania    1977    73.5 14074100    18334.

The assignment T-pipe magrittr::%T>% allows two separate pipes or pathways after the implementation of this operator.

Illustration: The Australia subset of the dataframe is piped into a brackets that print the head of these data, but the filtered subset is also piped around the printing bracket and into another filter that is then displayed. The crucial feature is that intermediate results (tables or graphs) of the pipeline can be displayed without breaking the data flow through the main pipeline.

gapminder %>% 
  dplyr::filter(country=="Australia") %T>%
  {head(.) %>% print()} %>%
  filter(year > 1990) %>%
  head()

## # A tibble: 6 x 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Australia Oceania    1952    69.1  8691212    10040.
## 2 Australia Oceania    1957    70.3  9712569    10950.
## 3 Australia Oceania    1962    70.9 10794968    12217.
## 4 Australia Oceania    1967    71.1 11872264    14526.
## 5 Australia Oceania    1972    71.9 13177000    16789.
## 6 Australia Oceania    1977    73.5 14074100    18334.

## # A tibble: 4 x 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Australia Oceania    1992    77.6 17481977    23425.
## 2 Australia Oceania    1997    78.8 18565243    26998.
## 3 Australia Oceania    2002    80.4 19546792    30688.
## 4 Australia Oceania    2007    81.2 20434176    34435.

7.2 Apply a function to object

7.2.1 Base apply functions

We have seen the dplyr summarise function compute summary statistics for variables in a dataframe, but there is rather a lot of coding to ask for the mean or standard deviation of each variable in a dataframe. If there is a large number of variables, this could easily become a nightmare.

We can ask the apply function from base R to compute the mean of each variable in a dataframe (represented by . in the first argument below - dataframe was piped into apply) by noting that MARGIN=2 means compute the function for columns, and MARGIN=1 means compute for rows of a dataframe.

The following code computes the average of these three quantitative variables across years for Australia.

gapminder %>%
  filter(country=="Australia") %>%
  select(lifeExp,pop,gdpPercap) %>%
  base::apply(.,MARGIN=2,FUN=mean)

##      lifeExp          pop    gdpPercap 
## 7.466292e+01 1.464931e+07 1.998060e+04

Equivalent code using base function colMeans.

gapminder %>%
  filter(country=="Australia") %>%
  select(lifeExp,pop,gdpPercap) %>%
  base::colMeans()

##      lifeExp          pop    gdpPercap 
## 7.466292e+01 1.464931e+07 1.998060e+04

7.2.2 Using purrr package

This section uses the purrr package in tidyverse, and the data used comes from the repurrsive package.

library(purrr)
library(repurrrsive)

The function purrr::map() is a function for applying a function to each element of a list. The closest base R function is lapply().

Here we illustrate the map function by applying the natural log function to each element of a list, returning a list:

tempobj <- purrr::map(c(1,2,3), log)
str(tempobj)

## List of 3
##  $ : num 0
##  $ : num 0.693
##  $ : num 1.1

is.list(tempobj)

## [1] TRUE

tempobj

## [[1]]
## [1] 0
## 
## [[2]]
## [1] 0.6931472
## 
## [[3]]
## [1] 1.098612

The map function has two basic components: a list, and a function to apply to each element of the list. There are variants of the map function that are used for different kinds of output objects: for example,map_chr() will produce a vector of character values. The previous example now produces a vector of characters rather than numbers.

tempobj <- purrr::map_chr(c(1,2,3), log)
str(tempobj)

##  chr [1:3] "0.000000" "0.693147" "1.098612"

tempobj

## [1] "0.000000" "0.693147" "1.098612"

is.list(tempobj)

## [1] FALSE

is.vector(tempobj)

## [1] TRUE

Next we consider a list of characters from “Game of Thrones” - we display a list of three items, with each item a list of 18 sub-items, only the first 8 sub-items are shown in the output by the list.len option. Based on tutorial: https://jennybc.github.io/purrr-tutorial/index.html.

str(got_chars[1:3], list.len = 8)

## List of 3
##  $ :List of 18
##   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1022"
##   ..$ id         : int 1022
##   ..$ name       : chr "Theon Greyjoy"
##   ..$ gender     : chr "Male"
##   ..$ culture    : chr "Ironborn"
##   ..$ born       : chr "In 278 AC or 279 AC, at Pyke"
##   ..$ died       : chr ""
##   ..$ alive      : logi TRUE
##   .. [list output truncated]
##  $ :List of 18
##   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1052"
##   ..$ id         : int 1052
##   ..$ name       : chr "Tyrion Lannister"
##   ..$ gender     : chr "Male"
##   ..$ culture    : chr ""
##   ..$ born       : chr "In 273 AC, at Casterly Rock"
##   ..$ died       : chr ""
##   ..$ alive      : logi TRUE
##   .. [list output truncated]
##  $ :List of 18
##   ..$ url        : chr "https://www.anapioficeandfire.com/api/characters/1074"
##   ..$ id         : int 1074
##   ..$ name       : chr "Victarion Greyjoy"
##   ..$ gender     : chr "Male"
##   ..$ culture    : chr "Ironborn"
##   ..$ born       : chr "In 268 AC or before, at Pyke"
##   ..$ died       : chr ""
##   ..$ alive      : logi TRUE
##   .. [list output truncated]

purrr::map(got_chars[1:3], "name")

## [[1]]
## [1] "Theon Greyjoy"
## 
## [[2]]
## [1] "Tyrion Lannister"
## 
## [[3]]
## [1] "Victarion Greyjoy"

Next we take this list of character names and pipe it through map again to extract the number of characters in each name.

purrr::map(got_chars[1:3], "name") %>%
  purrr::map(.,nchar)

## [[1]]
## [1] 13
## 
## [[2]]
## [1] 16
## 
## [[3]]
## [1] 17

What if we wanted to do some sort of numerical analysis of name lengths? We could collect these numeric values in a numeric vector to enable further processing, like calculate the average word length.

purrr::map(got_chars[1:3], "name") %>%
  purrr::map_dbl(.,nchar)

## [1] 13 16 17

#
purrr::map(got_chars[1:3], "name") %>%
  purrr::map_dbl(.,nchar) %T>%
  print() %>%
  mean()

## [1] 13 16 17

## [1] 15.33333

There is a variant of map that makes a dataframe as output.

map_dfr(got_chars[1:3], magrittr::extract, c("name", "culture", "gender")) %>%
  head()

## # A tibble: 3 x 3
##   name              culture    gender
##   <chr>             <chr>      <chr> 
## 1 Theon Greyjoy     "Ironborn" Male  
## 2 Tyrion Lannister  ""         Male  
## 3 Victarion Greyjoy "Ironborn" Male

gapminder %>%
  # make sure year is numeric...
  dplyr::mutate(year=as.numeric(year)) %>%
  dplyr::filter(continent=="Oceania") %>%
  # we will want to estimate separate models for each
  # country in Oceania, but the dataframe retains the
  # memory of other countries that must be dropped to
  # avoid empty dataframes... so we drop unused levels
  droplevels() %>%
  split(.$country) %>%
   purrr::map(.,~ lm(lifeExp ~ year, data = .x)) %>%
  purrr::map(summary) %>%
  purrr::map_dbl("r.squared")

##   Australia New Zealand 
##   0.9796477   0.9535846

Estimate a linear regression model on each data subset - again one subset for each value of cylinder (cyl). Resulting output is a dataframe, based on purrr package online documentation.

mtcars %>%
  split(.$cyl) %>%
  purrr::map(~ lm(mpg ~ wt, data = .x)) %>%
  purrr::map_dfr(~ as.data.frame(t(as.matrix(coef(.)))))

##   (Intercept)        wt
## 1    39.57120 -5.647025
## 2    28.40884 -2.780106
## 3    23.86803 -2.192438

7.3 Joining Datasets

Notice that both country and continent are factor/categorical variables.

gapminder %>% 
  dplyr::select(year,country, continent) %>% 
  dplyr::group_by(country) %>% 
  head(n=3)

## # A tibble: 3 x 3
## # Groups:   country [1]
##    year country     continent
##   <int> <fct>       <fct>    
## 1  1952 Afghanistan Asia     
## 2  1957 Afghanistan Asia     
## 3  1962 Afghanistan Asia

#
gapminder %>% 
  dplyr::select(country, continent) %>% 
  dplyr::group_by(country) %>% 
  # the country continent values repeat over now-nonexistent years
  # so extract only the first instance with slice(1)
  # slice_head(n=1) will work soon
  dplyr::slice(n=1) %>% 
  head()

## # A tibble: 6 x 2
## # Groups:   country [6]
##   country     continent
##   <fct>       <fct>    
## 1 Afghanistan Asia     
## 2 Albania     Europe   
## 3 Algeria     Africa   
## 4 Angola      Africa   
## 5 Argentina   Americas 
## 6 Australia   Oceania

Notice in this dataframe, country is type character.

# view the country_codes dataframe from gapminder...
country_codes %>%
  head()

## # A tibble: 6 x 3
##   country     iso_alpha iso_num
##   <chr>       <chr>       <int>
## 1 Afghanistan AFG             4
## 2 Albania     ALB             8
## 3 Algeria     DZA            12
## 4 Angola      AGO            24
## 5 Argentina   ARG            32
## 6 Armenia     ARM            51

A left_join will return all rows from x(left=gapminder), and all columns from x(left=gapminder ) and y(right=country_codes table). If there are multiple matches between x and y, all combination of the matches are returned.

The left_join command will keep all elements (rows) on the left (gapminder records) and join the country_codes rows that share common variables. Notice here the type conflict for country is resolved by forcing the gapminder country variable to character.

gapminder %>% 
  dplyr::select(country, continent) %>% 
  dplyr::group_by(country) %>% 
  dplyr::slice(1) %>% 
  dplyr::left_join(country_codes) %>%
  head()

## Joining, by = "country"

## # A tibble: 6 x 4
## # Groups:   country [6]
##   country     continent iso_alpha iso_num
##   <chr>       <fct>     <chr>       <int>
## 1 Afghanistan Asia      AFG             4
## 2 Albania     Europe    ALB             8
## 3 Algeria     Africa    DZA            12
## 4 Angola      Africa    AGO            24
## 5 Argentina   Americas  ARG            32
## 6 Australia   Oceania   AUS            36

Notice that changing the type declaration for country in gapminder, we get a clean join on all common variables (country).

gapminder %>% 
  dplyr::select(country, continent) %>% 
  dplyr::mutate(country = as.character(country)) %>%
  dplyr::group_by(country) %>% 
  dplyr::slice(1) %>% 
  dplyr::left_join(country_codes) %>%
  head()

## Joining, by = "country"

## # A tibble: 6 x 4
## # Groups:   country [6]
##   country     continent iso_alpha iso_num
##   <chr>       <fct>     <chr>       <int>
## 1 Afghanistan Asia      AFG             4
## 2 Albania     Europe    ALB             8
## 3 Algeria     Africa    DZA            12
## 4 Angola      Africa    AGO            24
## 5 Argentina   Americas  ARG            32
## 6 Australia   Oceania   AUS            36

You can also explicitly control the joining key, the variables that form the connection between the two tables (here a common variable, country of type character) variables with a by option. Notice that because the right (tempcc) dataset does not contain rows for most of the countries, the variables for tempcc are missing for the rows that are only contained in the gapminder dataframe.

tempcc <- country_codes %>%
  filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>% 
  dplyr::select(country, continent) %>% 
  dplyr::mutate(country = as.character(country)) %>%
  dplyr::group_by(country) %>% 
  dplyr::slice(1) %>% 
  dplyr::left_join(tempcc,by=c("country"="country")) %>%
  head()

## # A tibble: 6 x 4
## # Groups:   country [6]
##   country     continent iso_alpha iso_num
##   <chr>       <fct>     <chr>       <int>
## 1 Afghanistan Asia      <NA>           NA
## 2 Albania     Europe    <NA>           NA
## 3 Algeria     Africa    <NA>           NA
## 4 Angola      Africa    <NA>           NA
## 5 Argentina   Americas  <NA>           NA
## 6 Australia   Oceania   AUS            36

An inner_join produces table entries that are common to both datasets. Specifically, it returns all rows from x (gapminder) where there are matching values in y(tempcc), and all columns from x(gapminder) and y(tempcc). If there are multiple matches between x and y, all combination of the matches are returned.

tempcc <- country_codes %>%
  filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>%
  dplyr::filter(continent=="Oceania") %>%
  dplyr::select(country, continent) %>% 
  dplyr::mutate(country = as.character(country)) %>%
  dplyr::group_by(country) %>% 
  dplyr::slice(1) %>% 
  dplyr::inner_join(tempcc,by = c("country" = "country")) %>%
  head()

## # A tibble: 2 x 4
## # Groups:   country [2]
##   country     continent iso_alpha iso_num
##   <chr>       <fct>     <chr>       <int>
## 1 Australia   Oceania   AUS            36
## 2 New Zealand Oceania   NZL           554

Illustration of right_join:

tempcc <- country_codes %>%
  filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>%
  dplyr::filter(continent=="Oceania") %>%
  dplyr::select(country, continent) %>% 
  dplyr::mutate(country = as.character(country)) %>%
  dplyr::group_by(country) %>% 
  dplyr::slice(1) %>% 
  dplyr::right_join(tempcc,by = c("country" = "country")) %>%
  head()

## # A tibble: 3 x 4
## # Groups:   country [3]
##   country     continent iso_alpha iso_num
##   <chr>       <fct>     <chr>       <int>
## 1 Australia   Oceania   AUS            36
## 2 New Zealand Oceania   NZL           554
## 3 Japan       <NA>      JPN           392

Illustration of a semi_join(x, y): Return all rows from x(gapminder) where there are matching values in y(tempcc), keeping just columns from x(gapminder). A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

tempcc <- country_codes %>%
  filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>%
  dplyr::filter(continent=="Oceania") %>%
  dplyr::select(country, continent) %>% 
  dplyr::mutate(country = as.character(country)) %>%
  dplyr::group_by(country) %>% 
  dplyr::slice(1) %>% 
  dplyr::semi_join(tempcc,by = c("country" = "country")) %>%
  head()

## # A tibble: 2 x 2
## # Groups:   country [2]
##   country     continent
##   <chr>       <fct>    
## 1 Australia   Oceania  
## 2 New Zealand Oceania

An anti_join returns all rows from x (gapminder) where there are not matching values in y (tempcc), keeping just columns from x(gapminder). Illustration of anti_join:

tempcc <- country_codes %>%
  filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>%
  dplyr::filter(continent=="Oceania") %>%
  dplyr::select(country, continent) %>% 
  dplyr::mutate(country = as.character(country)) %>%
  dplyr::group_by(country) %>% 
  dplyr::slice(1) %>% 
  dplyr::anti_join(tempcc,by = c("country" = "country")) %>%
  head()

## # A tibble: 0 x 2
## # Groups:   country [0]
## # … with 2 variables: country <chr>, continent <fct>

7.4 Pivoting

7.4.1 Converting Wide to Long Format

data("relig_income",package="tidyr")
#

The relig_income dataset from the tidyr stores counts based on a survey which asked people about their religion and annual income:

head(relig_income)

## # A tibble: 6 x 11
##   religion       `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
## 1 Agnostic            27        34        60        81        76       137        122
## 2 Atheist             12        27        37        52        35        70         73
## 3 Buddhist            27        21        30        34        33        58         62
## 4 Catholic           418       617       732       670       638      1116        949
## 5 Don’t know/re…      15        14        15        11        10        35         21
## 6 Evangelical P…     575       869      1064       982       881      1486        949
## # … with 3 more variables: $100-150k <dbl>, >150k <dbl>, Don't know/refused <dbl>

The first argument is the dataset to reshape, relig_income.

The second argument describes which columns need to be reshaped - every column but not religion.

The names_to gives the name of the variable that will be created from the data stored in the original data (relig_income) column names - in this case income.

The values_to gives the name of the variable that will be created from the data stored in the cell value - in this case count.

relig_income %>%
  pivot_longer(data=., cols=-religion, names_to = "income", values_to = "count") %>%
  head()

## # A tibble: 6 x 3
##   religion income  count
##   <chr>    <chr>   <dbl>
## 1 Agnostic <$10k      27
## 2 Agnostic $10-20k    34
## 3 Agnostic $20-30k    60
## 4 Agnostic $30-40k    81
## 5 Agnostic $40-50k    76
## 6 Agnostic $50-75k   137

week2 <- c(2,2,2)
week6 <- c(3,3,3)
week10 <- c(5,5,5)
name <- c("Jerry","Bob","Bill")
instrument <- c("guitar","guitar","drums")
ds <- data.frame(name,instrument,week2,week6,week10)
head(ds)

##    name instrument week2 week6 week10
## 1 Jerry     guitar     2     3      5
## 2   Bob     guitar     2     3      5
## 3  Bill      drums     2     3      5

ds %>%
  tidyr::pivot_longer(data=.,
    cols = starts_with("week"),
    names_prefix = "week",
    names_to = "week",
    values_to = "concerts",
    values_drop_na = TRUE) %>% 
  head(n=7)

## # A tibble: 7 x 4
##   name  instrument week  concerts
##   <chr> <chr>      <chr>    <dbl>
## 1 Jerry guitar     2            2
## 2 Jerry guitar     6            3
## 3 Jerry guitar     10           5
## 4 Bob   guitar     2            2
## 5 Bob   guitar     6            3
## 6 Bob   guitar     10           5
## 7 Bill  drums      2            2

Note that the new variable week is type character. We can add an option to the pivot_longer command or add a mutate to change the form of the variable.

ds %>%
  tidyr::pivot_longer(data=.,
    cols = starts_with("week"),
    names_prefix = "week",
    names_to = "week",
    values_to = "concerts",
    values_drop_na = TRUE) %>% 
  mutate(week=as.integer(week)) %>%
  head()

## # A tibble: 6 x 4
##   name  instrument  week concerts
##   <chr> <chr>      <int>    <dbl>
## 1 Jerry guitar         2        2
## 2 Jerry guitar         6        3
## 3 Jerry guitar        10        5
## 4 Bob   guitar         2        2
## 5 Bob   guitar         6        3
## 6 Bob   guitar        10        5

7.4.2 Converting Long to Wide Format

fish_encounters %>% head()

## # A tibble: 6 x 3
##   fish  station  seen
##   <fct> <fct>   <int>
## 1 4842  Release     1
## 2 4842  I80_1       1
## 3 4842  Lisbon      1
## 4 4842  Rstr        1
## 5 4842  Base_TD     1
## 6 4842  BCE         1

fish_encounters %>% pivot_wider(data=.,names_from = station, values_from = seen) %>%
  head()

## # A tibble: 6 x 12
##   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
##   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
## 1 4842        1     1      1     1       1     1     1     1     1     1     1
## 2 4843        1     1      1     1       1     1     1     1     1     1     1
## 3 4844        1     1      1     1       1     1     1     1     1     1     1
## 4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
## 5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
## 6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA

find example of values_fill

id <- c(1,1,2,2,3,3)
response <- c(5,6,2,3,7,8)
period <- c("pre","post","pre","post","pre","post")
dframe <- data.frame(id,response,period)
head(dframe)

##   id response period
## 1  1        5    pre
## 2  1        6   post
## 3  2        2    pre
## 4  2        3   post
## 5  3        7    pre
## 6  3        8   post

dframe %>%
pivot_wider(data=.,names_from = period, values_from = response) %>%
  head()

## # A tibble: 3 x 3
##      id   pre  post
##   <dbl> <dbl> <dbl>
## 1     1     5     6
## 2     2     2     3
## 3     3     7     8

7.5 Rectangling and Nested Data

Rectangling - means creating a “tidy” data table with rows as observations, columns as variables.

Let’s use the “Game of Thrones” nested data structure used earlier, and create a data table with the following components:

The name of the character.
Their unique identifier.
If they’re alive.
Their gender.

To extract each of these variables we need to use a map series of functions. The first four are straightforward: you just need to identify the name of the component and its type.

got_tibble <- tibble(
  name = got_chars %>% map_chr("name"),
  id = got_chars %>% map_int("id"),
  alive = got_chars %>% map_lgl("alive"),
  gender = got_chars %>% map_chr("gender"))
#
head(got_tibble)

## # A tibble: 6 x 4
##   name                 id alive gender
##   <chr>             <int> <lgl> <chr> 
## 1 Theon Greyjoy      1022 TRUE  Male  
## 2 Tyrion Lannister   1052 TRUE  Male  
## 3 Victarion Greyjoy  1074 TRUE  Male  
## 4 Will               1109 FALSE Male  
## 5 Areo Hotah         1166 TRUE  Male  
## 6 Chett              1267 FALSE Male

An example using models

mtcars_nested <- mtcars %>%
  group_by(cyl) %>%
  nest()
#
mtcars_nested

## # A tibble: 3 x 2
## # Groups:   cyl [3]
##     cyl data              
##   <dbl> <list>            
## 1     6 <tibble [7 × 10]> 
## 2     4 <tibble [11 × 10]>
## 3     8 <tibble [14 × 10]>

# display 6 cyl dataset
mtcars_nested$data[1] %>% head()

## [[1]]
## # A tibble: 7 x 10
##     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21    160    110  3.9   2.62  16.5     0     1     4     4
## 2  21    160    110  3.9   2.88  17.0     0     1     4     4
## 3  21.4  258    110  3.08  3.22  19.4     1     0     3     1
## 4  18.1  225    105  2.76  3.46  20.2     1     0     3     1
## 5  19.2  168.   123  3.92  3.44  18.3     1     0     4     4
## 6  17.8  168.   123  3.92  3.44  18.9     1     0     4     4
## 7  19.7  145    175  3.62  2.77  15.5     0     1     5     6

mtcars_nested <- mtcars_nested %>%
  mutate(model = map(data, function(df) lm(mpg ~ wt, data = df)))
mtcars_nested

## # A tibble: 3 x 3
## # Groups:   cyl [3]
##     cyl data               model 
##   <dbl> <list>             <list>
## 1     6 <tibble [7 × 10]>  <lm>  
## 2     4 <tibble [11 × 10]> <lm>  
## 3     8 <tibble [14 × 10]> <lm>

# model result for cyl=4 model:
mtcars_nested$model[2]

## [[1]]
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Coefficients:
## (Intercept)           wt  
##      39.571       -5.647

# make a set of predictions from each model
mtcars_nested <- mtcars_nested %>%
  mutate(model = map(model, predict))
mtcars_nested

## # A tibble: 3 x 3
## # Groups:   cyl [3]
##     cyl data               model     
##   <dbl> <list>             <list>    
## 1     6 <tibble [7 × 10]>  <dbl [7]> 
## 2     4 <tibble [11 × 10]> <dbl [11]>
## 3     8 <tibble [14 × 10]> <dbl [14]>

mtcars_nested$model[2]

## [[1]]
##        1        2        3        4        5        6        7        8        9 
## 26.47010 21.55719 21.78307 27.14774 30.45125 29.20890 25.65128 28.64420 27.48656 
##       10       11 
## 31.02725 23.87247

7.6 Handling Missing Values

Make implicit missing values explicit with complete(); make explicit missing values implicit with drop_na(); replace missing values with next/previous value with fill(), or a known value with replace_na().

If a variable is not complete and contains empty places, these are denoted in R as NA. We will often wish to create a dataframe without any missing values, or discover how many rows contain variables with missing values.

First let’s create a small dataset with missing values:

x <- c(1,2,NA,4)
y <- c(11,12,13,NA)
z <- c(7,8,9,10)
tempdf <- data.frame(x,y,z)
tempdf

##    x  y  z
## 1  1 11  7
## 2  2 12  8
## 3 NA 13  9
## 4  4 NA 10

# count missing values for variable x
tempdf %>%
  dplyr::summarise(count = sum(is.na(x)))

##   count
## 1     1

# count rows with missing y
tempdf %>%
  dplyr::tally(is.na(y))

##   n
## 1 1

# subset of rows with complete data for specified columns
tempdf %>%
  dplyr::select(y,z) %>%
  tidyr::drop_na() %>%
  head()

##    y z
## 1 11 7
## 2 12 8
## 3 13 9

#  
tempdf %>%
  tidyr::drop_na() %>%
  head()

##   x  y z
## 1 1 11 7
## 2 2 12 8

7.7 Handling Dates

We will use the lubridate package from the tidyverse to assist in handling date variables. Date variables are a special data type like character or integer or double precision for numeric values.

library(lubridate)

7.7.1 creating dates from strings

Notice that these lubridate functions accept strings (or numbers) of many different forms on intake.

ymd("2017-01-31")

## [1] "2017-01-31"

ymd(20170131)

## [1] "2017-01-31"

mdy("January 31st, 2017")

## [1] "2017-01-31"

mdy(c("01/01/1999","Jan031993"))

## [1] "1999-01-01" "1993-01-03"

tempvarstring <-c("31-Jan-2017","01-June-2020","20-April-2019")
dmy(c("31-Jan-2017","01-June-2020","20-April-2019"))

## [1] "2017-01-31" "2020-06-01" "2019-04-20"

tempvar <- dmy(c("31-Jan-2017","01-June-2020","20-April-2019"))
dplyr::glimpse(data.frame(tempvar,tempvarstring))

## Rows: 3
## Columns: 2
## $ tempvar       <date> 2017-01-31, 2020-06-01, 2019-04-20
## $ tempvarstring <chr> "31-Jan-2017", "01-June-2020", "20-April-2019"

7.7.2 Date time variables

Beyond day-month-year information, hour-minute-seconds (and even finer increments) can be added to the variable.

ymd_hms("2017-01-31 20:11:59")

## [1] "2017-01-31 20:11:59 UTC"

mdy_hm("01/31/2017 08:01")

## [1] "2017-01-31 08:01:00 UTC"

ymd(20170131, tz = "Pacific/Auckland")

## [1] "2017-01-31 NZDT"

7.7.3 Create from components of time

Sometimes components of time are present, and a date variable can be made from the components.

library(nycflights13)
flights %>% 
  select(year, month, day, hour, minute) %>%
  head()

## # A tibble: 6 x 5
##    year month   day  hour minute
##   <int> <int> <int> <dbl>  <dbl>
## 1  2013     1     1     5     15
## 2  2013     1     1     5     29
## 3  2013     1     1     5     40
## 4  2013     1     1     5     45
## 5  2013     1     1     6      0
## 6  2013     1     1     5     58

#
flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute)) %>%
  head()

## # A tibble: 6 x 6
##    year month   day  hour minute departure          
##   <int> <int> <int> <dbl>  <dbl> <dttm>             
## 1  2013     1     1     5     15 2013-01-01 05:15:00
## 2  2013     1     1     5     29 2013-01-01 05:29:00
## 3  2013     1     1     5     40 2013-01-01 05:40:00
## 4  2013     1     1     5     45 2013-01-01 05:45:00
## 5  2013     1     1     6      0 2013-01-01 06:00:00
## 6  2013     1     1     5     58 2013-01-01 05:58:00

7.7.4 Computing With Dates

We will illustrate how to do some computations using time/date variables. First we define two date/time variables representing an arrival and departure from New Zealand.

arrive <- ymd_hms("2011-06-04 12:00:00", tz = "Pacific/Auckland")
arrive

## [1] "2011-06-04 12:00:00 NZST"

leave <- ymd_hms("2011-08-10 14:00:00", tz = "Pacific/Auckland")
leave

## [1] "2011-08-10 14:00:00 NZST"

7.7.5 extract date components

lubridate::month(arrive)

## [1] 6

lubridate::month(arrive, label=TRUE)

## [1] Jun
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < ... < Dec

#
lubridate::day(arrive)

## [1] 4

lubridate::year(arrive)

## [1] 2011

7.7.6 Periods and Durations

The following constructions make a range or interval of time - between arrival and leave dates.

auckland <- lubridate::interval(arrive, leave) 
auckland

## [1] 2011-06-04 12:00:00 NZST--2011-08-10 14:00:00 NZST

auckland <- arrive %--% leave
auckland

## [1] 2011-06-04 12:00:00 NZST--2011-08-10 14:00:00 NZST

Another person will be at a meeting outside of New Zealand from July 20 until the end of August.

atameeting <- lubridate::interval(ymd(20110720, tz = "Pacific/Auckland"), 
                ymd(20110831, tz = "Pacific/Auckland"))
atameeting

## [1] 2011-07-20 NZST--2011-08-31 NZST

Will the first interval overlap with the interval of being at a meeting?

lubridate::int_overlaps(atameeting, auckland)

## [1] TRUE

What period will the first person be in New Zealand and the second person not being away at a meeting?

lubridate::setdiff(auckland, atameeting)

## [1] 2011-06-04 12:00:00 NZST--2011-07-20 NZST

An age computation

# How old is someone born July 4, 1994?
person_age <- lubridate::today() - lubridate::ymd(19940704)
person_age

## Time difference of 9838 days

A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.

ageduration<-lubridate::as.duration(person_age)
ageduration

## [1] "850003200s (~26.93 years)"

Intervals are specific time spans (because they are tied to specific dates), but lubridate also supplies two general time span classes: Durations and Periods. Helper functions for creating periods are named after the units of time (plural). Helper functions for creating durations follow the same format but begin with a “d” (for duration) or, if you prefer, and “e” (for exact).

minutes(2) ## period

## [1] "2M 0S"

dminutes(2) ## duration

## [1] "120s (~2 minutes)"

auckland

## [1] 2011-06-04 12:00:00 NZST--2011-08-10 14:00:00 NZST

# number of days
auckland / lubridate::ddays(1)

## [1] 67.08333

auckland / lubridate::days(1)

## [1] 67.08333

# sets of 2days
auckland / ddays(2)

## [1] 33.54167

# number of minutes
auckland / dminutes(1)

## [1] 96600

How many weeks, days, hours, minutes was person alive based on duration variable ageduration?

ageduration / dweeks(1)

## [1] 1405.429

ageduration / ddays(1)

## [1] 9838

ageduration / dhours(1)

## [1] 236112

ageduration / dminutes(1)

## [1] 14166720

periods:

one_pm <- lubridate::ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
#
one_pm + lubridate::days(1)

## [1] "2016-03-13 13:00:00 EDT"

Notice what happens if we add the “duration” form of a year to Jan 1, 2016 compared to the period-form of a year. We get the expected outcome from the period form - the duration form suffers because 2016 was a leap year.

# A leap year
lubridate::ymd("2016-01-01") + lubridate::dyears(1)

## [1] "2016-12-31 06:00:00 UTC"

lubridate::ymd("2016-01-01") + lubridate::years(1)

## [1] "2017-01-01"

Let’s look at an example that denotes the amount of time under observation expressed in weeks and days.

ObsTime <- c("10w 4d", "13w", "9w 5d", NA, "15w 6d")

What if we wanted to convert that to numeric weeks?

lubridate::as.duration(lubridate::period(ObsTime, units = c("week","day"))) / dweeks(1)

## [1] 10.571429 13.000000  9.714286        NA 15.857143

We use the period function to define a Period object. The units argument says the first part of the text string represents weeks and the second part represents days. Then this is converted to a Duration object that stores time in seconds. We then divide by dweeks(1) (duration) to convert seconds to weeks. Notice how the NA remains NA and that “13w” converts to 13 because there are no days appended to it.