Chapter 7 Data Wrangling Part 2
In this chapter we present some more advanced methods that are required by data scientists.
7.1 Advanced Piping
We have previously experienced the “%>%” pipe, meaning “and then” that stitched together a set of operations.
Here we create a set of observations for Australia and save the subset in a dataframe called ozdata
.
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1952 69.1 8691212 10040.
## 2 Australia Oceania 1957 70.3 9712569 10950.
## 3 Australia Oceania 1962 70.9 10794968 12217.
## 4 Australia Oceania 1967 71.1 11872264 14526.
## 5 Australia Oceania 1972 71.9 13177000 16789.
## 6 Australia Oceania 1977 73.5 14074100 18334.
This same action could be accomplished by using a “forward assignment” operation
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1952 69.1 8691212 10040.
## 2 Australia Oceania 1957 70.3 9712569 10950.
## 3 Australia Oceania 1962 70.9 10794968 12217.
## 4 Australia Oceania 1967 71.1 11872264 14526.
## 5 Australia Oceania 1972 71.9 13177000 16789.
## 6 Australia Oceania 1977 73.5 14074100 18334.
The assignment T-pipe magrittr::%T>%
allows two separate pipes or pathways after the implementation of this operator.
Illustration: The Australia subset of the dataframe is piped into a brackets that print the head of these data, but the filtered subset is also piped around the printing bracket and into another filter that is then displayed. The crucial feature is that intermediate results (tables or graphs) of the pipeline can be displayed without breaking the data flow through the main pipeline.
gapminder %>%
dplyr::filter(country=="Australia") %T>%
{head(.) %>% print()} %>%
filter(year > 1990) %>%
head()
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1952 69.1 8691212 10040.
## 2 Australia Oceania 1957 70.3 9712569 10950.
## 3 Australia Oceania 1962 70.9 10794968 12217.
## 4 Australia Oceania 1967 71.1 11872264 14526.
## 5 Australia Oceania 1972 71.9 13177000 16789.
## 6 Australia Oceania 1977 73.5 14074100 18334.
## # A tibble: 4 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 1992 77.6 17481977 23425.
## 2 Australia Oceania 1997 78.8 18565243 26998.
## 3 Australia Oceania 2002 80.4 19546792 30688.
## 4 Australia Oceania 2007 81.2 20434176 34435.
7.2 Apply a function to object
7.2.1 Base apply functions
We have seen the dplyr summarise function compute summary statistics for variables in a dataframe, but there is rather a lot of coding to ask for the mean or standard deviation of each variable in a dataframe. If there is a large number of variables, this could easily become a nightmare.
We can ask the apply
function from base R to compute the mean of each variable in a dataframe (represented by .
in the first argument below - dataframe was piped into apply
) by noting that MARGIN=2
means compute the function for columns, and MARGIN=1
means compute for rows of a dataframe.
The following code computes the average of these three quantitative variables across years for Australia.
gapminder %>%
filter(country=="Australia") %>%
select(lifeExp,pop,gdpPercap) %>%
base::apply(.,MARGIN=2,FUN=mean)
## lifeExp pop gdpPercap
## 7.466292e+01 1.464931e+07 1.998060e+04
Equivalent code using base function colMeans
.
## lifeExp pop gdpPercap
## 7.466292e+01 1.464931e+07 1.998060e+04
7.2.2 Using purrr package
This section uses the purrr package in tidyverse, and the data used comes from the repurrsive package.
The function purrr::map()
is a function for applying a function to each element of a list. The closest base R function is lapply().
Here we illustrate the map
function by applying the natural log function to each element of a list, returning a list:
## List of 3
## $ : num 0
## $ : num 0.693
## $ : num 1.1
## [1] TRUE
## [[1]]
## [1] 0
##
## [[2]]
## [1] 0.6931472
##
## [[3]]
## [1] 1.098612
The map
function has two basic components: a list, and a function to apply to each element of the list. There are variants of the map
function that are used for different kinds of output objects: for example,map_chr()
will produce a vector of character values. The previous example now produces a vector of characters rather than numbers.
## chr [1:3] "0.000000" "0.693147" "1.098612"
## [1] "0.000000" "0.693147" "1.098612"
## [1] FALSE
## [1] TRUE
Next we consider a list of characters from “Game of Thrones” - we display a list of three items, with each item a list of 18 sub-items, only the first 8 sub-items are shown in the output by the list.len
option. Based on tutorial: https://jennybc.github.io/purrr-tutorial/index.html.
## List of 3
## $ :List of 18
## ..$ url : chr "https://www.anapioficeandfire.com/api/characters/1022"
## ..$ id : int 1022
## ..$ name : chr "Theon Greyjoy"
## ..$ gender : chr "Male"
## ..$ culture : chr "Ironborn"
## ..$ born : chr "In 278 AC or 279 AC, at Pyke"
## ..$ died : chr ""
## ..$ alive : logi TRUE
## .. [list output truncated]
## $ :List of 18
## ..$ url : chr "https://www.anapioficeandfire.com/api/characters/1052"
## ..$ id : int 1052
## ..$ name : chr "Tyrion Lannister"
## ..$ gender : chr "Male"
## ..$ culture : chr ""
## ..$ born : chr "In 273 AC, at Casterly Rock"
## ..$ died : chr ""
## ..$ alive : logi TRUE
## .. [list output truncated]
## $ :List of 18
## ..$ url : chr "https://www.anapioficeandfire.com/api/characters/1074"
## ..$ id : int 1074
## ..$ name : chr "Victarion Greyjoy"
## ..$ gender : chr "Male"
## ..$ culture : chr "Ironborn"
## ..$ born : chr "In 268 AC or before, at Pyke"
## ..$ died : chr ""
## ..$ alive : logi TRUE
## .. [list output truncated]
## [[1]]
## [1] "Theon Greyjoy"
##
## [[2]]
## [1] "Tyrion Lannister"
##
## [[3]]
## [1] "Victarion Greyjoy"
Next we take this list of character names and pipe it through map
again to extract the number of characters in each name.
## [[1]]
## [1] 13
##
## [[2]]
## [1] 16
##
## [[3]]
## [1] 17
What if we wanted to do some sort of numerical analysis of name lengths? We could collect these numeric values in a numeric vector to enable further processing, like calculate the average word length.
## [1] 13 16 17
## [1] 13 16 17
## [1] 15.33333
There is a variant of map
that makes a dataframe as output.
## # A tibble: 3 x 3
## name culture gender
## <chr> <chr> <chr>
## 1 Theon Greyjoy "Ironborn" Male
## 2 Tyrion Lannister "" Male
## 3 Victarion Greyjoy "Ironborn" Male
gapminder %>%
# make sure year is numeric...
dplyr::mutate(year=as.numeric(year)) %>%
dplyr::filter(continent=="Oceania") %>%
# we will want to estimate separate models for each
# country in Oceania, but the dataframe retains the
# memory of other countries that must be dropped to
# avoid empty dataframes... so we drop unused levels
droplevels() %>%
split(.$country) %>%
purrr::map(.,~ lm(lifeExp ~ year, data = .x)) %>%
purrr::map(summary) %>%
purrr::map_dbl("r.squared")
## Australia New Zealand
## 0.9796477 0.9535846
Estimate a linear regression model on each data subset - again one subset for each value of cylinder (cyl). Resulting output is a dataframe, based on purrr package online documentation.
mtcars %>%
split(.$cyl) %>%
purrr::map(~ lm(mpg ~ wt, data = .x)) %>%
purrr::map_dfr(~ as.data.frame(t(as.matrix(coef(.)))))
## (Intercept) wt
## 1 39.57120 -5.647025
## 2 28.40884 -2.780106
## 3 23.86803 -2.192438
7.3 Joining Datasets
Notice that both country and continent are factor/categorical variables.
## # A tibble: 3 x 3
## # Groups: country [1]
## year country continent
## <int> <fct> <fct>
## 1 1952 Afghanistan Asia
## 2 1957 Afghanistan Asia
## 3 1962 Afghanistan Asia
#
gapminder %>%
dplyr::select(country, continent) %>%
dplyr::group_by(country) %>%
# the country continent values repeat over now-nonexistent years
# so extract only the first instance with slice(1)
# slice_head(n=1) will work soon
dplyr::slice(n=1) %>%
head()
## # A tibble: 6 x 2
## # Groups: country [6]
## country continent
## <fct> <fct>
## 1 Afghanistan Asia
## 2 Albania Europe
## 3 Algeria Africa
## 4 Angola Africa
## 5 Argentina Americas
## 6 Australia Oceania
Notice in this dataframe, country is type character.
## # A tibble: 6 x 3
## country iso_alpha iso_num
## <chr> <chr> <int>
## 1 Afghanistan AFG 4
## 2 Albania ALB 8
## 3 Algeria DZA 12
## 4 Angola AGO 24
## 5 Argentina ARG 32
## 6 Armenia ARM 51
A left_join
will return all rows from x(left=gapminder), and all columns from x(left=gapminder ) and y(right=country_codes table). If there are multiple matches between x and y, all combination of the matches are returned.
The left_join
command will keep all elements (rows) on the left (gapminder records) and join the country_codes
rows that share common variables. Notice here the type conflict for country is resolved by forcing the gapminder
country variable to character.
gapminder %>%
dplyr::select(country, continent) %>%
dplyr::group_by(country) %>%
dplyr::slice(1) %>%
dplyr::left_join(country_codes) %>%
head()
## Joining, by = "country"
## # A tibble: 6 x 4
## # Groups: country [6]
## country continent iso_alpha iso_num
## <chr> <fct> <chr> <int>
## 1 Afghanistan Asia AFG 4
## 2 Albania Europe ALB 8
## 3 Algeria Africa DZA 12
## 4 Angola Africa AGO 24
## 5 Argentina Americas ARG 32
## 6 Australia Oceania AUS 36
Notice that changing the type declaration for country
in gapminder
, we get a clean join on all common variables (country).
gapminder %>%
dplyr::select(country, continent) %>%
dplyr::mutate(country = as.character(country)) %>%
dplyr::group_by(country) %>%
dplyr::slice(1) %>%
dplyr::left_join(country_codes) %>%
head()
## Joining, by = "country"
## # A tibble: 6 x 4
## # Groups: country [6]
## country continent iso_alpha iso_num
## <chr> <fct> <chr> <int>
## 1 Afghanistan Asia AFG 4
## 2 Albania Europe ALB 8
## 3 Algeria Africa DZA 12
## 4 Angola Africa AGO 24
## 5 Argentina Americas ARG 32
## 6 Australia Oceania AUS 36
You can also explicitly control the joining key, the variables that form the connection between the two tables (here a common variable, country of type character) variables with a by
option. Notice that because the right (tempcc) dataset does not contain rows for most of the countries, the variables for tempcc
are missing for the rows that are only contained in the gapminder
dataframe.
tempcc <- country_codes %>%
filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>%
dplyr::select(country, continent) %>%
dplyr::mutate(country = as.character(country)) %>%
dplyr::group_by(country) %>%
dplyr::slice(1) %>%
dplyr::left_join(tempcc,by=c("country"="country")) %>%
head()
## # A tibble: 6 x 4
## # Groups: country [6]
## country continent iso_alpha iso_num
## <chr> <fct> <chr> <int>
## 1 Afghanistan Asia <NA> NA
## 2 Albania Europe <NA> NA
## 3 Algeria Africa <NA> NA
## 4 Angola Africa <NA> NA
## 5 Argentina Americas <NA> NA
## 6 Australia Oceania AUS 36
An inner_join
produces table entries that are common to both datasets. Specifically, it returns all rows from x (gapminder) where there are matching values in y(tempcc), and all columns from x(gapminder) and y(tempcc). If there are multiple matches between x and y, all combination of the matches are returned.
tempcc <- country_codes %>%
filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>%
dplyr::filter(continent=="Oceania") %>%
dplyr::select(country, continent) %>%
dplyr::mutate(country = as.character(country)) %>%
dplyr::group_by(country) %>%
dplyr::slice(1) %>%
dplyr::inner_join(tempcc,by = c("country" = "country")) %>%
head()
## # A tibble: 2 x 4
## # Groups: country [2]
## country continent iso_alpha iso_num
## <chr> <fct> <chr> <int>
## 1 Australia Oceania AUS 36
## 2 New Zealand Oceania NZL 554
Illustration of right_join
:
tempcc <- country_codes %>%
filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>%
dplyr::filter(continent=="Oceania") %>%
dplyr::select(country, continent) %>%
dplyr::mutate(country = as.character(country)) %>%
dplyr::group_by(country) %>%
dplyr::slice(1) %>%
dplyr::right_join(tempcc,by = c("country" = "country")) %>%
head()
## # A tibble: 3 x 4
## # Groups: country [3]
## country continent iso_alpha iso_num
## <chr> <fct> <chr> <int>
## 1 Australia Oceania AUS 36
## 2 New Zealand Oceania NZL 554
## 3 Japan <NA> JPN 392
Illustration of a semi_join(x, y): Return all rows from x(gapminder) where there are matching values in y(tempcc), keeping just columns from x(gapminder). A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
tempcc <- country_codes %>%
filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>%
dplyr::filter(continent=="Oceania") %>%
dplyr::select(country, continent) %>%
dplyr::mutate(country = as.character(country)) %>%
dplyr::group_by(country) %>%
dplyr::slice(1) %>%
dplyr::semi_join(tempcc,by = c("country" = "country")) %>%
head()
## # A tibble: 2 x 2
## # Groups: country [2]
## country continent
## <chr> <fct>
## 1 Australia Oceania
## 2 New Zealand Oceania
An anti_join
returns all rows from x (gapminder) where there are not matching values in y (tempcc), keeping just columns from x(gapminder).
Illustration of anti_join
:
tempcc <- country_codes %>%
filter(country %in% c("Australia","New Zealand","Japan"))
gapminder %>%
dplyr::filter(continent=="Oceania") %>%
dplyr::select(country, continent) %>%
dplyr::mutate(country = as.character(country)) %>%
dplyr::group_by(country) %>%
dplyr::slice(1) %>%
dplyr::anti_join(tempcc,by = c("country" = "country")) %>%
head()
## # A tibble: 0 x 2
## # Groups: country [0]
## # … with 2 variables: country <chr>, continent <fct>
7.4 Pivoting
7.4.1 Converting Wide to Long Format
The relig_income dataset from the tidyr stores counts based on a survey which asked people about their religion and annual income:
## # A tibble: 6 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t know/re… 15 14 15 11 10 35 21
## 6 Evangelical P… 575 869 1064 982 881 1486 949
## # … with 3 more variables: $100-150k <dbl>, >150k <dbl>, Don't know/refused <dbl>
The first argument is the dataset to reshape, relig_income.
The second argument describes which columns need to be reshaped - every column but not religion.
The names_to
gives the name of the variable that will be created from the data stored in the original data (relig_income) column names - in this case income
.
The values_to
gives the name of the variable that will be created from the data stored in the cell value - in this case count
.
relig_income %>%
pivot_longer(data=., cols=-religion, names_to = "income", values_to = "count") %>%
head()
## # A tibble: 6 x 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
week2 <- c(2,2,2)
week6 <- c(3,3,3)
week10 <- c(5,5,5)
name <- c("Jerry","Bob","Bill")
instrument <- c("guitar","guitar","drums")
ds <- data.frame(name,instrument,week2,week6,week10)
head(ds)
## name instrument week2 week6 week10
## 1 Jerry guitar 2 3 5
## 2 Bob guitar 2 3 5
## 3 Bill drums 2 3 5
ds %>%
tidyr::pivot_longer(data=.,
cols = starts_with("week"),
names_prefix = "week",
names_to = "week",
values_to = "concerts",
values_drop_na = TRUE) %>%
head(n=7)
## # A tibble: 7 x 4
## name instrument week concerts
## <chr> <chr> <chr> <dbl>
## 1 Jerry guitar 2 2
## 2 Jerry guitar 6 3
## 3 Jerry guitar 10 5
## 4 Bob guitar 2 2
## 5 Bob guitar 6 3
## 6 Bob guitar 10 5
## 7 Bill drums 2 2
Note that the new variable week
is type character. We can add an option to the pivot_longer
command or add a mutate
to change the form of the variable.
ds %>%
tidyr::pivot_longer(data=.,
cols = starts_with("week"),
names_prefix = "week",
names_to = "week",
values_to = "concerts",
values_drop_na = TRUE) %>%
mutate(week=as.integer(week)) %>%
head()
## # A tibble: 6 x 4
## name instrument week concerts
## <chr> <chr> <int> <dbl>
## 1 Jerry guitar 2 2
## 2 Jerry guitar 6 3
## 3 Jerry guitar 10 5
## 4 Bob guitar 2 2
## 5 Bob guitar 6 3
## 6 Bob guitar 10 5
7.4.2 Converting Long to Wide Format
## # A tibble: 6 x 3
## fish station seen
## <fct> <fct> <int>
## 1 4842 Release 1
## 2 4842 I80_1 1
## 3 4842 Lisbon 1
## 4 4842 Rstr 1
## 5 4842 Base_TD 1
## 6 4842 BCE 1
## # A tibble: 6 x 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 NA NA NA NA NA NA
## 5 4847 1 1 1 NA NA NA NA NA NA NA NA
## 6 4848 1 1 1 1 NA NA NA NA NA NA NA
find example of values_fill
id <- c(1,1,2,2,3,3)
response <- c(5,6,2,3,7,8)
period <- c("pre","post","pre","post","pre","post")
dframe <- data.frame(id,response,period)
head(dframe)
## id response period
## 1 1 5 pre
## 2 1 6 post
## 3 2 2 pre
## 4 2 3 post
## 5 3 7 pre
## 6 3 8 post
## # A tibble: 3 x 3
## id pre post
## <dbl> <dbl> <dbl>
## 1 1 5 6
## 2 2 2 3
## 3 3 7 8
7.5 Rectangling and Nested Data
Rectangling - means creating a “tidy” data table with rows as observations, columns as variables.
Let’s use the “Game of Thrones” nested data structure used earlier, and create a data table with the following components:
- The name of the character.
- Their unique identifier.
- If they’re alive.
- Their gender.
To extract each of these variables we need to use a map
series of functions. The first four are straightforward: you just need to identify the name of the component and its type.
got_tibble <- tibble(
name = got_chars %>% map_chr("name"),
id = got_chars %>% map_int("id"),
alive = got_chars %>% map_lgl("alive"),
gender = got_chars %>% map_chr("gender"))
#
head(got_tibble)
## # A tibble: 6 x 4
## name id alive gender
## <chr> <int> <lgl> <chr>
## 1 Theon Greyjoy 1022 TRUE Male
## 2 Tyrion Lannister 1052 TRUE Male
## 3 Victarion Greyjoy 1074 TRUE Male
## 4 Will 1109 FALSE Male
## 5 Areo Hotah 1166 TRUE Male
## 6 Chett 1267 FALSE Male
An example using models
## # A tibble: 3 x 2
## # Groups: cyl [3]
## cyl data
## <dbl> <list>
## 1 6 <tibble [7 × 10]>
## 2 4 <tibble [11 × 10]>
## 3 8 <tibble [14 × 10]>
## [[1]]
## # A tibble: 7 x 10
## mpg disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 160 110 3.9 2.88 17.0 0 1 4 4
## 3 21.4 258 110 3.08 3.22 19.4 1 0 3 1
## 4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
## 5 19.2 168. 123 3.92 3.44 18.3 1 0 4 4
## 6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4
## 7 19.7 145 175 3.62 2.77 15.5 0 1 5 6
mtcars_nested <- mtcars_nested %>%
mutate(model = map(data, function(df) lm(mpg ~ wt, data = df)))
mtcars_nested
## # A tibble: 3 x 3
## # Groups: cyl [3]
## cyl data model
## <dbl> <list> <list>
## 1 6 <tibble [7 × 10]> <lm>
## 2 4 <tibble [11 × 10]> <lm>
## 3 8 <tibble [14 × 10]> <lm>
## [[1]]
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 39.571 -5.647
# make a set of predictions from each model
mtcars_nested <- mtcars_nested %>%
mutate(model = map(model, predict))
mtcars_nested
## # A tibble: 3 x 3
## # Groups: cyl [3]
## cyl data model
## <dbl> <list> <list>
## 1 6 <tibble [7 × 10]> <dbl [7]>
## 2 4 <tibble [11 × 10]> <dbl [11]>
## 3 8 <tibble [14 × 10]> <dbl [14]>
## [[1]]
## 1 2 3 4 5 6 7 8 9
## 26.47010 21.55719 21.78307 27.14774 30.45125 29.20890 25.65128 28.64420 27.48656
## 10 11
## 31.02725 23.87247
7.6 Handling Missing Values
Make implicit missing values explicit with complete(); make explicit missing values implicit with drop_na(); replace missing values with next/previous value with fill(), or a known value with replace_na().
If a variable is not complete and contains empty places, these are denoted in R as NA
. We will often wish to create a dataframe without any missing values, or discover how many rows contain variables with missing values.
First let’s create a small dataset with missing values:
## x y z
## 1 1 11 7
## 2 2 12 8
## 3 NA 13 9
## 4 4 NA 10
## count
## 1 1
## n
## 1 1
# subset of rows with complete data for specified columns
tempdf %>%
dplyr::select(y,z) %>%
tidyr::drop_na() %>%
head()
## y z
## 1 11 7
## 2 12 8
## 3 13 9
## x y z
## 1 1 11 7
## 2 2 12 8
7.7 Handling Dates
We will use the lubridate package from the tidyverse to assist in handling date variables. Date variables are a special data type like character or integer or double precision for numeric values.
7.7.1 creating dates from strings
Notice that these lubridate functions accept strings (or numbers) of many different forms on intake.
## [1] "2017-01-31"
## [1] "2017-01-31"
## [1] "2017-01-31"
## [1] "1999-01-01" "1993-01-03"
tempvarstring <-c("31-Jan-2017","01-June-2020","20-April-2019")
dmy(c("31-Jan-2017","01-June-2020","20-April-2019"))
## [1] "2017-01-31" "2020-06-01" "2019-04-20"
tempvar <- dmy(c("31-Jan-2017","01-June-2020","20-April-2019"))
dplyr::glimpse(data.frame(tempvar,tempvarstring))
## Rows: 3
## Columns: 2
## $ tempvar <date> 2017-01-31, 2020-06-01, 2019-04-20
## $ tempvarstring <chr> "31-Jan-2017", "01-June-2020", "20-April-2019"
7.7.2 Date time variables
Beyond day-month-year information, hour-minute-seconds (and even finer increments) can be added to the variable.
## [1] "2017-01-31 20:11:59 UTC"
## [1] "2017-01-31 08:01:00 UTC"
## [1] "2017-01-31 NZDT"
7.7.3 Create from components of time
Sometimes components of time are present, and a date variable can be made from the components.
## # A tibble: 6 x 5
## year month day hour minute
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 5 15
## 2 2013 1 1 5 29
## 3 2013 1 1 5 40
## 4 2013 1 1 5 45
## 5 2013 1 1 6 0
## 6 2013 1 1 5 58
#
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute)) %>%
head()
## # A tibble: 6 x 6
## year month day hour minute departure
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 5 15 2013-01-01 05:15:00
## 2 2013 1 1 5 29 2013-01-01 05:29:00
## 3 2013 1 1 5 40 2013-01-01 05:40:00
## 4 2013 1 1 5 45 2013-01-01 05:45:00
## 5 2013 1 1 6 0 2013-01-01 06:00:00
## 6 2013 1 1 5 58 2013-01-01 05:58:00
7.7.4 Computing With Dates
We will illustrate how to do some computations using time/date variables. First we define two date/time variables representing an arrival and departure from New Zealand.
## [1] "2011-06-04 12:00:00 NZST"
## [1] "2011-08-10 14:00:00 NZST"
7.7.5 extract date components
## [1] 6
## [1] Jun
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < ... < Dec
## [1] 4
## [1] 2011
7.7.6 Periods and Durations
The following constructions make a range or interval of time - between arrival and leave dates.
## [1] 2011-06-04 12:00:00 NZST--2011-08-10 14:00:00 NZST
## [1] 2011-06-04 12:00:00 NZST--2011-08-10 14:00:00 NZST
Another person will be at a meeting outside of New Zealand from July 20 until the end of August.
atameeting <- lubridate::interval(ymd(20110720, tz = "Pacific/Auckland"),
ymd(20110831, tz = "Pacific/Auckland"))
atameeting
## [1] 2011-07-20 NZST--2011-08-31 NZST
Will the first interval overlap with the interval of being at a meeting?
## [1] TRUE
What period will the first person be in New Zealand and the second person not being away at a meeting?
## [1] 2011-06-04 12:00:00 NZST--2011-07-20 NZST
An age computation
# How old is someone born July 4, 1994?
person_age <- lubridate::today() - lubridate::ymd(19940704)
person_age
## Time difference of 9838 days
A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.
## [1] "850003200s (~26.93 years)"
Intervals are specific time spans (because they are tied to specific dates), but lubridate also supplies two general time span classes: Durations and Periods. Helper functions for creating periods are named after the units of time (plural). Helper functions for creating durations follow the same format but begin with a “d” (for duration) or, if you prefer, and “e” (for exact).
## [1] "2M 0S"
## [1] "120s (~2 minutes)"
## [1] 2011-06-04 12:00:00 NZST--2011-08-10 14:00:00 NZST
## [1] 67.08333
## [1] 67.08333
## [1] 33.54167
## [1] 96600
How many weeks, days, hours, minutes was person alive based on duration variable ageduration
?
## [1] 1405.429
## [1] 9838
## [1] 236112
## [1] 14166720
periods:
one_pm <- lubridate::ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
#
one_pm + lubridate::days(1)
## [1] "2016-03-13 13:00:00 EDT"
Notice what happens if we add the “duration” form of a year to Jan 1, 2016 compared to the period-form of a year. We get the expected outcome from the period form - the duration form suffers because 2016 was a leap year.
## [1] "2016-12-31 06:00:00 UTC"
## [1] "2017-01-01"
Let’s look at an example that denotes the amount of time under observation expressed in weeks and days.
What if we wanted to convert that to numeric weeks?
## [1] 10.571429 13.000000 9.714286 NA 15.857143
We use the period function to define a Period object. The units argument says the first part of the text string represents weeks and the second part represents days. Then this is converted to a Duration object that stores time in seconds. We then divide by dweeks(1) (duration) to convert seconds to weeks. Notice how the NA remains NA and that “13w” converts to 13 because there are no days appended to it.