Spreadsheets

Tentative schedule

1. Introduction (18’)

Principles for good data organization

Exercise

What are spreadsheets good/bad for?

Data entry.
Organizing data.
Subsetting and sorting data.
Statistics.
Plotting.

Key point

It is crucial to organize research data well.

2. Formatting data tables in Spreadsheets (35’)

How to structure a dataset so that it is easy to work with

Wide layout: Good to enter data

Example

Each cell represents a count of individuals for a given combination of plot and Sex.
The variable sex is implicit. It’s values are spread along two columns F and M.

plot	F	M
1	11	12
2	21	22
3	31	32

From wide to long, automatically

The key word here is automatically. Example in R:

long <- wide %>% 
  gather(key = "sex", value = "n", F:M)

long
## # A tibble: 6 x 3
##    plot sex       n
##   <int> <chr> <int>
## 1     1 F        11
## 2     2 F        21
## 3     3 F        31
## 4     1 M        12
## 5     2 M        22
## 6     3 M        32

Long layout: Great for data analysis

Example in R

Take long, then group by sex, then sum the n of individuals.

long %>% 
  group_by(sex) %>% 
  summarize(n = sum(n))
## # A tibble: 2 x 2
##   sex       n
##   <chr> <int>
## 1 F        63
## 2 M        66

Make your research reproducible

Track how you cleaned your data

Automatically with googlesheets: File > Version history

Make your research reproducible

Track how you cleaned your data

Manually: Write cleaning steps in plain text (e.g. README.txt)

Structuring data in spreadsheets

Tidy data (Book chapter; paper)

(Different kinds of data may be in different, related tables.)

Is this a tidy dataset?

Introduction to workshop data

(source)

Time-series for a small mammal community in southern Arizona to study the effects of rodents and ants on the plant community

Rodents sampled on 24 plots, with different treatments.
Real dataset (used in over 100 publications).
Slightly simplified for the workshop.

Discussion

What’s wrong and how would you fix it?

Online: http://bit.ly/carpentries-googlesheets-messy, or
Download: http://bit.ly/carpentries-spreadsheets-messy

Key points

Keep raw data raw.
Track changes with an automated system or in plain text.
Organize your data according to tidy data principles.

Bonus: Google forms and sheets

Demo google forms

3. Formatting problems

Common formatting challenges and how to avoid them

See lesson.

Learn more

Data organization in spreadsheets

Broman KW, Woo KH. (2018) Data organization in spreadsheets. PeerJ Preprints 6:e3183v2 https://doi.org/10.7287/peerj.preprints.3183v2

4. Dates as data (13’)

How to handle dates safely

Exercise

Solution

Exercise

Solution

How spreadsheets store dates?

Dates stored as integer: Number of days from 1899-12-31.

WARNING: Can’t parse dates before 1899-12-31

Exercise

Solution

Preferred date formats

Many programs like these formats

Date:

Clear and succint (1 column): YYYY-MM-DD
Clear but not succint (3 columns): YEAR, MONTH, DAY
Succint but unclear (1 column): YYYYMMDD.

Generalizing to date-time: YYYYMMDDhhmmss

Example in R:

library(lubridate)
as_datetime("20180923145108")
## [1] "2018-09-23 14:51:08 UTC"

Key point

Format dates for any program to interpret them correctly.

5. Quality control (5’)

How to validate data to avoid errors

Design a system to miminize errors

From be_pragmatic()

Demo data validation

bit.ly/carpentries-googlesheets-messy

Cell range and Criteria may come from different sheets.

Demo data validation

You can protect sheets and cells

Demo data validation

Others can request changes via comments

Key points

You can control what type of data gets entered.
You can explore invalid data with sorting.
You can flag invalid data with conditional formatting.

6. Exporting data (10’)

How to export data to maximize downstream applications

Export a spreadsheet as a .csv file

In googlesheets:

File > Download as > Comma-separated values (.csv, current sheet)

When whould you export Tab-separated values (.tsv)?

Key points

Export data to plain text formats (.csv, .tsv).

Tentative schedule

1. Introduction (18’)

Principles for good data organization

Exercise

What are spreadsheets good/bad for?

Key point

2. Formatting data tables in Spreadsheets (35’)

How to structure a dataset so that it is easy to work with

Wide layout: Good to enter data

Example

From wide to long, automatically

The key word here is automatically. Example in R:

Long layout: Great for data analysis

Example in R

Make your research reproducible

Track how you cleaned your data

Make your research reproducible

Track how you cleaned your data

Structuring data in spreadsheets

Tidy data (Book chapter; paper)

Is this a tidy dataset?

Is this a tidy dataset?

Introduction to workshop data

(source)

Discussion

What’s wrong and how would you fix it?

Key points

Bonus: Google forms and sheets

3. Formatting problems

Common formatting challenges and how to avoid them

Learn more

Data organization in spreadsheets

4. Dates as data (13’)

How to handle dates safely

Exercise

Solution

Exercise

Solution

How spreadsheets store dates?

Exercise

Solution

Preferred date formats

Many programs like these formats

Key point

5. Quality control (5’)

How to validate data to avoid errors

Design a system to miminize errors

Demo data validation

bit.ly/carpentries-googlesheets-messy

Demo data validation

You can protect sheets and cells

Demo data validation

Others can request changes via comments

Key points

6. Exporting data (10’)

How to export data to maximize downstream applications

Export a spreadsheet as a .csv file

Key points

End

Have you installed the software for the next lesson?