2 Dataframes in R

If you haven’t already, read in the data using

library('readr')
library('dplyr')
library('ggplot2')
testdata=read.csv("https://raw.github.com/hdg204/DoctorsAsDataScientists/main/simulated_diabetes_data.csv")

2.1 The Diabetes Prediction Dataset

The Diabetes Prediction Dataset contains 9 variables recorded in 100,000 people. You can explore the dataset yourself by double clicking on testdata on the top right of your screen, in the Environment tab. You can also sort the columns by double clicking on the headers to get a feel for what’s in the dataset.

2.2 Summarising columns

We will be using two functions for summarising the data in one column, mean and sum. Both functions require a specific variable to be specified, which can be done with dataframe$variable

2.2.1 Mean

To calculate a column average, run

mean(testdata$bmi)
#> [1] 27.8137

This will give you the average BMI.

Exercise 1: Write a command to give the average HbA1c level
Note that R is case sensitive, pay attention to the column header

2.2.2 Sum

Adding up everything in a column seems weird until you see hypertension, heart_disease and diabetes are all coded as 0 and 1. This means you can find out how many people have diabetes, by running

sum(testdata$diabetes)
#> [1] 8699

Exercise 2: How many people in the dataset have hypertension?

While we’re here, you can also calculate the mean of a variable coded as 0 and 1, and this will give you the fraction that are 1. In this data, 0.08699 have diabetes (8.699%). What is the prevalence of hypertension?

2.3 Summary Statistics on Dataframes

R has a lot of commands that allow you to summarise data quickly and easily. Run the following commands one by one and try to figure out what they’re doing.

dim(testdata)
nrow(testdata)
ncol(testdata)
ls(testdata)
head(testdata)
glimpse(testdata)

Of particular importance is nrow(testdata). The nrow function counts how many rows are in your data, which in this case is the number of people because we have one row per person.

2.4 Filtering and Managing Data

We will study two commands from the dplyr package: filter and mutate.

dplyr commands always take the same general form, dataframe2=dataframe(instructions). This means you are creating a new dataframe dataframe2 which is the original dataframe dataframe1 but with stuff happening to it.

2.4.1 Filter

Filter is a useful command for studying a subset of people. For example, run

females=filter(testdata,gender=='Female')

This has made a new dataframe, called females, which includes only the people when gender is Female. R uses the double == for comparisons, and = for defining new variables. Double click on it in the environment to check it’s done the right thing.

You can use nrow(females) to count them.

The & symbol can be used to combine multiple features. For example,

females_diabetes=filter(testdata,gender=='Female' & diabetes==1)
nrow(females_diabetes)
#> [1] 4684

Will tell you how many women in the data have diabetes.

Exercise 3: how many men have hypertension?

Using nrow and filter we can calculate prevelance of a disease in a subset.

100*nrow(females_diabetes)/nrow(females)
#> [1] 9.027657

This takes the number of women with diabetes, divides by the number of women, multiplies by 100 to give a percentage.

Exercise 4: what proportion of men have hypertension?

2.4.2 Mutate

Mutate lets you make new columns based on operations on other ones. This can be useful for transforming variables or classifying people based on a criteria.

testdata=mutate(testdata,obesity=bmi>30)

The command above takes the dataframe testdata, makes a new column called obesity when BMI is over 30. It then overwrites testdata with this new dataframe, which contains obesity.

If you look at HbA1c_level, it’s still in the old % units, not mmol/mol

Exercise 5: Use mutate to make a new HbA1c column called HbA1c_mmol
Hint: the conversion formula is HbA1c_mmol=11*(HbA1c_level-2.15)

2.5 Final Exercise

Now you can filter dataframes, and make new variables based on old ones. Now you can answer some questions about the data.

What is the average BMI in people with diabetes and people without diabetes
Do males or females have higher rates of hypertension, diabetes or heart disease?
How many times higher is the heart disease prevalence of people with BMi over 25 to people with BMI under 25?
The threshold for a diabetes diagnosis is HbA1c > 48. How many people have diabetes but don’t have a diagnosis?

1 Doctors as Data Scientists

3 Plotting Data