2 Dataframes in R
If you haven’t already, read in the data using
library('readr')
library('dplyr')
library('ggplot2')
testdata=read.csv("https://raw.github.com/hdg204/DoctorsAsDataScientists/main/simulated_diabetes_data.csv")
2.1 The Diabetes Prediction Dataset
The Diabetes Prediction Dataset contains 9 variables recorded in 100,000 people. You can explore the dataset yourself by double clicking on testdata on the top right of your screen, in the Environment
tab. You can also sort the columns by double clicking on the headers to get a feel for what’s in the dataset.
2.2 Summarising columns
We will be using two functions for summarising the data in one column, mean
and sum
. Both functions require a specific variable to be specified, which can be done with dataframe$variable
2.2.1 Mean
To calculate a column average, run
mean(testdata$bmi)
#> [1] 27.8137
This will give you the average BMI.
Exercise 1: Write a command to give the average HbA1c level
Note that R is case sensitive, pay attention to the column header
2.2.2 Sum
Adding up everything in a column seems weird until you see hypertension
, heart_disease
and diabetes
are all coded as 0 and 1. This means you can find out how many people have diabetes, by running
sum(testdata$diabetes)
#> [1] 8699
Exercise 2: How many people in the dataset have hypertension?
While we’re here, you can also calculate the mean
of a variable coded as 0 and 1, and this will give you the fraction that are 1. In this data, 0.08699 have diabetes (8.699%). What is the prevalence of hypertension?
2.3 Summary Statistics on Dataframes
R has a lot of commands that allow you to summarise data quickly and easily. Run the following commands one by one and try to figure out what they’re doing.
Of particular importance is nrow(testdata)
. The nrow
function counts how many rows are in your data, which in this case is the number of people because we have one row per person.
2.4 Filtering and Managing Data
We will study two commands from the dplyr
package: filter
and mutate
.
dplyr
commands always take the same general form, dataframe2=dataframe(instructions)
. This means you are creating a new dataframe dataframe2
which is the original dataframe dataframe1
but with stuff happening to it.
2.4.1 Filter
Filter is a useful command for studying a subset of people. For example, run
females=filter(testdata,gender=='Female')
This has made a new dataframe, called females, which includes only the people when gender is Female. R uses the double == for comparisons, and = for defining new variables. Double click on it in the environment to check it’s done the right thing.
You can use nrow(females)
to count them.
The &
symbol can be used to combine multiple features. For example,
Will tell you how many women in the data have diabetes.
Exercise 3: how many men have hypertension?
Using nrow
and filter
we can calculate prevelance of a disease in a subset.
This takes the number of women with diabetes, divides by the number of women, multiplies by 100 to give a percentage.
Exercise 4: what proportion of men have hypertension?
2.4.2 Mutate
Mutate lets you make new columns based on operations on other ones. This can be useful for transforming variables or classifying people based on a criteria.
testdata=mutate(testdata,obesity=bmi>30)
The command above takes the dataframe testdata
, makes a new column called obesity when BMI is over 30. It then overwrites testdata
with this new dataframe, which contains obesity.
If you look at HbA1c_level, it’s still in the old % units, not mmol/mol
Exercise 5: Use mutate to make a new HbA1c column called HbA1c_mmol
Hint: the conversion formula is HbA1c_mmol=11*(HbA1c_level-2.15)
2.5 Final Exercise
Now you can filter dataframes, and make new variables based on old ones. Now you can answer some questions about the data.
- What is the average BMI in people with diabetes and people without diabetes
- Do males or females have higher rates of hypertension, diabetes or heart disease?
- How many times higher is the heart disease prevalence of people with BMi over 25 to people with BMI under 25?
- The threshold for a diabetes diagnosis is HbA1c > 48. How many people have diabetes but don’t have a diagnosis?