1 Doctors as Data Scientists

1.1 About

This document contains instructions for a tutorial aimed at introducing third-year medical students at the University of Exeter to data science, using R. The focus is on giving some experience of handling large healthcare datasets and visualising health data, rather than on developing good programming fundamentals. There are two sections, one on data handling and one on visualisation, each of which are intended to take approximately 90 minutes.

1.2 Introduction

Data science has become increasingly important in modern medical research, which is often done using large databases of millions of people. As computing facilities become better and we move towards a “Digital NHS,” the amount of data available for medical research will only increase. Data science is about gaining insights from large data. Most medical guidelines that inform your practice will have been informed by data science, so having some experience in the field will help understand where these insights come from.

As a junior doctor, you will also be expected to be auditing your own practice. You may also become involved in research using data from either your own practice or larger, external databases. Efficient and accurate data-handling skills can make these processes a lot easier.

1.3 Usage

We will be using the R programming language through a piece of software called RStudio, hosted on a website called Posit Cloud, which is free to use. R is used widely for data analyses and statistics in medical research and it is free to download and very flexible in what you can use it for. In comparison to using something like Excel, R is better able to work with 100,000s of records, is easier to automate and can easily allow steps in your analyses to be shared with others.

While this tutorial can be run on your own machine, for ease of setup I recommend creating an account on Posit Cloud at https://posit.cloud/. This should load up the RStudio interface. If all has worked correctly, you should see

R version 4.4.3 (2025-02-28) -- "Trophy Case"
Copyright (C) 2025 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

On your screen.

Once logged in, at the top right, click new project -> new RStudio project.

1.4 Setup

There are a few R commands we need to run to set up the project before starting session 1.

Install the following packages by pasting these commands to the command line and pressing enter. readr is a package for reading in data, dplyr is a popular package for managing dataframes, and ggplot2 is a popular package for making graphs. Now these packages are installed (ignore red text) they need to be loaded

There’s one last thing we need to do before we start.

testdata=read.csv("https://raw.github.com/hdg204/DoctorsAsDataScientists/main/simulated_diabetes_data.csv")

This command will read a csv that I’ve stored on GitHub. If it’s all run correctly, running nrow(testdata) should give the following result.

nrow(testdata)
#> [1] 100000

Once that’s done, you’re ready to move onto part 1 - Dataframes in R.