3 Data Visualization

After laboring yourselves to collect, clean, and analyze the data, you want to share the findings and engage the audience with insights about the significance of the study. But most people are not good at parsing rows and columns of numbers. It is the job of the researchers to get important messages across to the audience so they can understand them with as little effort as possible.

Learning Objectives:

This chapter discusses the properties of various graphs, what they are good at and not so good at, and how to plot them in R using two different graphical systems. After finishing this chapter, you should be able to:

  • Utilize the number of variables and their types to choose the most appropriate graph.

  • Convey the message you want to convey to the audience embedded in the data through graphs.

  • Create basic graphs with labels and a main title using the base R graphical system.

  • Create publication-quality graphs using the ggplot2 package.

3.1 Types of Graphs

The best way to choose the most appropriate graph depends on the data. Specially, the types of variables (numeric or categorical) and how many of them. You will plot graph for:

  1. One numeric variable.

  2. One categorical variable.

  3. One categorical and one numeric variables.

  4. Two or more numeric variables.

  5. A summary table.

3.2 Base R Graphics

This graphical system is built into the base R; no additional package is required. It is simple, ideal for creating graphs promptly. However, the flip side is that the graphs may be primitive and lack features.

3.2.1 One Numeric Variable

You can plot a histogram of the numeric variable to see its distribution, such as the spread, the peak, symmetry, and flatness. If your concern is robust statistics, a boxplot will do the job.

3.2.1.1 Histogram

Let’s examine the distribution of iris’s Sepal.Width.


par(mfrow=c(2,2))

# Top left
hist(iris$Sepal.Width)

# Top right
hist(iris$Sepal.Width,
     xlab="Sepal Width",
     ylab = "Count",
     main = "Histogram of Sepal Width")

# Bottom left
hist(iris$Sepal.Width,
     col = 'skyblue', 
     border = 'white', 
     xlab="Sepal Width", 
     ylab = "Count", 
     main = "Color options")

# Bottom right
hist(iris$Sepal.Width, 
     breaks = seq(2,4.5,by=0.1), 
     col = 'skyblue', 
     border = 'white', 
     xlab="Sepal Width", 
     ylab = "Count", 
     main = "breaks option")
Base R Histogram. Top left: default. Top right: Customized X-Y labels and the main title. Bottom left: change color of bars and border. Bottom right: change the bin width  (`breaks=`)

Figure 3.1: Base R Histogram. Top left: default. Top right: Customized X-Y labels and the main title. Bottom left: change color of bars and border. Bottom right: change the bin width (breaks=)

Note: par(mfrow=c(2,2)) splits the plotting area into a 2x2 grid such that multiple graphs can be grouped

3.2.1.2 Boxplot

If the message is robust statistics, such as median and quantiles, or the presence of outliers, the best graph is a boxplot. The y-axis represents the numeric variable, and the x-axis has no meaning.

par(mfrow=c(2,2))

# Top left
boxplot(iris$Sepal.Width)

# Top right
boxplot(iris$Sepal.Width, ylab="Sepal Width", main="Boxplot of Sepal Width")

# Bottom left
boxplot(iris$Sepal.Width,
            col = "skyblue",
            ylab="Sepal Width",
            outcol = 'red',
            outpch = 19,
            main="Color options")

# Bottom right
boxplot(iris$Sepal.Width, 
        horizontal = TRUE, 
        xlab="Sepal Width", 
        main="Horizontal Boxplot")
Boxplot. Top left: Default, Top right: labels and main title. Bottom left: color options. Bottom right: `horizontal = TRUE`

Figure 3.2: Boxplot. Top left: Default, Top right: labels and main title. Bottom left: color options. Bottom right: horizontal = TRUE

Anatomy of a Boxplot

Figure 3.3: Anatomy of a Boxplot

The upper and lower whiskers are defined as:

\[Upper\ Whisker = Q3 + 1.5 \times IQR\] \[Lower\ Whisker = Q1 - 1.5 \times IQR\] , where \(IQR=(Q3-Q1)\).

In practice, values that are greater or less than the upper or lower whiskers are treated as outliers.

Histogram offers a glimpse of how the data is distributed, displaying the peak(s), and long tail, if any, that is less obvious in a boxplot. However, pinpointing the median and outliers on a histogram is non-trivial. Therefore, it makes sense to combine the strengths of both plots in a single graph. To be sure the two plots are comparable, make sure to set the axes in the same range by the xlim and/or ylim= parameters. Here’s an example below:

par(mfrow=c(2,1))

hist(iris$Sepal.Width, 
     breaks = seq(2,4.5,by=0.1), 
     col = 'skyblue', 
     border = 'white', 
     xlim = c(2,4.5),
     xlab = "",
     ylab = "Count", 
     main = "Stacking Up Graphs")

boxplot(iris$Sepal.Width, 
        horizontal = TRUE, 
        ylim = c(2,4.5),
        xlab="Sepal Width", 
        main="")
Stacking up two graphs.

Figure 3.4: Stacking up two graphs.

Figure (3.4)) illustrated the added advantage of combining graphs. It is clear that the peak in the histogram is almost perfectly aligned with the median shown in the boxplot, indicating symmetric data distribution. Additionally, the outliers hardly visible from the histogram can easily be spotted in the boxplot. This example illustrates the synergy and additional perspectives created by combining different graphs.

3.2.1.3 One Categorical Variable

When it comes to categorical variables, we are usually interested in the number of samples fall in each category of a categorical variable. As discussed in Chapter 2, the tallying can be done using the table() function (2.1.8) or group_by() followed by summarize(n=n()) (2.2.10).

3.2.1.3.1 Barplot

Barplot is the standard choice for visualizing tallied data. The only input to the R’s barplot() graph function is a tabulated count of the categorical variable. Here’s an example:

par(mfrow=c(2,2))

# Top left
barplot(with(iris, table(Species)))

# Top right
barplot(with(iris, table(Species)), 
        xlab="Species", 
        ylab="Count", 
        main="A Barplot of Species")

# Bottom left
barplot(with(iris, table(Species)), 
        col='skyblue', 
        border='blue', 
        xlab="Species", 
        ylab="Count", 
        main="Color Options")

# Bottom right
barplot(with(iris, table(Species)), 
        horiz = TRUE, 
        col='skyblue', 
        border='blue', 
        xlab="Count", 
        ylab="Species", 
        main="Color Options")
A simple bar plot. Top left: Default. Top right: with tables

Figure 3.5: A simple bar plot. Top left: Default. Top right: with tables

Sometime you might want to reorder the categories along the x- or y-axis. By default, barplot() places the bars from left to right according to the current order of the levels. E.g., the levels of iris$Species is:

levels(iris$Species)
#> [1] "setosa"     "versicolor" "virginica"

You can change the current order by updating the factor variable by the factor() function (1.2.5). Suppose, the desired order is virginica, setosa, and veriscolor.

iris$Species <- factor(iris$Species, 
                       levels = c("virginica", "setosa", "versicolor"), 
                       ordered = TRUE)
levels(iris$Species)
#> [1] "virginica"  "setosa"     "versicolor"
barplot(with(iris, table(Species)), 
        col='skyblue', 
        border='blue', 
        xlab="Species", 
        ylab="Count", 
        main="Reordered Species")
Reorder the Bars of a Barplot

Figure 3.6: Reorder the Bars of a Barplot

3.2.1.3.2 Pie Chart

Besides a barplot, a pie chart is also a good fit for displaying tallied data.

par(mfrow=c(2,2))

# Top left
pie(with(iris, table(Species)))

# Top left
pie(with(iris, table(Species)), 
    col = c('skyblue', 'orange', 'white'), 
    main="Species")

# Bottom right
counts <- with(iris, table(Species))
my_labs <- paste0(names(counts),"(",counts,")")
pie(counts, 
    labels = my_labs, 
    col = c('skyblue', 'orange', 'white'), 
    main = "Species Counts")


# Bottom right
counts <- with(iris, table(Species))
fractions <- round(counts/sum(counts),2)
my_labs <- paste0(names(counts),"(",fractions,")")
pie(counts, 
    labels = my_labs, 
    col = c('skyblue', 'orange', 'white'), 
    main = "Species Fractions")
A simple pie chart. Top left: Default. Top right: with tables

Figure 3.7: A simple pie chart. Top left: Default. Top right: with tables

3.2.1.4 One Categorical and One Numeric Variables

The categorical is usually served as a group identifier that splits the data into groups so the numerical variable can be compared between groups. This kind of graph is generally named side-by-side plot. You can choose a side-by-side boxplot or a side-by-side barplot. For example, visually compare the robust statistics between different species of iris.

3.2.1.4.1 Side-by-side Boxplot
par(mfrow=c(2,1))

boxplot(Sepal.Length ~ Species,
        data=iris, 
        main="A Side-by-Side Boxplot")

boxplot(Sepal.Length ~ Species,
        data=iris, 
        horizontal = TRUE,
        main="A Horizontal Side-by-Side Boxplot")
Side-by-side plot

Figure 3.8: Side-by-side plot

Sepal.Length ~ Species is read as “Sepal length by species”, the tilde symbol “~” means “by”.

However, there is glitch in the horizontal boxplot. The tick y-labels collided. It can be fixed by adding las=2 parameter to the boxplot() function.

boxplot(Sepal.Length ~ Species,
        data=iris, 
        horizontal = TRUE,
        las = 2,
        main="Y-tick Labels Reorientated")
Side-by-side plot

Figure 3.9: Side-by-side plot

3.2.1.4.2 Side-by-side Barplot

Suppose you want to visually compare the average sepal length among different iris species. You will begin to feel the cumbersome of base R, and appreciate the power of tidyverse. Anyway, you will see how to make such a graph with a new function, tapply(), in two steps.

First, make a table of average sepal length by species. Note that it cannot be done with the table() function, as it tallies the number of samples by species.

mean_data <- tapply(iris$Sepal.Length, iris$Species, FUN = mean)
mean_data
#>  virginica     setosa versicolor 
#>      6.588      5.006      5.936

How does tapply() work? The “t” represents table. The function applies the mean function to the input data iris$Sepal.Length grouped by iris$Species. In other words, the mean() is applied to sepal length per species.

And then, pass the mean_data object to barplot().

barplot(mean_data, xlab="Species", ylab="Average Sepal.Length", main="Side-by-side Barplot")
Side-by-side Barplot

Figure 3.10: Side-by-side Barplot

3.2.1.5 Two or more Numeric Variables

scatter plot with regression line abline. pch . See more in this post: https://www.r-bloggers.com/2021/06/r-plot-pch-symbols-different-point-shapes-in-r/

cex=1 (default)

multipanel plot

3.2.1.6 Summary Table

3.3 ggplot2

But before you plot the graphs, activate tidyverse so you can use the bundled graphical package ggplot2.

library(tidyverse)
#> ── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
#> ✔ purrr     1.1.0     
#> ── Conflicts ────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

3.3.1 Grammer of ggplot