ECOM20001
Econometrics 1


Semester 1, 2025

Tutorial 2
Visualising and Describing Data in R

Objectives



After completing this tutorial you should be comfortable with

  • Using R to summarise and interpret data
  • Creating, and interpreting, density plots in R
  • Creating, and interpreting, scatter plots in R

You should have also revisited

  • marginal, joint and conditional distributions

  • linear transformations of probability distributions

  • using Summation and Expectation operators

Contact Details



Tutor: Richard Hayes (Richard)



Email: rjhayes@unimelb.edu.au



Consultation Time: Wednesdays, 02:00-03:00 pm

\(\hspace 8cm\) Room 473, 4th floor FBE Building

Create a Data Frame



First, you should already have the data file,tute2_crime.csv, and \(\texttt{R}\) script file,tute2.R, in your T2 folder (as we discussed last week).

Then open \(\texttt{RStudio}\) go to the top menu bar and select

\[ \text{Session} \rightarrow \text{Set Working Directory} \rightarrow \text{Choose Directory ...} \]

highlight your T2 folder and hit Open.

Now, in R studio, click on the \(\texttt{R}\) script file,tute2.R to open this in the Script Window.

Highlight then Run the following lines

## Load Stargazer package for summary statistics and regression tables
library(stargazer)

## Load the dataset from a comma separate value
data=read.csv(file="tute2_crime.csv")

The line library(stargazer) loads the stargazer package.

The next (executable) line data=read.csv(file="tute2_crime.csv") creates a data frame named data.

Describing Data



The variables are

names(data)
[1] "stateid" "vio"     "rob"     "dens"    "avginc" 


We are then asked what does a “typical” state look like. Use stargazer

stargazer(data, 
          summary.stat = c("n", "mean", "sd", "median","p25","p75", "min", "max"), 
       covariate.labels = c("State", "Violent Rate", "Robbery Rate", "Pop. Density", "Ave. PC Income"),   
          type="html", title="Descriptive Statistics")


Descriptive Statistics
Statistic N Mean St. Dev. Median Pctl(25) Pctl(75) Min Max
State 45 23.000 13.134 23 12 34 1 45
Violent Rate 45 431.484 209.541 382.800 275.500 570.000 66.900 854.000
Robbery Rate 45 106.656 64.193 100.900 75.300 152.500 8.800 240.800
Pop. Density 45 105.656 97.664 76.530 34.542 157.042 1.086 385.441
Ave. PC Income 45 15.816 1.937 15.797 13.919 17.114 12.370 20.273


Descriptive Statistics
Statistic N Mean St. Dev. Median Pctl(25) Pctl(75) Min Max
State 45 23.000 13.134 23 12 34 1 45
Violent Rate 45 431.484 209.541 382.800 275.500 570.000 66.900 854.000
Robbery Rate 45 106.656 64.193 100.900 75.300 152.500 8.800 240.800
Pop. Density 45 105.656 97.664 76.530 34.542 157.042 1.086 385.441
Ave. PC Income 45 15.816 1.937 15.797 13.919 17.114 12.370 20.273


Interpretation:

  • So we see a typical state has: 431 violent crimes per 100,000 people, 107 robberies per 100,000 people, an urban density of 106 people per square mile, and an average annual per-capita income of $15,8200 per year.
  • The range of robbery and violence rates is remarkable.
    Some states have only 67 violent crimes per 100,000 people per year, while others have up to 854 (!) violent crimes per 100,000 people per year.
    Its more than 10 times the difference between the least and most violent crime rates across states.
    Similarly, the robbery rate is as small as 9 robberies per 100,000 people year and goes up to 240 (!) per 100,000 people per year.
  • We also have some very rural (1 person per square mile) and urban (385 people per square mile) states.
    And per capital income similarly ranges from $12,370 to $20,270.

Probability Density Plots



Now, we are asked to produce probability density plots for these variables. This can be done using the plot(density()) command in R .

# create a new plotting window and set the plotting area into a 2*2 array
par(mfrow = c(2, 2))
# create density plots 
plot(density(data$vio))
plot(density(data$rob))
plot(density(data$avginc))
plot(density(data$dens))

Better looking Probability Density Plots:

You should also add a title, and label the X and Y axis (particularly when including graphs in an assignment). For example, use

## Graph density of vio, more nicely done
# pdf("fig_nice_density_rob.pdf")
plot(density(data$vio),
     main="Density of Violent Crime Rate",
     xlab="Violent Crimes per 100,000 People",
     ylab="Density",
     col="orange")
# dev.off() 

Interpretation

# create a new plotting window and set the plotting area into a 1*2 array 
par(mfrow = c(1, 2)) 
# create density plots  
plot(density(data$rob),
     main="Density of Robbery Rate",
     xlab="Robberies per 100,000 People",
     ylab="Density",
     col="orange",
     lwd=2) 

plot(density(data$avginc),
     main="Per Capita Income (in $000's)",
     xlab="Per Capita Income",
     ylab="Density",
     col="orange",lwd=2)

The distributions of vio , rob and avginc appear symmetrically distributed around their respective means.

continued

plot(density(data$dens),
     main="Density of People Per Square Mile",
     xlab="People Per Square Mile",
     ylab="Density",
     col="orange", lwd=2) 

####  The following is optional you do NOT need to include
####  these lines in your assignment code
# to put in the mean and median add (as a dashed lines) use
abline(v=mean(data$dens),col="red", lty=2)
abline(v=median(data$dens),col="blue", lty=2)
# and add a legend 
legend(220, 0.005, legend=c("PDF", "mean of pop density", 
                            "median of pop density"),
       col=c("orange", "red","blue"), lty=1:3, cex=0.8,lwd=2)

The urban density variable is right skewed, meaning there are many similarly dense US states, but a few in the right tail of the distribution as very densely populated such as New York and California.

Histograms



Histograms are closely related to probability density plots. For example, again looking at the dens variable, we have:

# produce a histogram
hist(data$dens, 
     col="green",
     border="black",
     prob = TRUE,
     xlab = "People per Square Mile",
     main = "Historam and Density Plot for Urban Density")

# overlay the density plot 
lines(density(data$dens),
      lwd = 2,
      col = "chocolate3")

Box-Plots



Another method we could use is a box-plot e.g.

## Box and whisker plot of avginc, more nicely done
boxplot(data$dens,
        main="Box and Whisker Plot of Urban Density",
        ylab="Population per square mile",
        col="purple")

Scatter Plots



Scatter plots are used to visualise the relationship between two variables.

Again, we have to be careful about formatting the plots correctly.

Let’s start with looking at the relationship (if any) between rob and vio .

plot(data$vio,data$rob, 
     main="Relationship Between Robbery Rate and Violent Crime Rate",
     xlab="Violent Crime Rate per 100,000 People",
     ylab="Robbery Rate per 100,000 People",
     col="blue",
     pch=16)

What do you think?

Positive/Negative /No Relationship?

Linear/Non-Linear?

Other examples:

Interpretation



You have been asked to offer an economic explanation of why a relationship may exist.

Meaning

Economic explanations focus on the costs and benefits of a particular behaviour to explain empirical patterns

There may be multiple explanations; one is fine!
There is not one “correct” explanation so as long the one you come up with makes sense go with that.
However, if the explanation does not make sense you would lose marks in an assignment.

Example

We see a positive relationship between robbery rates and urban density.
This could potentially reflect:

  • The cost of robbery being lower in more dense states as potential robbery targets are more plentiful in close proximity.

  • The benefit of robbery being higher if more dense locations attract more retail shops and merchants (called “agglomeration” benefits of urban density), which provides more opportunities and hence benefit for robbery.

  • More difficult for police to identify potential robbers in more crowded places, which again makes the expected costs of robbery lower since robbers are less likely to be caught

Summation and Expectation Operators



Some useful rules for Summations

If \(\text{a}\) and \(\text{b}\) are constants and \(X\) and \(Y\) random variables then:

  1. \(\qquad \sum\limits_{i=1}^n \text{a}X_i = \text{a}\sum\limits_{i=1}^n X_i\)

  2. \(\qquad \sum\limits_{i=1}^n (X_i+Y_i) = \sum\limits_{i=1}^n X_i+\sum\limits_{i=1}^n X_i\)

  3. \(\qquad \sum\limits_{i=1}^n \text{b} = n\text{b}\)

  4. \(\qquad \overline{X} = \dfrac{\sum\limits_{i=1}^n X_i}{n} \Leftrightarrow \sum\limits_{i=1}^n X_i=n\overline {X}\)

Useful rules for Expectation and Variance operators can be found in the solutions (HTML version).

Also see Lecture 2 slides 23-28.

Exercises



Let’s go through an example of how these rules can be applies (see Part 2 Summation example 3 in the tutorial questions).

Show the following equality is true
\[\sum\limits_{i=1}^{n}\left(x_i - \bar{x} \right)^2 = \sum\limits_{i=1}^{n} x_i^2 - n\bar{x}^2 \] \[\begin{align} \displaystyle \sum\left( x_i - \overline{x}\right)^2 &= \sum \left( x_i^2 - 2\overline x x_i + \overline {x}^2 \right) \tag{1}\\ &= \displaystyle \sum x_i^2 -\sum\left( 2 \overline{x} x_i \right) + \sum\left( \overline {x} ^2 \right) \tag{2}\\ &= \displaystyle \sum x_i^2 - 2 \overline {x } \sum x_i + n \overline {x}^2 \tag{3} \\ &= \displaystyle \sum x_i^2 - 2 \overline {x }n \overline {x}+n \overline {x}^2 \tag{4}\\ &= \displaystyle \sum x_i^2 - n \overline {x}^2 \end{align}\] In line 3, you could also multiply the term \(2 \overline {x } \sum x_i\) by \(\dfrac{n}{n}\) e.g
multiply by \(\dfrac{n}{n} \Rightarrow \displaystyle \sum x_i^2 - 2 n \overline {x } \frac{\sum x_i}{n} + n \overline {x}^2\) which would give the same result as above.

Linear Function of a Random Variable



In Part 2 Qn1, we have a random variable, \(X\) that is i.i.d. from a \(N(\mu_X,1)\) distribution and another random variable, \(Y\) defines as \(Y=2+2X\).

It turns out that \(Y \thicksim N(2+2\mu_X,4)\)

How did we get this?

In general (using Expectation outlined in Lecture 2), if one i.i.d. random variable i.e. \(Y\) is a linear combination of another i.i.d. variable, \(X\) such that \[ Y = a + bX \] the mean of \(Y\) is \[\mu_Y = a + b \mu_X \] and the variance of \(Y\) \[ \sigma_Y^2 = b^2 \sigma_X^2 \] In this case \(a=2,b=2\) and \(\,\sigma_X^2=1\)

e.g. \(\qquad \mu_Y = 2+2\mu_X\) and \(\, \sigma_Y^2 = 2^2 \times1=4\).

Graphically



If \(\mu_x=2,5\) or \(10\), then the distribution of \(Y\) is \(N(6,4)\), \(N(12,4)\) and \(N(22,4)\) respecitvely.
The following graph plots the distributions of \(Y\), conditional on the three \(\mu_x\) values.
Larger values of \(X\) shift the distribution of \(Y\) to the right.

Probabilities - Contigency Tabes



High Grade Medium Grade Low Grade Total
Study Hard 0.20 0.10 0.02 0.32
Sometimes 0.07 0.30 0.10 0.47
Never Study 0.01 0.05 0.15 0.21
Total 0.28 0.45 0.27 1.00
Study_GradeHighMediumLowTotal
Hardjointjointjointmarginal
Sometimesjointjointjointmarginal
Neverjointjointjointmarginal
Totalmarginalmarginalmarginal 1.00

Marginal and Conditional Probabilities



The marginal distribution for studying is

  • P(Study Hard)= 0.32

  • P(Study Sometimes)= 0.47

  • P(Study Never) = 0.21

The marginal distribution for performance is

  • P(High Grade)= 0.28

  • P(Medium Grade)= 0.45

  • P(Low Grade) = 0.27

The probability distribution for performance, conditional on Studying Hard is

  • P(High Grade|Study Hard)= 0.20/0.32 = 0.625

  • P(Medium Grade|Study Hard)= 0.10/0.32=0.3125

  • P(Low Grade|Study Hard) = 0.02/0.32=0.0625

Statistical Independence

If, for example, Studying and Performance were independent, then the joint probability (Study Hard,High Grade) would equal the product of the respective marginal probabilities.


P(Study Hard) \(\times\) P(High Grade)

Computing this product we get \(0.32 \times 0.28=0.0896\).

This is not equal to the joint probability of P(Study Hard,High Grade) which is, from the table, \(0.20\).


Therefore, the random variables Studying and Performance are not independent.

you can pick any other of the joint distributions and compute the product of the respective marginal probabilites - you will obtain the same result in this example.