ECOM20001
Econometrics 1

Semester 1, 2025

Tutorial 2
Visualising and Describing Data in R

Objectives

After completing this tutorial you should be comfortable with

Using R to summarise and interpret data
Creating, and interpreting, density plots in R
Creating, and interpreting, scatter plots in R

You should have also revisited

marginal, joint and conditional distributions
linear transformations of probability distributions
using Summation and Expectation operators

Contact Details

Tutor: Richard Hayes (Richard)

Email: rjhayes@unimelb.edu.au

Consultation Time: Wednesdays, 02:00-03:00 pm

$\hspace 8cm$ Room 473, 4^th floor FBE Building

Create a Data Frame

First, you should already have the data file,tute2_crime.csv, and $\texttt{R}$ script file,tute2.R, in your T2 folder (as we discussed last week).

Then open $\texttt{RStudio}$ go to the top menu bar and select

\[ \text{Session} \rightarrow \text{Set Working Directory} \rightarrow \text{Choose Directory ...} \]

highlight your T2 folder and hit Open.

Now, in R studio, click on the $\texttt{R}$ script file,tute2.R to open this in the Script Window.

Highlight then Run the following lines

## Load Stargazer package for summary statistics and regression tables
library(stargazer)

## Load the dataset from a comma separate value
data=read.csv(file="tute2_crime.csv")

The line library(stargazer) loads the stargazer package.

The next (executable) line data=read.csv(file="tute2_crime.csv") creates a data frame named data.

Describing Data

The variables are

names(data)

[1] "stateid" "vio"     "rob"     "dens"    "avginc"

We are then asked what does a “typical” state look like. Use stargazer

stargazer(data, 
          summary.stat = c("n", "mean", "sd", "median","p25","p75", "min", "max"), 
       covariate.labels = c("State", "Violent Rate", "Robbery Rate", "Pop. Density", "Ave. PC Income"),   
          type="html", title="Descriptive Statistics")

**Descriptive Statistics**

Statistic	N	Mean	St. Dev.	Median	Pctl(25)	Pctl(75)	Min	Max

State	45	23.000	13.134	23	12	34	1	45
Violent Rate	45	431.484	209.541	382.800	275.500	570.000	66.900	854.000
Robbery Rate	45	106.656	64.193	100.900	75.300	152.500	8.800	240.800
Pop. Density	45	105.656	97.664	76.530	34.542	157.042	1.086	385.441
Ave. PC Income	45	15.816	1.937	15.797	13.919	17.114	12.370	20.273

**Descriptive Statistics**

Statistic	N	Mean	St. Dev.	Median	Pctl(25)	Pctl(75)	Min	Max

State	45	23.000	13.134	23	12	34	1	45
Violent Rate	45	431.484	209.541	382.800	275.500	570.000	66.900	854.000
Robbery Rate	45	106.656	64.193	100.900	75.300	152.500	8.800	240.800
Pop. Density	45	105.656	97.664	76.530	34.542	157.042	1.086	385.441
Ave. PC Income	45	15.816	1.937	15.797	13.919	17.114	12.370	20.273

Interpretation:

So we see a typical state has: 431 violent crimes per 100,000 people, 107 robberies per 100,000 people, an urban density of 106 people per square mile, and an average annual per-capita income of $15,8200 per year.
The range of robbery and violence rates is remarkable.
Some states have only 67 violent crimes per 100,000 people per year, while others have up to 854 (!) violent crimes per 100,000 people per year.
Its more than 10 times the difference between the least and most violent crime rates across states.
Similarly, the robbery rate is as small as 9 robberies per 100,000 people year and goes up to 240 (!) per 100,000 people per year.
We also have some very rural (1 person per square mile) and urban (385 people per square mile) states.
And per capital income similarly ranges from $12,370 to $20,270.

Probability Density Plots

Now, we are asked to produce probability density plots for these variables. This can be done using the plot(density()) command in R .

# create a new plotting window and set the plotting area into a 2*2 array
par(mfrow = c(2, 2))
# create density plots 
plot(density(data$vio))
plot(density(data$rob))
plot(density(data$avginc))
plot(density(data$dens))

Better looking Probability Density Plots:

You should also add a title, and label the X and Y axis (particularly when including graphs in an assignment). For example, use

## Graph density of vio, more nicely done
# pdf("fig_nice_density_rob.pdf")
plot(density(data$vio),
     main="Density of Violent Crime Rate",
     xlab="Violent Crimes per 100,000 People",
     ylab="Density",
     col="orange")

# dev.off()

Interpretation

# create a new plotting window and set the plotting area into a 1*2 array 
par(mfrow = c(1, 2)) 
# create density plots  
plot(density(data$rob),
     main="Density of Robbery Rate",
     xlab="Robberies per 100,000 People",
     ylab="Density",
     col="orange",
     lwd=2) 

plot(density(data$avginc),
     main="Per Capita Income (in $000's)",
     xlab="Per Capita Income",
     ylab="Density",
     col="orange",lwd=2)

The distributions of vio , rob and avginc appear symmetrically distributed around their respective means.

continued

plot(density(data$dens),
     main="Density of People Per Square Mile",
     xlab="People Per Square Mile",
     ylab="Density",
     col="orange", lwd=2) 

####  The following is optional you do NOT need to include
####  these lines in your assignment code
# to put in the mean and median add (as a dashed lines) use
abline(v=mean(data$dens),col="red", lty=2)
abline(v=median(data$dens),col="blue", lty=2)
# and add a legend 
legend(220, 0.005, legend=c("PDF", "mean of pop density", 
                            "median of pop density"),
       col=c("orange", "red","blue"), lty=1:3, cex=0.8,lwd=2)

The urban density variable is right skewed, meaning there are many similarly dense US states, but a few in the right tail of the distribution as very densely populated such as New York and California.

Histograms

Histograms are closely related to probability density plots. For example, again looking at the dens variable, we have:

# produce a histogram
hist(data$dens, 
     col="green",
     border="black",
     prob = TRUE,
     xlab = "People per Square Mile",
     main = "Historam and Density Plot for Urban Density")

# overlay the density plot 
lines(density(data$dens),
      lwd = 2,
      col = "chocolate3")

Box-Plots

Another method we could use is a box-plot e.g.

## Box and whisker plot of avginc, more nicely done
boxplot(data$dens,
        main="Box and Whisker Plot of Urban Density",
        ylab="Population per square mile",
        col="purple")

Scatter Plots

Scatter plots are used to visualise the relationship between two variables.

Again, we have to be careful about formatting the plots correctly.

Let’s start with looking at the relationship (if any) between rob and vio .

plot(data$vio,data$rob, 
     main="Relationship Between Robbery Rate and Violent Crime Rate",
     xlab="Violent Crime Rate per 100,000 People",
     ylab="Robbery Rate per 100,000 People",
     col="blue",
     pch=16)

What do you think?

Positive/Negative /No Relationship?

Linear/Non-Linear?

Other examples:

Interpretation

You have been asked to offer an economic explanation of why a relationship may exist.

Meaning

Economic explanations focus on the costs and benefits of a particular behaviour to explain empirical patterns

There may be multiple explanations; one is fine!
There is not one “correct” explanation so as long the one you come up with makes sense go with that.
However, if the explanation does not make sense you would lose marks in an assignment.

Example

We see a positive relationship between robbery rates and urban density.
This could potentially reflect:

The cost of robbery being lower in more dense states as potential robbery targets are more plentiful in close proximity.
The benefit of robbery being higher if more dense locations attract more retail shops and merchants (called “agglomeration” benefits of urban density), which provides more opportunities and hence benefit for robbery.
More difficult for police to identify potential robbers in more crowded places, which again makes the expected costs of robbery lower since robbers are less likely to be caught

Summation and Expectation Operators

Some useful rules for Summations

If $\text{a}$ and $\text{b}$ are constants and $X$ and $Y$ random variables then:

$\qquad \sum\limits_{i=1}^n \text{a}X_i = \text{a}\sum\limits_{i=1}^n X_i$
$\qquad \sum\limits_{i=1}^n (X_i+Y_i) = \sum\limits_{i=1}^n X_i+\sum\limits_{i=1}^n X_i$
$\qquad \sum\limits_{i=1}^n \text{b} = n\text{b}$
$\qquad \overline{X} = \dfrac{\sum\limits_{i=1}^n X_i}{n} \Leftrightarrow \sum\limits_{i=1}^n X_i=n\overline {X}$

Useful rules for Expectation and Variance operators can be found in the solutions (HTML version).

Also see Lecture 2 slides 23-28.

Exercises

Let’s go through an example of how these rules can be applies (see Part 2 Summation example 3 in the tutorial questions).

Show the following equality is true
\[\sum\limits_{i=1}^{n}\left(x_i - \bar{x} \right)^2 = \sum\limits_{i=1}^{n} x_i^2 - n\bar{x}^2 \] \[\begin{align} \displaystyle \sum\left( x_i - \overline{x}\right)^2 &= \sum \left( x_i^2 - 2\overline x x_i + \overline {x}^2 \right) \tag{1}\\ &= \displaystyle \sum x_i^2 -\sum\left( 2 \overline{x} x_i \right) + \sum\left( \overline {x} ^2 \right) \tag{2}\\ &= \displaystyle \sum x_i^2 - 2 \overline {x } \sum x_i + n \overline {x}^2 \tag{3} \\ &= \displaystyle \sum x_i^2 - 2 \overline {x }n \overline {x}+n \overline {x}^2 \tag{4}\\ &= \displaystyle \sum x_i^2 - n \overline {x}^2 \end{align}\] In line 3, you could also multiply the term $2 \overline {x } \sum x_i$ by $\dfrac{n}{n}$ e.g
multiply by $\dfrac{n}{n} \Rightarrow \displaystyle \sum x_i^2 - 2 n \overline {x } \frac{\sum x_i}{n} + n \overline {x}^2$ which would give the same result as above.

Linear Function of a Random Variable

In Part 2 Qn1, we have a random variable, $X$ that is i.i.d. from a $N(\mu_X,1)$ distribution and another random variable, $Y$ defines as $Y=2+2X$.

It turns out that $Y \thicksim N(2+2\mu_X,4)$

How did we get this?

In general (using Expectation outlined in Lecture 2), if one i.i.d. random variable i.e. $Y$ is a linear combination of another i.i.d. variable, $X$ such that \[ Y = a + bX \] the mean of $Y$ is \[\mu_Y = a + b \mu_X \] and the variance of $Y$ \[ \sigma_Y^2 = b^2 \sigma_X^2 \] In this case $a=2,b=2$ and $\,\sigma_X^2=1$

e.g. $\qquad \mu_Y = 2+2\mu_X$ and $\, \sigma_Y^2 = 2^2 \times1=4$.

Graphically

If $\mu_x=2,5$ or $10$, then the distribution of $Y$ is $N(6,4)$, $N(12,4)$ and $N(22,4)$ respecitvely.
The following graph plots the distributions of $Y$, conditional on the three $\mu_x$ values.
Larger values of $X$ shift the distribution of $Y$ to the right.

Probabilities - Contigency Tabes

	High Grade	Medium Grade	Low Grade	Total
Study Hard	0.20	0.10	0.02	0.32
Sometimes	0.07	0.30	0.10	0.47
Never Study	0.01	0.05	0.15	0.21
Total	0.28	0.45	0.27	1.00

Study_Grade	High	Medium	Low	Total
Hard	joint	joint	joint	marginal
Sometimes	joint	joint	joint	marginal
Never	joint	joint	joint	marginal
Total	marginal	marginal	marginal	1.00

Marginal and Conditional Probabilities

The marginal distribution for studying is

P(Study Hard)= 0.32
P(Study Sometimes)= 0.47
P(Study Never) = 0.21

The marginal distribution for performance is

P(High Grade)= 0.28
P(Medium Grade)= 0.45
P(Low Grade) = 0.27

The probability distribution for performance, conditional on Studying Hard is

P(High Grade|Study Hard)= 0.20/0.32 = 0.625
P(Medium Grade|Study Hard)= 0.10/0.32=0.3125
P(Low Grade|Study Hard) = 0.02/0.32=0.0625

Statistical Independence

If, for example, Studying and Performance were independent, then the joint probability (Study Hard,High Grade) would equal the product of the respective marginal probabilities.

P(Study Hard) $\times$ P(High Grade)

Computing this product we get $0.32 \times 0.28=0.0896$.

This is not equal to the joint probability of P(Study Hard,High Grade) which is, from the table, $0.20$.

Therefore, the random variables Studying and Performance are not independent.

you can pick any other of the joint distributions and compute the product of the respective marginal probabilites - you will obtain the same result in this example.

ECOM20001 Econometrics 1

Objectives

Contact Details

Create a Data Frame

Describing Data

Probability Density Plots

Histograms

Box-Plots

Scatter Plots

Interpretation

Summation and Expectation Operators

Exercises

Linear Function of a Random Variable

Graphically

Probabilities - Contigency Tabes

ECOM20001
Econometrics 1