ECON30001/ECOM90001
Tutorial 1
Introduction to R and RStudio

Introductions

Your tutor will provide their contact details in this inital tutorial.

Please take a note of this email address and be sure to send through any queries you may have during the semester to this address.

Hopefully, you were able to meet a few of your classmates in this, your first tutorial. You don’t have to form groups within the same tutorial for assignments but if you are after an assignment group partner(s) your tutorial could be a good place to start.

Admin

The Subject Home Page in Canvas can be found here .

Dr, Andrew Clarke (Andy) will go through requirements to successfully complete this capstone subject in the first weeks’ lectures.

In terms of assessment, key dates are:

Assessment Overview
Task Indiv./Grp. Week Due Date Weight
Capstone Project Proposal Group 4 Sunday, March 30th 5%
Online Mid-semester test Individual 6 Tuesday, April 17th 10%
Capstone Project Progress Report Group 8 Monday, April 28th 10%
Capstone Project Final Report Group 8 Monday, May 26th 25%
End-of-semester exam Individual Final Exam Period 50%

We will be using R (and the integrated development environment (ide) for R, RStudio) exclusively in this subject. Details of how to download R and RStudio are contained in Section 6.10 of the Subject User Guide.

Once you have installed both, you will only need to use RStudio to run the required R scripts for tutorials each week.

The tutorial material for each week can be found under Modules \(\rightarrow\) Tutorials in the Subject Home Page e.g. the material for this tutorial can be found here .

You are expected to run the code provided in the R script file before attending your tutorial.

If you are unsure how the blocks of code (“chunks”) are set up then either/or:

  • have a look at the Pre-Tutorial videos that Andy will provde each week which contain a “line-by-line” description of how the code chunks work; e.g. see here for Tutorial 1.

  • make a note of what you are unsure about and ask about your tutor

  • use one of the consultation times

Objectives

Tutorial 1 reviews some basic operations using the R software package used in this subject. Specifically, this tutorial provides some practice in:

  • obtaining summary statistics for a dataset using the stargazer package in R

  • using the ggplot2 package to produce high-quality graphs in R

  • importing financial and economic time-series data using the quantmod package in R

The tutorial file used today is:

  • attend.csv

This file can be obtained from the Canvas subject page.

In addition, the R script file tut1.R provides the program code necessary to complete the tutorial. The R script file for this tutorial also uses the following packages which need to be installed prior to running the R script file:

  • ggplot2 - for creating graphs and plots in R

  • stargazer - to easily generate summary statistics for an R data file

  • zoo - for working with dates in R

  • rio - for easily importing data files into R

  • quantmod - R package for accessing financial/economic data

  • broom - for easily “tidying” up R data

  • tidyr - for easily “tidying” up R data

  • scales - used for formatting scales in ggplot2

These can be installed directly in RStudio from the packages tab or by using the command install.packages() and inserting the name of the package in the brackets.

Please feel free to play around with the code, particularly the plotting commands for ggplot2 and the table commands for stargazer.

Important Preliminary Task

We will be using the package stargazer to generate summary statistics and produce high-quality tables for presenting the results from estimated econometric models in R. Unfortunately, there are some issues with the current version of stargazer and the current version of R (Version 4.3.2 “Eye Holes”).

A recommended fix for this issue is:

  1. install stargazer directly in RSTUDIO from the packages tab or by using the command install.packages() and inserting the name of the package in the brackets

  2. using the packages window in RSTUDIO make sure the stargazer package is un-ticked

  3. open the run the R script file stargazer_fix_2024.R avail;able on the Canvas subject page

After completing these steps, the package stargazer should now be working with the current version of R.

Note

this fix should work well if you are running R in a PC-Windows environment. It will not work for users who are running R in a Mac-OS environment - see additional code in the R script file in this case.

Question 1

Download the data file attend.csv from the subject page.

This file contains data on attendance at Australian Football League (AFL) matches at the Melbourne Cricket Ground (MCG) for 317 matches (excluding the finals) between 2014 and 2019.
It also contains data on other variables that might determine attendance:

crowd Attendance at the M.C.G.
maxtemp Maximum temperature on the day in degrees Celsius
totmembers Sum of the total membership of the two clubs
victoria 1 if both are Victorian teams;0 otherwise
weekend 1 if if the match was held on Saturday or Sunday; 0 otherwise
night 1 if the match was held after 7 p.m.; 0 otherwise

Note that:

\[ \begin{align*}\texttt{victoria} & =\begin{cases}1 & \text{if both teams are from Victoria}\\0 & \text{if at least one team is not from Victoria}\\\end{cases}\\ \\\texttt{weekend} & =\begin{cases}1 & \text{if the match was held on a Saturday or a Sunday}\\0 & \text{if the match was not held on a Saturday or Sunday}\\\end{cases}\\ \\\texttt{night} & =\begin{cases}1 & \text{if the match was held after 7:00pm}\\0 & \text{if the match was not held after 7:00pm}\\\end{cases}\end{align*} \]

(a)

i.

examine the data to make sure it has been imported correctly

Solution

Tip

you only need to install a package one time. Once it is installed you can call it using the library() command

Load the required packages and import the data

# Required packages
library(ggplot2)    # Flexible graphic facility for R
library(stargazer)  # For producing good quality output from R
library(zoo)        # For working with dates in R
library(rio)        # for easily importing data files into R
library(quantmod)   # R package for accessing financial/economic data
library(broom)      # used for ggplot2 tidy
library(tidyr)      # used for ggplot2 spread
library(scales)     # used for formatting scales in ggplot2
# (a)(i) Import Data File (csv file) and Save as R Data File 
# attend <- read.csv("attend.csv")  
# or using the rio package simply 
attend <-import("attend.csv")

If you click on the dataframe attend in the Environment window (top right in R Studio) you will get a spreadsheet view of the data.
However, often be dealing with large data sets so to check that all the data has been imported you can use

# Required packages
# Print first 10 rows of the data frame
print(head(attend,10))
   round dayname day month year   time        home            away venue crowd
1      2     Thu  27     3 2014 7:45pm    Richmond         Carlton   MCG 62037
2      2     Sun  30     3 2014 3:20pm   Melbourne      West Coast   MCG 22226
3      3     Fri   4     4 2014 7:50pm    Hawthorn       Fremantle   MCG 43583
4      3     Sat   5     4 2014 7:40pm Collingwood         Geelong   MCG 63152
5      3     Sun   6     4 2014 7:10pm    Essendon         Carlton   MCG 62730
6      4     Fri  11     4 2014 7:50pm    Richmond     Collingwood   MCG 62100
7      4     Sat  12     4 2014 1:45pm     Carlton       Melbourne   MCG 37323
8      5     Sat  19     4 2014 1:40pm Collingwood North Melbourne   MCG 57116
9      5     Sun  20     4 2014 3:20pm   Melbourne      Gold Coast   MCG 17243
10     5     Mon  21     4 2014 3:20pm     Geelong        Hawthorn   MCG 80222
   victoria night weekend totmembers rainfall maxtemp
1         1     1       0     113607        5      26
2         0     0       1      94440        1      24
3         0     1       0     117427        0      19
4         1     1       1     123150        0      24
5         1     1       1     108111        0      24
6         1     1       0     145469       10      19
7         1     0       1      83396        0      22
8         1     0       1     118407        2      17
9         0     0       1      49389        0      19
10        1     0       0     112453        0      19
### extra - and for,say the last 4 rows
tail(attend,4)
    round dayname day month year   time        home     away venue crowd
314    23     Fri  23     8 2019 7:50pm Collingwood Essendon   MCG 85405
315    23     Fri  23     8 2019 7:50pm Collingwood Essendon   MCG 85405
316    23     Sun  25     8 2019 3:20pm    Richmond Brisbane   MCG 76995
317    23     Sun  25     8 2019 3:20pm    Richmond Brisbane   MCG 76995
    victoria night weekend totmembers rainfall maxtemp
314        1     1       0     169463        0      15
315        1     1       0     169463        0      15
316        0     0       1     131381        1      13
317        0     0       1     131381        1      13

ii.

restrict the sample to matched held on either a Saturday or a Sunday

Solution

# (a)(ii) Restrict sample to weekend games
attend_wend <- subset(attend, attend$weekend==1)

iii.

generate the following two variables

crowd_th Attendance at the M.C.G. in ’000s
totmembers_th Sum of the total membership of the two clubs in ’000s

Solution

# (a)(iii) Create new variables, divide attendance and total members by thousands
attend_wend$crowd_th = attend_wend$crowd/1000
attend_wend$totmembers_th = attend_wend$totmembers/1000

(b)

Construct a scatter plot if crowd_th against totmembers_th . You should ensure the variable crowd_th is on the vertical axis and totmembers_th is on the horizontal axis.

  • does there appear to be a positive relationship between attendance and (total) membership?
  • does it look linear?

Solution

The code used to produce this graph is below and is included in the R script file tut1.R on the subject home page.

A line-by-line description of how this code works is in the Subject Home Page here


# (b) Plot Membership and Attendance
# x axis: membership
# y axis: attendance
ggplot(attend_wend, aes(x=totmembers_th, y = crowd_th)) + 
geom_point(color="blue") +
labs(x = "Club Membership (in thousands)", y = "Attendance (In thousands)") +
  ggtitle("Club Membership and Attendance at AFL Games at the MCG 2014-2019")  +
  theme_classic() +
  scale_x_continuous(breaks = round(seq(0,240, by = 20),1)) +
  scale_y_continuous(breaks = round(seq(0,120, by = 20),1)) +
  theme(axis.text=element_text(size=8), axis.title=element_text(size=8),
  plot.title=element_text(size=10))
# landscape A4 size
ggsave("tut1_graph1.pdf", width = 297, height = 210, units = "mm")

It appears as if there is a positive relationship between attendance and the total membership of the two teams.
An examination of the plot suggests that the relationship appears to be approximately linear. However, further examination might be necessary to investigate this issue further.

(c)

Construct a scatter plot of crowd_th against maxtemp.
You should ensure that the variable crowd_th is on the vertical axis and the variable maxtemp is on the horizontal axis.
Does there appear to be a positive relationship between attendance and temperature?
Does it look linear?

Solution

# (c) Plot attendance and temperature
 # x axis: temperature
 # y axis: attendance
ggplot(attend_wend, aes(x=maxtemp, y = crowd_th)) + 
  geom_point(color="blue") +
  labs(x = "Temperature (in celsius)", y = "Attendance (In thousands)") +
  ggtitle("Temperature and Attendance at AFL Games at the MCG 2014-2019")  +
  scale_x_continuous(breaks = round(seq(0,40, by = 5),1)) +
  scale_y_continuous(breaks = round(seq(0,120, by = 20),1)) +
  theme_classic() +
  theme(axis.text=element_text(size=8), axis.title=element_text(size=8),
        plot.title=element_text(size=10))
ggsave("tut1_graph2.pdf", width = 297, height = 210, units = "mm")

It appears as if there is a positive relationship between attendance and the maximum temperature.
An examination of the plot suggests, if anything, a weak relationship between attendance and temperature. However, further examination might be necessary to investigate this issue further.

(d)

Compute some descriptive statistics for the data. Describe the data.

Solution

We will use the stargazer package to generate the descriptive statistics below by running this R code chunk. The statistics will be printed to an HTML file containing the following table

stargazer(attend_wend [c("crowd_th", "totmembers_th", "maxtemp", "rainfall", "victoria", "night")],  
          covariate.labels = c("Attendance ('000s)", "Total Members ('000s)", 
                               "Max Temp (cels).","Rainfall (mm)", "Victoria.", "Night"),
          type="html", digits=3, out = "tut1_sum1.html",omit.summary.stat = c("p25", "p75"))
# MAC users (without using stargazer fix)
table1df <- attend_wend[c("crowd_th", "totmembers_th", "maxtemp", "rainfall", "victoria", "night")]
stargazer(table1df,  
          covariate.labels = c("Attendance ('000s)", "Total Members ('000s)", 
                               "Max Temp (cels).","Rainfall (mm)", "Victoria.", "Night"),
          type="html", digits=3, out = "tut1_mac_sum1.html",omit.summary.stat = c("p25", "p75"))
Fig 3: Descriptive Statistics
Statistic N Mean St. Dev. Min Max
Attendance (’000s) 236 45.175 15.576 14.636 88.395
Total Members (’000s) 236 119.970 29.370 48.951 193.803
Max Temp (cels). 236 17.059 4.099 10 31
Rainfall (mm) 236 1.411 2.770 0 16
Victoria. 236 0.597 0.491 0 1
Night 236 0.136 0.343 0 1

Concentrate on the means and range (maximum-minimum)

Note that Victoria and Night are dummy variables taking the values [0,1] when interpreting the sample statistics for these variables.

Some descriptive statistics:

  • The mean attendance at weekend matches is 45,175 with a minimum of 14,636 and a maximum of 88,395.
  • The mean total membership of the two teams is 119,970 with a minimum of 48,951 and a maximum of 193,803.
  • The above two imply that, on average, approximately 37.6% of members attend weekend matches.
  • The mean maximum temperature is 17:06 degrees Celsius with a minimum of 10:0 degrees Celsius and a maximum of 31:0 degrees Celsius.
  • About 59.7% of weekend matches involves both teams from Victoria.
  • About 13.6% of weekend matches are played after 7:00pm.

(e)

Generate a variable representing the natural logarithm of crowd_th.
Compute some descriptive statistics for the natural logarithm of crowd_th.
How does the sample mean and sample variance differ to that for crowd_th computed in part (d)?

Solution

First create the variable attend_wend

# Generate new variable for (log) attendance
attend_wend$lncrowd_th <- log(attend_wend$crowd_th)

then use stargazer to generate a table of statistics stored in the HTML file tut1_sum2.html

stargazer(attend_wend [c("crowd_th", "lncrowd_th")],
          covariate.labels = c("Attendance", "(Log) Attendance"),
          type="html", digits=3, out = "tut1_sum2.html", omit.summary.stat = c("p25", "p75"))
Fig4: Descriptive Statistis - attend and lnattend
Statistic N Mean St. Dev. Min Max
Attendance 236 45.175 15.576 14.636 88.395
(Log) Attendance 236 3.749 0.360 2.683 4.482

The table of summary statistics for crowd_th and the natural log of crowd_th shows that the sample mean and variance of the natural logarithm of attendance is lower than that for the level of attendance.

Taking logs reduces the scale in which the variable is measured.
For example, an attendance of 50 is five times as large as an attendance of 10.
However, ln(50) = 3.912 and ln(10) = 2.3026 so the log of a value of 50 is only approximately 1.6 times larger than (log) attendance of 10.

(f)

Restrict the sample to only those matches in which both teams are from Victoria (Victoria==1). Compare the sample mean and sample variance to those for the full sample.

Solution

First restrict the sample

# (f) Only Victorian teams

victoria <- subset(attend_wend, attend_wend$victoria==1)

if using a MAC, first create a table, then use stargazer to report the statistics for Victorian teams stored in tut1_sum3.html

# PC Users
stargazer(victoria [c("crowd_th", "totmembers_th", "maxtemp", "rainfall", "night")],  
          covariate.labels = c("Attendance ('000s)", "Total Members ('000s)", 
                               "Max Temp (cels).","Rainfall (mm)", "Night"),
          type="html", digits=3, out = "tut1_sum3.html",omit.summary.stat = c("p25", "p75"))
# MAC users
table3df <- victoria [c("crowd_th", "totmembers_th", "maxtemp", "rainfall", "night")]
stargazer(table2df, 
          covariate.labels = c("Attendance", "(Log) Attendance"),
          type="html", digits=3, out = "tut1_mac_sum2.html", omit.summary.stat = c("p25", "p75"))
stargazer(attend_wend [c("crowd_th", "lncrowd_th")],title = "Fig 5: Descriptive Statistics All Teams (full sample)",           covariate.labels = c("Attendance", "(Log) Attendance"),           type="html", digits=3,  omit.summary.stat = c("p25", "p75"))   
stargazer(victoria  [c("crowd_th", "lncrowd_th")], title = "Fig 6: Descriptive Statistics Victorian Teams",           covariate.labels = c("Attendance", "(Log) Attendance"),           type="html", digits=3, omit.summary.stat = c("p25", "p75")) 
Fig 5: Descriptive Statistics All Teams (full sample)
Statistic
N
Mean
St. Dev.
Min
Max
Attendance
236
45.175
15.576
14.636
88.395
(Log) Attendance
236
3.749
0.360
2.683
4.482
Fig 6: Descriptive Statistics Victorian Teams
Statistic
N
Mean
St. Dev.
Min
Max
Attendance
141
51.393
13.810
22.945
88.395
(Log) Attendance
141
3.904
0.269
3.133
4.482

Figure 5 provides descriptive statistics for the full sample of weekend matches and Figure 6 provides descriptive statistics for the sample of weekend matches in which both teams are Victorian.
As expected, the mean temperature of these matches does not appear to be considerably different to that for the full sample.

As expected, the mean attendance for matches at the M.C.G. when both teams are from Victoria is slightly higher than in the full sample (45,175 for the full sample and 51,393 when both teams are from Victoria).

The mean of totmembers_th is slightly higher and the variance slightly lower than that for the full sample.

Question 2

The quantmod package (Quantitative Financial Modelling Framework) is a very useful package for data management and visualisation for financial and economic time-series data. It enables direct access in R to financial data from Yahoo Finance as well as economic data from FRED (Federal Reserve Economic Data), providing both Australian and international data.

(a)

Using the quantmod package, obtain data, from Yahoo Finance, on the daily stock price for the Westfarmers Limited (symbol Wes.AX) for the period 1 January, 2013 to 31 January, 2024.
Then using the ggplot2 package, construct a line-graph of the Adjusted Closing Price, with year on the horizontal axis.
Provide a brief discussion of the main features of the time-series over the sample period.

Solution

# Obtain data from Yahoo Finance using the quantmod package
# Wesfarmers Australia
# symbol = "WES.AX"
wes_stock <- getSymbols("WES.AX", from = "2013-1-1", to = "2024-1-31", src = "yahoo", auto.assign = FALSE)
tail(wes_stock)
##            WES.AX.Open WES.AX.High WES.AX.Low WES.AX.Close WES.AX.Volume
## 2024-01-22       58.31       58.39      57.95        57.95       1324253
## 2024-01-23       57.87       57.95      57.31        57.50       1299810
## 2024-01-24       58.03       58.70      57.45        58.45       2040139
## 2024-01-28       58.00       58.13      57.47        57.89       1430680
## 2024-01-29       58.27       58.32      57.72        57.91       1860202
## 2024-01-30       58.11       58.24      57.36        58.17       2669475
##            WES.AX.Adjusted
## 2024-01-22        55.58640
## 2024-01-23        55.15474
## 2024-01-24        56.06600
## 2024-01-28        55.52884
## 2024-01-29        55.54802
## 2024-01-30        55.79742
tail(wes_stock,10)
##            WES.AX.Open WES.AX.High WES.AX.Low WES.AX.Close WES.AX.Volume
## 2024-01-16       57.28       57.36      56.94        57.05       1241329
## 2024-01-17       56.95       57.11      56.70        57.11       1808529
## 2024-01-18       57.78       57.90      56.86        57.16       2478035
## 2024-01-21       57.41       58.28      57.32        58.10       1310907
## 2024-01-22       58.31       58.39      57.95        57.95       1324253
## 2024-01-23       57.87       57.95      57.31        57.50       1299810
## 2024-01-24       58.03       58.70      57.45        58.45       2040139
## 2024-01-28       58.00       58.13      57.47        57.89       1430680
## 2024-01-29       58.27       58.32      57.72        57.91       1860202
## 2024-01-30       58.11       58.24      57.36        58.17       2669475
##            WES.AX.Adjusted
## 2024-01-16        54.72310
## 2024-01-17        54.78065
## 2024-01-18        54.82862
## 2024-01-21        55.73027
## 2024-01-22        55.58640
## 2024-01-23        55.15474
## 2024-01-24        56.06600
## 2024-01-28        55.52884
## 2024-01-29        55.54802
## 2024-01-30        55.79742
# convert data to long form with a `date' variable`
wes_stock2 <- tidy(wes_stock)
# convert data to wide form with a `date' variable`
wes_stock3 <- spread(wes_stock2, series, value)
# basic plot
plot(wes_stock$WES.AX.Adjusted)

a better plot using ggplot2

# create plot
ggplot(wes_stock3, aes(y= WES.AX.Adjusted, x= index)) +
  geom_line(color="blue") +
  labs(x = "Year", y = "Adjusted Closing Price") +
  ggtitle("Wesfarmers, Adjusted Closing Price: 1 Jan 2013 - 31 Jan 2024")  +
  scale_y_continuous(breaks = round(seq(0,120, by = 10),0)) +
  scale_x_date(breaks = date_breaks("1 year"),labels = date_format("%Y")) +
  theme_classic() +
  theme(axis.text=element_text(size=8), axis.title=element_text(size=8))

(b)

Using the quantmod package, obtain data, from the FRED database, on the monthly Australian Seasonally Adjusted Unemployment Rate, All Persons Aged 15 and over (LRUNUNTTTTAUM156S).

Then, using the subset command in the zoo package, restrict the sample to the period January,2013 to November,2023. Using the ggplot2 package, construct a line-graph of the monthly unemployment rate rate , with year on the horizontal axis.

Provide a brief discussion of the main features of the time-series over the sample period.

Solution

# Obtain data from FRED using the quantmod package
# Unemployment Rate: Aged 15 and over: All Persons for Australia, Percent, Monthly, Seasonally Adjusted
# LRUNTTTTAUM156S
urate <- getSymbols("LRUNTTTTAUM156S", src= "FRED", auto.assign = FALSE)
# convert data to long form with a `date' variable`
urate2 <- tidy(urate)
# only keep a specific date range: January 2013 - November 2023
urate3 <-subset(urate2,urate2$index>="2013-01-01" & urate2$index<="2023-11-01")
# rename series to audusd
names(urate3)[names(urate3)=="value"] <- "urate"

then use ggplot2()

# create plot
ggplot(urate3, aes(y= urate, x= index)) +
  geom_line(color="blue") +
  labs(x = "Year", y = "Unemployment Rate") +
  ggtitle(" Unemployment Rate: Aged 15 and over: All Persons for Australia, Percent, Monthly, Seasonally Adjusted:
          January 2013 - November 2023")  +
  scale_y_continuous(breaks = round(seq(0,10, by = 0.5),1)) +
  scale_x_date(breaks = date_breaks("1 year"),labels = date_format("%Y")) +
    theme_classic() +
  theme(axis.text=element_text(size=8), axis.title=element_text(size=8))

The plot is provided in Figure 8 ABOVE.

In general, the unemployment rate exhibits a downward trend over the sample period. There is clear evidence of a considerable of an intimal increase in the unemployment rate associated with the lock downs for the onset of COVID19 in Australia.

Following this, there has been a sustained decrease in the unemployment rate such that for the period 2022-2023, the unemployment rate is below its pre-COVID levels.

There is clear evidence of a tightening of labour markets (lower unemployment) in the period 2022-2023.