# Required packages
library(ggplot2) # Flexible graphic facility for R
library(stargazer) # For producing good quality output from R
library(zoo) # For working with dates in R
library(rio) # for easily importing data files into R
library(quantmod) # R package for accessing financial/economic data
library(broom) # used for ggplot2 tidy
library(tidyr) # used for ggplot2 spread
library(scales) # used for formatting scales in ggplot2
# (a)(i) Import Data File (csv file) and Save as R Data File
# attend <- read.csv("attend.csv")
# or using the rio package simply
<-import("attend.csv") attend
Introductions
Your tutor will provide their contact details in this inital tutorial.
Please take a note of this email address and be sure to send through any queries you may have during the semester to this address.
Hopefully, you were able to meet a few of your classmates in this, your first tutorial. You don’t have to form groups within the same tutorial for assignments but if you are after an assignment group partner(s) your tutorial could be a good place to start.
Admin
The Subject Home Page in Canvas can be found here .
Dr, Andrew Clarke (Andy) will go through requirements to successfully complete this capstone subject in the first weeks’ lectures.
In terms of assessment, key dates are:
Task | Indiv./Grp. | Week | Due Date | Weight |
---|---|---|---|---|
Capstone Project Proposal | Group | 4 | Sunday, March 30th | 5% |
Online Mid-semester test | Individual | 6 | Tuesday, April 17th | 10% |
Capstone Project Progress Report | Group | 8 | Monday, April 28th | 10% |
Capstone Project Final Report | Group | 8 | Monday, May 26th | 25% |
End-of-semester exam | Individual | Final Exam Period | 50% |
We will be using R (and the integrated development environment (ide) for R, RStudio) exclusively in this subject. Details of how to download R and RStudio are contained in Section 6.10 of the Subject User Guide.
Once you have installed both, you will only need to use RStudio to run the required R scripts for tutorials each week.
The tutorial material for each week can be found under Modules \(\rightarrow\) Tutorials in the Subject Home Page e.g. the material for this tutorial can be found here .
You are expected to run the code provided in the R script file before attending your tutorial.
If you are unsure how the blocks of code (“chunks”) are set up then either/or:
have a look at the Pre-Tutorial videos that Andy will provde each week which contain a “line-by-line” description of how the code chunks work; e.g. see here for Tutorial 1.
make a note of what you are unsure about and ask about your tutor
use one of the consultation times
Objectives
Tutorial 1 reviews some basic operations using the R software package used in this subject. Specifically, this tutorial provides some practice in:
obtaining summary statistics for a dataset using the stargazer package in R
using the ggplot2 package to produce high-quality graphs in R
importing financial and economic time-series data using the quantmod package in R
The tutorial file used today is:
- attend.csv
This file can be obtained from the Canvas subject page.
In addition, the R script file tut1.R provides the program code necessary to complete the tutorial. The R script file for this tutorial also uses the following packages which need to be installed prior to running the R script file:
ggplot2 - for creating graphs and plots in R
stargazer - to easily generate summary statistics for an R data file
zoo - for working with dates in R
rio - for easily importing data files into R
quantmod - R package for accessing financial/economic data
broom - for easily “tidying” up R data
tidyr - for easily “tidying” up R data
scales - used for formatting scales in ggplot2
These can be installed directly in RStudio from the packages tab or by using the command install.packages() and inserting the name of the package in the brackets.
Please feel free to play around with the code, particularly the plotting commands for ggplot2 and the table commands for stargazer.
Important Preliminary Task
We will be using the package stargazer to generate summary statistics and produce high-quality tables for presenting the results from estimated econometric models in R. Unfortunately, there are some issues with the current version of stargazer and the current version of R (Version 4.3.2 “Eye Holes”).
A recommended fix for this issue is:
install stargazer directly in RSTUDIO from the packages tab or by using the command install.packages() and inserting the name of the package in the brackets
using the packages window in RSTUDIO make sure the stargazer package is un-ticked
open the run the R script file stargazer_fix_2024.R avail;able on the Canvas subject page
After completing these steps, the package stargazer should now be working with the current version of R.
this fix should work well if you are running R in a PC-Windows environment. It will not work for users who are running R in a Mac-OS environment - see additional code in the R script file in this case.
Question 1
Download the data file attend.csv from the subject page.
This file contains data on attendance at Australian Football League (AFL) matches at the Melbourne Cricket Ground (MCG) for 317 matches (excluding the finals) between 2014 and 2019.
It also contains data on other variables that might determine attendance:
crowd | Attendance at the M.C.G. |
maxtemp | Maximum temperature on the day in degrees Celsius |
totmembers | Sum of the total membership of the two clubs |
victoria | 1 if both are Victorian teams;0 otherwise |
weekend | 1 if if the match was held on Saturday or Sunday; 0 otherwise |
night | 1 if the match was held after 7 p.m.; 0 otherwise |
Note that:
\[ \begin{align*}\texttt{victoria} & =\begin{cases}1 & \text{if both teams are from Victoria}\\0 & \text{if at least one team is not from Victoria}\\\end{cases}\\ \\\texttt{weekend} & =\begin{cases}1 & \text{if the match was held on a Saturday or a Sunday}\\0 & \text{if the match was not held on a Saturday or Sunday}\\\end{cases}\\ \\\texttt{night} & =\begin{cases}1 & \text{if the match was held after 7:00pm}\\0 & \text{if the match was not held after 7:00pm}\\\end{cases}\end{align*} \]
(a)
i.
examine the data to make sure it has been imported correctly
Solution
you only need to install a package one time. Once it is installed you can call it using the library()
command
Load the required packages and import the data
If you click on the dataframe attend in the Environment window (top right in R Studio) you will get a spreadsheet view of the data.
However, often be dealing with large data sets so to check that all the data has been imported you can use
# Required packages
# Print first 10 rows of the data frame
print(head(attend,10))
round dayname day month year time home away venue crowd
1 2 Thu 27 3 2014 7:45pm Richmond Carlton MCG 62037
2 2 Sun 30 3 2014 3:20pm Melbourne West Coast MCG 22226
3 3 Fri 4 4 2014 7:50pm Hawthorn Fremantle MCG 43583
4 3 Sat 5 4 2014 7:40pm Collingwood Geelong MCG 63152
5 3 Sun 6 4 2014 7:10pm Essendon Carlton MCG 62730
6 4 Fri 11 4 2014 7:50pm Richmond Collingwood MCG 62100
7 4 Sat 12 4 2014 1:45pm Carlton Melbourne MCG 37323
8 5 Sat 19 4 2014 1:40pm Collingwood North Melbourne MCG 57116
9 5 Sun 20 4 2014 3:20pm Melbourne Gold Coast MCG 17243
10 5 Mon 21 4 2014 3:20pm Geelong Hawthorn MCG 80222
victoria night weekend totmembers rainfall maxtemp
1 1 1 0 113607 5 26
2 0 0 1 94440 1 24
3 0 1 0 117427 0 19
4 1 1 1 123150 0 24
5 1 1 1 108111 0 24
6 1 1 0 145469 10 19
7 1 0 1 83396 0 22
8 1 0 1 118407 2 17
9 0 0 1 49389 0 19
10 1 0 0 112453 0 19
### extra - and for,say the last 4 rows
tail(attend,4)
round dayname day month year time home away venue crowd
314 23 Fri 23 8 2019 7:50pm Collingwood Essendon MCG 85405
315 23 Fri 23 8 2019 7:50pm Collingwood Essendon MCG 85405
316 23 Sun 25 8 2019 3:20pm Richmond Brisbane MCG 76995
317 23 Sun 25 8 2019 3:20pm Richmond Brisbane MCG 76995
victoria night weekend totmembers rainfall maxtemp
314 1 1 0 169463 0 15
315 1 1 0 169463 0 15
316 0 0 1 131381 1 13
317 0 0 1 131381 1 13
ii.
restrict the sample to matched held on either a Saturday or a Sunday
Solution
# (a)(ii) Restrict sample to weekend games
<- subset(attend, attend$weekend==1) attend_wend
iii.
generate the following two variables
crowd_th | Attendance at the M.C.G. in ’000s |
totmembers_th | Sum of the total membership of the two clubs in ’000s |
Solution
# (a)(iii) Create new variables, divide attendance and total members by thousands
$crowd_th = attend_wend$crowd/1000
attend_wend$totmembers_th = attend_wend$totmembers/1000 attend_wend
(b)
Construct a scatter plot if crowd_th against totmembers_th . You should ensure the variable crowd_th is on the vertical axis and totmembers_th is on the horizontal axis.
- does there appear to be a positive relationship between attendance and (total) membership?
- does it look linear?
Solution
The code used to produce this graph is below and is included in the R script file tut1.R on the subject home page.
A line-by-line description of how this code works is in the Subject Home Page here
# (b) Plot Membership and Attendance
# x axis: membership
# y axis: attendance
ggplot(attend_wend, aes(x=totmembers_th, y = crowd_th)) +
geom_point(color="blue") +
labs(x = "Club Membership (in thousands)", y = "Attendance (In thousands)") +
ggtitle("Club Membership and Attendance at AFL Games at the MCG 2014-2019") +
theme_classic() +
scale_x_continuous(breaks = round(seq(0,240, by = 20),1)) +
scale_y_continuous(breaks = round(seq(0,120, by = 20),1)) +
theme(axis.text=element_text(size=8), axis.title=element_text(size=8),
plot.title=element_text(size=10))
# landscape A4 size
ggsave("tut1_graph1.pdf", width = 297, height = 210, units = "mm")
It appears as if there is a positive relationship between attendance and the total membership of the two teams.
An examination of the plot suggests that the relationship appears to be approximately linear. However, further examination might be necessary to investigate this issue further.
(c)
Construct a scatter plot of crowd_th against maxtemp.
You should ensure that the variable crowd_th is on the vertical axis and the variable maxtemp is on the horizontal axis.
Does there appear to be a positive relationship between attendance and temperature?
Does it look linear?
Solution
# (c) Plot attendance and temperature
# x axis: temperature
# y axis: attendance
ggplot(attend_wend, aes(x=maxtemp, y = crowd_th)) +
geom_point(color="blue") +
labs(x = "Temperature (in celsius)", y = "Attendance (In thousands)") +
ggtitle("Temperature and Attendance at AFL Games at the MCG 2014-2019") +
scale_x_continuous(breaks = round(seq(0,40, by = 5),1)) +
scale_y_continuous(breaks = round(seq(0,120, by = 20),1)) +
theme_classic() +
theme(axis.text=element_text(size=8), axis.title=element_text(size=8),
plot.title=element_text(size=10))
ggsave("tut1_graph2.pdf", width = 297, height = 210, units = "mm")
It appears as if there is a positive relationship between attendance and the maximum temperature.
An examination of the plot suggests, if anything, a weak relationship between attendance and temperature. However, further examination might be necessary to investigate this issue further.
(d)
Compute some descriptive statistics for the data. Describe the data.
Solution
We will use the stargazer package to generate the descriptive statistics below by running this R code chunk. The statistics will be printed to an HTML file containing the following table
stargazer(attend_wend [c("crowd_th", "totmembers_th", "maxtemp", "rainfall", "victoria", "night")],
covariate.labels = c("Attendance ('000s)", "Total Members ('000s)",
"Max Temp (cels).","Rainfall (mm)", "Victoria.", "Night"),
type="html", digits=3, out = "tut1_sum1.html",omit.summary.stat = c("p25", "p75"))
# MAC users (without using stargazer fix)
<- attend_wend[c("crowd_th", "totmembers_th", "maxtemp", "rainfall", "victoria", "night")]
table1df stargazer(table1df,
covariate.labels = c("Attendance ('000s)", "Total Members ('000s)",
"Max Temp (cels).","Rainfall (mm)", "Victoria.", "Night"),
type="html", digits=3, out = "tut1_mac_sum1.html",omit.summary.stat = c("p25", "p75"))
Statistic | N | Mean | St. Dev. | Min | Max |
Attendance (’000s) | 236 | 45.175 | 15.576 | 14.636 | 88.395 |
Total Members (’000s) | 236 | 119.970 | 29.370 | 48.951 | 193.803 |
Max Temp (cels). | 236 | 17.059 | 4.099 | 10 | 31 |
Rainfall (mm) | 236 | 1.411 | 2.770 | 0 | 16 |
Victoria. | 236 | 0.597 | 0.491 | 0 | 1 |
Night | 236 | 0.136 | 0.343 | 0 | 1 |
Concentrate on the means and range (maximum-minimum)
Note that Victoria and Night are dummy variables taking the values [0,1] when interpreting the sample statistics for these variables.
Some descriptive statistics:
- The mean attendance at weekend matches is 45,175 with a minimum of 14,636 and a maximum of 88,395.
- The mean total membership of the two teams is 119,970 with a minimum of 48,951 and a maximum of 193,803.
- The above two imply that, on average, approximately 37.6% of members attend weekend matches.
- The mean maximum temperature is 17:06 degrees Celsius with a minimum of 10:0 degrees Celsius and a maximum of 31:0 degrees Celsius.
- About 59.7% of weekend matches involves both teams from Victoria.
- About 13.6% of weekend matches are played after 7:00pm.
(e)
Generate a variable representing the natural logarithm of crowd_th.
Compute some descriptive statistics for the natural logarithm of crowd_th.
How does the sample mean and sample variance differ to that for crowd_th computed in part (d)?
Solution
First create the variable attend_wend
# Generate new variable for (log) attendance
$lncrowd_th <- log(attend_wend$crowd_th) attend_wend
then use stargazer to generate a table of statistics stored in the HTML file tut1_sum2.html
stargazer(attend_wend [c("crowd_th", "lncrowd_th")],
covariate.labels = c("Attendance", "(Log) Attendance"),
type="html", digits=3, out = "tut1_sum2.html", omit.summary.stat = c("p25", "p75"))
Statistic | N | Mean | St. Dev. | Min | Max |
Attendance | 236 | 45.175 | 15.576 | 14.636 | 88.395 |
(Log) Attendance | 236 | 3.749 | 0.360 | 2.683 | 4.482 |
The table of summary statistics for crowd_th and the natural log of crowd_th shows that the sample mean and variance of the natural logarithm of attendance is lower than that for the level of attendance.
Taking logs reduces the scale in which the variable is measured.
For example, an attendance of 50 is five times as large as an attendance of 10.
However, ln(50) = 3.912
and ln(10) = 2.3026
so the log of a value of 50 is only approximately 1.6 times larger than (log) attendance of 10.
(f)
Restrict the sample to only those matches in which both teams are from Victoria (Victoria==1). Compare the sample mean and sample variance to those for the full sample.
Solution
First restrict the sample
# (f) Only Victorian teams
<- subset(attend_wend, attend_wend$victoria==1) victoria
if using a MAC, first create a table, then use stargazer to report the statistics for Victorian teams stored in tut1_sum3.html
# PC Users
stargazer(victoria [c("crowd_th", "totmembers_th", "maxtemp", "rainfall", "night")],
covariate.labels = c("Attendance ('000s)", "Total Members ('000s)",
"Max Temp (cels).","Rainfall (mm)", "Night"),
type="html", digits=3, out = "tut1_sum3.html",omit.summary.stat = c("p25", "p75"))
# MAC users
<- victoria [c("crowd_th", "totmembers_th", "maxtemp", "rainfall", "night")]
table3df stargazer(table2df,
covariate.labels = c("Attendance", "(Log) Attendance"),
type="html", digits=3, out = "tut1_mac_sum2.html", omit.summary.stat = c("p25", "p75"))
stargazer(attend_wend [c("crowd_th", "lncrowd_th")],title = "Fig 5: Descriptive Statistics All Teams (full sample)", covariate.labels = c("Attendance", "(Log) Attendance"), type="html", digits=3, omit.summary.stat = c("p25", "p75"))
stargazer(victoria [c("crowd_th", "lncrowd_th")], title = "Fig 6: Descriptive Statistics Victorian Teams", covariate.labels = c("Attendance", "(Log) Attendance"), type="html", digits=3, omit.summary.stat = c("p25", "p75"))
|
|||||
Statistic
|
N
|
Mean
|
St. Dev.
|
Min
|
Max
|
|
|||||
Attendance
|
236
|
45.175
|
15.576
|
14.636
|
88.395
|
(Log) Attendance
|
236
|
3.749
|
0.360
|
2.683
|
4.482
|
|
|
|||||
Statistic
|
N
|
Mean
|
St. Dev.
|
Min
|
Max
|
|
|||||
Attendance
|
141
|
51.393
|
13.810
|
22.945
|
88.395
|
(Log) Attendance
|
141
|
3.904
|
0.269
|
3.133
|
4.482
|
|
Figure 5 provides descriptive statistics for the full sample of weekend matches and Figure 6 provides descriptive statistics for the sample of weekend matches in which both teams are Victorian.
As expected, the mean temperature of these matches does not appear to be considerably different to that for the full sample.
As expected, the mean attendance for matches at the M.C.G. when both teams are from Victoria is slightly higher than in the full sample (45,175 for the full sample and 51,393 when both teams are from Victoria).
The mean of totmembers_th is slightly higher and the variance slightly lower than that for the full sample.
Question 2
The quantmod package (Quantitative Financial Modelling Framework) is a very useful package for data management and visualisation for financial and economic time-series data. It enables direct access in R to financial data from Yahoo Finance as well as economic data from FRED (Federal Reserve Economic Data), providing both Australian and international data.
(a)
Using the quantmod package, obtain data, from Yahoo Finance, on the daily stock price for the Westfarmers Limited (symbol Wes.AX) for the period 1 January, 2013 to 31 January, 2024.
Then using the ggplot2 package, construct a line-graph of the Adjusted Closing Price, with year on the horizontal axis.
Provide a brief discussion of the main features of the time-series over the sample period.
Solution
# Obtain data from Yahoo Finance using the quantmod package
# Wesfarmers Australia
# symbol = "WES.AX"
<- getSymbols("WES.AX", from = "2013-1-1", to = "2024-1-31", src = "yahoo", auto.assign = FALSE)
wes_stock tail(wes_stock)
## WES.AX.Open WES.AX.High WES.AX.Low WES.AX.Close WES.AX.Volume
## 2024-01-22 58.31 58.39 57.95 57.95 1324253
## 2024-01-23 57.87 57.95 57.31 57.50 1299810
## 2024-01-24 58.03 58.70 57.45 58.45 2040139
## 2024-01-28 58.00 58.13 57.47 57.89 1430680
## 2024-01-29 58.27 58.32 57.72 57.91 1860202
## 2024-01-30 58.11 58.24 57.36 58.17 2669475
## WES.AX.Adjusted
## 2024-01-22 55.58640
## 2024-01-23 55.15474
## 2024-01-24 56.06600
## 2024-01-28 55.52884
## 2024-01-29 55.54802
## 2024-01-30 55.79742
tail(wes_stock,10)
## WES.AX.Open WES.AX.High WES.AX.Low WES.AX.Close WES.AX.Volume
## 2024-01-16 57.28 57.36 56.94 57.05 1241329
## 2024-01-17 56.95 57.11 56.70 57.11 1808529
## 2024-01-18 57.78 57.90 56.86 57.16 2478035
## 2024-01-21 57.41 58.28 57.32 58.10 1310907
## 2024-01-22 58.31 58.39 57.95 57.95 1324253
## 2024-01-23 57.87 57.95 57.31 57.50 1299810
## 2024-01-24 58.03 58.70 57.45 58.45 2040139
## 2024-01-28 58.00 58.13 57.47 57.89 1430680
## 2024-01-29 58.27 58.32 57.72 57.91 1860202
## 2024-01-30 58.11 58.24 57.36 58.17 2669475
## WES.AX.Adjusted
## 2024-01-16 54.72310
## 2024-01-17 54.78065
## 2024-01-18 54.82862
## 2024-01-21 55.73027
## 2024-01-22 55.58640
## 2024-01-23 55.15474
## 2024-01-24 56.06600
## 2024-01-28 55.52884
## 2024-01-29 55.54802
## 2024-01-30 55.79742
# convert data to long form with a `date' variable`
<- tidy(wes_stock)
wes_stock2 # convert data to wide form with a `date' variable`
<- spread(wes_stock2, series, value)
wes_stock3 # basic plot
plot(wes_stock$WES.AX.Adjusted)
a better plot using ggplot2
# create plot
ggplot(wes_stock3, aes(y= WES.AX.Adjusted, x= index)) +
geom_line(color="blue") +
labs(x = "Year", y = "Adjusted Closing Price") +
ggtitle("Wesfarmers, Adjusted Closing Price: 1 Jan 2013 - 31 Jan 2024") +
scale_y_continuous(breaks = round(seq(0,120, by = 10),0)) +
scale_x_date(breaks = date_breaks("1 year"),labels = date_format("%Y")) +
theme_classic() +
theme(axis.text=element_text(size=8), axis.title=element_text(size=8))
(b)
Using the quantmod package, obtain data, from the FRED database, on the monthly Australian Seasonally Adjusted Unemployment Rate, All Persons Aged 15 and over (LRUNUNTTTTAUM156S
).
Then, using the subset command in the zoo package, restrict the sample to the period January,2013 to November,2023. Using the ggplot2 package, construct a line-graph of the monthly unemployment rate rate , with year on the horizontal axis.
Provide a brief discussion of the main features of the time-series over the sample period.
Solution
# Obtain data from FRED using the quantmod package
# Unemployment Rate: Aged 15 and over: All Persons for Australia, Percent, Monthly, Seasonally Adjusted
# LRUNTTTTAUM156S
<- getSymbols("LRUNTTTTAUM156S", src= "FRED", auto.assign = FALSE)
urate # convert data to long form with a `date' variable`
<- tidy(urate)
urate2 # only keep a specific date range: January 2013 - November 2023
<-subset(urate2,urate2$index>="2013-01-01" & urate2$index<="2023-11-01")
urate3 # rename series to audusd
names(urate3)[names(urate3)=="value"] <- "urate"
then use ggplot2()
# create plot
ggplot(urate3, aes(y= urate, x= index)) +
geom_line(color="blue") +
labs(x = "Year", y = "Unemployment Rate") +
ggtitle(" Unemployment Rate: Aged 15 and over: All Persons for Australia, Percent, Monthly, Seasonally Adjusted:
January 2013 - November 2023") +
scale_y_continuous(breaks = round(seq(0,10, by = 0.5),1)) +
scale_x_date(breaks = date_breaks("1 year"),labels = date_format("%Y")) +
theme_classic() +
theme(axis.text=element_text(size=8), axis.title=element_text(size=8))
The plot is provided in Figure 8 ABOVE.
In general, the unemployment rate exhibits a downward trend over the sample period. There is clear evidence of a considerable of an intimal increase in the unemployment rate associated with the lock downs for the onset of COVID19 in Australia.
Following this, there has been a sustained decrease in the unemployment rate such that for the period 2022-2023, the unemployment rate is below its pre-COVID levels.
There is clear evidence of a tightening of labour markets (lower unemployment) in the period 2022-2023.