ECOM30001/ECOM90001
Tutorial 12
Panel Data Models

This tutorial reviews some concepts using the econometrics software package R. Specifically, the tutorial reviews:

This tutorial requires one (1) data file:

This file can be obtained from the Canvas subject page

In addition, the file, tut12.R , provides the program code (R script file) necessary to complete this tutorial. The R script file uses the following packages which need to be installed prior to running this program:

stargazer : for easily generating summary statistics for an R data file
ggplot2 : for easily producing graphs in R
car: for easily conducting hypothesis tests in R
lmtest : for easily conducting the Ramsey RESET test in R
sandwich: for easily calculating robust standard errors in R
rio: for easily importing data into R
plm: for easily estimating models linear panel data models in R

These can be installed directly in RStudio from the packages tab or by using the command install.packages() and inserting the name of the package in the brackets.

Question

Consider the following econometric model for the price of airline seats on routes in a large country: \[\begin{align} \ln\,\mbox{fare}_{it} & = \beta_{0} + \beta_{1}\,\mbox{concen}_{it} + \beta_{2}\,\ln\,\mbox{dist}_{i} + \beta_{3}\left\{\ln\,\mbox{dist}_{i}\right\}^{2} \notag \\ & + \beta_{4}\,\mbox{YEAR1998}_{t} + \beta_{5}\,\mbox{YEAR1999}_{t} + \beta_{6}\,\mbox{YEAR2000}_{t} + \varepsilon_{it} \tag{1} \end{align}\] with the following assumptions

\[\begin{align*} E[\varepsilon_{it}|\mathbf{X}_{it}] = 0 & \quad \text{(zero conditional mean)} \\ \mathop{\mathrm{VAR}}[\varepsilon_{it}|\mathbf{X}_{it}] = \sigma^{2} & \quad \text{(homoskedasticity)} \\ \mathop{\mathrm{COV}}[\varepsilon_{it},\,\varepsilon_{is}|\mathbf{X}] = \text{ for } t \ne s &\quad \text{(uncorrelated errors)} \\ \mathop{\mathrm{COV}}[\varepsilon_{it},\,\mathbf{X}_{it}]= 0 & \quad \text{(strict exogeneity)} \end{align*}\]

where: \[\begin{align*} \text{fare}_{it} = & \mbox{ average fare on route $i$ in period $t$, in dollars} \\ \text{dist}_{i} = & \mbox{ average distance of route $i$, in miles} \\ \text{concen}_{it} = & \mbox{ market concentration, measured by the market share of the largest carrier} \\ & \mbox{ on route $i$ at time $t$} \\ \text{YEAR1998}_{t} = & \mbox{ 1 if year=1998, 0 otherwise} \\ \text{YEAR1999}_{t} = & \mbox{ 1 if year=1999, 0 otherwise} \\ \text{YEAR2000}_{t} = & \mbox{ 1 if year=2000, 0 otherwise} \\ \end{align*}\] and \(\ln(X)\) denotes the natural logarithm of variable \(X\).

The data file tut12_panel.csv contains observations on 1,149 airline routes for the period 1997 - 2000.

(a)

Estimate the econometric model (1) by Ordinary Least Squares (OLS).

Solution

Code
tut12_panel <- import("tut12_panel.csv")
tut12_pd <- pdata.frame(tut12_panel, index=c("id", "year"))
pdim(tut12_pd)
Balanced Panel: n = 1149, T = 4, N = 4596
Code
#----------------------------------------------
# Part (a) and (b)
# Estimate pooled OLS model
# note: required (log) distance variable already on the data frame
pool <- plm(lnfare ~ concen + lndist + lndistsq 
            + year1998 + year1999 + year2000, 
            model="pooling", data = tut12_pd )
# print(summary(pool))
# wald_pool <-pwaldtest(pool)
# print(wald_pool)
wald_pool <-pwaldtest(pool, test = "F")
fstat1 <-round(wald_pool[[2]][1], digits=4)
pvalf1 <-round(wald_pool[[4]][1], digits=4)
numdf1 <-wald_pool[[3]][1]
demdf1 <-wald_pool[[3]][2]
#-------------------------------------------------
# Cluster-Robust Variance Estimator
# (1) Use finite Sample: Degrees of Freedom  Adjustment
# (N/N-1)*(NT-1)*(NT-K-1)
N <- length(unique(tut12_pd$id))                   # cross section size: unique ID
# print(N)
NT <- length(tut12_pd$id)                          # NT: (balanced) panel dimension
# print(NT)
# print(pool$df.residual)                            #NT-K-1
dfa <- (N/(N - 1)) * (NT - 1)/pool$df.residual    # df adjustment  
# display with cluster VCE and df-adjustment
pool_crse_vcov <- dfa * vcovHC(pool, type = "HC0", cluster = "group")
pool_crse_r <-coeftest(pool, vcov = pool_crse_vcov, df=Inf)
# print(pool_crse_r)
# cluster-robust standard errors for pooled ols model
pool_crse <- sqrt(diag(pool_crse_vcov))
# Note: pwaldtest reports chi-squared, not F by default
# wald_pool_crse <- pwaldtest(pool, vcov = pool_crse_vcov)
# print(wald_pool_crse)
wald_pool_crse <- pwaldtest(pool, vcov = pool_crse_vcov, test= "F")
# print(wald_pool_crse)
fstat2 <-round(wald_pool_crse[[2]][1], digits=4)
pvalf2 <-round(wald_pool_crse[[4]][1], digits=4)
numdf2 <-wald_pool_crse[[3]][1]
demdf2 <-wald_pool_crse[[3]][2]
#------------------------------------------------------
Code
stargazer(pool,pool, type = "html", dep.var.labels=c("(Log) Fare"),
          column.labels = c( "Pool OLS", "Pool OLS (Cluster-Robust)"),
          covariate.labels =c("Intercept", "Market Concentration", 
                              "(Log) Distance", "(Log) Distance Squared",
                              "Year=1998", "Year=1999", "Year=2000"),
          se        = list(NULL, pool_crse),
          omit.stat = c("f","adj.rsq"),
          add.lines = list(c("F Statistic", fstat1, fstat2),
                           c("F p value", pvalf1, pvalf2),
                           c("F num df", numdf1, numdf2),
                           c("F dem df", demdf1,demdf2)),
          digits=4, align=TRUE,
          intercept.bottom=FALSE,
          star.cutoffs = c(0.05, 0.01, 0.001))
Dependent variable:
(Log) Fare
Pool OLS Pool OLS (Cluster-Robust)
(1) (2)
Intercept 6.6638*** 6.6638***
(0.4836) (1.0454)
Market Concentration 0.3601*** 0.3601***
(0.0301) (0.0586)
(Log) Distance -1.0002*** -1.0002***
(0.1375) (0.2912)
(Log) Distance Squared 0.1031*** 0.1031***
(0.0097) (0.0202)
Year=1998 0.0211 0.0211***
(0.0140) (0.0041)
Year=1999 0.0378** 0.0378***
(0.0140) (0.0052)
Year=2000 0.0999*** 0.0999***
(0.0140) (0.0056)
F Statistic 523.1228 205.6331
F p value 0 0
F num df 6 6
F dem df 4589 1148
Observations 4,596 4,596
R2 0.4062 0.4062
Note: p<0.05; p<0.01; p<0.001

Figure 1: Estimation Results Model (1)


Pooled OLS

consistent estimator
incorrect standard errors

The estimation results are reported in Figure 1. As the assumptions are written, the OLS estimator for model (1) will be a consistent estimator of the population parameters. However, the sample consists of repeated observations on airline routes. The assumptions about the error terms outlined above may not necessarily be appropriate. It is likely that the errors are correlated over time, for each route. In this case, the error \(\varepsilon_{it}\) might be characterised by both heteroskedasticity and autocorrelation, within routes.
In this case, the OLS estimator will still be consistent but the standard errors will be wrong. Conclusions based upon t-tests and F-tests could provide misleading conclusions.

(b)

Do you think that the standard errors are valid? Clearly explain why or why not?

Solution
The estimation results are reported in Figure 1. The following table summarizes the results.

Pooled OLS Ordinary s.e Pooled OLS Cluster s.e
Variable coef. std. error coef. std,error
CONCEN 0.3601 0.0301 0.3601 0.0586
LDIST -1.002 0.1375 -1.002 0.2912
LDISTSQ 0.1031 0.0097 0.1031 0.202
Y98 0.0211 0.0140 0.0211 0.0041
Y99 0.0378 0.0140 0.0378 0.0052
Y00 0.0999 0.0140 0.0999 0.0056

As expected, the (ordinary) OLS standard errors are too low in the presence of errors that are correlated over time, for each route. Note that, for reasons outside the scope of this subject, care must be exercised when using the cluster-robust standard errors on indicator variables (such as the time indicator variables used here). In this example, the cluster-robust standard errors are lower than the (ordinary) OLS standard errors for the time indicator variables.

(c)

Consider the following alternative econometric model: \[\begin{align} \ln\,\mbox{fare}_{it} & = \beta_{1}\,\mbox{concen}_{it} + \beta_{2}\,\ln\,\mbox{dist}_{i} + \beta_{3}\left\{\ln\,\mbox{dist}_{i}\right\}^{2} \notag \\ & + \beta_{4}\,\mbox{YEAR1998}_{t} + \beta_{5}\,\mbox{YEAR1999}_{t} + \beta_{6}\,\mbox{YEAR2000}_{t} + \upsilon_{i} + \varepsilon_{it} \tag{2} \end{align}\] where \(\upsilon_{i}\) represents an unobserved time-invariant random variable and:

\[\begin{align*} E[\varepsilon_{it}|\mathbf{X}_{it},\upsilon_{i}] =0 & \quad \text{(zero conditional mean)} \\ \mathop{\mathrm{VAR}}[\varepsilon_{it}|\mathbf{X}_{it},\upsilon_{i}] = \sigma^{2} & \quad \text{(homoskedasticity)} \\ \mathop{\mathrm{COV}}[\varepsilon_{it},\,\varepsilon_{is}|\mathbf{X},\upsilon_{i}] = 0 \text{ for } t \ne s & \quad \text{(uncorrelated errors, within groups)} \\ \mathop{\mathrm{COV}}[\varepsilon_{it},\,\varepsilon_{js}|\mathbf{X},\upsilon_{i}] = 0 \text{ for }i \ne j & \quad \text{(uncorrelated errors, across groups)} \\ \mathop{\mathrm{COV}}[\varepsilon_{it},\,\mathbf{X}_{it}|\upsilon_{i}] = 0 & \quad \text{(strict exogeneity)} \\ \end{align*}\]

(i)


Clearly explain, and provide some examples, whether you think that \(\upsilon_{i}\) is correlated with \(\mbox{concen}_{it}\) or not.

Solution

\(\mathbf{\text{COV}[\upsilon_{i},X_{it}] \neq 0}\)
for the pooled OLS model

  • population
  • demographics
  • road and rail alternatives

Factors about the cities near the two airports served by the routes could affect demand for air travel. These factors might include population, average education, the characteristics of employers (size, industry etc. \(\ldots\)), and the quality and feasibility of alternatives such as freeways and rail, just to name a few. Although these factors might be time-varying, over a short period (such as the four years in this sample) these might be approximately constant. Importantly, all these time-invariant factors might reasonably be correlated with \(\mbox{concen}_{it}\), the degree of competition on the route, leading to \(\mathop{\mathrm{COV}}(\mbox{concen}_{it},\upsilon_{i}) \ne 0\).

(ii)

Suppose you attempt to estimate this econometric model using the Fixed Effects (FE) estimator. Clearly explain why your estimated model cannot include the distance variables.

Solution
The variables \(\text{lndist}\) and \(\text{lndistsq}\) only vary across routes and not over time, within routes. The “within transformation” applies the following transformation to all variables (to eliminate the fixed effect) - subtract individual specific means from each variable. For example: \(\mbox{lndist}^{*} = \mbox{lndist}{i} - \frac{\sum_{t}\,\mbox{lndist}_{i}}{T} = \mbox{lndist}{i} - \frac{T\,\mbox{lndist}_{i}}{T} = 0\) .The variable \(\text{lndistsq}\) is also transformed in a similar way. Consequently, in a fixed effects specification, the coefficients on time invariant explanatory variables cannot be identified. The fixed effects (FE) estimator only identifies the coefficients using the within-route variation over time.

Fixed Effects Model
distance variables are time invariant
so are omitted

concentration and log(fare)
do vary over time
so are included

Code
par(mfrow=c(2,2))
library(gplots)
plotmeans(lndist~year, main="Heterogeneity across years",data=tut12_pd)
plotmeans(lndistsq~year, main="Heterogeneity across years",data=tut12_pd)
plotmeans(lnfare~year, main="Heterogeneity across years",data=tut12_pd)
plotmeans(concen~year, main="Heterogeneity across years",data=tut12_pd)

(iii)

Estimate model (2) using the Fixed Effects (FE) estimator with cluster-robust standard errors. Compare and contrast the estimate for \(\beta_{1}\) to that obtained for the pooled OLS model in part b).

Solution

Code
#-------------------------------------------
# Question  (c)(i)-(iv) 
# Estimate Fixed Effects (FE) model
#--------------------------------------
# fixed <- plm(lnfare ~ concen + lndist + lndistsq +year1998 + year1999 + year2000, 
 #            model="within", data = tut12_pd )
# print(summary(fixed))
fixed <- plm(lnfare ~ concen + year1998 + year1999 + year2000, 
             model="within", data = tut12_pd )
# print(summary(fixed))
wald_fixed <-pwaldtest(fixed, test="F")
#  print(wald_fixed)
fstat3 <-round(wald_fixed[[2]][1], digits=4)
pvalf3 <-round(wald_fixed[[4]][1], digits=4)
numdf3 <-wald_fixed[[3]][1]
demdf3 <-wald_fixed[[3]][2]
#------------------------------------------
# Cluster-Robust Variance Estimator
# (1) Use finite Sample: Degrees of Freedom  Adjustment
# (N/N-1)*(NT-1)*(NT-K-1)
# display with cluster VCE and df-adjustment
fixed_crse_vcov <- dfa * vcovHC(fixed, type = "HC0", cluster = "group")
fixed_crse_r <-coeftest(fixed, vcov = fixed_crse_vcov)
# print(fixed_crse_r)
# cluster-robust standard errors for fixed effects model
fixed_crse <- sqrt(diag(fixed_crse_vcov))
#---------------------------------------------
# Note: pwaldtest reports chi-squared, not F by default
wald_fixed_crse <- pwaldtest(fixed, vcov = fixed_crse_vcov, test= "F")
# print(wald_fixed_crse)
fstat4 <-round(wald_fixed_crse[[2]][1], digits=4)
pvalf4 <-round(wald_fixed_crse[[4]][1], digits=4)
numdf4 <-wald_fixed_crse[[3]][1]
demdf4 <-wald_fixed_crse[[3]][2]
Code
# compare FE estimator with cluster-robust std. errors
stargazer(fixed,fixed, type = "html", dep.var.labels=c("(Log) Fare"),
          column.labels = c( "Within", "Within (Cluster-Robust)"),
          covariate.labels =c("Market Concentration", "Year=1998", 
                              "Year=1999", "Year=2000"),
          se        = list(NULL, fixed_crse),
          omit.stat = c("f","adj.rsq"),
          add.lines = list(c("F Statistic", fstat3, fstat4),
                           c("F p value", pvalf3, pvalf4),
                           c("F num df", numdf3, numdf4),
                           c("F dem df", demdf3,demdf4)),
          digits=4, align=TRUE,
          intercept.bottom=FALSE,
          star.cutoffs = c(0.05, 0.01, 0.001))
Dependent variable:
(Log) Fare
Within Within (Cluster-Robust)
(1) (2)
Market Concentration 0.1689*** 0.1689***
(0.0294) (0.0495)
Year=1998 0.0228*** 0.0228***
(0.0045) (0.0042)
Year=1999 0.0364*** 0.0364***
(0.0044) (0.0051)
Year=2000 0.0978*** 0.0978***
(0.0045) (0.0055)
F Statistic 134.6105 120.004
F p value 0 0
F num df 4 4
F dem df 3443 1148
Observations 4,596 4,596
R2 0.1352 0.1352
Note: p<0.05; p<0.01; p<0.001

Figure 2: Fixed Effects (Within transformation) Results Model (2)


The estimation results are reported in Figure 2. The OLS results (with cluster-robust standard errors) are compared to the fixed effects estimates in the table below.

Pooled OLS / FE comparison

Relative to FE
Pooled OLS suffers
from Omitted Variable Bias
In FE \(\upsilon_t\) in the error term drops out using the “within transformation”

Pooled OLS Ordinary s.e Fixed Effects Cluster s.e
Variable coef. std. error coef. std,error
CONCEN 0.3601 0.0586 0.1689 0.0495

Note the model degrees of freedom for the FE model (without cluster-robust standard errors). The degrees of freedom are \(\left\{NT - (K+1) - (N-1)\right\} = (NT - K - N) (4,596 -4 - 1,149) = 3,443\). The degrees of freedom for the FE model with cluster-robust standard errors is given by \(N-1=1,148\).

Relative to the FE model, the time-invariant unobserved effect \(\upsilon_{i}\) represents an omitted variable in the pooled OLS model. The pooled OLS estimate of 0.3601 is greater than the FE estimate (which controls for \(\upsilon_{i}\)) of 0.1689.

Provided cities with higher values of \(\upsilon_{i}\) tend to have greater fares, this result would be consistent with an omitted variable bias in the pooled OLS model where \(\upsilon_{i}\) is positively correlated with \(\text{concen}_{i}\) - cities with higher values of \(\upsilon_{i}\) tend to also have less competition on air routes (larger market share of largest carriers).

(iv)

At the 5% level of significance, test the hypothesis that the pooled OLS model is the appropriate model, against the alternative that the Fixed Effects Model is the most appropriate. Your answer should clearly state the null and alternative hypothesis, the distribution of the test statistic, and your conclusion.

Test FE vs Pooled OLS

Restricted Model: Pooled OLS
Unrestricted: FE (Within) Model
not reject \(H_0: \rightarrow\) Pooled OLS
reject \(H_0: \rightarrow\) FE model

Solution
The null hypothesis is: \[ H\_{0} : \mbox{’fixed effects’ are jointly } = 0 \qquad H\_{A} : \mbox{at least one ’fixed effect’ } \ne 0 \] The test statistic will follow a F-distribution with \(M\) numerator degrees of freedom and $ = (NT-K-N) $ degrees of freedom. Here the number of restrictions \(M = (N-1) = 1,148\) where \(N\) represents the number of cross-sectional units (air routes) and the model degrees of freedom is \((NT-K-N) = (4,596 - 4 - 1,149) = 3,443\). The F critical value \(F_{c} \thickapprox 1.00\). The decision rule - reject \(H_{0}\) if the sample value of F test statistic exceeds the \(F_{c}\) critical value. Alternatively, reject \(H_{0}\) if the p-value for the sample value of the test statistic is less than \(\alpha = 0.05\).

The restricted model is: \[\begin{align*} \ln\,\mbox{fare}_{it} & = \beta_{0} + \beta_{1}\,\mbox{concen}_{it} + \beta_{2}\,\mbox{YEAR1998}_{t} \\ & + \beta_{3}\,\mbox{YEAR1999}_{t} + \beta_{4}\,\mbox{YEAR2000}_{t} + \varepsilon_{it} \end{align*}\] The unrestricted model is the Fixed Effects model: \[\begin{align*} \ln\,\mbox{fare}_{it} & = \beta_{0i} + \beta_{1}\,\mbox{concen}_{it} + \beta_{2}\,\mbox{YEAR1998}_{t} \\ & + \beta_{3}\,\mbox{YEAR1999}_{t} + \beta_{4}\,\mbox{YEAR2000}_{t} + \varepsilon_{it} \end{align*}\]

The R output from the Wald test is given in Figure 3.
The output provides a sample F-statistic of \(F=60.521\) with a p-value of \(0.0000\). Since the p-value is less than the desired level of significance we reject the null hypothesis . The sample evidence is consistent with the Fixed Effects model, rather than the pooled OLS model.

Code
#------------------------------------------
# Which is the preferred model? 
# Do we need to include individual specific intercepts?
# Pooled OLS or Fixed Effects model?
# Ftest: unrestricted Model: Fixed Effects Model
# Ftest: restricted Model:   Pooled OLS
# So we can compare pooled OLS with FE need to
# estimate pooled OLS with only time varying X variables
# so re-estimate pooled OLS model with no distance variables.
pool1 <- plm(lnfare ~ concen + year1998 + year1999 + year2000, 
             model="pooling", data = tut12_pd )
pFtest(fixed, pool1)

    F test for individual effects

data:  lnfare ~ concen + year1998 + year1999 + year2000
F = 60.521, df1 = 1148, df2 = 3443, p-value < 0.00000000000000022
alternative hypothesis: significant effects

Figure 3: \(H_0:\) Wald Test of Individual Effects

(v)

Estimate this econometric model (2) using the Random Effects (RE) estimator. Clearly explain the assumption about the relationship between \(\upsilon_{i}\) and \(\mbox{concen}_{it}\) that is imposed when estimating the model using the Random Effects (RE) estimator.
Clearly explain, and provide an example, whether you think that this is a realistic assumption.

Solution

Code
#-----------------------------------------
# Question (c)(v-(vii): Random Effects (RE) Model
#------------------------------------------
random <- plm(lnfare ~ concen + lndist + lndistsq 
              + year1998 + year1999 + year2000, 
              model="random", data = tut12_pd )
# print(summary(random))
wald_random <-pwaldtest(random)
# print(wald_random)
chisq5 <-round(wald_random[[2]][1], digits=4)
pvalf5 <-round(wald_random[[4]][1], digits=4)
numdf5 <-wald_random[[3]][1]
#--------------------------------------
# Cluster-Robust Variance Estimator
# (1) Use finite Sample: Degrees of Freedom  Adjustment
# (N/N-1)*(NT-1)*(NT-K-1)
# display with cluster VCE and df-adjustment
random_crse_vcov <- dfa * vcovHC(random, type = "HC0", cluster = "group")
# report robust se with z values, not t
random_crse_r <-coeftest(random, vcov = random_crse_vcov,df= Inf)
# print(random_crse_r)
# cluster-robust standard errors for random effects model
random_crse <- sqrt(diag(random_crse_vcov))
#------------------------------------------
# RE estimator is only consistent so we want a chi-squared statistic
wald_random_crse <- pwaldtest(random, vcov = random_crse_vcov)
# print(wald_random_crse)
chisq6 <-round(wald_random_crse[[2]][1], digits=4)
pvalf6 <-round(wald_random_crse[[4]][1], digits=4)
numdf6 <-wald_random_crse[[3]][1]
# Display additional RE parameters 
# print(ercomp(random))
sigma_e_random <- round(sqrt(ercomp(random)[["sigma2"]][1]), digits=4)
# print(sigma_e_random)
sigma_v_random <- round(sqrt(ercomp(random)[["sigma2"]][2]), digits=4)
# print(sigma_v_random)
theta_random <- round(ercomp(random)[["theta"]], digits=4)
# print(theta_random)
Code
# compare RE estimator with cluster-robust std. errors
stargazer(random,random, type = "html", dep.var.labels=c("(Log) Fare"),
          column.labels = c( "Random", "Random (Cluster-Robust)"),
          covariate.labels =c("Intercept", "Market Concentration", 
                              "(Log) Distance", "(Log) Distance Squared",
                              "Year=1998", "Year=1999", "Year=2000"),
          se        = list(NULL, random_crse),
          omit.stat = c("f","adj.rsq"),
          add.lines = list(c("Chi-sq Statistic", chisq5, chisq6),
                           c("Chi-sq p value", pvalf5, pvalf6),
                           c("Chi-sq df", numdf5, numdf6),
                           c("sigma (idiosyncratic)", sigma_e_random, sigma_e_random),
                           c("sigma (individual)", sigma_v_random, sigma_v_random),
                           c("alpha", theta_random,theta_random)),
          digits=4, align=TRUE,
          intercept.bottom=FALSE,
          star.cutoffs = c(0.05, 0.01, 0.001))
Dependent variable:
(Log) Fare
Random Random (Cluster-Robust)
(1) (2)
Intercept 6.6516*** 6.6516***
(0.9312) (1.0483)
Market Concentration 0.2090*** 0.2090***
(0.0265) (0.0422)
(Log) Distance -0.9454*** -0.9454**
(0.2642) (0.2913)
(Log) Distance Squared 0.0975*** 0.0975***
(0.0186) (0.0201)
Year=1998 0.0225*** 0.0225***
(0.0045) (0.0041)
Year=1999 0.0367*** 0.0367***
(0.0045) (0.0051)
Year=2000 0.0982*** 0.0982***
(0.0045) (0.0055)
Chi-sq Statistic 1360.3406 1222.7445
Chi-sq p value 0 0
Chi-sq df 6 6
sigma (idiosyncratic) 0.1065 0.1065
sigma (individual) 0.3193 0.3193
alpha 0.8355 0.8355
Observations 4,596 4,596
R2 0.2287 0.2287
Note: p<0.05; p<0.01; p<0.001

Figure 4: Random Effects Results Model (2)

The estimation results are reported in Figure 4. Note that:

Code
# Display additional RE parameters 
print(c('sigma_e_random: ',sigma_e_random))
                                idios 
"sigma_e_random: "           "0.1065" 
Code
print(c('sigma_v_random: ',sigma_v_random))
                                   id 
"sigma_v_random: "           "0.3193" 
Code
print(c('theta_random: ',theta_random))
                               id 
"theta_random: "         "0.8355" 

\(\hat{\sigma}{\upsilon} = 0.31935 \qquad \qquad \hat{\sigma}{\varepsilon} = 0.10651\) so:

\(\hat{\rho} = \frac{\hat{\sigma}_{\upsilon}^{2}}{\hat{\sigma}_{\upsilon}^{2} + \hat{\sigma}_{\varepsilon}^{2}} = \frac{0.31935^{2}}{0.31935^{2} + 0.10651^{2}} = 0.8999\)
and \[\begin{align*} \hat{\alpha} & =1 - \frac{\hat{\sigma}_{\varepsilon}}{\sqrt{T\,\hat{\sigma}^{2}_{\upsilon} + \hat{\sigma}^{2}_{\varepsilon}}} \\ & = 1 - \frac{0.31935}{\sqrt{4\,(0.31935^{2}) + 0.10651^{2}}} \\ & = 1 - 0.164489 \\ & = 0.8355 \end{align*}\]

Since \(\hat{\alpha}\) is close to 1, we expect that the pooled OLS model is likely to be rejected by the data. However, this would need to be formally tested using an LM test.

The RE estimator imposes the restriction that \(\mathop{\mathrm{COV}}(\upsilon_{i}, \mbox{concen}_{it}) = 0\) - the concentration index is uncorrelated with the individual effect.

Factors about the cities near the two airports served by the routes could affect demand for air travel. These factors might include population, average education, the characteristics of employers (size, industry \(\ldots\)), and the quality and feasibility of alternatives such as freeways and rail, just to name a few.
Although these factors might be time-varying, over a short period (such as the four years in this sample) these might be approximately constant. Importantly, all these time-invariant factors might reasonably be correlated with \(\mbox{concen}_{it}\), the degree of competition on the route, leading to \(\mathop{\mathrm{COV}}(\mbox{concen}_{it},\upsilon_{i}) \ne 0\).
The identifying restriction under the Random Effects model might not be appropriate.

(vi)

The Random Effects estimator (RE) nests the Pooled OLS model when \(\sigma_{\upsilon} = 0\). Test the hypothesis that the pooled OLS model is the most appropriate model for the data. Your answer should clearly state the null and alternative hypotheses, the distribution of the test statistic, and your decision.

Solution

Code
#  Question 1 (b): LM test: compare random effects & pooled OLS
# Test statistic does not change with the cluster-robust estimator
# Compare to Pool OLS (including time-invariant variables)
plmtest(pool, type=c("bp"))    

    Lagrange Multiplier Test - (Breusch-Pagan)

data:  lnfare ~ concen + lndist + lndistsq + year1998 + year1999 + year2000
chisq = 5566.2, df = 1, p-value < 0.00000000000000022
alternative hypothesis: significant effects

Figure 5: LM Test for Random Effects in Model (2)


Test RE vs Pooled OLS

not reject \(H_0: \rightarrow\) Pooled OLS
reject \(H_0: \rightarrow\) RE model

The null hypothesis is \(H_{0}: \sigma^{2}_{\upsilon} = 0\) against the alternative \(H_{A}: \sigma^{2}_{\upsilon} \ne 0\). The test statistic is an LM statistic and follows a \(\chi^{2}\) distribution with 1 degree of freedom. The 5% critical value is approximately 3.84.
The decision rule is to reject the null hypothesis when the sample value of the test statistic exceeds 3.84.
The sample value of the test statistic, provided in Figure 5 is 5566.2 with a p-value of 0.0000. Clearly reject the null hypothesis. The sample evidence is consistent with significant individual effects, suggesting the pooled OLS specification is not the appropriate model.

(vii)

Clearly outline the important differences between the Random Effects (RE) estimator and the Fixed Effects (FE) estimator. Your answer should clearly explain the variation in the data that is used to identify the parameters of interest.

Solution

this is a good summary


Fixed Effects:

  • allow for time invariant unobserved heterogeneity between individuals.

  • time invariant unobserved heterogeneity between individuals potentially correlated with included explanatory variables.

  • can allow for arbitrary correlation in error terms for the same individual (cluster-robust standard errors).

  • identification for within individual variation only: time varying explanatory variables only.

Random Effects:

  • allow for time invariant unobserved heterogeneity between individuals.

  • time invariant unobserved heterogeneity between individuals uncorrelated with included explanatory variables.

  • errors correlated over time for the same individual.

  • identification from both within individual variation and between individuals: can include time-varying and time-invariant explanatory variables.

When \(\mathop{\mathrm{COV}}(\upsilon_{i},X_{itk}) = 0\): both the RE estimator (GLS) and the FE estimator (within transformation) are consistent estimators, but the GLS estimator for the RE model has a lower variance. In large samples, both estimators converge to the true parameter values \(\beta_{k}\) - RE and FE estimates should be asymptotically similar.

When \(\mathop{\mathrm{COV}}(\upsilon_{i},X_{itk}) \ne 0\): the RE estimator is not consistent, the FE estimator is consistent. In large samples the FE estimator converges to the true parameter values \(\beta_{k}\), but the RE converges to some other value - expect RE and FE estimates should be asymptotically different.

(d)

Consider the following alternative Correlated Random Effects (CRE) model for the price of airline seats on routes in a large country: \[\begin{align} \ln\,\mbox{fare}_{it} & = \beta_{0} + \beta_{1}\,\mbox{concen}_{it} + \beta_{2}\,\ln\,(\mbox{dist})_{i} + \beta_{3}\,\ln\left\{\mbox{dist}_{i}^{2}\right\} \notag \\ & + \beta_{4}\,\mbox{YEAR98}_{t} + \beta_{5}\,\mbox{YEAR99}_{t} + \beta_{6}\,\mbox{YEAR00}_{t} + \beta_{7}\,\overline{\mbox{concen}}_{i} + \eta_{i} + \varepsilon_{it} \label{E:cre} \end{align}\] where \(\overline{\mbox{concen}}_{i}\) represents the mean value for market concentration index on route \(i\) and \(\eta_{i}\) represents an unobserved time-invariant random variable which satisfies the restriction \(\mathop{\mathrm{COV}}(\eta_{i},\mathbf{X}_{it}) = 0\) for each of the explanatory variables and: [ (|) = ^{2}{} (,) = 0 (,_{it}) = 0 ] and \(\mathop{\mathrm{VAR}}(\varepsilon_{it}|\mathbf{X}_{it},\eta_{i}) = \sigma^{2}_{\varepsilon}\).

Estimate this correlated random effect model (CRE) using the random effects estimator.
Test the hypothesis that the Random Effects (RE) estimator is the most appropriate model using a z-test at the 5% level of significance.
Your answer should clearly state the null and alternative hypotheses, the distribution of the test statistic, and your decision.

Solution

Code
#---------------------
# Question (d): Correlated Random Effects Estimation
#--------------------
# create individual specific means of X vars
tut12_pd$mconcen <- ave(tut12_pd$concen, tut12_pd$id)
# correlated random effects (CRE)
random_cre <- plm(lnfare ~ concen + lndist + lndistsq + year1998 
                  + year1999 + year2000 + mconcen, 
                  model="random", data = tut12_pd )
# print(summary(random_cre))
wald_random_cre <- pwaldtest(random_cre)
# print(wald_random_cre)
chisq7 <-round(wald_random_cre[[2]][1], digits=4)
pvalf7 <-round(wald_random_cre[[4]][1], digits=4)
numdf7 <-wald_random_cre[[3]][1]
# Cluster-Robust VCOV
random_cre_crse_vcov <- dfa * vcovHC(random_cre, type = "HC0", cluster = "group")
random_cre_crse_r <-coeftest(random_cre, vcov = random_cre_crse_vcov, df= Inf)
# print(random_cre_crse_r)
# cluster-robust standard errors for CRE model
random_cre_crse <- sqrt(diag(random_cre_crse_vcov))
# RE estimator is only consistent so we want a chi-squared statistic
wald_random_cre_crse <- pwaldtest(random_cre, vcov = random_cre_crse_vcov)
# print(wald_random_cre_crse)
chisq8 <-round(wald_random_cre_crse[[2]][1], digits=4)
pvalf8 <-round(wald_random_cre_crse[[4]][1], digits=4)
numdf8 <-wald_random_cre_crse[[3]][1]

# print(ercomp(random_cre))
sigma_e_random_cre <- round(sqrt(ercomp(random_cre)[["sigma2"]][1]), digits=4)
# print(sigma_e_random_cre)
sigma_v_random_cre <- round(sqrt(ercomp(random_cre)[["sigma2"]][2]), digits=4)
# print(sigma_v_random_cre)
theta_random_cre <- round(ercomp(random_cre)[["theta"]], digits=4)
# print(theta_random_cre)
Code
stargazer(random_cre,random_cre, type = "html", dep.var.labels=c("(Log) Fare"),
          column.labels = c( "CRE", "CRE (Cluster-Robust)"),
          covariate.labels =c("Intercept", "Market Concentration", 
                              "(Log) Distance", "(Log) Distance Squared",
                              "Year=1998", "Year=1999", "Year=2000",
                              "(Mean) Mkt. Concentr."),
          se        = list(NULL, random_cre_crse),
          omit.stat = c("f","adj.rsq"),
          add.lines = list(c("Chi-sq Statistic", chisq7, chisq8),
                           c("Chi-sq value", pvalf7, pvalf8),
                           c("Chi-sq df", numdf7, numdf8),
                           c("sigma (idiosyncratic)", sigma_e_random, sigma_e_random),
                           c("sigma (individual)", sigma_v_random, sigma_v_random),
                           c("alpha", theta_random,theta_random)),
          digits=4, align=TRUE,
          intercept.bottom=FALSE,
          star.cutoffs = c(0.05, 0.01, 0.001))
Dependent variable:
(Log) Fare
CRE CRE (Cluster-Robust)
(1) (2)
Intercept 6.6661*** 6.6661***
(0.9303) (1.0454)
Market Concentration 0.1689*** 0.1689***
(0.0294) (0.0495)
(Log) Distance -1.0083*** -1.0083***
(0.2647) (0.2914)
(Log) Distance Squared 0.1039*** 0.1039***
(0.0187) (0.0202)
Year=1998 0.0228*** 0.0228***
(0.0045) (0.0042)
Year=1999 0.0364*** 0.0364***
(0.0044) (0.0051)
Year=2000 0.0978*** 0.0978***
(0.0045) (0.0055)
(Mean) Mkt. Concentr. 0.2136** 0.2136**
(0.0679) (0.0816)
Chi-sq Statistic 1372.8893 1273.4667
Chi-sq value 0 0
Chi-sq df 7 7
sigma (idiosyncratic) 0.1065 0.1065
sigma (individual) 0.3193 0.3193
alpha 0.8355 0.8355
Observations 4,596 4,596
R2 0.2303 0.2303
Note: p<0.05; p<0.01; p<0.001

Figure 6: Correlated Random Effects Model (3)


The estimation results are reported in Figure 6.

note in the Lecture example
there are 7 restrictions
so use a F test
here there is only 1 restriction
so use a t test

The null hypothesis is \(H_{0}: \beta_{7} = 0\) against the two-sided alternative \(H_{A}: \beta_{7} \ne 0\). The test statistic will follow a t-distribution with \((N-1) =1,148\) degrees of freedom, under the null hypothesis. Note that since we are using the cluster-robust variance estimator the correct degrees of freedom are \((N-1)\).
Since this is a two-sided test, the 5% critical value will be \(t_{c} \thickapprox 1.96\). The decision rule is to reject \(H_{0}\) if \(t > t_{c}\) or \(t< - t_{c}\).
The sample test statistic: [ t = = 2.6165 ] since \(t > t_{c}\) - reject \(H_{0}\).

Alternatively, use a \(\chi^{2}\) (large-sample F test). The test statistic will follow a \(\chi^{2}\) distribution with \(M\) degrees of freedom.
Since there is a single restriction, \(M=1\). The \(\chi^{2}(1)\) critical value is 3.84.
The \(\chi^{2}(1)\) critical value is 3.84. The decision rule is to reject \(H_{0}\) if the test statistic exceeds 3.84. Alternatively, reject \(H_{0}\) if the p-value for the sample value of the test statistic is less than \(\alpha = 0.05\).

Code
# Using car package
hnull <-c("mconcen=0")
f_cre<-linearHypothesis(random_cre, hnull, vcov=random_cre_crse_vcov)
print(f_cre)

Linear hypothesis test:
mconcen = 0

Model 1: restricted model
Model 2: lnfare ~ concen + lndist + lndistsq + year1998 + year1999 + year2000 + 
    mconcen

Note: Coefficient covariance matrix supplied.

  Res.Df Df  Chisq Pr(>Chisq)   
1   4589                        
2   4588  1 6.8459   0.008885 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Figure 7: \(H_0:\beta_7=0\) in Model (3)


The R output from the Wald test is given in Figure 7.
The output provides a sample \(\chi^{2}\) test statistic of \(6.8459\) with a p-value of \(0.0089\).
Since the p-value is less than the desired level of significance we reject the null hypothesis.

Note: When using R, I would recommend using the \(\chi^{2}\) form of the test, rather than the F version. The main reason is that the random effects estimator is consistent (but not unbiased) and asymptotically normally distributed. Additionally, the \(\text{linearHypothesis}\) function in the \(\text{car}\) package does not get the denominator degrees of freedom correct when conducting the F test using the cluster-robust variance estimator. After estimating the random effects model (and correlated random effects model),* R reports the \(\chi^{2}\) test statistic following the \(\text{linearHypothesis}\) command by default so there is nothing extra you need to do.

Regardless of whether you use a t-test of a \(\chi^{2}\) test, the sample evidence is consistent with the alternative hypothesis that the fixed effects specification (compared to the random effects specification) where \(\mathop{\mathrm{COV}}(\upsilon_{i}, \mbox{concen}_{it}) \ne 0\) is the appropriate specification.

Fixed Effects is the preferred model

Aside: Which model is preferred?

  1. Test Pooled OLS vs Fixed Effects Model
-   Sample test statistic F = 60.521, p-value=0.000
-   Fixed Effects preferred over pooled OLS
  1. Test Pooled OLS vs Random Effects

    • Sample test statistic LM=5,566.2, p-value=0.000
    • Random Effect preferred over pooled OLS
  2. Test Fixed Effects vs Random Effects

    • CRE (Cluster Robust) Sample test statistic CRE=6.8459, p-value=0.0089
    • Fixed Effects preferred over Random Effects