Chapter 10 Solutions to Selected Exercises

10.1 Chapter 1

Question 5

\(y_i =\) weight of the \(i\)-th person, \(x_{i1} =\) their age, \(x_{i2} =\) sex, \(x_{i3} =\) height, \(x_{i4} =\) mean daily food intake, \(x_{i5} =\) mean daily energy expenditure. Distribution of \(y_i\) could be Gaussian. Linear component is \(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5}\).
\(y_i =\) proportion of infected mice in 20 mice in one experiment, \(x_{ij} =\) indicator variable for exposure at level \(j\), with \(j = 1, 2, 3, 4\). Distribution of \(y_i\) could be rescaled binomial. Linear component is \(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4}\).
\(y_i =\) number of trips to the supermarket per week, \(x_{i1} =\) number of people in household, \(x_{i2} =\) household income, \(x_{i3} =\) distance to the supermarket. Distribution of \(y_i\) could be Poisson. Linear component is \(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3}\).

Question 6

The grouped data with \(n = 7\):

Number of trips	Size	Council tax band	Car	\(m_i\)	\(y_i\)	\(\phi_i\)
{3, 2}	1	C	1	2	2.5	1/2
{1, 2}	2	B	0	2	1.5	1/2
{1, 1, 2}	1	A	0	3	1.33	1/3
{3, 2}	4	B	0	2	2.5	1/2
{4}	3	C	1	1	4	1
{2}	3	C	0	1	2	1
{3}	4	C	1	1	3	1

10.2 Chapter 2

Question 1b

For log link

We have \(S(\boldsymbol{\beta}) = \sum_{i=1}^{n_A + n_B} ( y_i - \mu_i ) \, \boldsymbol{x}_i\), where \(\boldsymbol{x}_i = (1, x_i)^T\). Note that data in the same group have the same estimated mean since \(x_i\) is the same within each group. From the score equation:

\[ \boldsymbol{0} = S(\boldsymbol{\hat{\beta}}) = \sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) \begin{pmatrix} 1 \\ 1 \end{pmatrix} + \sum_{i=n_A+1}^{n_A+n_B} ( y_i - \hat{\mu}_B ) \begin{pmatrix} 1 \\ 0 \end{pmatrix}. \]

This implies:

\[ \sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) + \sum_{i=n_A+1}^{n_A+n_B} ( y_i - \hat{\mu}_B ) = 0 \] and

\[ \sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) = 0.\] These two equations imply that \(\hat{\mu}_A = \frac{1}{n_A} \sum_{i=1}^{n_A} y_i\) and \(\hat{\mu}_B = \frac{1}{n_B} \sum_{i=n_A+1}^{n_A+n_B} y_i\).

For general link

With a general link, data in the same group still have the same estimated mean and linear predictor. Using the general form of the score function, we have:

\[ \boldsymbol{0} = S(\boldsymbol{\hat{\beta}}) = \frac{1}{\phi} \sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) \frac{h'(\hat{\eta}_A)}{\mathcal{V}(\hat{\mu}_A)} \begin{pmatrix} 1 \\ 1 \end{pmatrix} + \frac{1}{\phi} \sum_{i=n_A+1}^{n_A+n_B} ( y_i - \hat{\mu}_B ) \frac{h'(\hat{\eta}_B)}{\mathcal{V}(\hat{\mu}_B)} \begin{pmatrix} 1 \\ 0 \end{pmatrix}. \] This implies:

\[ \sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) \frac{h'(\hat{\eta}_A)}{\mathcal{V}(\hat{\mu}_A)} + \sum_{i=n_A+1}^{n_A+n_B} ( y_i - \hat{\mu}_B ) \frac{h'(\hat{\eta}_B)}{\mathcal{V}(\hat{\mu}_B)} = 0 \] and

\[ \sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) \frac{h'(\hat{\eta}_A)}{\mathcal{V}(\hat{\mu}_A)} = 0. \] These two equations also imply that \(\hat{\mu}_A = \frac{1}{n_A} \sum_{i=1}^{n_A} y_i\) and \(\hat{\mu}_B = \frac{1}{n_B} \sum_{i=n_A+1}^{n_A+n_B} y_i\).

Question 2

This link function tries to capture \(E[y_i] = \mu_i = (\boldsymbol{\beta}^T \boldsymbol{x}_i)^2\).
\(\displaystyle S(\boldsymbol{\beta}) = \frac{2}{\sigma^2} \sum_i \left( y_i - (\boldsymbol{\beta}^T \boldsymbol{x}_i)^2 \right) (\boldsymbol{\beta}^T \boldsymbol{x}_i) \boldsymbol{x}_i\).

Question 3

First we identify the EDF components. An exponential distribution is an EDF with \(\phi = 1\), \(\theta = -\lambda\), and \(b(\theta) = -\ln(-\theta)\). This gives \(\mu = -1/\theta = 1/\lambda\) and \(\text{Var}[y] = 1/\theta^{2} = \mu^{2}\) as expected. Recall that \(\text{Var}[y] = \phi \;\mathcal{V}(\mu)\), defining \(\mathcal{V}\). The response function \(h\) is \(\mu = h(\eta) = e^{\eta}\), so that \(\mathcal{V}(\mu) = \mu^{2} = e^{2\eta}\). Putting all this together in the standard formula gives

\[ S(\boldsymbol{\beta}) = \sum_i (y_i - e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}) \;e^{-2 \boldsymbol{\beta}^T \boldsymbol{x}_i} \;e^{\boldsymbol{\beta}^T \boldsymbol{x}_i} \boldsymbol{x}_i = \sum_i \Biggl( {y_i \over e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}} - 1 \Biggr) \boldsymbol{x}_i. \]

Differentiating gives

\[ F_{\text{obs}}(\boldsymbol{\beta}) = -{\partial S\over \partial\boldsymbol{\beta}} = \sum_i {y_i \over e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}} \;\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{T}. \]

Expectation will only act on the \(y_i\) in each term, giving \(\mu_i\), so

\[ F(\boldsymbol{\beta}) = \sum_i {\mu_i \over e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}} \;\boldsymbol{x}_i \boldsymbol{x}_i^T = \sum_i \boldsymbol{x}_i \boldsymbol{x}_i^T. \]

Notice that this is not even a function of \(\boldsymbol{\beta}\).

Question 5

We have \(\eta_i = g(\mu_i) = \log(\mu_i) = \log(n_i \lambda_i) = \log n_i + \boldsymbol{\beta}^T \boldsymbol{x}_i\). We can also think of this model as a normal Poisson GLM with \(\eta_i = \tilde{\boldsymbol{\beta}}^T \tilde{\boldsymbol{x}}_i\), where \(\tilde{\boldsymbol{\beta}} = \begin{pmatrix} 1 \\ \boldsymbol{\beta} \end{pmatrix}\) and \(\tilde{\boldsymbol{x}}_i = \begin{pmatrix} \log n_i \\ \boldsymbol{x}_i \end{pmatrix}\). See Section 3.5.3 in the textbook for an example of this model.
This model is similar to a normal Poisson GLM but with an additional \(\log n_i\) term in the linear predictor, which does not affect the general forms of the score function and Fisher information (you should verify this). Thus, we can use the expressions for the score function and Fisher information in the lecture notes. Assume the data is ungrouped (it is also unrealistic to have grouped data here since we often cannot get several responses for each population). Since log is the natural link for this model, we have:

\[ S(\boldsymbol{\beta}) = \sum_i (y_i - \mu_i) \boldsymbol{x}_i = \sum_i \left( y_i - n_i \exp(\boldsymbol{\beta}^T \boldsymbol{x}_i) \right) \boldsymbol{x}_i. \] Furthermore, for the natural link, the observed Fisher information and expected Fisher information are the same, with

\[ F_{\text{obs}}(\boldsymbol{\beta}) = F(\boldsymbol{\beta}) = \sum_i h'(\eta_i) \boldsymbol{x}_i\boldsymbol{x}_i^T = \sum_i n_i \exp(\boldsymbol{\beta}^T \boldsymbol{x}_i) \boldsymbol{x}_i\boldsymbol{x}_i^T. \]

Question 6

The score function and Fisher information in matrix notation are:

\[\begin{align} \boldsymbol{S} &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (\boldsymbol{Y} - \boldsymbol{\mu}) \\ \boldsymbol{F} &= \boldsymbol{X}^{T}\boldsymbol{D}^{T}\boldsymbol{\Sigma}^{-1}\boldsymbol{D}\boldsymbol{X} \end{align}\]

where

\[\begin{align} \boldsymbol{X} &= \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{pmatrix} \\ \boldsymbol{Y} &= \begin{pmatrix} 12 \\ 10 \\ 8 \\ 7 \end{pmatrix} \\ \boldsymbol{\mu} &= \begin{pmatrix} e^{\beta_0} \\ e^{\beta_0 + \beta_1} \\ e^{\beta_0 + 2 \beta_1} \\ e^{\beta_0 + 3 \beta_1} \end{pmatrix} \\ \boldsymbol{D} &= \begin{pmatrix} e^{\beta_0} & 0 & 0 & 0 \\ 0 & e^{\beta_0 + \beta_1} & 0 & 0 \\ 0 & 0 & e^{\beta_0 + 2 \beta_1} & 0 \\ 0 & 0 & 0 & e^{\beta_0 + 3 \beta_1} \end{pmatrix} \\ \boldsymbol{\Sigma} &= \begin{pmatrix} \frac{e^{\beta_0}}{20} & 0 & 0 & 0 \\ 0 & \frac{e^{\beta_0 + \beta_1}}{20} & 0 & 0 \\ 0 & 0 & \frac{e^{\beta_0 + 2 \beta_1}}{10} & 0 \\ 0 & 0 & 0 & \frac{e^{\beta_0 + 3 \beta_1}}{10} \end{pmatrix}. \end{align}\]

Furthermore, since log link is the natural link and \(\phi=1\) for the Poisson GLM, we can further simplify:

\[\begin{align} \boldsymbol{S} &= \boldsymbol{X}^{T} \boldsymbol{G} (\boldsymbol{Y} - \boldsymbol{\mu}) \\ \boldsymbol{F} & = \boldsymbol{X}^{T}\boldsymbol{G}^{T}\boldsymbol{\Sigma}\boldsymbol{G}\boldsymbol{X} \end{align}\]

where \(\boldsymbol{G}\) is the grouping matrix: \[ \boldsymbol{G} = \begin{pmatrix} 20 & 0 & 0 & 0 \\ 0 & 20 & 0 & 0 \\ 0 & 0 & 10 & 0 \\ 0 & 0 & 0 & 10 \end{pmatrix}. \]

Question 9

From \(\boldsymbol{S} = \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (\boldsymbol{Y} - \boldsymbol{\mu})\), we have:

\[\begin{align} E[\boldsymbol{S}] = \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (E[\boldsymbol{Y}] - \boldsymbol{\mu}) = \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu}) = 0 \end{align}\]

and

\[\begin{align} \text{Var}[\boldsymbol{S}] &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \text{Var}(\boldsymbol{Y} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \text{Var}(\boldsymbol{Y}) \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \boldsymbol{\Sigma} \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X}. \end{align}\]

Taking the derivative of \(\boldsymbol{S}\) and applying the product rule for derivatives, we have:

\[\begin{align} \frac{\partial \boldsymbol{S}}{\partial \boldsymbol{\beta}} &= \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}(\boldsymbol{D} \boldsymbol{\Sigma}^{-1}) (\boldsymbol{Y} - \boldsymbol{\mu}) + \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial }{\partial \boldsymbol{\beta}} (\boldsymbol{Y} - \boldsymbol{\mu}) \\ &= \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}(\boldsymbol{D} \boldsymbol{\Sigma}^{-1}) (\boldsymbol{Y} - \boldsymbol{\mu}) - \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\beta}}. \end{align}\]

Thus,

\[\begin{align} \boldsymbol{F} = E \left[- \frac{\partial \boldsymbol{S}}{\partial \boldsymbol{\beta}} \right] &= - E \left[ \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}(\boldsymbol{D} \boldsymbol{\Sigma}^{-1}) ( \boldsymbol{Y} - \boldsymbol{\mu} ) \right] + E \left[ \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\beta}} \right] \\ &= - \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}( \boldsymbol{D} \boldsymbol{\Sigma}^{-1}) E [ \boldsymbol{Y} - \boldsymbol{\mu} ] + \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\beta}} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\eta}} \frac{\partial \boldsymbol{\eta}}{\partial \boldsymbol{\beta}} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X}. \end{align}\]

10.3 Chapter 3

Question 2a

Let \(x_{i1} = {\tt time}_i\), \(x_{i2} = {\tt I(cos(2*pi*time/12))}_i\), and \(x_{i3} = {\tt I(sin(2*pi*time/12))}_i\), where the subscript \(i\) indicates the \(i\)-th data point. We perform the test with

\[ \begin{aligned} & \mathcal{H}_0: \eta_i = \beta_0 + \beta_1 x_{i1} \\ \text{against } \quad & \mathcal{H}_1: \eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3}. \end{aligned} \] Let \(\boldsymbol{\beta} = (\beta_0, \beta_1, \beta_2, \beta_3)^T\), then

\[ \begin{aligned} C & = \begin{pmatrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \\ \gamma & = \begin{pmatrix} 0 \\ 0 \end{pmatrix}. \end{aligned} \]

The variance that we are looking for is:

\[ C \, F(\hat{\boldsymbol\beta})^{-1} \, C^T = C \, \text{Var} [ \hat{\boldsymbol\beta} ] \, C^T = \text{Var} \left[ \begin{pmatrix} \hat{\beta_2} \\ \hat{\beta_3} \end{pmatrix} \right]. \]

Thus, the Wald statistic is:

\[ \begin{aligned} W &= (C \hat{\boldsymbol\beta} - \gamma)^T \left( C F(\hat{\boldsymbol\beta})^{-1} C^T \right)^{-1} (C \hat{\boldsymbol\beta} - \gamma) \\ &= (\hat{\beta_2} \;\; \hat{\beta_3}) \, \text{Var} \left[ \begin{pmatrix} \hat{\beta_2} \\ \hat{\beta_3} \end{pmatrix} \right]^{-1} \begin{pmatrix} \hat{\beta_2} \\ \hat{\beta_3} \end{pmatrix}. \end{aligned} \]

From R, we can get \((\hat{\beta_2} \;\; \hat{\beta_3})\) and the corresponding covariance matrix by:

( c_betahat <- polio1.glm$coef[3:4] )

## I(cos(2 * pi * time/12)) I(sin(2 * pi * time/12)) 
##                0.1812537               -0.4231874

( c_varhat <- summary(polio1.glm)$cov.scaled[3:4, 3:4] )

##                          I(cos(2 * pi * time/12)) I(sin(2 * pi * time/12))
## I(cos(2 * pi * time/12))             0.0092468054            -0.0002083338
## I(sin(2 * pi * time/12))            -0.0002083338             0.0095238337

Thus, the Wald statistic can be computed by:

( w <- c_betahat %*% solve(c_varhat) %*% c_betahat )

##          [,1]
## [1,] 22.00497

Under \(\mathcal{H}_0\), this statistic is \(\chi^{2}(2)\) distributed with the p-value:

1 - pchisq(w, 2)

##              [,1]
## [1,] 1.666024e-05

Since this p-value is very small, we reject \(\mathcal{H}_0\) at both 5% and 1% significance levels.

Note that we can check our result by using the wald.test function from the mdscore package (to be installed if necessary), e.g.,

library(mdscore)
wald.test(polio1.glm, 3:4)

## $W
## [1] 22.00497
## 
## $pvalue
## [1] 1.666024e-05
## 
## attr(,"class")
## [1] "wald.test"

Question 3

From exercise 3 of Chapter 2, when \(\boldsymbol{x}_i = (1, \texttt{wbc}_i)^T\), the Fisher information becomes:

\[ F = \sum_i \begin{pmatrix} 1 & \texttt{wbc}_i \\ \texttt{wbc}_i & \texttt{wbc}_i^2 \end{pmatrix}. \]

Using the data, we can find:

\[ F = \begin{pmatrix} 17.0000 & 69.6300 \\ 69.6300 & 291.4571 \end{pmatrix}. \]

Its inverse is:

\[ F^{-1} = \begin{pmatrix} 2.7384 & -0.6542 \\ -0.6542 & 0.1597 \end{pmatrix}. \]

For a new value \(\texttt{wbc}_0 = 3\), the estimated value of the linear predictor is given by \(\hat\eta_0 = \hat{\boldsymbol\beta}^T \boldsymbol{x}_0 = \hat{\beta}_1 + 3 \hat{\beta}_2 = 5.150\). The estimated expected value of \(\texttt{time}\) is thus \(e^{5.15} = 172.4\).

From the value of \(F^{-1}\), we find the standard error of this value to be:

\[ SE = \sqrt{ \boldsymbol{x}_0^T F^{-1} \boldsymbol{x}_0 } = \sqrt{ \begin{pmatrix} 1 & 3 \end{pmatrix} \begin{pmatrix} 2.7384 & -0.6542 \\ -0.6542 & 0.1597 \end{pmatrix} \begin{pmatrix} 1 \\ 3 \end{pmatrix} } = \sqrt{0.251} = 0.5. \]

Thus the normal-approximation confidence interval on the linear predictor scale is: \[ CI = 5.15 \pm 1.96 \times 0.5 = [4.17, 6.13]. \]

and thus on the response scale, the confidence interval is: \[ CI = \exp([4.17, 6.13]) = [64.6, 460]. \]

The value of \(\hat\beta_2\) implies that \(\mu = e^{\beta_1 - 1.109 \, {\tt wbc}}\). Thus, as the white blood cell count increases, the survival time goes down exponentially. In fact, an increase in \(1\) in the count reduces survival time by about two-thirds (since \(e^{-1.109} = 0.3299 = 1-0.67\)).

As it is not specified here which test to use, it is acceptable to apply either of Wald test, z-test, t-test, or even conduct a test via a confidence interval for \({\beta}_2\). We can simply conduct the z-test remembering that \(\hat{\boldsymbol\beta} - \boldsymbol\beta\) is distributed asymptotically normal with mean \(0\) and covariance \(F^{-1}(\hat{\boldsymbol\beta})\). From the value for \(F^{-1}\), we have that under \(\mathcal{H}_0\), \(\text{Var}[\hat\beta_2] = 0.1597\). This gives a \(z\) value of \(-1.109/\sqrt{0.1597} = -2.78\). The probability of getting the observed value or less (i.e., a situation more severe) is then \(\Phi(-2.78) = 0.00276\). This is significant at the \(5\%\) level (and at the \(1\%\) level).

Question 4

Let there be \(n\) matches and \(p\) teams. We can construct a match matrix \(M \in \mathbb{R}^{n \times p}\) where each column represents a team and each row a match by assigning:

\[ x_{ij} = \begin{cases} +1 & \mbox{if team } j \mbox{ played at home in match } i \\ -1 & \mbox{if team } j \mbox{ played away in match } i \\ 0 & \mbox{if team } j \mbox{ did not play in match } i \end{cases} \]

However, notice that:

\(x_{i1} = - \sum_{j=2}^p x_{ij}\), meaning that \(M\) is not a design matrix because it will be singular. In particular, this means that the team strength is not identifiable.
Further, this matrix does not provide for an intercept.

Hence, we adjust the above matrix into the design matrix \(X \in \mathbb{R}^{n \times p}\), where the first column is \((1,\ldots, 1)^T\) and where the remaining columns are composed of any subset of \(p-1\) columns of \(M\). From the result table, the strength for Arsenal is omitted, which means that the column for Arsenal is not included in \(X\). In this setting, Arsenal is defined to be strength zero and all other strengths are fitted relative to this.

The intercept term represents the home team advantage as it is a constant additive effect to the home team strength.

The simple model predicts odds on Chelsea winning of:

\[ e^{\hat{\beta}_0 + \hat{\beta}_{\text{Chelsea}} - \hat{\beta}_{\text{Newcastle}}} = e^{-0.284+(-0.498)-(-1.507)} = e^{0.725} = 2.06. \]

This is quite different from the bookies.

10.4 Chapter 4

Question 2

We will only compare polio.glm and polio1.glm here (see the solutions of Exercise 2a in Chapter 3 for the test hypothesis).

First, we fit polio3.glm again but with an appropriate order for the variables and then get the analysis of deviance table:

temp_data <- rep(c(5.195, 5.138, 5.316, 5.242, 5.094, 5.108, 5.260, 5.153, 
                   5.155, 5.231, 5.234, 5.142, 5.173, 5.167), each = 12)
scaled_temp <- 10 * (temp_data - min(temp_data))/(max(temp_data) - min(temp_data))
uspolio$temp <- scaled_temp

polio3.glm <- glm(cases~time + I(cos(2*pi*time/12)) + I(sin(2*pi*time/12)) + 
                  I(cos(2*pi*time/6)) + I(sin(2*pi*time/6)) + temp, 
                  family=poisson(link=log), data=uspolio)
anova(polio3.glm)

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: cases
## 
## Terms added sequentially (first to last)
## 
## 
##                          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                       167     343.00              
## time                      1   9.4538       166     333.55 0.0021071 ** 
## I(cos(2 * pi * time/12))  1   3.4306       165     330.12 0.0639970 .  
## I(sin(2 * pi * time/12))  1  19.3938       164     310.72 1.064e-05 ***
## I(cos(2 * pi * time/6))   1  21.3637       163     289.36 3.799e-06 ***
## I(sin(2 * pi * time/6))   1   0.5036       162     288.85 0.4779331    
## temp                      1  12.0192       161     276.84 0.0005266 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the table, the test statistic for \(\mathcal{H}_0\): polio.glm and \(\mathcal{H}_1\): polio1.glm is:

\[ \frac{D({\tt polio.glm}, {\tt polio1.glm})}{\hat{\phi}} = \frac{333.55 - 310.72}{1} = 22.83. \]

Under \(\mathcal{H}_0\), this statistic is approximately \(\chi^{2}(2)\) distributed with the p-value:

1 - pchisq(22.83, 2)

## [1] 1.102881e-05

Since this p-value is very small, we reject \(\mathcal{H}_0\) at both 5% and 1% significance levels.

We can check our result above using the following R code, which should give us a very similar p-value.

anova(polio.glm, polio1.glm, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: cases ~ time
## Model 2: cases ~ time + I(cos(2 * pi * time/12)) + I(sin(2 * pi * time/12))
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       166     333.55                          
## 2       164     310.72  2   22.825 1.106e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we only have the summary of polio3.glm, we can use the null deviance and residual deviance in the summary to test the null model against polio3.glm.

Question 3

From the table in Section 4.5.3, the test statistic for \(\mathcal{H}_0: \mathcal{M}_1\) and \(\mathcal{H}_1: \mathcal{M}_3\) is:

\[ \frac{D(\mathcal{M}_1, \mathcal{M}_3)}{\hat{\phi}} = \frac{8.172 - 5.785}{0.266} = 8.97. \]

Under \(\mathcal{H}_0\), this statistic is approximately \(\chi^{2}(2)\) distributed with the p-value:

1 - pchisq(8.97, 2)

## [1] 0.01127689

Thus, we reject \(\mathcal{H}_0\) at 5% but not at 1% significance level.

The full model \(\mathcal{M}_7\) adds more parameters that do not help reduce the deviance while at the same time eating up degrees of freedom. This makes us fail to reject \(\mathcal{H}_0\).
This is because R uses different dispersion estimates for the two anova calls. The first table uses the estimated dispersion of \(\mathcal{M}_7\), while the second table uses the estimated dispersion of \(\mathcal{M}_2\) (the bigger model).

1 - pchisq((8.1722-6.7879)/summary(m7)$disp, 1)

## [1] 0.02258198

1 - pchisq((8.1722-6.7879)/summary(m2)$disp, 1)

## [1] 0.04149395

Question 4

\(A = 1\), \(C = 80\), \(D = 72.807 + 5.2574 = 78.064\), \(B = 88.268 - 78.064 = 10.204\).
We compare the models: \[ \begin{aligned} M_0 &: 1 + \texttt{age} + \texttt{sex} \\ M_1 &: 1 + \texttt{age} + \texttt{sex} + \texttt{ward} \end{aligned} \]

The test statistic is:

\[ \frac{D(M_0, M_1)}{\hat\phi} = \frac{72.807 - 71.719}{0.9856213} = 1.1035. \]

This is a likelihood ratio statistic, which follows a \(\chi^2\) distribution. The degrees of freedom corresponds to the difference in the number of parameters between the two models, which is 2. The relevant quantile is:

\[ \chi^2_{2, 0.05} = 5.99\]

so we do not reject \(M_0\). There is no evidence that the inclusion of \(\texttt{ward}\) leads to a better model compared to the model with only \(\texttt{age}\) and \(\texttt{sex}\).

The dispersion is very close to 1, so the shape parameter (1/dispersion) is close to 1 as well. Since an Exponential GLM is a special case of Gamma GLMs with dispersion 1 (or equivalently, shape parameter 1), one may consider modelling an Exponential instead of a Gamma-distribution for the response variable.

10.5 Chapter 5

Question 2

We will only compare polio.glm and polio1.glm here (continuing from the solutions of Exercise 2 in Chapter 4).

The Pearson estimated dispersion of polio1.glm is:

1/(polio1.glm$df.res) * sum((uspolio$cases-polio1.glm$fitted)^2/(polio1.glm$fitted))

## [1] 2.347206

The test statistic for \(\mathcal{H}_0\): polio.glm and \(\mathcal{H}_1\): polio1.glm is:

\[ \frac{D({\tt polio.glm}, {\tt polio1.glm})}{\hat{\phi}} = \frac{333.55 - 310.72}{2.347} = 9.727. \]

Under \(\mathcal{H}_0\), this statistic is approximately \(\chi^{2}(2)\) distributed with the p-value:

1 - pchisq(9.727, 2)

## [1] 0.007723405

Thus, we reject \(\mathcal{H}_0\) at both 5% and 1% significance levels.

We fit the quasi-Poisson models and compare them using the anova function. The p-value here is similar to the one obtained from part (a).

polio.quasi <- glm(cases ~ time, family=quasipoisson(link=log), data=uspolio)
polio1.quasi <- glm(cases ~ time + I(cos(2*pi*time/12)) + I(sin(2*pi*time/12)),
                    family=quasipoisson(link=log), data=uspolio)
anova(polio.quasi, polio1.quasi, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: cases ~ time
## Model 2: cases ~ time + I(cos(2 * pi * time/12)) + I(sin(2 * pi * time/12))
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       166     333.55                        
## 2       164     310.72  2   22.825 0.007737 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value in this question is higher than the ones obtained in Exercise 2 of Chapter 3 and Exercise 2 of Chapter 4. This is because here we scaled the standard errors by \(\sqrt{\hat\phi}\) instead of using the theoretical value \(\phi=1\).

Question 3

Note that if we model \(Y \sim \text{Bin}(m, p)\), then \(E(Y) = mp\) and \(\mbox{Var}(Y) = mp(1-p)\). There will be overdispersion if the actual variance of \(Y\) is greater than \(mp(1-p)\).

Since \(Y = \sum_{i=1}^m Y_i\), the actual variance of \(Y\) is:

\[ \mbox{Var}(Y) = \sum_{i=1}^m \mbox{Var}(Y_i) + 2 \sum_{i<j} \mbox{Cov}(Y_i, Y_j). \]

We know that \(\mbox{Var}(Y_i) = p(1-p)\) and \(\mbox{Cov}(Y_i, Y_j) = \mbox{Corr}(Y_i, Y_j) \sqrt{\mbox{Var}(Y_i)} \sqrt{\mbox{Var}(Y_j)} = \tau p(1-p)\). Thus,

\[ \mbox{Var}(Y) = mp(1-p) + (m-1) m \, \tau p(1-p). \]

If \(\tau > 0\) then \(\mbox{Var}(Y) > mp(1-p)\), which implies overdispersion.

If \(\tau < 0\) then \(\mbox{Var}(Y) < mp(1-p)\), which implies underdispersion. However, we also need \(\mbox{Var}(Y) > 0\), which means \(\tau > - \frac{1}{m-1}\). This case rarely happens in a real poll, but it may happen when the respondents compete with each other to give an affirmative answer.
In this case, \(Y = \sum_{i=1}^M Y_i\) with \(E(M) = \mu_M\) and \(\text{Var}(M) = \sigma_M^2\). Using the law of total expectation, we have:

\[ E(Y) = E(E(Y | M)) = E(E(\sum_{i=1}^M Y_i | M)) = E(M p) = p E(M) = p \mu_M. \]

Using the law of total variance, the previous computation, and the computation from part (a), we have:

\[ \begin{aligned} \text{Var}(Y) &= E(\text{Var}(Y | M)) + \text{Var}(E(Y|M)) \\ &= E(Mp(1-p) + (M-1) M \tau p(1-p)) + \text{Var}(M p) \\ &= E(Mp(1-p)) + E(M^2 \tau p(1-p)) - E(M \tau p(1-p)) + \text{Var}(M p) \\ &= \mu_M p(1-p) + (\sigma_M^2 + \mu_M^2) \tau p(1-p) - \mu_M \tau p(1-p) + p^2 \sigma_M^2. \end{aligned} \]

There is overdispersion in the binomial model if \(\text{Var}(Y) > \mu_M p(1-p)\), which means:

\[ (\sigma_M^2 + \mu_M^2) \tau p(1-p) - \mu_M \tau p(1-p) + p^2 \sigma_M^2 > 0. \]

This is equivalent to:

\[ \tau > - \frac{p \sigma_M^2}{(1-p)(\sigma_M^2 + \mu_M^2 - \mu_M)}. \]

Note that part (a) is a special case of this condition with \(\sigma_M = 0\) and \(\mu_M = m\).

Question 4

a. i. The mean of an EDF is given by \(\mu = b'(\theta)\). In this case, we have \(\mu = -1/\theta\).

The log-likelihood is given by:

\[ \ell(\beta) = \sum_{i=1}^n \left\{ \frac{y_i \theta_i - b(\theta_i)}{\phi} + c(y_i, \phi) \right\}. \]

In this case, \(b(\theta) = -\ln(-\theta) = \ln \mu\) and \(\phi = 1/\nu\). The log-likelihood therefore becomes:

\[ \begin{aligned} \ell(\beta) &= \sum_{i=1}^n \left\{ \frac{ - \frac{y_i}{\mu_i} - \ln \mu_i}{\phi} + \text{const} \right\} = -\frac{1}{\phi} \sum_{i=1}^n \left\{ \frac{y_i}{\mu_i} + \ln \mu_i + \text{const} \right\}. \end{aligned} \]

In the saturated model, maximization with respect to the separate \(\beta_i\)’s, one for each data point, is the same as maximization with respect to the \(\mu_i\), leading to:

\[ \frac{\partial \ell}{\partial \mu_i} = \frac{1}{\phi} \left\{ \frac{y_i}{\mu_i^2} - \frac{1}{\mu_i} \right\} = 0, \]

so that its MLE becomes \(\mu_i = y_i\). Thus, the log-likelihood of the saturated model is:

\[ \ell_{\text{sat}} = -\frac{1}{\phi} \sum_{i=1}^n \left\{ 1 + \ln y_i + \text{const} \right\}. \]

Using the definition of the deviance, we have:

\[ D = 2 \phi (\ell_{\text{sat}} - \ell(\beta)) = 2 \sum_{i=1}^n \left\{ \frac{y_i}{\mu_i} - \ln \frac{y_i}{\mu_i} - 1 \right\}. \]

b. The rough (quick-and-dirty) estimate of the dispersion is: \[ \hat{\phi} = \frac{D}{n-p} = \frac{5.109}{25-8} = 0.3005. \]

c. i. \[ C = \begin{pmatrix} 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \end{pmatrix} \in \mathbb{R}^{2 \times 8}. \]

\[ C = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 1 & -1 & -1 \end{pmatrix} \in \mathbb{R}^{1 \times 8}. \]
\[ C = \begin{pmatrix} 0 & 1 & 0 & \ldots & 0 \\ 0 & 0 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \ldots & 1 \end{pmatrix} \in \mathbb{R}^{7 \times 8}. \]

d. The hypothesis is \(H_0: \beta_2 = 0\) and \(\beta_3 = 0\), or \(H_0 : C \beta = 0\).

Recall that for the Wald test, under \(H_0 : C \beta = \gamma\):

\[ W = (C \hat\beta - \gamma)^T (C F^{-1}(\hat\beta) C^T)^{-1} (C \hat\beta - \gamma) \stackrel{a}{\sim} \chi^2(s), \]

where \(C F^{-1}(\hat\beta) C^T = \text{Var}(C \hat\beta)\). Then we can reject \(H_0\) at significance level \(\alpha\) if \(W > \chi^2_{s, \alpha}\).

For this question, \(\text{Var}(C \hat\beta) = \text{Var}((\hat\beta_2, \hat\beta_3)^T)\), which can be obtained from the appropriate submatrix of the full covariance:

\[ \text{Var}(C \hat\beta) = \begin{pmatrix} 0.00004767174 & -0.0001699029 \\ -0.0001699029 & 0.0660872869 \end{pmatrix}. \]

To calculate \(W\), we need its inverse:

\[ \text{Var}(C \hat\beta)^{-1} = \begin{pmatrix} 21170.77 & 54.43 \\ 54.43 & 15.27 \end{pmatrix}. \]

Now, the vector \(C \hat\beta = (0.009675, 0.05091)^T\), so that:

\[ W = \begin{pmatrix} 0.009675 & 0.05091 \end{pmatrix} \begin{pmatrix} 21170.77 & 54.43 \\ 54.43 & 15.27 \end{pmatrix} \begin{pmatrix} 0.009675 \\ 0.05091 \end{pmatrix} = 2.0749. \]

Since \(\chi^2_{2, 0.05} = 5.99\), the conclusion is not to reject \(H_0\).

Recall that for the likelihood ratio test,

\[ \Lambda = 2 (\ell (\hat\beta) - \ell(\hat\beta_0)) = \frac{D_0 - D}{\hat\phi} \stackrel{a}{\sim} \chi^2(s), \]

where \(\hat\beta\) is the MLE of the full model, \(\hat\beta_0\) is the MLE of the reduced model under \(H_0\), \(D_0\) is the deviance of the reduced model, and \(D\) is the deviance of the full model. Then we can reject \(H_0\) at significance level \(\alpha\) if \(\Lambda > \chi^2_{s, \alpha}\).

The deviance of the full model is, according to the R code:

\[ D = 2 \phi (\ell_{\text{sat}} - \ell (\hat\beta)) = 5.109. \]

The deviance of the reduced model with \(\beta_2 = \beta_3 = 0\) is:

\[ D_0 = 2 \phi (\ell_{\text{sat}} - \ell (\hat\beta_0)) = 5.604. \]

Therefore,

\[ \Lambda = \frac{5.604 - 5.109}{0.3005} = 1.6473. \]

Since \(\chi^2_{2, 0.05} = 5.99\), the conclusion is again not to reject \(H_0\); i.e., the full model is not sufficiently better than the reduced model.

10.6 Chapter 6

Question 2

For stationary AR(1) processes where \(|\alpha|<1\), one has \(Var(y_t) \equiv V\) for some \({ V>0 }\). (We can also easily show that \(V=\sigma^2/(1-\alpha^2)\), even though this result is not necessary to solve this question).

We have:

\[ \begin{aligned} \mbox{Corr}(y_t, y_{t+1}) &= \mbox{Corr}(y_t, \alpha y_{t}+\epsilon_{t+1}) = \frac{\mbox{Cov}(y_t, \alpha y_{t} + \epsilon_{t+1})}{\sqrt{\mbox{Var}(y_t)}\sqrt{\mbox{Var}(y_{t+1})}} \\ &= \frac{1}{\mbox{Var}(y_t)}(\alpha \mbox{Var}(y_t)+ \mbox{Cov}(y_{t}, \epsilon_{t+1})) = \frac{1}{V}(\alpha V) = \alpha. \end{aligned} \]

\(\alpha=0.3034688\).

Question 3

With \(\boldsymbol{\mu}= \boldsymbol{X}\boldsymbol{\beta}\), we have the mean estimate \(\hat{\boldsymbol{\mu}}= \boldsymbol{X}\hat{\boldsymbol{\beta}}\), and so

\[ \begin{aligned} \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{Y}-\hat{\boldsymbol{\mu}}) &= \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{Y}- \boldsymbol{X}\hat{\boldsymbol{\beta}}) \\ &= \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{Y}- \boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y}) \\ &= \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y}-(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y} \\ &= \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y}- \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y} \\ &= 0. \end{aligned} \]

Note that for any non-random matrix \(\boldsymbol{A}\), we have \(\mbox{Var}(\boldsymbol{A}\boldsymbol{Y}) = \boldsymbol{A} \mbox{Var}(\boldsymbol{Y}) \boldsymbol{A}^T\). Also note that \(\mbox{Var}(\boldsymbol{Y})=\boldsymbol{\Sigma}\). Thus,

\[ \begin{aligned} \mbox{Var}(\hat{\boldsymbol{\beta}}) &= (\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1} \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\mbox{Var}(\boldsymbol{Y})\boldsymbol{\Sigma}^{-1}\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1} \\ &= (\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1} \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1} \\ &= (\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1}. \end{aligned} \]

In this case, one can write \(\boldsymbol{\Sigma}= \phi\tilde{\boldsymbol{\Sigma}}\), where the dispersion parameter \(\phi\) is unknown and the matrix \(\tilde{\boldsymbol{\Sigma}}\) is known. Hence, one has \(\boldsymbol{\Sigma}^{-1}= \phi^{-1}\tilde{\boldsymbol{\Sigma}}^{-1}\), and so

\[ \begin{aligned} \hat{\boldsymbol{\beta}} &= (\boldsymbol{X}^T\phi^{-1}\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\phi^{-1}\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{Y} \\ &= \phi \phi^{-1}(\boldsymbol{X}^T\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{Y} \\ &= (\boldsymbol{X}^T\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{Y}. \end{aligned} \]

This does not depend on \(\phi\) and hence does not require its estimation.

Since we do not want to model temporal autocorrelations, we can use the exchangeable correlation structure. Thus, one way of writing \(\boldsymbol{\Sigma}\) is:

\[ \begin{pmatrix} \sigma^2 & \alpha \sigma^2 & \alpha \sigma^2 & & & \\ \alpha \sigma^2 & \sigma^2 & \alpha \sigma^2 & & & \\ \alpha \sigma^2 & \alpha \sigma^2 & \sigma^2 & & & \\ &&& \sigma^2 & \alpha \sigma^2 & \alpha \sigma^2\\ &&& \alpha \sigma^2 & \sigma^2 & \alpha \sigma^2\\ &&& \alpha \sigma^2 & \alpha \sigma^2 & \sigma^2 \end{pmatrix} \]

where \(\alpha<1\) is some constant. (If one wants to make a link to part (c), which was not requested, one could observe that \(\phi=\sigma^2\) is just the error variance and then \(\alpha\) plays the role of the intra-class correlation.)

Question 4

The naive variance estimate is:

\[ \text{Var}(\hat\beta) = \frac{\hat\sigma^2 (1 - \hat\alpha^2)}{\sum_i \left( a_i^2 + b_i^2 - 2 \hat\alpha \, a_i b_i \right)} \]

where \(a_i = x_{i1} h'(\hat\beta x_{i1})\) and \(b_i = x_{i2} h'(\hat\beta x_{i2})\).

In this case, \(a_i = b_i\), implying \(\text{Var}(\hat\beta) = \hat\sigma^2 (1 + \hat\alpha)/(\sum_i 2a_i^2)\). So, \(\text{Var}(\hat\beta)\) increases when \(\hat\alpha\) increases from 0.
In this case, \(a_i = 0\), implying \(\text{Var}(\hat\beta) = \hat\sigma^2 (1 - \hat\alpha^2)/(\sum_i b_i^2)\). So, \(\text{Var}(\hat\beta)\) decreases when \(\hat\alpha\) increases from 0 to 1.

Question 5

From the notations for \(\boldsymbol{X}\), \(\boldsymbol{\Sigma}\), and \(\boldsymbol{Y}\) in Section 6.3, we can easily verify the expressions for \(\hat{\boldsymbol\beta}\) and \(\text{Var}(\hat{\boldsymbol\beta})\). Thus,

\[ E(\hat\beta) = \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big)^{-1} \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} E ( \boldsymbol{Y}_i ) \big) = \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big)^{-1} \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big) \, \beta = \beta. \]

We have \(\text{Var}(\hat{\beta}) = \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big)^{-1}\) and

\[ \begin{aligned} \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i &= \left(\begin{array}{ccc} x_{i1} & \dots & x_{im} \end{array}\right) \, c \left(\begin{array}{cccc} 1 & \phi & \dots & \phi \\ \phi & 1 & \dots & \phi \\ \vdots & \vdots & \ddots & \vdots \\ \phi & \phi & \dots & 1 \end{array}\right) \left(\begin{array}{c} x_{i1} \\ \vdots \\ x_{im} \end{array}\right) \\ &= c \, \left(\begin{array}{ccc} x_{i1} & \dots & x_{im} \end{array}\right) \left(\begin{array}{c} \phi \sum_j x_{ij} + (1-\phi) x_{i1} \\ \vdots \\ \phi \sum_j x_{ij} + (1-\phi) x_{im} \end{array}\right) \\ &= c \, \sum_k x_{ik} (\phi \sum_j x_{ij} + (1-\phi) x_{ik}) \\ &= c \left( \sum_j x_{ij}^2 + \phi \Big[ \big(\sum_j x_{ij}\big)^2 - \sum_j x_{ij}^2 \Big] \right). \end{aligned} \]

Thus,

\[ \begin{aligned} \text{Var}(\hat{\beta}) &= \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big)^{-1} = \left( \sum_i c \, \Big\{ \sum_j x_{ij}^2 + \phi \Big[ \big(\sum_j x_{ij}\big)^2 - \sum_j x_{ij}^2 \Big] \Big\} \right)^{-1} \\ &= \frac{c^{-1}}{\sum_i \Big\{ \sum_j x_{ij}^2 + \phi \Big[ \big(\sum_j x_{ij}\big)^2 - \sum_j x_{ij}^2 \Big] \Big\}} \\ &= \frac{\sigma^2(1 + (m-1)\phi\rho)}{\sum_i \Big\{ \sum_j x_{ij}^2 + \phi \Big[ \big(\sum_j x_{ij}\big)^2 - \sum_j x_{ij}^2 \Big] \Big\}}. \end{aligned} \]

If the clustering is ignored, then \(\boldsymbol{Y} \sim \mathcal{N}(\boldsymbol{X} \beta, \sigma^2 \boldsymbol{I})\). Thus, \(b = (\boldsymbol{X}^T \boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{Y}\) and \(\text{Var}(b) = \sigma^2 (\boldsymbol{X}^T \boldsymbol{X})^{-1} = \sigma^2/\sum_i \sum_j x_{ij}^2.\)

10.7 Chapter 7

Question 2

a. i. We have: \(\boldsymbol{\beta} = (\alpha, \beta, \gamma)^T, \; \boldsymbol{u} = (u_1, \ldots, u_n)^T\),

\[ \boldsymbol{X} = \begin{pmatrix} 1 & x_{11} & x_{11}^2 \\ 1 & x_{12} & x_{12}^2 \\ \vdots & \vdots & \vdots \\ 1 & x_{n n_n} & x_{n n_n}^2 \\ \end{pmatrix} \in \mathbb{R}^{N \times 3},\, \quad \boldsymbol{Z} = \begin{pmatrix} 1 & & \\ \vdots & & \\ 1 & & \\ & 1 & \\ & \vdots & \\ & 1 & \\ & & \ddots \end{pmatrix} \in \mathbb{R}^{N \times n} \]

with \(N= \sum_{i=1}^n n_i\) being the total sample size.

\[ \begin{aligned} E(y_{ij}) &= \alpha+ \beta x_{ij}+ \gamma x_{ij}^2 \\ \mbox{Var}(y_{ij})&= \mbox{Var}(u_i+\epsilon_{ij})= \sigma_u^2+\sigma^2. \end{aligned} \]

\[ \begin{aligned} \mbox{Cov}(y_{ij}, y_{ik}) &= E\left((y_{ij} -E(y_{ij})) (y_{ik} -E(y_{ik})) \right) \\ &= E((u_i+\epsilon_{ij})(u_i+\epsilon_{ik})) \\ &= E(u_i^2)+0 \\ &= \sigma_u^2 \\ \mbox{Corr}(y_{ij}, y_{ik})&= \frac{\sigma_u^2}{\sigma^2+\sigma_u^2}. \end{aligned} \]

In this model, the measure for all subjects shares the same quadratic shape but starts at some different vertical levels depending on the cluster.

b. i. We have: \(\boldsymbol{\beta}= (\beta), \; \boldsymbol{u}= (v_1, \ldots, v_n)^T\),

\[ \boldsymbol{X} = \begin{pmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{21} \\ x_{22} \\ \vdots \\ x_{n1} \\ x_{n2} \\ \vdots \end{pmatrix} \in \mathbb{R}^{N \times 1},\, \quad \boldsymbol{Z} = \begin{pmatrix} x_{11} & & & \\ x_{12} & & & \\ \vdots & & & \\ & x_{21} & & \\ & x_{22} & & \\ & \vdots & & \\ & & \ddots & \\ & & & x_{n1} \\ & & & x_{n2} \\ & & & \vdots \\ \end{pmatrix} \in \mathbb{R}^{N \times n}. \]

\[ \begin{aligned} E(y_{ij}) &= \beta x_{ij} \\ \mbox{Var}(y_{ij})&= \mbox{Var}(v_i x_{ij} + \epsilon_{ij}) = x_{ij}^2 \sigma_v^2 + \sigma^2. \end{aligned} \]
\[ \begin{aligned} \mbox{Cov}(y_{ij}, y_{ik}) &= E\left((y_{ij} -E(y_{ij})) (y_{ik} -E(y_{ik})) \right) \\ &= E((v_i x_{ij} + \epsilon_{ij})(v_i x_{ik} + \epsilon_{ik})) \\ &= E(v_i^2) x_{ij} x_{ik} + 0 \\ &= \sigma_v^2 x_{ij} x_{ik}. \\ \mbox{Corr}(y_{ij}, y_{ik}) &= \frac{\sigma_v^2 x_{ij} x_{ik}}{\sqrt{x_{ij}^2 \sigma_v^2 + \sigma^2} \sqrt{x_{ik}^2 \sigma_v^2 + \sigma^2}}. \end{aligned} \]
In this model, each cluster is roughly a line that passes through the origin with a slope randomly drawn around some value \(\beta\).

Question 3

We have

\[ \tilde{\boldsymbol{Q}}= \left(\begin{array}{cc} 24.741^2 & 0.07 \times 24.741 \times 5.922 \\ 0.07 \times 24.741 \times 5.922 & 5.922^2 \end{array}\right)= \left(\begin{array}{cc} 612.12 & 10.26 \\ 10.26 & 35.07 \end{array}\right). \]

Furthermore, since we have \(n=18\) subjects, the \(36 \times 36\) matrix \(\boldsymbol{Q}\) is given by

\[ \boldsymbol{Q}= \left(\begin{array}{cccccccc} 612.12 & 10.26 &&&&&&\\ 10.26 & 35.07 &&&&&&\\ && 612.12 & 10.26 &&&&\\ && 10.26 & 35.07 &&&&\\ &&&& \ddots & &&\\ &&&& &\ddots &&\\ &&&&&& 612.12 & 10.26\\ &&&&&&10.26 & 35.07\\ \end{array}\right). \]

Dobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. CRC Press.

Fahrmeir, Ludwig, and Gerhard Tutz. 2001. Multivariate Statistical Modelling Based on Generalized Linear Models (2nd Edition). Springer.

Green, P. J. 1984. “Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and Some Robust and Resistant Alternatives.” JRSSB 46 (2): 149–92.

Hauck, W. W., and A. Donner. 1976. “Wald’s Test as Applied to Hypotheses in Logit Analysis.” Journal of the American Statistical Association 72 (360a): 851–53.

McCulloch, Charles E, Shayle R Searle, and John M Neuhaus. 2008. Generalized, Linear, and Mixed Models (2nd Edition). Wiley.

Nelder, John A, and Robert WM Wedderburn. 1972. “Generalized Linear Models.” Journal of the Royal Statistical Society Series A: Statistics in Society 135 (3): 370–84.