Lecture 6 Bayesian Modelling

6.1 Overview

In this part of the lecture, we delve deeper into the concepts of Bayesian inference by exploring some modeling techniques. Using the example of linear models, we present different approaches for modeling the full conditionals of the parameters included. Our main focus will be on the Gibbs Sampling algorithm. To help clarify the different components and hyperparameters of the algorithm, we include an interactive Shiny app.

6.2 Example: Linear Model

6.2.1 Setup

For the Bayesian modeling approaches, we take the linear model that was introduced in Lecture 3.

\[ \begin{eqnarray*} \boldsymbol{y} &=& \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\\ \varepsilon_i&\overset{iid}{\sim}&\N(0,\sigma^2) \text{ where } i \in 1,\ldots,n\\ Y_i&\overset{iid}{\sim}&\N\left(\beta_0+\sum_{j=1}^{p}\beta_jx_{ij},\sigma^2\right)\\ \boldsymbol{Y}&\sim&\MVN(\boldsymbol{X}\boldsymbol{\beta}, \sigma^2\boldsymbol{I}) \end{eqnarray*} \]

Reminder: Dimensions

\[\begin{eqnarray} \boldsymbol{Y} = \underbrace{\left( \begin{array}{c} y_1 \\ \vdots \\ y_n \end{array} \right)}_{n\times 1} \end{eqnarray}\]

\[\begin{eqnarray} \boldsymbol{\mu_Y} = \boldsymbol{X\beta} = \underbrace{\left(\begin{array}{c} 1 & x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{array} \right)}_{n \times (p+1)} \underbrace{\left(\begin{array}{c} \beta_0 \\ \vdots \\ \beta_p \end{array} \right)}_{(p+1) \times 1} \end{eqnarray}\]

\[\begin{eqnarray} \boldsymbol{\Sigma_Y} = \sigma^2\boldsymbol{I} = \underbrace{\left(\begin{array}{c} \sigma^2 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \sigma^2 \end{array} \right)}_{n \times n} \end{eqnarray}\]

6.2.2 Bayesian Considerations

In Bayesian inference, we assume that parameters follow probability distributions. For the linear model, the random parameters for which we need to specify distributions are \(\bb\) and \(\sigma^2\). To find suitable distributions, we have to consider the domain for each of the parameters:

The parameter vector \(\bb\) is defined of the entire real line: \(-\infty < \bb < \infty\). Since the vector may contain more than one random variable, the Multivariate Normal Distribution is a good choice.
For the variance \(\sigma^2\) to be defined correctly, the distribution must have finite non-negative values (\(\sigma^2 > 0\)) and is typically defined over real numbers. This includes the Chi-Squared, Gamma, and Inverse Gamma distributions. We choose the latter since it is conjugate to the Multivariate Normal Distribution.

6.2.3 Bayesian Ingredients

Every Bayesian Model needs a well-defined Likelihood and Prior distribution(s) for the parameter(s):

Likelihood:

\(\boldsymbol{Y}\sim\MVN(\boldsymbol{X}\boldsymbol{\beta}, \sigma^2\boldsymbol{I})\)

Priors:

\(\bb|\sigma^2\sim\MVN(\boldsymbol{m},\sigma^2\boldsymbol{M})\)
\(\sigma^2\sim \IG(a_0,b_0)\)

Question: How would an uninformative prior for the coefficient vector look like?

When we have no prior knowledge about the possible values of the coefficient vector \(\bb\) could look like, an typical approach is to set:

\[ \boldsymbol{m} = \left(\begin{array}{c} 0 \\ \vdots \\ 0 \end{array} \right) \] and

\[ \boldsymbol{M} = \left(\begin{array}{c} \psi & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \psi \end{array} \right) \] where \(\psi\) is an arbitrary large number, such as 100,000. This choice effectively results in a flat prior distribution for \(\bb\).

6.3 Bayesian Inference

Our goal in Bayesian inference is to produce samples that come from the posterior distribution. There are different possibilities to obtain these:

analytical calculation of posterior distribution (e.g., for conjugate priors)
Markov Chain Monte Carlo (MCMC) simulation techniques
- Gibbs Sampler
- Metropolis Hastings Sampler
- …

6.3.1 Gibbs Sampling Algorithm

The Gibbs Sampler is an MCMC technique in which each result depends on a specific number of preceding results.

Assume having a parameter vector \(\boldsymbol{\theta}\) of length \(K\) with entries \(\theta_1,\ldots,\theta_K\):

Initialize \(\boldsymbol{\theta} =\boldsymbol{\theta}_0\)
For \(b\) in \(1\):\(B\) repeat:
for \(k \in\) \(1\):\(K\) draw \(\theta^{(b)}_k\) from the full conditional distribution \[p\left(\theta_k|\theta^{(b)}_{-k},\boldsymbol{y}\right)\] where \[ \theta_{-k}^{(b)} = \begin{cases} \left( \theta_2^{(b-1)}, \ldots, \theta_K^{(b-1)} \right), & \text{if } k = 1 \\ \left( \theta_1^{(b)}, \ldots, \theta_{k-1}^{(b)}, \theta_{k+1}^{(b-1)}, \ldots, \theta_K^{(b-1)} \right), & \text{if } 1 < k < K \\ \left( \theta_1^{(b)}, \ldots, \theta_{K-1}^{(b)} \right), & \text{if } k = K \end{cases} \]
Result after burn-in (deletion of the first iterations before convergence): a vector of random numbers from each marginal posterior distribution \(p(\theta_k|\boldsymbol{y})\)

6.3.2 Gibbs Sampler Application: Linear Model

Remember that in the linear model the parameters of interest are \(\bb\) and \(\sigma^2\) so our goal is to find the marginal posterior distributions \(p(\bb \mid \boldsymbol{y})\) and \(p(\sigma^2 \mid \boldsymbol{y})\).

To apply the Gibbs sampling we first need to find the full conditionals:

\(p(\boldsymbol{\beta}|\cdot) = p(\boldsymbol{\beta}|\sigma^2,\boldsymbol{y})\)
\(p(\sigma^2|\cdot) =p(\sigma^2|\boldsymbol{\beta}, \boldsymbol{y})\)

These full conditionals are then used as a tool to generate samples from the posterior distribution.

6.3.2.1 Full Conditional for \(\bb\)

Bayes’ theorem gives us the following form for the full conditional of \(\beta\): \[p(\boldsymbol{\beta}|\sigma^2, \boldsymbol{y}) = \frac{p(\boldsymbol{\beta}|\sigma^2,\boldsymbol{y})}{p(\sigma^2, \boldsymbol{y})}\]

Since \(\bb\) does not appear in the denominator, in the process of finding the the full conditional we can focus on the numerator only (using the proportionality sign). Plugging in the distributions defined in Section 6.2.3 we can now derive the full conditional by removing all parts that do not depend on \(\bb\):

\[\begin{aligned} p(\bb|\cdot) \propto & \;p(\boldsymbol{y}|\boldsymbol{\beta},\sigma^2)p(\boldsymbol{\beta}|\sigma^2) \\ = &\;(2 \pi)^{-\frac{n}{2}}\text{det}(\sigma^2 \boldsymbol{I})^{-\frac{1}{2}} \exp{\left(-\frac{1}{2} (\yb - \boldsymbol{X \beta})^{\top}(\sigma^2 \boldsymbol{I})^{-1}(\yb - \boldsymbol{X \beta}) \right)} \\ &\;(2 \pi)^{-\frac{p+1}{2}} \text{det}(\sigma^2 \boldsymbol{M})^{-1/2} \exp{\left(-\frac{1}{2} (\bb - \boldsymbol{m})^{\top}(\sigma^2 \boldsymbol{M})^{-1}(\bb - \boldsymbol{m}) \right)} \\ \propto &\; \exp{\left(-\frac{1}{2} (\yb - \boldsymbol{X \beta})^{\top}(\sigma^2 \boldsymbol{I})^{-1}(\yb - \boldsymbol{X \beta}) \right)} \\ &\; \exp{\left(-\frac{1}{2}(\bb-\mb)^{\top}(\sigma^2 \Mb)^{-1}(\bb - \mb) \right)}\\ = &\; \exp{\left(-\frac{1}{2\sigma^2} \left(\bb^\top\Xb^\top\Xb\bb - 2\bb^\top\Xb^\top\yb + \yb^\top\yb \right) \right)} \\ &\; \exp{\left(-\frac{1}{2\sigma^2} \left(\bb^\top \Mb^{-1} \bb - 2\bb^\top \Mb^{-1} \mb + \mb^\top \Mb^{-1} \mb \right)\right)} \\ \propto &\; \exp{\left(-\frac{1}{2\sigma^2} \left(\bb^\top\Xb^\top\Xb\bb - 2\bb^\top\Xb^\top\yb + \bb^\top \Mb^{-1} \bb - 2\bb^\top \Mb^{-1} \mb \right)\right)} \\ = &\; \exp{\biggl( -\frac{1}{2} \bigl(\bb^\top\underbrace{\frac{1}{\sigma^2}\left(\Xb^\top\Xb + \Mb^{-1}\right)}_{:=\boldsymbol{\Sigma}^{-1}_{\bb}}\bb \bigr) + \frac{1}{\sigma^2}\left(\bb^\top\left(\Xb^\top\yb + \Mb^{-1}\mb\right)\right)\biggr)} \\ = &\; \exp{\biggl(-\frac{1}{2}\Bigl(\bb^\top\boldsymbol{\Sigma}_\bb^{-1}\bb\Bigr) + \bb^\top\boldsymbol{\Sigma}_\bb^{-1}\underbrace{\frac{1}{\sigma^2}\boldsymbol{\Sigma}_\bb \bigl(\Xb^\top\yb + \Mb^{-1}\mb\bigr)}_{:=\boldsymbol{\mu}_\bb} \biggr)} \\ = &\; \exp{\biggl(-\frac{1}{2}\Bigl(\bb^\top\boldsymbol{\Sigma}_\bb^{-1}\bb\Bigr) + \bb^\top\boldsymbol{\Sigma}_\bb^{-1}\boldsymbol{\mu}_\bb\biggr)} \end{aligned}\]

The resulting full conditional of \(\bb\) is proportional to a multivariate normal distribution \[\bb|\cdot \sim \MVN(\boldsymbol{\mu}_\bb,\boldsymbol{\Sigma}_\bb)\] with \[ \boldsymbol{\mu}_\bb = \boldsymbol{\Sigma}_\bb\left(\frac{1}{\sigma^2}\Xb^\top\yb + \frac{1}{\sigma^2}\Mb^{-1}\mb\right) \tag{1} \label{mubeta} \] \[ \boldsymbol{\Sigma}_\bb = \left(\frac{1}{\sigma^2}\Xb^\top\Xb + \frac{1}{\sigma^2}\Mb^{-1}\right)^{-1} \tag{2} \label{Sigbeta} \]

Expected value with weak prior information

If you include a prior \(\bb|\cdot \sim \N(\boldsymbol{m}, \sigma^2\boldsymbol{M})\), the posterior mean \(\mu_\bb\) becomes a weighted average between the OLS estimate (from the likelihood) and the prior mean \(\boldsymbol{m}\), weighted by the prior precision \(\boldsymbol{M}^{-1}\). With no or weak prior information (flat prior, i.e., \(\boldsymbol{M} \rightarrow \infty\boldsymbol{I}\)) we draw \(\bb\) from a distribution with an expected value that is equal to the OLS estimator: \[\begin{aligned} \boldsymbol{\mu}_\bb &= \left(\frac{1}{\sigma^2}\Xb^\top\Xb + \frac{1}{\sigma^2}\Mb^{-1}\right)^{-1} \left(\frac{1}{\sigma^2}\Xb^\top\yb + \frac{1}{\sigma^2}\Mb^{-1}\mb\right) \\ &= \left(\Xb^\top\Xb + \Mb^{-1}\right)^{-1} \left(\Xb^\top\yb + \Mb^{-1}\mb\right) \\ &\approx \left(\Xb^\top\Xb\right)^{-1} \left(\Xb^\top\yb\right) \end{aligned}\]

6.3.2.2 Full Conditional for \(\sigma^2\)

Using Bayes’ theorem we can state the full conditional as: \[\begin{aligned} p(\sigma^2|\cdot)\propto &\; p(\boldsymbol{y}|\boldsymbol{\beta},\sigma^2)p(\boldsymbol{\beta}|\sigma^2)p(\sigma^2) \\ = &\;(2 \pi)^{-\frac{n}{2}}\text{det}(\sigma^2 \boldsymbol{I})^{-\frac{1}{2}} \exp{\left(-\frac{1}{2} (\yb - \boldsymbol{X \beta})^{\top}(\sigma^2 \boldsymbol{I})^{-1}(\yb - \boldsymbol{X \beta}) \right)} \\ &\;(2 \pi)^{-\frac{p+1}{2}} \text{det}(\sigma^2 \boldsymbol{M})^{-1/2} \exp{\left(-\frac{1}{2} (\bb - \boldsymbol{m})^{\top}(\sigma^2 \boldsymbol{M})^{-1}(\bb - \boldsymbol{m}) \right)} \\ &\; \frac{b_0}{\Gamma(a)}\left(\sigma^2\right)^{-(a_0+1)} \exp{\left(-\frac{b_0}{\sigma^2}\right)} \\ \propto &\; \left(\sigma^2\right)^{-\frac{n}{2}}\exp{\left(-\frac{1}{2\sigma^2} \left(\yb-\Xb\bb\right)^\top\left(\yb - \Xb\bb\right)\right)} \\ &\; \left(\sigma^2\right)^{-\frac{p+1}{2}} \exp{\left(-\frac{1}{2\sigma^2}\left(\bb-\boldsymbol{m}\right)^\top\boldsymbol{M}^{-1}\left(\bb-\boldsymbol{m}\right)\right)} \\ &\; \left(\sigma^2\right)^{-(a_0+1)} \exp{\left(-\frac{b_0}{\sigma^2}\right)} \\ =&\; \left(\sigma^2\right)^{-(\frac{n}{2}+\frac{p+1}{2}+a_0+1)} \\ &\; \exp{\left(-\frac{1}{\sigma^2}\left(\frac{1}{2}\left(\left(\yb-\Xb\bb\right)^\top\left(\yb - \Xb\bb\right) + \left(\bb-\boldsymbol{m}\right)^\top\boldsymbol{M}^{-1}\left(\bb-\boldsymbol{m}\right)\right) + b_0\right)\right)} \end{aligned}\] This result for the full conditional of \(\sigma^2\) is proportional to an Inverse Gamma distribution, thus:

\[ \sigma^2 | \cdot \sim \IG(a_{\sigma^2}, b_{\sigma^2}) \] with \[ a_{\sigma^2} = a_0 + \frac{n}{2}+\frac{p+1}{2} \tag{3} \label{asigma} \] \[ b_{\sigma^2} = b_0 + \frac{1}{2}\left(\left(\yb-\Xb\bb\right)^\top\left(\yb - \Xb\bb\right) + \left(\bb-\boldsymbol{m}\right)^\top\boldsymbol{M}^{-1}\left(\bb-\boldsymbol{m}\right)\right) \tag{4} \label{bsigma} \]

Expected value with weak prior information

In the case of flat priors, i.e. \(\boldsymbol{M}^{-1} = \boldsymbol{0}\) and \(a_0, b_0 \rightarrow 0\), the expected value for the variance is as follows: \[\begin{aligned} \mathbb{E}(\sigma^2 | \cdot) &= \frac{b_{\sigma^2}}{a_{\sigma^2} -1} \\ &= \frac{b_0 + \frac{1}{2}\left(\left(\yb-\Xb\bb\right)^\top\left(\yb - \Xb\bb\right) + \left(\bb-\boldsymbol{m}\right)^\top\boldsymbol{M}^{-1}\left(\bb-\boldsymbol{m}\right)\right)}{a_0 + \frac{n}{2}+\frac{p+1}{2} - 1} \\ &\approx \frac{ \frac{1}{2}\left(\left(\yb-\Xb\bb\right)^\top\left(\yb - \Xb\bb\right)\right)}{\frac{n}{2}+\frac{p+1}{2} - 1} \end{aligned}\] \[\Rightarrow \hat{\sigma}^2 = \frac{1}{n-p-1}\sum^{n}_{i=1}(y_i - \boldsymbol{x}_i\bb)^2\]

6.3.2.3 Application: Algorithm

The Gibbs-Sampler algorithm for linear model example looks as follows:

initialize \(\boldsymbol {\beta}^{(0)}, \sigma^2\)
initialize parameter of prior distribution of \(\boldsymbol {\hat \beta}\): \(\boldsymbol m\) and \(\boldsymbol M\)
initalize parameter of prior distribution of \(\sigma^2\): \(a_0\) and \(b_0\)
update \(\boldsymbol{\Sigma_{\beta}}\) as in eq. (\(\ref{Sigbeta}\))
update \(\boldsymbol{\mu_{\beta}}\) as in eq. (\(\ref{mubeta}\))
draw new \(\boldsymbol{\beta}\) from multivariate normal distribution with updated \(\boldsymbol{\mu}_\bb\) and \(\boldsymbol{\Sigma}_\bb\)
update \(a_{\sigma^2}\) as in eq. (\(\ref{asigma}\))
update \(b_{\sigma^2}\) as in eq. (\(\ref{bsigma}\)) using newly drawn \(\boldsymbol \beta\)
draw new \(\sigma^2\) from inverse gamma distribution with updated shape (\(a_{\sigma^2}\)) and scale (\(b_{\sigma^2}\)) parameter
repeat updating steps 4.-5. until predefined number of iteration is reached
drop values from burn-in phase (predefined length)

hier examples droppen wegen app #### Interactive Shiny App

5 Bayesian Inference

7 Boosting

Chair of Spatial Data Science and Statistical Learning

Introduction to Bayesian Inference and Statistical Learning