3.4 Multivariate linear regression: The conjugate normal-normal/inverse Wishart model
Let’s study the multivariate regression setting where there are \(N\)-dimensional vectors \({\boldsymbol{y}}_m\), for \(m = 1, 2, \dots, M\), such that \({\boldsymbol{y}}_m = {\boldsymbol{X}} \boldsymbol{\beta}_m + \mu_m\). Here, \({\boldsymbol{X}}\) represents the set of common regressors, and \(\mu_m\) is the \(N\)-dimensional vector of stochastic errors for each equation. We assume that \({\boldsymbol{U}} = [\mu_1 \ \mu_2 \ \dots \ \mu_M] \sim MN_{N,M}({\boldsymbol{0}}, {\boldsymbol{I}}_N, {\boldsymbol{\Sigma}})\), which is a matrix variate normal distribution where \(\boldsymbol{\Sigma}\) is the covariance matrix of each \(i\)-th row of \({\boldsymbol{U}}\), for \(i = 1, 2, \dots, N\), and we assume independence between the rows. Consequently, we have that \(vec({\boldsymbol{U}}) \sim N_{N \times M}({\boldsymbol{0}}, \boldsymbol{\Sigma} \otimes {\boldsymbol{I}}_N)\).23
This framework can be written in matrix form \[\begin{align*} \underbrace{ \begin{bmatrix} y_{11} & y_{12} & \dots & y_{1M}\\ y_{21} & y_{22} & \dots & y_{2M}\\ \vdots & \vdots & \dots & \vdots\\ y_{N1} & y_{N2} & \dots & y_{NM}\\ \end{bmatrix}}_{\boldsymbol{Y}} &= \underbrace{\begin{bmatrix} x_{11} & x_{12} & \dots & x_{1K}\\ x_{21} & x_{22} & \dots & x_{2K}\\ \vdots & \vdots & \dots & \vdots\\ x_{N1} & x_{N2} & \dots & x_{NK}\\ \end{bmatrix}}_{\boldsymbol{X}} \underbrace{ \begin{bmatrix} \beta_{11} & \beta_{12} & \dots & \beta_{1M}\\ \beta_{21} & \beta_{22} & \dots & \beta_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \beta_{K1} & \beta_{K2} & \dots & \beta_{KM}\\ \end{bmatrix}}_{\boldsymbol{B}}\\ &+ \underbrace{\begin{bmatrix} \mu_{11} & \mu_{12} & \dots & \mu_{1M}\\ \mu_{21} & \mu_{22} & \dots & \mu_{2M}\\ \vdots & \vdots & \dots & \vdots\\ \mu_{N1} & \mu_{N2} & \dots & \mu_{NM}\\ \end{bmatrix}}_{\boldsymbol{U}}. \end{align*}\]
Therefore, \({\boldsymbol{Y}}\sim N_{N\times M}({\boldsymbol{X}}{\boldsymbol{B}},\boldsymbol{\Sigma}\otimes {\boldsymbol{I}}_N)\), \[\begin{align*} p({\boldsymbol{Y}}\mid {\boldsymbol{B}},{\boldsymbol{\Sigma}}, {\boldsymbol{X}})&\propto |{{\boldsymbol \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\boldsymbol{Y}}-{\boldsymbol{X}}{\boldsymbol{B}})^{\top}({\boldsymbol{Y}}-{\boldsymbol{X}}{\boldsymbol{B}}){{\boldsymbol \Sigma}}^{-1}\right]\right\rbrace \\ &=|{{\boldsymbol \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[\left({\boldsymbol{S}}+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}})\right){{\boldsymbol \Sigma}}^{-1}\right]\right\rbrace, \end{align*}\]
where \({\boldsymbol{S}}= ({\boldsymbol{Y}}-{\boldsymbol{X}}\widehat{\boldsymbol{B}})^{\top}({\boldsymbol{Y}}-{\boldsymbol{X}}\widehat{\boldsymbol{B}})\), \(\widehat{\boldsymbol{B}}= ({\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}{\boldsymbol{X}}^{\top}{\boldsymbol{Y}}\) (see Exercise 9).
The conjugate prior for this models is \(\pi({\boldsymbol{B}},{\boldsymbol{\Sigma}})=\pi({\boldsymbol{B}}\mid {\boldsymbol{\Sigma}})\pi({\boldsymbol{\Sigma}})\) where \({\boldsymbol{B}}\mid {\boldsymbol \Sigma}\sim N_{K\times M}({\boldsymbol{B}}_{0},{\boldsymbol{V}}_{0},{\boldsymbol{\Sigma}})\) and \({\boldsymbol{\Sigma}}\sim IW({\boldsymbol{\Psi}}_{0},\alpha_{0})\), that is, \[\begin{align*} \pi ({\boldsymbol{B}},{\boldsymbol{\Sigma}})\propto &\left|{\boldsymbol{\Sigma}} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0}){\boldsymbol \Sigma}^{-1}\right]\right\rbrace \\ & \times \left|{\boldsymbol \Sigma} \right|^{-(\alpha_{0}+M+1)/2}\exp\left\lbrace -\frac{1}{2}tr \left[ {\boldsymbol{\Psi}}_{0} {\boldsymbol \Sigma}^{-1}\right] \right\rbrace. \end{align*}\]
The posterior distribution is given by \[\begin{align*} \pi({\boldsymbol{B}},{\boldsymbol{\Sigma}}\mid {\boldsymbol{Y}},{\boldsymbol{X}})&\propto p({\boldsymbol{Y}}\mid {\boldsymbol{B}},{\boldsymbol{\Sigma}},{\boldsymbol{X}}) \pi({\boldsymbol{B}}\mid {\boldsymbol \Sigma})\pi({\boldsymbol{\Sigma}})\\ &\propto \left|{\boldsymbol{\Sigma}} \right|^{-\frac{N+K+\alpha_{0}+M+1}{2}}\\ &\times\exp\left\lbrace -\frac{1}{2}tr\left[(\boldsymbol{\Psi}_{0}+{\boldsymbol{S}} +({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0})\right.\right.\\ &\left.\left. +({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}))\boldsymbol{\Sigma}^{-1}\right]\right\rbrace . \end{align*}\] Completing the squares on \({\boldsymbol{B}}\) and collecting the remaining terms in the bracket yields
\[\begin{align*} {\boldsymbol{\Psi}}_{0}+{\boldsymbol{S}} +({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0})+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}) & = ({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)+{\boldsymbol{\Psi}}_n, \end{align*}\]
where \[\begin{align*} {\boldsymbol{B}}_n = &({\boldsymbol{V}}_{0}^{-1}+{\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}({\boldsymbol{V}}_{0}^{-1}{\boldsymbol{B}}_{0}+{\boldsymbol{X}}^{\top}{\boldsymbol{Y}})=({\boldsymbol{V}}_{0}^{-1}+{\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1}({\boldsymbol{V}}_{0}^{-1}{\boldsymbol{B}}_{0}+{\boldsymbol{X}}^{\top}{\boldsymbol{X}}\widehat{\boldsymbol{B}}),\\ {\boldsymbol{V}}_n = &({\boldsymbol{V}}_{0}^{-1}+{\boldsymbol{X}}^{\top}{\boldsymbol{X}})^{-1},\\ {\boldsymbol{\Psi}}_n= &{\boldsymbol{\Psi}}_{0}+{\boldsymbol{S}}+{\boldsymbol{B}}_{0}^{\top}{\boldsymbol{V}}_{0}^{-1}{\boldsymbol{B}}_{0}+\widehat{\boldsymbol{B}}^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}\widehat{\boldsymbol{B}}-{\boldsymbol{B}}_n^{\top}{\boldsymbol{V}}_n^{-1}{\boldsymbol{B}}_n. \end{align*}\] Thus, the posterior distribution can be written as \[\begin{align*} \pi({\boldsymbol{B}},{\boldsymbol \Sigma}\mid {\boldsymbol{Y}}, {\boldsymbol{X}})\propto &\left|{\boldsymbol \Sigma} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2} tr\left[({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n) {\boldsymbol \Sigma}^{-1}\right]\right\rbrace \\ \times & \left|{\boldsymbol \Sigma} \right|^{-\frac{N+\alpha_{0}+M+1}{2}}\exp\left\lbrace -\frac{1}{2} tr \left[ {\boldsymbol{\Psi}}_n{\boldsymbol \Sigma}^{-1}\right] \right\rbrace . \end{align*}\] That is \(\pi({\boldsymbol{B}},{\boldsymbol \Sigma}\mid {\boldsymbol{Y}}, {\boldsymbol{X}})=\pi ({\boldsymbol{B}}\mid {\boldsymbol \Sigma},{\boldsymbol{Y}},{\boldsymbol{X}})\pi({\boldsymbol \Sigma}\mid {\boldsymbol{Y}},{\boldsymbol{X}})\) where \({\boldsymbol{B}}\mid {\boldsymbol \Sigma},{\boldsymbol{Y}}, {\boldsymbol{X}} \sim N_{K\times M}({\boldsymbol{B}}_n,{\boldsymbol{V}}_n,{\boldsymbol \Sigma})\) and \({\boldsymbol \Sigma}\mid {\boldsymbol{Y}},{\boldsymbol{X}} \sim IW({\boldsymbol{\Psi}}_n,{\alpha}_n)\), \(\alpha_n= N+\alpha_{0}\). Observe again that we can write down the posterior mean as a weighted average between prior and sample information such that \({\boldsymbol{V}}_0\rightarrow\infty\) implies \({\boldsymbol{B}}_n\rightarrow\hat{{\boldsymbol{B}}}\), as we show in the univariate linear model.
The marginal posterior for \({\boldsymbol{B}}\) is given by \[\begin{align*} \pi({\boldsymbol{B}}\mid {\boldsymbol{Y}},{\boldsymbol{X}})&\propto \int_{\boldsymbol{\mathcal{S}}} \left|{\boldsymbol \Sigma} \right|^{-(\alpha_n+K+M+1)/2}\\ &\times\exp\left\lbrace -\frac{1}{2} tr\left\{\left[({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)+{\boldsymbol{\Psi}}_n \right] {\boldsymbol \Sigma}^{-1}\right\}\right\rbrace d{\boldsymbol{\Sigma}} \\ &\propto|({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)+{\boldsymbol{\Psi}}_n|^{-(K+\alpha_n)/2}\\ &=\left[|{\boldsymbol{\Psi}}_n|\times|{\boldsymbol{I}}_K+{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n){\boldsymbol{\Psi}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}|\right]^{-(\alpha_n+1-M+K+M-1)/2}\\ &\propto|{\boldsymbol{I}}_K+{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n){\boldsymbol{\Psi}}_n^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_n)^{\top}|^{-(\alpha_n+1-M+K+M-1)/2}. \end{align*}\]
The second line uses the inverse Wishart distribution, the third line the Sylverter’s theorem, and the last line is the kernel of a matrix \(t\)-distribution, that is, \({\boldsymbol{B}}\mid {\boldsymbol{Y}},{\boldsymbol{X}}\sim T_{K\times M}({\boldsymbol{B}}_n,{\boldsymbol{V}}_n,{\boldsymbol{\Psi}}_n)\) with \(\alpha_n+1-M\) degrees of freedom.
Observe that \(vec({\boldsymbol{B}})\) has mean \(vec({\boldsymbol{B}}_n)\) and variance \(({\boldsymbol{V}}_n\otimes{\boldsymbol{\Psi}}_n)/(\alpha_n-M-1)\) based on its marginal distribution. On the other hand, the variance based on the conditional distribution is \({\boldsymbol{V}}_n\otimes{\boldsymbol{\Sigma}}\), where the mean of \({\boldsymbol{\Sigma}}\) is \({\boldsymbol{\Psi}}_n/(\alpha_n-M-1)\).
The marginal likelihood is the following, \[\begin{align*} p({\boldsymbol{Y}})&=\int_{\mathcal{B}}\int_{\mathcal{S}}\left\{ (2\pi)^{-NM/2} |{{\boldsymbol \Sigma}}|^{-N/2}\exp\left\lbrace -\frac{1}{2}tr\left[{\boldsymbol{S}}+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}})\right]{{\boldsymbol \Sigma}}^{-1}\right\rbrace\right.\\ &\times (2\pi)^{-KM/2}\left|{\boldsymbol V}_0 \right|^{-M/2} \left|{\boldsymbol{\Sigma}} \right|^{-K/2}\exp\left\lbrace -\frac{1}{2}tr\left[({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0}){\boldsymbol \Sigma}^{-1}\right]\right\rbrace \\ &\left. \times \frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)} \left|{\boldsymbol \Sigma} \right|^{-(\alpha_{0}+M+1)/2}\exp\left\lbrace -\frac{1}{2}tr \left[ {\boldsymbol{\Psi}}_{0} {\boldsymbol \Sigma}^{-1}\right] \right\rbrace \right\} d{\boldsymbol{\Sigma}} d{\boldsymbol B}\\ &=(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}\\ &\times\int_{\mathcal{B}}\int_{\mathcal{S}} \left\{ \left|{\boldsymbol \Sigma} \right|^{-(\alpha_{0}+N+K+M+1)/2}\right.\\ &\left. \exp\left\lbrace -\frac{1}{2}tr\left[{\boldsymbol{S}}+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}})+({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0})+{\boldsymbol{\Psi}}_0\right]{{\boldsymbol \Sigma}}^{-1}\right\rbrace\right\}d{\boldsymbol{\Sigma}} d{\boldsymbol B}\\ &=(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)\\ &\times \int_{\mathcal{B}}\left|{\boldsymbol{S}}+({\boldsymbol{B}}-\widehat{\boldsymbol{B}})^{\top}{\boldsymbol{X}}^{\top}{\boldsymbol{X}}({\boldsymbol{B}}-\widehat{\boldsymbol{B}})+({\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{0}^{-1}({\boldsymbol{B}}-{\boldsymbol{B}}_{0})+{\boldsymbol{\Psi}}_0\right|^{-(\alpha_n+K)/2}d{\boldsymbol{B}}\\ &=(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)\\ &\times \int_{\mathcal{B}}\left|({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n)^{\top}{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n)+{\boldsymbol{\Psi}}_n\right|^{-(\alpha_n+K)/2}d{\boldsymbol{B}}\\ &=(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)\\ &\times \int_{\mathcal{B}}\left[|{\boldsymbol{\Psi}}_n|\times |{\boldsymbol{I}}_{K}+{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n){\boldsymbol{\Psi}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n)^{\top}|\right]^{-(\alpha_n+K)/2}d{\boldsymbol{B}}\\ &=|{\boldsymbol{\Psi}}_n|^{-(\alpha_n+K)/2}(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}\\ &\times \int_{\mathcal{{\boldsymbol{B}}}}\left| {\boldsymbol{I}}_{K}+{\boldsymbol{V}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n){\boldsymbol{\Psi}}_n^{-1}({\boldsymbol{B}}-\widehat{\boldsymbol{B}}_n)^{\top}\right|^{-(\alpha_n+1-M+K+M-1)/2}d{\boldsymbol{B}}\\ &=|{\boldsymbol{\Psi}}_n|^{-(\alpha_n+K)/2}(2\pi)^{-M(N+K)/2}\left|{\boldsymbol V}_0\right|^{-M/2}\frac{|\Psi_0|^{\alpha_0/2}2^{M(\alpha_n+K)/2}\Gamma_M((\alpha_n+K)/2)}{2^{\alpha_0M/2}\Gamma_M(\alpha_0/2)}\\ &\times \pi^{MK/2}\frac{\Gamma_M((\alpha_n+1-M+M-1)/2)}{\Gamma_M((\alpha_n+1-M+K+M-1)/2)}|{\boldsymbol{\Psi}}_n|^{K/2}|{\boldsymbol{V}}_n|^{M/2}\\ &=\frac{|{\boldsymbol{V}}_n|^{M/2}}{|{\boldsymbol{V}}_0|^{M/2}}\frac{|{\boldsymbol{\Psi}}_0|^{\alpha_0/2}}{|{\boldsymbol{\Psi}}_n|^{\alpha_n/2}}\frac{\Gamma_M(\alpha_n/2)}{\Gamma_M(\alpha_0/2)}\pi^{-MN/2}. \end{align*}\]
The third equality follows from the kernel of an inverse Wishart distribution, the fifth from Sylvester’s theorem, and the seventh from the kernel of a matrix \(t\)-distribution.
Observe that this last expression is the multivariate case of the marginal likelihood of the univariate regression model. Taking into account that \[\begin{align*} ({\boldsymbol{A}}+{\boldsymbol{B}})^{-1}&={\boldsymbol{A}}^{-1}-({\boldsymbol{A}}^{-1}+{\boldsymbol{B}}^{-1})^{-1}{\boldsymbol{A}}^{-1}\\ &={\boldsymbol{B}}^{-1}-({\boldsymbol{A}}^{-1}+{\boldsymbol{B}}^{-1})^{-1}{\boldsymbol{B}}^{-1}\\ &={\boldsymbol{A}}^{-1}({\boldsymbol{A}}^{-1}+{\boldsymbol{B}}^{-1}){\boldsymbol{B}}^{-1}, \end{align*}\]
we can show that \({\boldsymbol{\Psi}}_{n}={\boldsymbol{\Psi}}_{0}+{\boldsymbol{S}}+(\hat{\boldsymbol{B}}-{\boldsymbol{B}}_{0})^{\top}{\boldsymbol{V}}_{n}(\hat{\boldsymbol{B}}-{\boldsymbol{B}}_{0})\) (see Exercise 7). Therefore, the marginal likelihood rewards fit (smaller sum of squares, \({\boldsymbol{S}}\)), similarity between prior and sample information regarding location parameters, and information gains in variability from \({\boldsymbol{V}}_0\) to \({\boldsymbol{V}}_n\).
Given a matrix of regressors \({\boldsymbol{X}}_0\) for \(N_0\) unobserved units, the predictive density of \({\boldsymbol{Y}}_0\) given \({\boldsymbol{Y}}\), \(\pi({\boldsymbol{Y}}_0\mid {\boldsymbol{Y}})\) is a matrix t distribution \(T_{N_0,M}(\alpha_n-M+1,{\boldsymbol{X}}_0{\boldsymbol{B}}_n,{\boldsymbol{I}}_{N_0}+{\boldsymbol{X}}_0{\boldsymbol{V}}_n{\boldsymbol{X}}_0^{\top},{\boldsymbol{\Psi}}_n)\) (see Exercise 6). Observe that the prediction is centered at \({\boldsymbol{X}}_0{\boldsymbol{B}}_n\), and the covariance matrix of \(vec({\boldsymbol{Y}}_0)\) is \(\frac{({\boldsymbol{I}}_{N_0}+{\boldsymbol{X}}_0{\boldsymbol{V}}_n{\boldsymbol{X}}_0^{\top})\otimes{\boldsymbol{\Psi}}_n}{\alpha_n-M-1}\).
\(vec\) denotes the vectorization operation, and \(\otimes\) denotes the Kronecker product.↩︎