Exercise 4.12 - BIC for Gaussians

Answers

For question (a), recall that the maximum likelihood estimation for a MVN model is:

μ_{MLE} = \frac{1}{N} \sum_{n = 1}^{N} 𝐱_{n},

Σ_{MLE} = \frac{1}{N} \sum_{n = 1}^{N} (𝐱_{n} - μ_{MLE}) {(𝐱_{n} - μ_{MLE})}^{T} .

So the likelihood reads:

\begin{aligned} p (𝒟 | μ_{MLE}, Σ_{MLE}) & = \prod_{n = 1}^{N} p (𝐱_{n} | μ_{MLE}, Σ_{MLE}) \\ = \prod_{n = 1}^{N} {(2 π)}^{- \frac{D}{2}} \cdot | Σ_{MLE} |^{- \frac{1}{2}} \cdot \exp (- \frac{1}{2} {(𝐱_{n} - μ_{MLE})}^{T} Σ_{MLE}^{- 1} (𝐱_{n} - μ_{MLE})) \\ = {(2 π)}^{- \frac{𝑁𝐷}{2}} \cdot | Σ_{MLE} |^{- \frac{N}{2}} \cdot \exp (- \frac{1}{2} \sum_{n = 1}^{N} {(𝐱_{n} - μ_{MLE})}^{T} Σ_{MLE}^{- 1} (𝐱_{n} - μ_{MLE})) . \end{aligned}

Denote:

𝐘 = (\begin{matrix} 𝐱_{1} - μ_{MLE} & \dots & 𝐱_{N} - μ_{MLE} \end{matrix}),

then $Σ_{MLE} = \frac{1}{N} 𝐘 𝐘^{T}$ , while the term in the exponential of the likelihood is:

- \frac{1}{2} \cdot tr (𝐘^{T} Σ_{MLE}^{- 1} 𝐘) = - \frac{1}{2} \cdot tr (Σ_{MLE}^{- 1} 𝐘 𝐘^{T}) = - \frac{𝑁𝐷}{2} .

Thus the BIC is:

- \frac{𝑁𝐷}{2} \cdot \log (2 π e) - \frac{N}{2} \cdot \log | Σ_{MLE} | - \frac{D + \frac{D (D + 1)}{2}}{2} \cdot \log N .

For question (b), the fitting of a diagonal MVN model is tantamount to fitting $D$ independent 1d Gaussian models simultaneously, thus the $d$ -th diagnoal component of $Σ_{MLE}^{d}$ is:

\frac{1}{N} \sum_{n = 1}^{N} 𝐱_{n, d}^{2},

where we have assumed $𝐱 = 0$ w.l.o.g. Thus the term inside the exponential of the likelihood remains $- \frac{𝑁𝐷}{2}$ . So the BIC in this case is:

- \frac{𝑁𝐷}{2} \cdot \log (2 π e) - \frac{N}{2} \cdot \log | Σ_{MLE}^{d} | - D \cdot \log N .

We observe that if all $D$ components are mutually independent, i.e., $Σ_{MLE}$ is diagonal then the BIC for diagonal MVN model is strictly larger than that for general MVN, hence the diagonal version is always preferred. In cases there exists dependence among components, the BIC for general MVN is still not necessarily larger than that of disgonal MVN. This is a reflection of the trade-off between complexity and generalization.

The Bayesian information criterion is an approximation of a model’s evidence, $p (𝒟)$ . Let us start from:

p (𝒟) = \int p (𝒟 | 𝜃) \cdot p (𝜃) d 𝜃,

where $𝜃$ is the collection of all parameters within the current model. The trick here is to expand $\log p (𝐱 | 𝜃)$ as a function of $𝜃$ and taking approximation to the second order at $𝜃_{MAP}$ so the first order gradient vanishes:

\log p (𝐱 | 𝜃) \approx \log p (𝐱 | 𝜃_{0}) - \frac{1}{2} {(𝜃 - 𝜃_{0})}^{T} 𝐇 (𝜃 - 𝜃_{0}),

where $𝐇$ is the Hessian matrix at $\log p (𝐱 | 𝜃_{0})$ . Thus we have:

p (𝐱 | 𝜃) \approx p (𝐱 | 𝜃_{0}) \cdot \exp (- \frac{1}{2} {(𝜃 - 𝜃_{0})}^{T} 𝐇 (𝜃 - 𝜃_{0})) .

We are now ready to perform the integral, with:

p (𝒟 | 𝜃) = p {(𝐱 | 𝜃_{0})}^{N} \cdot \exp (- \frac{N}{2} {(𝜃 - 𝜃_{0})}^{T} 𝐇 (𝜃 - 𝜃_{0})),

conducting the integral on the neighbour of $𝜃_{0} = 𝜃_{MAP}$ :

\begin{aligned} \int p (𝒟 | 𝜃) \cdot p (𝜃) d 𝜃 & \approx p (𝒟 | 𝜃_{MAP}) \cdot p (𝜃_{MAP}) \cdot \int \exp (- \frac{N}{2} {(𝜃 - 𝜃_{MAP})}^{T} 𝐇 (𝜃 - 𝜃_{MAP})) d 𝜃 \\ = p (𝒟 | 𝜃_{MAP}) \cdot p (𝜃_{MAP}) \cdot {(2 π)}^{\frac{d}{2}} | N^{- 1} 𝐇^{- 1} |^{\frac{1}{2}} \\ = p (𝒟 | 𝜃_{MAP}) \cdot p (𝜃_{MAP}) \cdot {(2 π)}^{\frac{d}{2}} \cdot N^{- \frac{d}{2}} \cdot | 𝐇^{- 1} |^{\frac{1}{2}}, \end{aligned}

where $d$ is the number of components in $𝜃$ . Taking the logarithm of both side of the evidence yields the BIC. One can see how many compromises and assumptions have been applied in deriving an analytic form of the evidence, which is arguably the most complex variable for Bayesian analysis.

Exercise 4.12 - BIC for Gaussians

Answers

Comments

Add answer