Ch 10 — Approximate Inference
Variational inference, mean-field factorization, expectation propagation, the variational view of Bayesian learning.
For most non-trivial Bayesian models, the posterior \(p(\mathbf{Z} \mid \mathbf{X})\) has no closed form. Two attack vectors: deterministic approximation (this chapter — variational inference, EP) and stochastic approximation (Ch 11 — sampling). Variational methods replace the intractable posterior with a tractable surrogate and minimize the gap; sampling methods generate empirical samples and reason about them statistically. Each has its sweet spot.
Prerequisites
- EM (Ch 9) — variational inference is EM with a constrained E-step.
- Information theory (Ch 1) — KL divergence, entropy.
- Calculus of variations — comfortable optimizing over functions, not just numbers.
10.1 The Variational Lower Bound
Recall (Ch 9): for any distribution \(q(\mathbf{Z})\),
\[\ln p(\mathbf{X}) = \mathcal{L}(q) + \mathrm{KL}(q \\,\|\, p(\mathbf{Z} \mid \mathbf{X})), \qquad \mathcal{L}(q) = \int q(\mathbf{Z}) \ln \frac{p(\mathbf{X}, \mathbf{Z})}{q(\mathbf{Z})}\, \mathrm{d}\mathbf{Z}. \tag{10.1}\]\(\mathcal{L}\) is the evidence lower bound (ELBO). Minimizing KL is equivalent to maximizing \(\mathcal{L}\) — but minimizing KL of an intractable posterior would still leave us stuck. The variational approach restricts \(q\) to a tractable family and maximizes \(\mathcal{L}\) within that family. The result is a tractable approximation to the true posterior; the tightness of the gap is exactly the residual KL.
10.2 Mean-Field Variational Inference
The classical restriction: \(q\) factorizes over disjoint groups of latent variables,
\[q(\mathbf{Z}) = \prod_{i=1}^{M} q_i(\mathbf{Z}_i). \tag{10.2}\]This is mean field. Maximizing \(\mathcal{L}\) under this constraint, holding \(q_{j \neq i}\) fixed, gives the optimal \(q_i\):
\[\ln q_i^{\star}(\mathbf{Z}_i) = \mathbb{E}_{j \neq i}\bigl[\ln p(\mathbf{X}, \mathbf{Z})\bigr] + \mathrm{const}. \tag{10.3}\]i.e., take the log-joint, average it over all other factors, and exponentiate. Iterate over \(i\) until convergence — same alternating-optimization structure as EM, just with each E-step approximating the true conditional posterior by its expectation under the rest of \(q\).
Example — Variational Gaussian Mixture
Take the Bayesian GMM from Ch 9 (Dirichlet prior on \(\pi\), Gaussian-Wishart on \((\boldsymbol{\mu}_k, \boldsymbol{\Lambda}_k)\)). Mean-field with the factorization \(q(\mathbf{Z})\, q(\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Lambda})\) gives closed-form updates for both factors. The result automatically prunes unused components (their posterior \(q(\pi_k)\) collapses around 0), so you can set \(K\) generously and let the model decide. Hugely useful in practice.
Why KL(q||p) and not KL(p||q)?
Mean-field minimizes \(\mathrm{KL}(q \| p)\), which has a mode-seeking character: \(q\) tends to cover the largest mode of \(p\) and ignore others. The reverse, \(\mathrm{KL}(p \| q)\), is mode-covering — \(q\) spreads to cover all modes, smearing density across them. Mode-seeking is computationally convenient (gradients of \(\mathrm{KL}(q\|p)\) involve only expectations under \(q\)) and usually gives reasonable point estimates, but it under-estimates posterior variance. Expectation propagation below uses \(\mathrm{KL}(p\|q)\) and trades convenience for fidelity.
10.3 Variational Logistic Regression
Logistic regression’s posterior over \(\mathbf{w}\) is non-Gaussian (no conjugate). In Ch 4 we used the Laplace approximation; the variational alternative bounds the logistic sigmoid below by a Gaussian-shaped lower bound (the Jaakkola–Jordan bound):
\[\sigma(z) \geq \sigma(\xi)\, \exp\\!\left( \tfrac{z - \xi}{2} - \lambda(\xi)(z^{2} - \xi^{2}) \right),\]with \(\lambda(\xi) = \tfrac{1}{2\xi}(\sigma(\xi) - \tfrac{1}{2})\) and a free variational parameter \(\xi\). Plugging in turns the integral over \(\mathbf{w}\) into a Gaussian integral with closed form. Adjust \(\xi\) per data point to tighten the bound; alternate with parameter updates. Yields a Gaussian \(q(\mathbf{w})\) and a tighter approximation than Laplace.
10.4 Expectation Propagation
A different decomposition. Suppose the posterior factorizes as a product over data points:
\[p(\boldsymbol{\theta}) \propto p_0(\boldsymbol{\theta}) \prod_{n=1}^{N} f_n(\boldsymbol{\theta}).\]EP approximates each factor \(f_n\) by a member of an exponential family \(\widetilde{f}_n\), giving a tractable global approximation \(\prod \widetilde{f}_n\). Updates are local: pick a factor, remove its current approximation, replace with the projection of \(p_0 \widetilde{f}_{\setminus n} f_n\) onto the chosen family by minimizing \(\mathrm{KL}(p \| \widetilde{p})\) (the reverse KL from VI).
Why bother? EP minimizes mode-covering KL, so it captures the posterior’s mass more faithfully — particularly important for predictive uncertainty. Trade-offs: no convergence guarantees in general; updates can oscillate. Used heavily in Gaussian-process classification.
10.5 Variational vs Sampling
| Variational | Sampling (Ch 11) | |
|---|---|---|
| Output | Parametric \(q^{\star}\) | Empirical samples \(\lbrace \mathbf{z}^{(t)}\rbrace\) |
| Bias | Approximation gap (positive) | Asymptotically zero |
| Variance | Deterministic | Monte Carlo noise |
| Speed | Fast — gradient-based | Slow — many samples to converge |
| Scalability to large data | Mini-batch friendly (SVI, BBVI) | Per-step cost grows with data |
| Practical win | Posterior approximations for huge models | Gold standard when feasible |
Modern rule of thumb: variational for scale, sampling for accuracy. For deep generative models (VAEs, normalizing flows) variational is essentially the only option.
10.6 Modern Variational Inference
PRML’s book-era VI was per-model derivations. Today:
- Black-box VI (BBVI). Define an arbitrary parameterized \(q_{\boldsymbol{\phi}}\); estimate gradients of \(\mathcal{L}\) by Monte Carlo sampling from \(q\). Works for any model with a tractable joint.
- Reparameterization trick. When \(q_{\boldsymbol{\phi}}\) is reparameterizable (e.g., Gaussian), \(\nabla_{\boldsymbol{\phi}} \mathbb{E}_{q}[f(\mathbf{Z})]\) becomes a low-variance gradient. Foundation of the variational autoencoder.
- Stochastic VI. For huge datasets, replace the full sum in \(\mathcal{L}\) with mini-batch estimates; converges much faster than full-batch updates.
- Normalizing flows. Build expressive \(q\) by composing invertible transformations of a base distribution. Sidesteps the mean-field underestimation problem.
Takeaways
- ELBO \(\mathcal{L}(q)\) is the central object: \(\ln p(\mathbf{X}) = \mathcal{L} + \mathrm{KL}\); maximizing \(\mathcal{L}\) over a tractable family minimizes the gap.
- Mean-field VI factorizes \(q\) over latent groups — closed-form coordinate updates, mode-seeking, underestimates variance.
- EP approximates factor by factor with mode-covering KL — better posterior mass, no convergence guarantees.
- Modern BBVI + reparameterization makes variational inference essentially black-box; the engine behind VAEs and large-scale Bayesian neural networks.
Forward to Ch 11 — Sampling Methods.
Figures from Bishop, Pattern Recognition and Machine Learning (Springer, 2006), © Springer / C. M. Bishop. Reproduced for educational use.