Part 2

26 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Bayesian Inference: Part 4: Core Theory II: Estimation, Prediction, and Uncertainty to 6. Advanced Topics: Approximate Inference

4. Core Theory II: Estimation, Prediction, and Uncertainty

4.1 MAP Estimation as Regularised MLE

Maximum a posteriori estimation chooses

\hat{\theta}_{\text{MAP}} = \arg\max_\theta p(\theta \mid \mathcal{D}) = \arg\max_\theta \left[\log p(\mathcal{D} \mid \theta) + \log p(\theta)\right].

Because the evidence does not depend on $\theta$ , MAP is equivalent to maximizing log-likelihood plus log-prior.

This makes the relationship to regularization immediate.

Gaussian prior

If $\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \lambda^{-1} I)$ , then

\log p(\boldsymbol{\theta}) = -\frac{\lambda}{2}\lVert \boldsymbol{\theta} \rVert_2^2 + C.

So MAP becomes

\hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} \left[ \log p(\mathcal{D}\mid\boldsymbol{\theta}) - \frac{\lambda}{2}\lVert \boldsymbol{\theta} \rVert_2^2 \right].

That is L2-regularized MLE, or ridge-style shrinkage.

Laplace prior

If the components are independent Laplace:

p(\theta_j) \propto \exp(-\lambda |\theta_j|),

then

\log p(\boldsymbol{\theta}) = -\lambda \lVert \boldsymbol{\theta} \rVert_1 + C,

so MAP becomes L1-regularized MLE.

Examples:

ridge regression = Gaussian-prior MAP
lasso = Laplace-prior MAP
weight decay in neural nets = Gaussian-prior interpretation of penalized optimization

Non-examples:

MAP is not full Bayesian inference. It collapses the posterior to one point.
A regularizer is not automatically Bayesian unless it can be interpreted as a log-prior under a probabilistic model.

This section should connect back to Estimation Theory, where MLE was treated as the frequentist workhorse. MAP is the bridge between frequentist optimization and Bayesian inference: the moment we add a prior penalty to the likelihood objective, we have moved into Bayesian territory, even if we stop short of integrating over the posterior.

For AI: Weight decay in large language model training is often discussed as a purely optimization-side trick. The Bayesian interpretation is cleaner: it encodes prior preference for smaller parameter norms, which in turn stabilizes generalization and reduces overfitting.

4.2 Posterior Covariance and Shrinkage

Point estimates answer "where is the posterior centered?" Posterior covariance answers "how uncertain are we, and in which directions?"

For a vector parameter $\boldsymbol{\theta}$ , the posterior covariance is

\operatorname{Cov}(\boldsymbol{\theta} \mid \mathcal{D}) = \mathbb{E}\left[ (\boldsymbol{\theta} - \mathbb{E}[\boldsymbol{\theta}\mid\mathcal{D}]) (\boldsymbol{\theta} - \mathbb{E}[\boldsymbol{\theta}\mid\mathcal{D}])^\top \middle|\mathcal{D} \right].

This matrix tells us:

which directions are well identified
which directions remain uncertain
how parameter uncertainties co-vary
how much predictions should vary when parameter uncertainty is propagated

Shrinkage is the visible effect of posterior covariance interacting with the prior. In small samples, the posterior often contracts toward structured prior beliefs, reducing variance at the price of some bias. This tradeoff is often beneficial.

Examples:

In Gaussian-Gaussian updating, posterior variance is smaller than both prior variance and raw sampling variance once both information sources are combined.
In Bayesian linear regression, highly collinear features produce posterior covariance directions with large uncertainty.
In hierarchical models, weak groups borrow information from the population prior, dramatically reducing variance.

Non-examples:

Shrinkage is not the same as underfitting. It is controlled bias used to reduce variance.
Posterior covariance is not simply the Hessian inverse unless a Gaussian approximation is justified.

Posterior covariance also clarifies why Bayesian inference is attractive for unstable estimation problems. If the likelihood is flat in some direction, a frequentist optimizer may still return one arbitrary point. A posterior keeps that uncertainty visible.

For AI: In Bayesian deep learning, uncertainty in weights is usually too high-dimensional to represent exactly, so practical methods approximate only a low-rank or diagonal covariance. Methods like SWAG make this tradeoff explicit.

4.3 Posterior Predictive Distribution

The posterior predictive distribution is the Bayesian answer to prediction:

p(x_{\text{new}} \mid \mathcal{D}) = \int p(x_{\text{new}} \mid \theta)\, p(\theta \mid \mathcal{D})\, d\theta.

This is the most important formula in applied Bayesian inference after Bayes' rule itself.

Why? Because it integrates parameter uncertainty into prediction. Plug-in prediction replaces $\theta$ by a point estimate such as MLE or MAP:

p(x_{\text{new}} \mid \hat{\theta}).

Bayesian prediction averages over all plausible parameter values under the posterior. That averaging is often more stable and better calibrated.

Examples:

Beta-Binomial: the predictive probability of success equals the posterior mean of the success probability.
Gamma-Poisson: integrating out the Poisson rate yields an overdispersed predictive distribution, more realistic than a pure plug-in Poisson.
Bayesian linear regression: predictive variance contains both observation noise and parameter uncertainty.

Non-examples:

Reporting only a posterior mean parameter and then acting as if uncertainty vanished.
Confusing the posterior over parameters with the predictive distribution over future observations.

The distinction matters. A posterior can be concentrated while predictions remain noisy because the observation model itself is noisy. Conversely, predictions can be sharp even when parameter uncertainty is moderate if the downstream quantity is insensitive to the uncertain directions.

For AI: Posterior predictive distributions are directly relevant to calibrated confidence, abstention, active learning, and synthetic-data generation under uncertainty.

The posterior predictive can be written in two-stage form:

p(x_{\text{new}} \mid \mathcal{D}) = \int p(x_{\text{new}} \mid \theta)\, p(\theta \mid \mathcal{D})\, d\theta.

This says:

sample a plausible parameter from what the data allow
simulate or score the new observation under that parameter
average over all such plausible parameters

That averaging is often called Bayesian model averaging at the parameter level. It explains why posterior predictive distributions can be more conservative and better calibrated than point-estimate predictions. A point estimate pretends one parameter is true. The posterior predictive averages over uncertainty honestly.

Three useful examples:

Example 1: Beta-Binomial smoothing. If one observes 1 success in 1 trial, the MLE predictive probability of success is 1. The posterior predictive with a moderate prior is far lower, avoiding catastrophic overconfidence from tiny samples.

Example 2: Gamma-Poisson overdispersion. Plugging in the posterior mean rate yields a Poisson predictive variance equal to the mean. Integrating over posterior rate uncertainty yields larger variance, which is often more realistic for real count data.

Example 3: Bayesian linear regression. Predictions far from the observed design points have larger uncertainty because $\mathbf{x}_*^\top \Sigma_n \mathbf{x}_*$ grows in poorly observed directions. This is exactly the kind of uncertainty a safe system should expose.

Two non-examples:

Non-example 1: "posterior predictive uncertainty is just aleatoric noise." No. It also includes epistemic uncertainty from the posterior over parameters.

Non-example 2: "if the posterior is concentrated, predictive uncertainty must be small." Not necessarily. If the data model itself is noisy, predictive uncertainty can remain large even with very certain parameters.

This distinction between parameter uncertainty and observation uncertainty is crucial in AI safety and active learning. Epistemic uncertainty can often be reduced by collecting more data; aleatoric uncertainty often cannot.

4.4 Posterior Predictive Checks

A model can fit the data badly even if posterior computation is exact. Bayesian inference does not rescue a misspecified model. That is why posterior predictive checks (PPCs) matter.

The idea is simple:

Sample $\theta^{(s)} \sim p(\theta \mid \mathcal{D})$ .
Simulate replicated data $\tilde{\mathcal{D}}^{(s)} \sim p(\mathcal{D} \mid \theta^{(s)})$ .
Compare statistics of $\tilde{\mathcal{D}}^{(s)}$ to those of the observed dataset.

If the model is adequate, the observed data should look typical under these replications. If the observed data sit in the extreme tail of replicated summaries, the model is missing important structure.

Examples of discrepancy statistics:

mean or variance
tail behavior
number of zero counts
class imbalance
calibration error
residual autocorrelation

Examples:

A Poisson model may underfit overdispersed counts; PPCs reveal replicated variance far below observed variance.
A Gaussian observation model may miss heavy tails; PPCs reveal too few extreme values in the replicated data.
A Bayesian classifier may look good on average accuracy but fail PPC-style calibration diagnostics.

Non-examples:

Using PPCs as formal proof the model is true. They are diagnostics, not proofs.
Comparing only one statistic and declaring the whole model validated.

For AI: In probabilistic ML systems, PPCs are a natural way to ask whether the learned generative or predictive model captures the patterns that matter operationally. They are especially useful for drift monitoring and calibration sanity checks.

PPCs are especially valuable because they focus attention on the observable world rather than on latent mathematical elegance. A posterior may look concentrated and well behaved while still generating unrealistic data. The real question is not "did the model optimize cleanly?" but "if I sampled from this model after seeing the data, would the synthetic worlds resemble the one I am actually trying to understand?"

Three examples:

Example 1: A language-quality model might match average toxicity rates but fail badly on tail-risk prompts. A PPC based on extreme-quantile toxicity scores can reveal the gap.

Example 2: A recommendation model may match mean click rate yet fail to reproduce the observed heterogeneity across users. A PPC on user-level variance catches what global accuracy misses.

Example 3: A forecasting model may predict the correct marginal variance but miss autocorrelation structure. A PPC based on lagged residual statistics can reveal the defect.

4.5 Bayesian Linear Regression as a Worked Example

Bayesian linear regression is a complete worked example that unifies prior design, posterior inference, shrinkage, and posterior prediction without becoming a full regression chapter.

Assume

\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I_n),

with prior

\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2 I_d).

The posterior is Gaussian:

\boldsymbol{\beta} \mid \mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_n, \Sigma_n)

where

\Sigma_n = \left(\frac{1}{\sigma^2}X^\top X + \frac{1}{\tau^2}I_d\right)^{-1}, \qquad \boldsymbol{\mu}_n = \Sigma_n \frac{1}{\sigma^2}X^\top \mathbf{y}.

This formula contains several important ideas at once.

MAP equals ridge. The posterior mode is the ridge estimator:

\hat{\boldsymbol{\beta}}_{\text{MAP}} = \arg\min_{\boldsymbol{\beta}} \left[ \frac{1}{2\sigma^2}\lVert \mathbf{y} - X\boldsymbol{\beta} \rVert_2^2 + \frac{1}{2\tau^2}\lVert \boldsymbol{\beta} \rVert_2^2 \right].

Posterior covariance measures identifiability. If columns of $X$ are nearly collinear, $X^\top X$ is ill-conditioned and posterior uncertainty grows along unstable directions.

Posterior predictive is Gaussian. For a new feature vector $\mathbf{x}_*$ ,

y_* \mid \mathbf{x}_*, \mathcal{D} \sim \mathcal{N}\left( \mathbf{x}_*^\top \boldsymbol{\mu}_n,\; \sigma^2 + \mathbf{x}_*^\top \Sigma_n \mathbf{x}_* \right).

The first term in the variance is observation noise. The second term is parameter uncertainty.

Examples:

With many data and moderate noise, $\Sigma_n$ shrinks and predictions approach plug-in regression.
With little data, predictions widen far from the training design.
With strong prior shrinkage (small $\tau^2$ ), coefficients stay near zero unless strongly supported by the data.

Non-examples:

Calling Bayesian linear regression "just ridge." Ridge corresponds only to the MAP point, not to the full posterior or predictive distribution.
Thinking the predictive variance is just the residual variance $\sigma^2$ . It also contains uncertainty from estimating $\boldsymbol{\beta}$ .

This example is the right stopping point for this section. We use regression as a vehicle for Bayesian ideas, not as a substitute for the full Regression Analysis section.

5. Core Theory III: Model Comparison and Hierarchical Bayes

5.1 Marginal Likelihood and Occam's Razor

The marginal likelihood

p(\mathcal{D} \mid \mathcal{M}) = \int p(\mathcal{D} \mid \theta, \mathcal{M})\, p(\theta \mid \mathcal{M})\, d\theta

scores an entire model class $\mathcal{M}$ by averaging the likelihood over its parameter space under the prior.

This is conceptually different from maximum likelihood. Maximum likelihood asks: is there one parameter setting in this model that fits the data well? Marginal likelihood asks: does the model place substantial prior mass on parameter settings that fit the data well? A flexible model with huge parameter space can achieve a large maximum likelihood while still receiving a poor marginal likelihood if most of its parameter space fits badly.

This is the Bayesian version of Occam's razor. Complex models are not penalized by an arbitrary extra term. They are penalized automatically because averaging over a larger parameter space dilutes probability mass unless the complexity is truly supported by the data.

Examples:

A model with many irrelevant parameters can overfit pointwise but still have weak marginal likelihood.
A simpler model with slightly lower peak likelihood may win because it concentrates probability on useful regions.
In latent-variable models, richer priors can improve marginal likelihood when they align with real structure rather than adding free flexibility blindly.

Non-examples:

Marginal likelihood is not "the likelihood at the posterior mean."
Occam's razor here is not a hand-tuned penalty; it is a consequence of integration.

For AI: Bayesian model evidence gives a principled language for deciding whether extra complexity in a probabilistic model is truly supported, rather than rewarding any architecture that can fit the training data harder.

5.2 Bayes Factors and Bayesian Model Comparison

If two models or hypotheses $\mathcal{M}_1$ and $\mathcal{M}_0$ are being compared, the Bayes factor is

B_{10} = \frac{p(\mathcal{D} \mid \mathcal{M}_1)}{p(\mathcal{D} \mid \mathcal{M}_0)}.

It updates prior model odds into posterior model odds:

\frac{P(\mathcal{M}_1 \mid \mathcal{D})}{P(\mathcal{M}_0 \mid \mathcal{D})} = B_{10} \cdot \frac{P(\mathcal{M}_1)}{P(\mathcal{M}_0)}.

This is the Bayesian analog of hypothesis comparison, but it differs fundamentally from p-values and likelihood-ratio tests.

A p-value asks how surprising the data are under the null.
A likelihood ratio compares best-fit parameter settings.
A Bayes factor compares integrated support across the full parameter spaces.

Examples:

Comparing a null model to a nonzero-effect model in online experimentation.
Comparing a sparse prior to a dense prior when choosing feature structure.
Comparing probabilistic sequence models under different smoothing assumptions.

Non-examples:

Treating a Bayes factor as a posterior probability without including prior model odds.
Treating Bayes factors as numerically interchangeable with p-values.

Backward reference: Hypothesis Testing gave only a brief preview of Bayes factors. This section is their canonical home.

For AI: In benchmark comparison and model selection, Bayes factors offer an evidence-centric alternative to thresholded significance testing, especially when one wants direct probability updates on competing models.

There is a subtle but important reason Bayes factors can behave very differently from p-values. Under a diffuse alternative prior, the model spreads probability mass over many parameter values that fit the data poorly. Even if one subset of those values fits well, the integrated evidence can remain modest because the average fit under the alternative is diluted. This is sometimes called the Occam penalty in action.

That means Bayes factors are sensitive not only to the data but also to the geometry of the alternative model. A large model with a vague prior can be punished more strongly than intuition expects. This is not a bug. It is the mechanism by which Bayesian evidence rewards models that made sharp, successful predictions rather than merely permitting many possibilities.

Three instructive examples:

Example 1: point null vs broad alternative. Suppose the null says $\theta=0$ and the alternative says $\theta \sim \mathcal{N}(0, \tau^2)$ with large $\tau^2$ . If the observed effect is small but nonzero, the Bayes factor can still favor the null because most of the broad alternative prior mass predicted larger effects than the data delivered.

Example 2: nested regression models. A larger regression model may improve the fitted likelihood, yet still lose on marginal likelihood if the extra coefficients are poorly identified and the prior spreads mass widely over them.

Example 3: model family comparison in sequence modeling. A more expressive probabilistic decoder can assign very high likelihood to some settings but still lose evidence if the prior support is too diffuse relative to the data size.

Two non-examples:

Non-example 1: "Bayes factor close to 1 means the models are equally true." No. It means the observed data do not strongly update the prior odds between the candidate models.

Non-example 2: "A strong Bayes factor removes dependence on the prior." The Bayes factor already incorporates the prior over model parameters. Strong evidence can dominate weak prior odds, but it does not erase prior specification.

Practical model comparison therefore often includes:

clear prior specification on each model
sensitivity analysis over prior scales
reporting posterior odds only after stating prior model odds
avoiding the temptation to compare Bayes factors and p-values as if they answered the same question

For decision-making, this matters because Bayesian comparison asks not only "can this model fit?" but "did this model place meaningful prior probability on what actually happened?" That is a much stronger and often more useful standard.

5.3 Empirical Bayes

Empirical Bayes sits between full Bayes and plug-in frequentist estimation. The idea is:

posit a prior family $p(\theta \mid \lambda)$ with hyperparameter $\lambda$
estimate $\lambda$ from the data, often by maximizing marginal likelihood
condition on the estimated $\hat{\lambda}$ and proceed as if it were fixed

Formally,

\hat{\lambda} = \arg\max_\lambda p(\mathcal{D} \mid \lambda) = \arg\max_\lambda \int p(\mathcal{D} \mid \theta)\, p(\theta \mid \lambda)\, d\theta.

This approach is attractive because it preserves shrinkage and pooled-information benefits without requiring a fully Bayesian posterior over hyperparameters.

Examples:

estimating a global variance scale for many related coefficients
setting smoothing strength in large collections of sparse count models
fitting prior variance in Bayesian linear models from observed tasks

Non-examples:

Calling empirical Bayes "fully Bayesian." It is not, because hyperparameters are plugged in rather than integrated out.
Thinking empirical Bayes is automatically ad hoc. In high-dimensional settings it can be extremely effective and principled.

For AI: Empirical Bayes ideas appear whenever a system learns regularization strength, task-sharing scales, or uncertainty hyperparameters from data instead of fixing them by hand.

Empirical Bayes is often misunderstood because it occupies an uncomfortable middle ground. Purist Bayesians object that uncertainty in $\lambda$ should be integrated out, not plugged in. Pure frequentists may object that the prior is being estimated from the same data it is supposed to regularize. In practice, the method works well in many high-dimensional settings because it shares information efficiently while remaining computationally tractable.

Three examples:

Example 1: many related effect sizes. If thousands of sparse task-level coefficients are believed to come from a common Gaussian prior, estimating the prior variance from all tasks can dramatically improve shrinkage relative to tuning each task independently.

Example 2: empirical Bayes smoothing in count models. Large collections of low-count rates benefit from a shared prior estimated from the full population, then used to stabilize each local estimate.

Example 3: layerwise uncertainty scales. In approximate Bayesian deep learning, one may fit prior or posterior scale hyperparameters from data rather than fixing one global number.

Two non-examples:

Non-example 1: "Empirical Bayes uses the data twice, so it is invalid." It uses the data in a coupled estimation problem. Whether the approximation is acceptable depends on the inferential goal and uncertainty requirements, not on a slogan.

Non-example 2: "Empirical Bayes removes all need for priors." It still requires a prior family; only the hyperparameters are estimated.

5.4 Hierarchical Models and Partial Pooling

Hierarchical Bayes is one of the most practically important Bayesian ideas.

Suppose we estimate task-specific parameters $\theta_1, \ldots, \theta_G$ for $G$ related groups. A non-hierarchical approach either:

fits each group independently (no pooling), or
forces all groups to share one parameter (complete pooling).

Hierarchical Bayes introduces a population distribution:

\theta_g \mid \mu, \tau^2 \sim \mathcal{N}(\mu, \tau^2), \qquad g = 1,\ldots,G.

Now each group has its own parameter, but the parameters are coupled through shared hyperparameters.

This creates partial pooling:

data-rich groups are estimated mostly from their own data
data-poor groups are shrunk toward the population mean
uncertainty is propagated at both group and population levels

Examples:

user-specific click models in recommendation systems
per-language or per-region metrics in multilingual products
patient- or site-level effects in medical ML

Non-examples:

Averaging all groups together and calling it hierarchical. That is complete pooling.
Fitting all groups independently and then post-hoc averaging. That misses shared uncertainty structure.

ASCII PARTIAL POOLING
======================================================================

  no pooling:        each group learns alone
  complete pooling:  all groups forced to same value
  partial pooling:   groups share information through a population prior

  weak groups  --->  pulled strongly toward population mean
  strong groups ---> remain close to their own data signal

======================================================================

Hierarchical Bayes often outperforms both extremes because it respects heterogeneity while reducing variance.

For AI: Partial pooling is valuable whenever many related tasks, users, prompts, or environments share structure but not identity. It is a natural language for transfer, personalization, and multi-task uncertainty.

The hidden principle behind hierarchical Bayes is exchangeability. Before seeing group identities in detail, we often judge the groups to be similar enough that their parameters should be modeled as draws from a common population. Exchangeability is weaker than identical equality and stronger than complete independence. It says the ordering of groups should not matter to the prior.

This matters because many AI problems have exactly this form:

users are different but not unrelated
prompts are different but drawn from a broader task family
regional models differ but share common infrastructure
evaluation suites contain related but nonidentical sub-benchmarks

Hierarchical models let the data decide how much pooling is appropriate. If the estimated population variance $\tau^2$ is small, groups are strongly tied together. If $\tau^2$ is large, the model relaxes toward no pooling.

Three extended examples:

Example 1: multilingual toxicity filtering. High-resource languages provide abundant labels, low-resource languages provide few. A hierarchical prior over language-specific parameters lets low-resource languages borrow strength without assuming all languages behave identically.

Example 2: hospital-level outcome models. Some hospitals contribute many observations, others very few. Partial pooling prevents extreme estimates for small hospitals while preserving true large-hospital differences.

Example 3: prompt family evaluation. If one estimates failure rates across prompt categories, hierarchical Bayes prevents sparse categories from producing overconfident extremes.

Two non-examples:

Non-example 1: using complete pooling because it "improves stability." Stability bought by erasing genuine heterogeneity is often misleading.

Non-example 2: using no pooling because it "avoids assumptions." Independence across groups is itself a strong assumption and is often worse.

The characteristic output of a hierarchical model is not just one estimate per group. It is a joint posterior over all groups and population-level hyperparameters. That joint uncertainty is what makes principled partial pooling possible.

5.5 Prior Sensitivity and Robustness

Bayesian inference is only as trustworthy as the model-prior pair. Sensitivity analysis asks: how much do posterior conclusions change under reasonable prior alternatives?

This is especially important when:

data are scarce
the likelihood is weakly informative
model comparison is sensitive to the prior scale
posterior tails drive decisions

Examples of sensitivity questions:

does a launch decision change if the prior on effect size widens modestly?
does a Bayes factor reverse if the alternative prior is made too diffuse?
do posterior tail probabilities for safety risk stay stable across plausible priors?

Non-examples:

Reporting one prior and pretending prior choice is irrelevant.
Declaring the posterior "objective" because the prior was weakly informative.

Robust Bayesian practice often includes:

prior predictive checks
multiple plausible prior specifications
reporting how posterior summaries move with prior strength
avoiding excessively diffuse priors that create unstable computation or misleading evidence calculations

For AI: Prior sensitivity matters in safety-related claims, low-data evaluations, and model-comparison problems where apparently strong evidence can collapse under a slightly different prior scale.

6. Advanced Topics: Approximate Inference

6.1 Why Exact Posteriors Become Intractable

Exact posterior formulas rely on one of two favorable situations:

conjugacy gives analytic normalization and closed-form updates
the parameter space is small enough for direct numerical integration

Modern ML problems usually satisfy neither. Deep nets have millions or billions of parameters, latent-variable models require integration over large hidden spaces, and structured priors introduce dependencies that destroy conjugacy.

The generic problem is:

p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta)p(\theta)

is easy to write down but hard to normalize, summarize, sample from, or integrate against.

This creates four recurring computational goals:

approximate the posterior itself
approximate posterior expectations
approximate the evidence
approximate the posterior predictive

Approximate inference methods trade off bias, variance, scalability, and calibration. There is no universal winner.

6.2 MCMC for Posterior Sampling

Markov chain Monte Carlo constructs a Markov chain whose stationary distribution is the posterior. Once the chain has mixed, samples from the chain can approximate posterior expectations:

\mathbb{E}[f(\theta) \mid \mathcal{D}] \approx \frac{1}{S}\sum_{s=1}^S f(\theta^{(s)}).

Important families include:

Metropolis-Hastings
Gibbs sampling
Hamiltonian Monte Carlo

We do not re-derive Markov-chain theory here; that belongs to Markov Chains. Here the emphasis is inferential use.

Examples:

Metropolis-Hastings proposes a candidate and accepts it with a ratio preserving the posterior as invariant.
Gibbs sampling cycles through tractable conditional distributions.
HMC uses gradients and auxiliary momentum to explore high-dimensional posteriors more efficiently.

Non-examples:

A sampler that returns one optimized point is not MCMC.
A chain that has not mixed does not justify posterior summaries as if they were exact.

Strengths:

asymptotically exact under appropriate conditions
flexible across complex posterior shapes
natural uncertainty propagation

Weaknesses:

expensive in high dimensions
convergence diagnostics are nontrivial
often hard to scale to modern deep learning without approximation

For AI: MCMC remains essential in Bayesian modeling and scientific ML, but for very large neural networks it is often replaced by approximate or local methods because full posterior sampling is too costly.

Because MCMC is approximate in finite compute, diagnostics matter.

Important practical diagnostics include:

trace plots for qualitative mixing
effective sample size (ESS)
split- $\hat{R}$ for cross-chain agreement
autocorrelation decay
divergence warnings in HMC-like methods

These diagnostics do not prove correctness, but they can reveal obvious failure. A badly mixed chain with high autocorrelation may produce deceptively stable but wrong posterior summaries.

Three examples:

Example 1: random-walk Metropolis in high dimension. Acceptance rates can collapse because local proposals do not move efficiently through the posterior.

Example 2: Gibbs in strongly correlated posteriors. Each conditional update is easy, but the chain moves slowly across the joint geometry.

Example 3: HMC in smooth posteriors. Gradient-guided exploration often gives better mixing and lower autocorrelation than naive local proposals.

Two non-examples:

Non-example 1: "the chain ran for many iterations, so it must have converged." Long runtime alone proves nothing.

Non-example 2: "MCMC gives exact samples." Only the stationary limit is exact under assumptions. Finite-run practice is always approximate.

There is also a conceptual distinction between optimization and sampling. Optimization seeks one good point. Sampling seeks representative coverage of the whole posterior mass. Mixing time, not just objective decrease, determines success.

6.3 Variational Inference and the ELBO

Variational inference replaces integration with optimization. Choose a tractable family $q_\phi(\theta)$ and fit it to the posterior by minimizing KL divergence:

q_\phi^* = \arg\min_{q_\phi \in \mathcal{Q}} D_{\mathrm{KL}}(q_\phi(\theta)\,\|\, p(\theta \mid \mathcal{D})).

Because the true posterior contains the intractable evidence, we optimize the evidence lower bound (ELBO):

\log p(\mathcal{D}) = \mathcal{L}(q_\phi) + D_{\mathrm{KL}}(q_\phi(\theta)\,\|\, p(\theta \mid \mathcal{D})),

where

\mathcal{L}(q_\phi) = \mathbb{E}_{q_\phi}[\log p(\mathcal{D},\theta)] - \mathbb{E}_{q_\phi}[\log q_\phi(\theta)].

Since KL divergence is nonnegative, maximizing the ELBO minimizes the KL gap to the posterior within the chosen family.

Examples:

mean-field VI assumes factorization and gains scalability
structured VI adds correlations for better fidelity
stochastic VI uses minibatches for large datasets

Non-examples:

VI is not exact Bayes unless the true posterior lies in the chosen variational family.
Mean-field VI is not "uncertainty solved." It often underestimates posterior variance because minimizing $D_{\mathrm{KL}}(q \| p)$ tends to favor mode-seeking approximations.

For AI: VI is central to VAEs, amortized latent inference, approximate Bayesian neural networks, and many large-scale probabilistic models where sampling is too expensive.

The direction of KL divergence matters. Standard VI often minimizes

D_{\mathrm{KL}}(q \| p),

not

D_{\mathrm{KL}}(p \| q).

These are not symmetric. Minimizing $D_{\mathrm{KL}}(q \| p)$ heavily penalizes placing mass where the true posterior has little support, but it is more tolerant of missing some posterior mass. As a result, mean-field VI often becomes mode-seeking and underestimates posterior variance.

This explains several common empirical observations:

VI can produce sharp but overconfident approximations
multimodal posteriors may be represented by one dominant mode
diagonal Gaussian approximations miss posterior correlations

Examples:

Example 1: bimodal posterior. A mean-field Gaussian may center on one mode and largely ignore the other.

Example 2: funnel geometry. A simple Gaussian variational family can misrepresent heavy curvature and tail structure badly.

Example 3: latent-variable autoencoders. A diagonal encoder posterior is computationally efficient, but it necessarily suppresses some dependence structure among latent coordinates.

Two non-examples:

Non-example 1: "ELBO improvement means posterior quality is good." A better ELBO is good within the variational family, but a limited family can still miss important structure.

Non-example 2: "VI is just a faster version of MCMC." It solves a different approximation problem and produces different error modes.

There are several standard responses:

richer variational families
normalizing flows
low-rank covariance structure
importance-weighted objectives
hybrid methods mixing variational warm starts with sampling refinements

The correct mental model is not "VI approximates Bayes perfectly if tuned carefully." It is "VI trades posterior fidelity for scalable optimization, and the choice of variational family is the central design decision."

6.4 Amortized Inference and VAEs

In many latent-variable models, each datapoint $\mathbf{x}^{(i)}$ has a latent variable $\mathbf{z}^{(i)}$ . Running a full optimization to approximate each posterior $p(\mathbf{z}^{(i)} \mid \mathbf{x}^{(i)})$ separately is too expensive. Amortized inference solves this by learning a shared inference network:

q_\phi(\mathbf{z} \mid \mathbf{x}).

The variational autoencoder is the canonical example. It defines:

prior $p(\mathbf{z})$
decoder $p_\theta(\mathbf{x} \mid \mathbf{z})$
encoder $q_\phi(\mathbf{z} \mid \mathbf{x})$

and optimizes

\mathbb{E}_{q_\phi(\mathbf{z}\mid\mathbf{x})} [\log p_\theta(\mathbf{x}\mid\mathbf{z})] - D_{\mathrm{KL}}(q_\phi(\mathbf{z}\mid\mathbf{x}) \| p(\mathbf{z})).

The first term is reconstruction quality. The second term keeps the approximate posterior close to the prior and prevents arbitrary memorization of latent codes.

Examples:

image VAEs with Gaussian latent spaces
text latent-variable models with approximate posterior encoders
structured probabilistic models with learned local posterior approximations

Non-examples:

A deterministic autoencoder is not a VAE.
The encoder output is not the true posterior unless the approximation family and optimization happen to recover it exactly.

For AI: Amortized inference is one of the main places where Bayesian ideas became fully integrated with neural networks and modern autodiff tooling.

6.5 Laplace, MC Dropout, SGLD, and SWAG

Exact posterior inference for large neural networks is generally unavailable, so practical Bayesian deep learning uses approximations.

Laplace approximation. Approximate the posterior locally around a mode by a Gaussian:

p(\boldsymbol{\theta} \mid \mathcal{D}) \approx \mathcal{N}(\hat{\boldsymbol{\theta}}_{\text{MAP}}, H^{-1}),

where $H$ is the Hessian of the negative log-posterior at the mode. This is local and can miss multimodality, but it is computationally attractive.

MC dropout. Keep dropout active at test time and average predictions across stochastic forward passes. Following Gal and Ghahramani, this can be interpreted as approximate variational Bayesian inference in a restricted family.

SGLD. Add Gaussian noise to stochastic-gradient updates so the iterates approximate posterior samples rather than collapsing to one optimizer. This ties posterior sampling to large-scale optimization.

SWAG. Fit a low-rank-plus-diagonal Gaussian approximation to SGD iterates near a wide optimum, then sample weights from that approximation for Bayesian model averaging.

Examples:

Laplace for local curvature-based uncertainty
MC dropout for cheap predictive uncertainty
SGLD for sampling-style uncertainty in scalable settings
SWAG for practical posterior approximations from standard training trajectories

Non-examples:

None of these methods is universally exact.
Good calibration from one benchmark does not prove full posterior fidelity.

For AI: These methods matter because exact Bayesian neural networks remain expensive. Approximate posterior methods are often used as uncertainty baselines, calibration tools, or practical approximations in safety-sensitive or data-limited applications.

It helps to compare the methods directly.

Method	Main idea	Strength	Main weakness
Laplace	Gaussian around MAP mode	Cheap local uncertainty	Misses multimodality and nonlocal geometry
MC dropout	Stochastic subnet averaging	Easy to bolt onto existing models	Approximation quality depends strongly on architecture and interpretation
SGLD	SGD plus injected noise	Sampling flavor at scale	Sensitive to step-size schedules and mixing quality
SWAG	Gaussian fit to SGD iterates	Strong practical calibration baseline	Approximate posterior geometry only near a training trajectory

Three examples of when each tends to shine:

Example 1: Laplace is attractive when one already has a trained MAP network and wants a quick curvature-based uncertainty estimate near the optimum.

Example 2: MC dropout is attractive when architecture and code already use dropout, and a cheap predictive-uncertainty baseline is needed.

Example 3: SWAG is attractive when one can afford collecting SGD iterates and wants a better uncertainty baseline than plain deterministic softmax confidence.

Two non-examples:

Non-example 1: using any of these methods and then assuming posterior probabilities are perfectly calibrated under severe distribution shift.

Non-example 2: comparing these methods only by test accuracy. Their purpose is uncertainty, calibration, and risk-sensitive prediction, not just point performance.

In Bayesian deep learning, the decisive question is often not "which method is most Bayesian?" but "which approximation gives the most reliable uncertainty at acceptable compute cost for this deployment setting?"

Bayesian Inference: Part 2 - Core Theory Ii Estimation Prediction And Uncertainty To 6 Advanced T

Bayesian Inference: Part 4: Core Theory II: Estimation, Prediction, and Uncertainty to 6. Advanced Topics: Approximate Inference

4. Core Theory II: Estimation, Prediction, and Uncertainty

4.1 MAP Estimation as Regularised MLE

Gaussian prior

Laplace prior

4.2 Posterior Covariance and Shrinkage

4.3 Posterior Predictive Distribution

4.4 Posterior Predictive Checks

4.5 Bayesian Linear Regression as a Worked Example

5. Core Theory III: Model Comparison and Hierarchical Bayes

5.1 Marginal Likelihood and Occam's Razor

5.2 Bayes Factors and Bayesian Model Comparison

5.3 Empirical Bayes

5.4 Hierarchical Models and Partial Pooling

5.5 Prior Sensitivity and Robustness

6. Advanced Topics: Approximate Inference

6.1 Why Exact Posteriors Become Intractable

6.2 MCMC for Posterior Sampling

6.3 Variational Inference and the ELBO

6.4 Amortized Inference and VAEs

6.5 Laplace, MC Dropout, SGLD, and SWAG

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?