Part 3

16 min read11 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Bayesian Inference: Part 7: Applications in Machine Learning to References

7. Applications in Machine Learning

7.1 Naive Bayes as Generative Classification

Naive Bayes uses Bayes' rule for classification:

P(y=k \mid \mathbf{x}) \propto p(\mathbf{x} \mid y=k)\, P(y=k).

Its defining assumption is conditional independence of features given the class:

p(\mathbf{x} \mid y=k) = \prod_j p(x_j \mid y=k).

That assumption is usually false, but the classifier can still perform well because classification needs the relative posterior ordering of classes more than exact density fidelity.

Backward reference: The conditional-independence structure behind Naive Bayes was developed in Joint Distributions.

For AI: Naive Bayes remains a useful illustration of generative classification, prior smoothing, and posterior class reasoning.

7.2 Bayesian Neural Networks

A Bayesian neural network places a distribution over weights:

p(\boldsymbol{\theta} \mid \mathcal{D}) \propto p(\mathcal{D} \mid \boldsymbol{\theta})\, p(\boldsymbol{\theta}).

Predictions average over weight uncertainty:

p(y_* \mid \mathbf{x}_*, \mathcal{D}) = \int p(y_* \mid \mathbf{x}_*, \boldsymbol{\theta})\, p(\boldsymbol{\theta} \mid \mathcal{D})\, d\boldsymbol{\theta}.

This is conceptually attractive because the network can express epistemic uncertainty: uncertainty due to limited knowledge rather than inherent observation noise.

In practice, exact BNN posteriors are intractable. That is why the previous subsection's approximations matter so much.

Examples:

Bayes by Backprop for variational Gaussian weight posteriors
MC dropout as approximate posterior averaging
Laplace or SWAG as posterior surrogates around trained networks

Non-examples:

A deterministic softmax score is not a posterior uncertainty estimate by itself.
High confidence on in-distribution data does not imply good OOD uncertainty.

It is useful to separate two types of uncertainty in Bayesian neural networks.

Aleatoric uncertainty: noise or ambiguity inherent in the data-generating process
Epistemic uncertainty: uncertainty due to limited knowledge of the model parameters

Bayesian weight distributions primarily target epistemic uncertainty. When the model sees more relevant data, epistemic uncertainty should contract. This is why Bayesian approximations are most useful in small-data regimes, under distribution shift, or in active-learning settings where additional data can genuinely reduce uncertainty.

A second key distinction is between weight-space and function-space uncertainty. Many approximate BNN methods place distributions over weights because weights are the direct parameters of the model. But what matters for decision-making is usually uncertainty over functions or predictions. Different weight configurations can induce similar predictive functions, especially in overparameterized networks. This makes posterior approximation in deep learning difficult: weight-space geometry can be highly redundant, multimodal, and curved even when predictive behavior is comparatively smooth.

Three examples:

Example 1: data-rich regime. On a very large in-distribution dataset, a Bayesian neural network may provide only modest epistemic gains over strong ensembles because the posterior is already relatively concentrated in function space.

Example 2: sparse or rare classes. For low-frequency classes or rare prompt types, posterior uncertainty can remain large even when average training accuracy is high. This is where Bayesian approximations are most informative.

Example 3: covariate shift. A model trained on one image distribution or one prompt distribution may produce confident but unstable predictions off-distribution. Posterior surrogates that widen uncertainty under shift can improve abstention and monitoring behavior, though they do not solve OOD detection perfectly.

Two non-examples:

Non-example 1: "A Bayesian neural network always outperforms a deterministic network." The point is not guaranteed accuracy gains. The point is better uncertainty representation for downstream decisions.

Non-example 2: "Any ensemble is Bayesian." Ensembles often behave like uncertainty approximators and can resemble Bayesian model averaging in practice, but they are not automatically posterior samples.

There is also an engineering reality. Exact Bayesian neural networks remain too expensive for many large-scale systems, so practitioners often choose between:

a theoretically closer but expensive approximation
a cheaper surrogate with weaker posterior interpretation
a deterministic model plus calibration fix

The right choice depends on the cost of being wrong, the amount of available data, and whether downstream systems actually use uncertainty rather than merely logging it.

For AI: This is why Bayesian deep learning should be evaluated by calibration, risk-sensitive utility, abstention quality, active-learning value, and OOD behavior, not only by top-line accuracy.

7.3 Uncertainty for Calibration, Active Learning, and OOD Detection

Bayesian uncertainty is operationally useful when a model must decide whether to trust itself.

Calibration. Posterior-aware prediction can improve probability calibration by accounting for parameter uncertainty.

Active learning. Query the datapoints with highest posterior uncertainty or highest expected information gain.

OOD detection. If the posterior predictive is diffuse, unstable across posterior samples, or assigns low mass to a new input, the system can flag potential distribution shift.

Examples:

uncertain moderation predictions routed to human review
active learning for labeling expensive data
uncertainty-aware ranking under sparse feedback

Non-examples:

Predictive entropy alone is not a perfect OOD detector.
A method with approximate Bayesian flavor is not automatically calibrated.

Calibration itself splits into at least two questions:

are predicted probabilities aligned with empirical frequencies in-distribution?
does uncertainty rise appropriately under ambiguity, shift, or sparse evidence?

The first can often be improved by temperature scaling or ensembling. The second is harder. Bayesian methods aim at the second because they try to propagate epistemic uncertainty instead of just correcting output logits after the fact.

Three examples:

Example 1: An image classifier may achieve good top-1 accuracy while remaining overconfident on corrupted data. Posterior-aware prediction aims to widen uncertainty under such perturbations.

Example 2: In active learning, a posterior or posterior surrogate identifies examples where the model's current beliefs are most uncertain, so labeling effort is spent where it will reduce epistemic uncertainty fastest.

Example 3: In retrieval and ranking, uncertainty over sparse user interactions can prevent a system from overcommitting to noisy early feedback.

Two non-examples:

Non-example 1: using softmax confidence as if it were Bayesian uncertainty.

Non-example 2: interpreting disagreement across posterior samples as a guarantee of correctness. It is evidence of uncertainty, not proof of truth.

7.4 Bayesian Optimization and Thompson Sampling

Bayesian optimization treats the objective function as uncertain and updates a posterior over it after each expensive evaluation. A surrogate model, often a Gaussian process, balances exploration and exploitation by using both posterior mean and posterior uncertainty.

Thompson sampling uses posterior sampling directly: sample a plausible objective from the posterior and act optimally under that sampled world. Repeating this naturally balances exploration and exploitation.

Examples:

hyperparameter tuning for expensive model training
prompt or policy selection with expensive evaluations
recommendation-bandit systems under uncertainty

Non-examples:

random search is not Bayesian optimization
fixed epsilon-greedy exploration does not exploit posterior structure

For AI: When every training run costs substantial compute, posterior-aware search can save enormous resources by targeting informative trials instead of brute-force sweeps.

Bayesian optimization is especially attractive when the objective is:

expensive to evaluate
noisy
derivative-free
available only through black-box experimentation

That description fits many modern ML workflows: hyperparameter tuning, inference-time tradeoff tuning, reward-model threshold selection, and evaluation-guided prompt or policy search.

Thompson sampling deserves separate emphasis because it turns posterior reasoning directly into a policy. Sampling one plausible world from the posterior and acting optimally in that sampled world leads to a simple randomized exploration strategy whose randomness is informed, not arbitrary.

This gives a clean contrast:

epsilon-greedy adds blind random exploration
UCB adds optimism bonuses
Thompson sampling samples according to posterior uncertainty

All three can work. Thompson sampling is the most explicitly Bayesian.

7.5 Bayesian Views of Regularization and Fine-Tuning

Many ML regularizers admit Bayesian interpretations:

L2 penalty = Gaussian prior
L1 penalty = Laplace prior
group shrinkage = hierarchical prior
sparse adaptation = structured prior on updates

This perspective is especially useful in fine-tuning. If we believe most weights should remain near their pretrained values, then Bayesian language suggests priors centered at those values rather than at zero. If we believe updates should be low-rank or sparse, Bayesian structure can encode that too.

Examples:

posterior or MAP interpretation of weight decay
priors centered on pretrained weights for conservative adaptation
hierarchical priors across tasks or domains during multi-task fine-tuning

Non-examples:

not every optimizer trick has a clean Bayesian meaning
saying "Bayesian" does not remove the need to validate adaptation behavior empirically

LoRA and DoRA as structured priors. Low-Rank Adaptation (LoRA) constrains weight updates to a low-rank subspace: $\Delta W = BA$ with $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , $r \ll \min(d,k)$ . From a Bayesian perspective, this is a structured prior that places all probability mass on rank- $r$ updates, encoding the belief that fine-tuning should perturb the pretrained representation along only a few informative directions. DoRA (Weight-Decomposed Low-Rank Adaptation) further decomposes updates into magnitude and direction components, which can be read as placing independent priors on scale and orientation. Neither method propagates full posterior uncertainty over weights, but both encode prior beliefs about adaptation geometry in a way that pure optimization language obscures.

The Bayesian view is most useful when it changes design decisions. For example:

Should the prior be centered at zero or at pretrained weights? (LoRA centers at zero; a Bayesian treatment might center at the pretrained value with tight scale.)
Should updates be globally isotropic or structured by layer and task? (DoRA's magnitude-direction decomposition suggests direction-specific priors.)
Should uncertainty be propagated into predictions after fine-tuning?
Should low-rank adaptation be interpreted as a computational constraint, a prior belief, or both?

These are not merely philosophical questions. They affect which parameters move, how confidently they move, and how much a system trusts its adapted outputs after seeing limited new data.

8. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	"The prior is just subjective bias."	Priors can encode symmetry, scale constraints, weak regularization, historical evidence, or hierarchical structure.	State the modeling role of the prior explicitly and perform sensitivity checks.
2	"The likelihood is a probability distribution over parameters."	The likelihood is a function of the parameter with the data held fixed, not a normalized distribution in $\theta$ .	Use posterior language only after multiplying by the prior and normalizing.
3	"MAP is the same as Bayesian inference."	MAP collapses the posterior to one point and discards posterior uncertainty.	Distinguish posterior summaries from the full posterior and posterior predictive.
4	"A 95% credible interval and a 95% confidence interval mean the same thing."	They may have similar endpoints but different semantics.	Keep the Bayesian and frequentist probability statements separate.
5	"Flat priors are objective and harmless."	Flatness depends on parameterization and can create unstable or improper posteriors.	Prefer justified weakly informative priors or invariance-based priors when appropriate.
6	"Bayes factors are just Bayesian p-values."	Bayes factors compare integrated model evidence, not tail probabilities under a null.	Interpret Bayes factors through posterior odds and prior model odds.
7	"Conjugate priors are automatically correct."	Conjugacy gives tractability, not truth.	Use conjugacy when it fits the problem or as an instructional baseline, not as a dogma.
8	"Variational inference gives exact uncertainty if optimized well."	A restricted variational family can remain systematically overconfident.	Evaluate posterior quality, calibration, and sensitivity to the approximation family.
9	"Softmax confidence is Bayesian uncertainty."	Deterministic confidence scores can be sharply wrong under shift or sparse evidence.	Use posterior or posterior-surrogate uncertainty and validate calibration.
10	"Posterior predictive checks prove the model is true."	PPCs are diagnostics for mismatch, not proofs of correctness.	Use PPCs to falsify bad fit and combine them with domain checks.
11	"More MCMC iterations always solve convergence."	Poor geometry and high autocorrelation can persist for a long time.	Check ESS, split- $\hat{R}$ , trace behavior, and proposal geometry.
12	"Bayesian methods eliminate the need for empirical validation."	Bayesian inference still depends on model assumptions, approximations, and deployment conditions.	Validate calibration, robustness, and decision quality on the real use case.

9. Exercises

These exercises are mirrored in exercises.ipynb, where each one includes a scaffold cell and a full reference solution.

Exercise 1 (*) - Beta-Binomial Updating Let $X_1,\ldots,X_n \sim \operatorname{Bern}(\theta)$ with prior $\theta \sim \operatorname{Beta}(\alpha,\beta)$ . (a) Derive the posterior density. (b) Compute the posterior mean, variance, and MAP. (c) Derive the posterior predictive probability of success for one future trial. (d) Explain how the answer changes as $\alpha+\beta$ grows with fixed prior mean.
Exercise 2 (*) - Gaussian Prior, Gaussian Likelihood Suppose $X_i \mid \mu \sim \mathcal{N}(\mu,\sigma^2)$ with known $\sigma^2$ and prior $\mu \sim \mathcal{N}(\mu_0,\tau_0^2)$ . (a) Derive the posterior by completing the square. (b) Express the posterior mean as a precision-weighted average. (c) Show what happens as $n \to \infty$ . (d) Interpret the result for a small-data ML monitoring problem.
Exercise 3 (*) - MAP as Regularised MLE (a) Show that a Gaussian prior on $\boldsymbol{\theta}$ yields an L2 penalty in the MAP objective. (b) Show that an independent Laplace prior yields an L1 penalty. (c) Explain why the resulting optimizer is not yet full Bayesian inference. (d) Connect the result to weight decay in deep learning.
Exercise 4 () - Posterior Predictive for Count Data** Let $X_i \mid \lambda \sim \operatorname{Poisson}(\lambda)$ and $\lambda \sim \operatorname{Gamma}(\alpha,\beta)$ . (a) Derive the posterior on $\lambda$ . (b) Derive the posterior predictive for one future count. (c) Compare its variance to that of a plug-in Poisson predictor using the posterior mean of $\lambda$ . (d) Explain why posterior predictive uncertainty can exceed observation-model uncertainty.
Exercise 5 () - Bayes Factor vs p-Value** Consider a simple null-vs-effect comparison problem. (a) Write down the likelihood under $H_0$ and $H_1$ . (b) Derive or numerically approximate the Bayes factor for a chosen prior under $H_1$ . (c) Compare the conclusion to a classical significance test. (d) Explain how the prior scale affects the Bayes factor.
Exercise 6 () - Hierarchical Partial Pooling** You observe noisy click-through data for several related groups with very different sample sizes. (a) Write down a two-level hierarchical model. (b) Explain the difference between no pooling, complete pooling, and partial pooling. (c) Show qualitatively which groups shrink most strongly. (d) Explain why this is useful in multilingual or multi-market ML systems.
Exercise 7 (*) - ELBO and Variational Inference** (a) Starting from $D_{\mathrm{KL}}(q(\theta)\|p(\theta\mid\mathcal{D}))$ , derive the ELBO identity. (b) Explain why maximizing the ELBO minimizes KL divergence. (c) Explain why mean-field VI can underestimate posterior variance. (d) Connect the derivation to the VAE objective.
Exercise 8 (*) - Bayesian Uncertainty in ML** Compare one or more approximate Bayesian deep-learning methods such as MC dropout, Laplace, SGLD, or SWAG. (a) State what posterior object each method approximates. (b) Identify one computational advantage. (c) Identify one calibration or fidelity limitation. (d) Propose which method you would choose for an uncertainty-aware deployment with limited compute and justify the decision.

10. Why This Matters for AI (2026 Perspective)

Concept	AI Impact
Prior as regularizer	Makes assumptions explicit instead of burying them in penalties and initialization choices
Posterior uncertainty	Supports abstention, fallback logic, and risk-sensitive prediction
Credible intervals	Enable direct probability statements about launch thresholds and safety margins
Posterior predictive	Gives calibrated next-step uncertainty rather than just point predictions
Conjugate Bayes	Powers fast online updates for bandits, CTR models, and monitoring systems
MAP interpretation	Explains weight decay, L1 sparsity, and other regularizers as prior assumptions
Hierarchical Bayes	Shares strength across tasks, languages, regions, users, and low-resource groups
Bayes factors	Provide an evidence-based alternative to thresholded model-comparison rituals
PPCs	Expose misspecification when a probabilistic model fits one metric but misses the real data shape
MCMC	Remains the gold-standard asymptotic route for posterior sampling in rich probabilistic models
Variational inference	Makes latent-variable Bayes scalable enough for neural architectures and large datasets
Amortized inference	Lets neural networks learn approximate posteriors efficiently, as in VAEs
Approximate Bayesian DL	Methods like MC dropout, SGLD, and SWAG provide practical uncertainty baselines
Bayesian optimization	Saves expensive training budget by using posterior uncertainty to guide search
Thompson sampling	Turns posterior uncertainty directly into principled exploration policies
Fine-tuning priors	Encourages conservative adaptation around pretrained models and supports structured uncertainty

Bayesian inference matters for AI because it keeps one crucial quantity visible: what the system still does not know. In modern model deployment, that is often more valuable than squeezing one more decimal place out of average validation accuracy.

11. Conceptual Bridge

This section sits at a precise point in the curriculum.

Backward, it depends on probability theory and frequentist statistics. From Joint Distributions, we inherit Bayes' theorem, conditional distributions, multivariate Gaussian conditioning, and the language of latent variables. From Estimation Theory, we inherit likelihoods, MLE, Fisher information, asymptotic normality, and confidence intervals. From Hypothesis Testing, we inherit the classical model-comparison mindset that Bayesian evidence now complements and sometimes challenges.

Conceptually, Bayesian inference is the point where those ideas are reorganized. Bayes' theorem stops being one identity inside probability theory and becomes the architecture of inference. Likelihood stops being only an optimization objective and becomes one term in a full posterior update. Confidence language stops being the only way to talk about uncertainty and is joined by posterior probability statements, credible intervals, and posterior predictive reasoning.

Forward, this section opens several doors.

Into Time Series, repeated Gaussian updating becomes filtering and state estimation.
Into Regression Analysis, penalized estimators acquire full posterior and predictive interpretations.
Into Optimization, variational objectives and stochastic-gradient posterior approximations become algorithmic objects.
Into Information Theory, KL divergence and evidence connect posterior approximation to compression and coding.
Into modern ML practice, approximate Bayesian methods become tools for calibration, active learning, bandits, and uncertainty-aware deployment.

The deepest conceptual move is this: frequentist statistics asks how procedures behave across hypothetical repeated datasets; Bayesian statistics asks how beliefs should change for the dataset we actually observed. Mature ML systems often need both views. They need procedures with strong repeated-use guarantees and posterior-aware decisions on real realized data.

ASCII CURRICULUM POSITION
======================================================================

  Probability Theory
        |
        +-- Bayes' theorem, conditionals, MVN conditioning
        |
        v
  Estimation Theory
        |
        +-- likelihood, MLE, Fisher information, asymptotics
        |
        v
  Hypothesis Testing
        |
        +-- p-values, likelihood ratios, decision rules
        |
        v
  Bayesian Inference
        |
        +-- prior + likelihood -> posterior
        +-- posterior -> prediction, evidence, decisions
        |
        +---------------------> Time Series
        +---------------------> Regression Analysis
        +---------------------> Optimization
        +---------------------> Information Theory

======================================================================

If probability theory gave us the syntax of uncertainty and estimation theory gave us the grammar of data-to-parameter inference, Bayesian inference gives us the full probabilistic semantics of learning under uncertainty. It is the chapter where uncertainty stops being a nuisance term and becomes a first-class computational object.

One final way to summarize the bridge is this:

probability theory told us how to manipulate uncertainty
estimation theory told us how to fit unknown parameters from data
hypothesis testing told us how to compare claims under repeated-sampling logic
Bayesian inference tells us how to update whole distributions of belief and use them for prediction and action

That perspective will keep returning. Whenever a later chapter asks us to choose under uncertainty, propagate uncertainty through a model, or reason about what the system does not know yet, the machinery introduced here is part of the answer.

References

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. Bayesian Data Analysis. CRC Press.
Murphy, K. P. Machine Learning: A Probabilistic Perspective. MIT Press.
Bishop, C. M. Pattern Recognition and Machine Learning. Springer.
Bernardo, J. M., and Smith, A. F. M. Bayesian Theory. Wiley.
Robert, C., and Casella, G. Monte Carlo Statistical Methods. Springer.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. "Variational Inference: A Review for Statisticians." JASA (2017).
Kingma, D. P., and Welling, M. "Auto-Encoding Variational Bayes." ICLR (2014).
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. "Weight Uncertainty in Neural Networks." ICML (2015).
Gal, Y., and Ghahramani, Z. "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." ICML (2016).
Welling, M., and Teh, Y. W. "Bayesian Learning via Stochastic Gradient Langevin Dynamics." ICML (2011).
Maddox, W., Garipov, T., Izmailov, P., Vetrov, D., and Wilson, A. G. "A Simple Baseline for Bayesian Uncertainty in Deep Learning." NeurIPS (2019).
Johari, R. "Lecture 16: Bayesian Inference." Stanford MS&E 226 notes.
Jeffreys, H. Theory of Probability. Oxford University Press.
Gelman, A. and Shalizi, C. R. "Philosophy and the Practice of Bayesian Statistics." British Journal of Mathematical and Statistical Psychology (2013).
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Burkner, P.-C. "Rank-Normalization, Folding, and Localization: An Improved R-hat for Assessing Convergence of MCMC." Bayesian Analysis (2021).

Bayesian Inference: Part 3 - Applications In Machine Learning To References

Bayesian Inference: Part 7: Applications in Machine Learning to References

7. Applications in Machine Learning

7.1 Naive Bayes as Generative Classification

7.2 Bayesian Neural Networks

7.3 Uncertainty for Calibration, Active Learning, and OOD Detection

7.4 Bayesian Optimization and Thompson Sampling

7.5 Bayesian Views of Regularization and Fine-Tuning

8. Common Mistakes

9. Exercises

10. Why This Matters for AI (2026 Perspective)

11. Conceptual Bridge

References

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?