Theory Notebook
Converted from
theory.ipynbfor web reading.
Bayesian Inference — Theory Notebook
Posterior reasoning turns uncertainty into something we can compute with, visualize, and act on.
This notebook is the interactive companion to notes.md. It follows the section arc from exact Bayesian updating through posterior prediction, model comparison, approximate inference, and uncertainty-aware ML applications.
| Coverage | What You Will Do |
|---|---|
| Exact Bayes | Conjugate updates, posterior summaries, credible intervals |
| Prediction | Posterior predictive distributions, posterior predictive checks |
| Model comparison | Marginal likelihoods, Bayes factors, prior sensitivity |
| Approximate Bayes | MCMC, VI, amortized inference, Bayesian deep-learning surrogates |
| ML applications | Naive Bayes, Bayesian neural nets, active learning, Thompson sampling |
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
try:
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
HAS_MPL = True
except ImportError:
HAS_MPL = False
try:
import seaborn as sns
if HAS_MPL:
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
if HAS_MPL:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
np.random.seed(42)
print("Plot setup complete.")
Code cell 4
import numpy.linalg as la
from scipy import stats, special
COLORS = {
"primary": "#0077BB",
"secondary": "#EE7733",
"tertiary": "#009988",
"error": "#CC3311",
"neutral": "#555555",
"highlight": "#EE3377",
}
np.set_printoptions(precision=6, suppress=True)
def header(title):
print("\n" + "=" * len(title))
print(title)
print("=" * len(title))
def report(name, cond):
print(f"{'PASS' if cond else 'FAIL'} - {name}")
return cond
def normalize_density(x, values):
area = np.trapezoid(values, x)
return values / area
def beta_posterior(alpha, beta, k, n):
return alpha + k, beta + n - k
print("Extra imports and helper functions loaded.")
1. Intuition
1.1 From Data to Belief
The first Bayesian habit is to represent uncertainty about a parameter before and after seeing data. A prior over a Bernoulli success probability becomes a posterior after observing outcomes.
Code cell 6
# === 1.1 From Data to Belief ===
theta = np.linspace(0.001, 0.999, 500)
prior = np.ones_like(theta)
k, n = 7, 10
likelihood = theta**k * (1 - theta)**(n - k)
posterior = normalize_density(theta, prior * likelihood)
posterior_mean = np.trapezoid(theta * posterior, theta)
posterior_map = theta[np.argmax(posterior)]
header("Bernoulli belief update")
print(f"Observed data: {k} successes out of {n} trials")
print(f"Posterior mean: {posterior_mean:.4f}")
print(f"Posterior MAP : {posterior_map:.4f}")
report("Posterior mean lies between 0 and 1", 0 < posterior_mean < 1)
if HAS_MPL:
fig, ax = plt.subplots()
ax.plot(theta, normalize_density(theta, prior), color=COLORS["neutral"], label="Prior")
ax.plot(theta, normalize_density(theta, likelihood), color=COLORS["secondary"], label="Likelihood")
ax.plot(theta, posterior, color=COLORS["primary"], label="Posterior")
ax.set_title("Prior, likelihood, and posterior for Bernoulli data")
ax.set_xlabel(r"Success probability $\theta$")
ax.set_ylabel("Density")
ax.legend()
fig.tight_layout()
plt.show()
1.2 Frequentist vs Bayesian Uncertainty
Bayesian questions often sound like: "Given what we observed, what is the probability the improvement exceeds the deployment threshold?" That is a posterior probability question, not a p-value question.
Code cell 8
# === 1.2 Frequentist vs Bayesian Uncertainty ===
alpha0, beta0 = 2, 18
control_clicks, control_impressions = 45, 500
test_clicks, test_impressions = 60, 500
a_c, b_c = beta_posterior(alpha0, beta0, control_clicks, control_impressions)
a_t, b_t = beta_posterior(alpha0, beta0, test_clicks, test_impressions)
draws_c = np.random.beta(a_c, b_c, size=50_000)
draws_t = np.random.beta(a_t, b_t, size=50_000)
improvement = draws_t - draws_c
prob_better = np.mean(improvement > 0)
prob_big = np.mean(improvement > 0.01)
header("Posterior probability of improvement")
print(f"P(test > control | data) = {prob_better:.4f}")
print(f"P(test - control > 0.01 | data) = {prob_big:.4f}")
report("Improvement probability exceeds 0.5", prob_better > 0.5)
if HAS_MPL:
fig, ax = plt.subplots()
ax.hist(improvement, bins=60, density=True, color=COLORS["primary"], alpha=0.75)
ax.axvline(0.0, color=COLORS["neutral"], linestyle="--", label="No improvement")
ax.axvline(0.01, color=COLORS["error"], linestyle="--", label="1 percentage point")
ax.set_title("Posterior uncertainty over effect size")
ax.set_xlabel("Test CTR - Control CTR")
ax.set_ylabel("Density")
ax.legend()
fig.tight_layout()
plt.show()
1.3 Historical Timeline
Bayesian inference has a 250-year history, but most of its computational machinery emerged only in the last 30 years.
| Year | Contributor | Contribution |
|---|---|---|
| 1763 | Thomas Bayes | Posthumous essay on inverse probability |
| 1774–1812 | Laplace | Generalizes to continuous parameters, develops asymptotic ideas |
| 1939 | Jeffreys | Objective Bayes, invariant priors |
| 1954 | Savage | Subjective Bayesian decision theory |
| 1990s | Gelfand, Smith, Neal, Gelman | MCMC makes complex posteriors tractable |
| 2014 | Kingma & Welling | VAE turns amortized VI into scalable deep learning |
| 2015 | Blundell et al. | Bayes by Backprop for weight posteriors |
| 2016 | Gal & Ghahramani | MC dropout as approximate Bayesian inference |
| 2019 | Maddox et al. | SWAG uncertainty baseline from SGD iterates |
Key pattern: The philosophy was settled early; the computation came later.
1.4 Why Bayesian Inference Matters for AI
When the cost of being overconfident is high, posterior probabilities and posterior predictive uncertainty become decision tools rather than academic extras.
Code cell 11
# === 1.4 Why Bayesian Inference Matters for AI ===
risk_threshold = 0.02
prob_bad = np.mean(improvement < -risk_threshold)
header("Deployment-style decision statistics")
print(f"Posterior risk of harming CTR by more than 2 points: {prob_bad:.6f}")
print(f"Posterior expected uplift: {np.mean(improvement):.5f}")
print(f"Posterior 95% interval: ({np.quantile(improvement, 0.025):.5f}, {np.quantile(improvement, 0.975):.5f})")
report("Expected uplift is positive", np.mean(improvement) > 0)
2. Formal Definitions
2.1 Prior, Likelihood, Posterior, Evidence
In exact Bayesian inference it is often helpful to compute the posterior on a dense grid to see normalization directly.
Code cell 13
# === 2.1 Prior, Likelihood, Posterior, Evidence ===
mu_grid = np.linspace(-2.0, 2.0, 800)
sigma = 0.5
x_obs = np.array([0.35, 0.1, 0.42, 0.55])
mu0, tau0 = 0.0, 0.7
prior_mu = stats.norm.pdf(mu_grid, loc=mu0, scale=tau0)
likelihood_mu = np.exp(np.sum(stats.norm.logpdf(x_obs[:, None], loc=mu_grid[None, :], scale=sigma), axis=0))
posterior_mu = normalize_density(mu_grid, prior_mu * likelihood_mu)
evidence_numeric = np.trapezoid(prior_mu * likelihood_mu, mu_grid)
header("Posterior normalization on a grid")
print(f"Numeric evidence p(D): {evidence_numeric:.8f}")
print(f"Posterior integral : {np.trapezoid(posterior_mu, mu_grid):.8f}")
report("Posterior integrates to 1", np.isclose(np.trapezoid(posterior_mu, mu_grid), 1.0, atol=1e-5))
2.2 Continuous Bayes and Normalisation
For the Gaussian-Gaussian model, the posterior also has a closed form. We can compare the closed form to the grid approximation.
Code cell 15
# === 2.2 Continuous Bayes and Normalisation ===
n = len(x_obs)
xbar = x_obs.mean()
post_var = 1.0 / (1.0 / tau0**2 + n / sigma**2)
post_mean = post_var * (mu0 / tau0**2 + n * xbar / sigma**2)
posterior_closed = stats.norm.pdf(mu_grid, loc=post_mean, scale=np.sqrt(post_var))
header("Closed-form Gaussian posterior")
print(f"Posterior mean (closed form): {post_mean:.6f}")
print(f"Posterior var (closed form): {post_var:.6f}")
diff = np.max(np.abs(posterior_closed - posterior_mu))
print(f"Max abs difference vs grid : {diff:.6e}")
report("Closed form matches grid posterior", diff < 5e-4)
2.3 Posterior Summaries and Bayes Estimators
Posterior mean, median, and MAP need not coincide, especially in skewed posteriors.
Code cell 17
# === 2.3 Posterior Summaries and Bayes Estimators ===
a, b = 2, 8
theta_grid = np.linspace(0.001, 0.999, 800)
post = stats.beta.pdf(theta_grid, a, b)
post = normalize_density(theta_grid, post)
mean_est = a / (a + b)
median_est = stats.beta.ppf(0.5, a, b)
map_est = (a - 1) / (a + b - 2)
header("Posterior summaries of a skewed Beta posterior")
print(f"Posterior mean : {mean_est:.4f}")
print(f"Posterior median: {median_est:.4f}")
print(f"Posterior MAP : {map_est:.4f}")
report("Mean, median, and MAP differ", len({round(mean_est, 3), round(median_est, 3), round(map_est, 3)}) == 3)
2.4 Credible Intervals vs Confidence Intervals
The numbers can be similar while the semantics differ. Here we compare a Bayesian equal-tail interval and a simple frequentist normal-approximation interval for a Bernoulli rate.
Code cell 19
# === 2.4 Credible Intervals vs Confidence Intervals ===
k, n = 18, 30
p_hat = k / n
se_hat = np.sqrt(p_hat * (1 - p_hat) / n)
ci = (p_hat - 1.96 * se_hat, p_hat + 1.96 * se_hat)
cred = stats.beta.ppf([0.025, 0.975], 1 + k, 1 + n - k)
header("Credible interval vs confidence interval")
print(f"Frequentist 95% interval (normal approx): ({ci[0]:.4f}, {ci[1]:.4f})")
print(f"Bayesian 95% credible interval : ({cred[0]:.4f}, {cred[1]:.4f})")
report("Intervals stay inside [0, 1]", 0 <= cred[0] <= 1 and 0 <= cred[1] <= 1)
2.5 Prior Families and Elicitation
Priors encode structure. Here the same data are combined with different Beta priors to show how prior strength changes the posterior.
Code cell 21
# === 2.5 Prior Families and Elicitation ===
priors = {
"weak uniform": (1, 1),
"skeptical": (10, 30),
"optimistic": (20, 10),
}
k, n = 9, 20
header("Effect of prior choice")
if HAS_MPL:
fig, ax = plt.subplots()
for name, (a0, b0) in priors.items():
a1, b1 = beta_posterior(a0, b0, k, n)
density = stats.beta.pdf(theta_grid, a1, b1)
ax.plot(theta_grid, density, label=f"{name}: mean={a1/(a1+b1):.3f}")
print(f"{name:12s} -> posterior mean = {a1/(a1+b1):.4f}")
ax.set_title("Posterior under different priors with the same data")
ax.set_xlabel(r"$\theta$")
ax.set_ylabel("Density")
ax.legend()
fig.tight_layout()
plt.show()
3. Core Theory I: Conjugacy and Closed-Form Updating
3.1 and 3.2 Conjugate Priors and the Beta-Binomial Model
Conjugacy lets us compare exact posterior formulas with brute-force grid integration.
Code cell 23
# === 3.1 and 3.2 Beta-Binomial conjugacy ===
alpha, beta = 3, 7
k, n = 11, 20
posterior_exact = stats.beta.pdf(theta_grid, alpha + k, beta + n - k)
posterior_exact = normalize_density(theta_grid, posterior_exact)
likelihood = theta_grid**k * (1 - theta_grid)**(n - k)
prior = stats.beta.pdf(theta_grid, alpha, beta)
posterior_grid = normalize_density(theta_grid, prior * likelihood)
max_err = np.max(np.abs(posterior_exact - posterior_grid))
header("Beta-Binomial exact update")
print(f"Posterior exact mean: {(alpha + k) / (alpha + beta + n):.6f}")
print(f"Max abs error vs grid: {max_err:.6e}")
report("Exact Beta posterior matches grid", max_err < 1e-5)
3.2 Sequential Updating
Bayesian updating can be done online. Each new Bernoulli outcome increments one posterior shape parameter.
Code cell 25
# === 3.2 Sequential updating ===
outcomes = np.array([1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1])
a_hist, b_hist = [1], [1]
means = [0.5]
for x in outcomes:
a_hist.append(a_hist[-1] + x)
b_hist.append(b_hist[-1] + (1 - x))
means.append(a_hist[-1] / (a_hist[-1] + b_hist[-1]))
header("Online Beta-Binomial updating")
print(f"Final posterior: Beta({a_hist[-1]}, {b_hist[-1]})")
print(f"Final posterior mean: {means[-1]:.4f}")
report("Posterior mean remains in [0, 1]", np.all((np.array(means) >= 0) & (np.array(means) <= 1)))
if HAS_MPL:
fig, ax = plt.subplots()
ax.plot(range(len(means)), means, marker="o", color=COLORS["primary"])
ax.set_title("Posterior mean under sequential Bernoulli updates")
ax.set_xlabel("Number of observations processed")
ax.set_ylabel("Posterior mean")
fig.tight_layout()
plt.show()
3.3 Gamma-Poisson and Dirichlet-Categorical
Count and categorical models update by adding sufficient statistics.
Code cell 27
# === 3.3 Gamma-Poisson and Dirichlet-Categorical ===
counts = np.array([4, 3, 7, 2, 6])
alpha, beta = 2.0, 1.5
alpha_post = alpha + counts.sum()
beta_post = beta + len(counts)
rate_mean = alpha_post / beta_post
dir_prior = np.array([1.0, 2.0, 1.0, 3.0])
observed = np.array([12, 4, 1, 9])
dir_post = dir_prior + observed
dir_mean = dir_post / dir_post.sum()
header("Gamma-Poisson and Dirichlet updates")
print(f"Posterior Gamma parameters: alpha={alpha_post:.2f}, beta={beta_post:.2f}")
print(f"Posterior mean count rate : {rate_mean:.4f}")
print(f"Dirichlet posterior mean : {dir_mean}")
report("Dirichlet posterior sums to 1", np.isclose(dir_mean.sum(), 1.0))
3.4 Gaussian-Gaussian Updating
The posterior mean becomes a precision-weighted average of prior mean and sample mean.
Code cell 29
# === 3.4 Gaussian-Gaussian updating ===
sigma = 1.0
tau0 = 2.0
mu0 = -0.5
x = np.array([0.4, 0.2, 0.1, 0.8, 0.5])
n = len(x)
xbar = x.mean()
post_var = 1.0 / (1.0 / tau0**2 + n / sigma**2)
post_mean = post_var * (mu0 / tau0**2 + n * xbar / sigma**2)
header("Gaussian-Gaussian shrinkage")
print(f"Prior mean : {mu0:.4f}")
print(f"Sample mean : {xbar:.4f}")
print(f"Posterior mean: {post_mean:.4f}")
print(f"Posterior var : {post_var:.4f}")
report("Posterior mean lies between prior and sample mean", min(mu0, xbar) <= post_mean <= max(mu0, xbar))
grid = np.linspace(-2.5, 2.5, 500)
prior = stats.norm.pdf(grid, mu0, tau0)
post = stats.norm.pdf(grid, post_mean, np.sqrt(post_var))
like = stats.norm.pdf(grid, xbar, sigma / np.sqrt(n))
if HAS_MPL:
fig, ax = plt.subplots()
ax.plot(grid, prior, label="Prior", color=COLORS["neutral"])
ax.plot(grid, like, label="Sampling model for mean", color=COLORS["secondary"])
ax.plot(grid, post, label="Posterior", color=COLORS["primary"])
ax.set_title("Precision-weighted Gaussian updating")
ax.set_xlabel(r"$\mu$")
ax.set_ylabel("Density")
ax.legend()
fig.tight_layout()
plt.show()
3.5 What Conjugacy Buys
Conjugacy gives exact posteriors and exact evidences in simple models. We can verify the evidence numerically and analytically.
Code cell 31
# === 3.5 Exact evidence in a conjugate Gaussian model ===
sigma = 0.8
tau0 = 1.1
mu0 = 0.0
x = np.array([0.4, 0.1, 0.2, -0.1])
xbar = x.mean()
n = len(x)
grid = np.linspace(-4, 4, 4000)
prior = stats.norm.pdf(grid, mu0, tau0)
likelihood = np.exp(np.sum(stats.norm.logpdf(x[:, None], loc=grid[None, :], scale=sigma), axis=0))
evidence_numeric = np.trapezoid(prior * likelihood, grid)
var_bar = sigma**2 / n + tau0**2
evidence_closed = stats.norm.pdf(xbar, loc=mu0, scale=np.sqrt(var_bar)) * (2 * np.pi * sigma**2) ** (-(n - 1) / 2) * np.exp(-np.sum((x - xbar) ** 2) / (2 * sigma**2)) / np.sqrt(n)
header("Conjugacy gives closed-form evidence")
print(f"Numeric evidence: {evidence_numeric:.8e}")
print(f"Closed evidence : {evidence_closed:.8e}")
report("Numeric and closed-form evidence agree", np.isclose(evidence_numeric, evidence_closed, rtol=1e-3))
4. Core Theory II: Estimation, Prediction, and Uncertainty
4.1 MAP Estimation as Regularised MLE
Gaussian priors produce ridge penalties in the MAP objective.
Code cell 33
# === 4.1 MAP with Gaussian prior equals ridge ===
X = np.array([[1.0, -1.0], [1.0, 0.0], [1.0, 1.0], [1.0, 2.0]])
y = np.array([0.1, 0.9, 2.0, 2.7])
sigma2 = 0.25
lam = 1.5
ridge_map = la.solve(X.T @ X / sigma2 + lam * np.eye(X.shape[1]), X.T @ y / sigma2)
ols = la.lstsq(X, y, rcond=None)[0]
header("Gaussian prior -> ridge-style MAP")
print(f"OLS estimate : {ols}")
print(f"MAP estimate : {ridge_map}")
report("MAP shrinks coefficients toward zero", la.norm(ridge_map) < la.norm(ols))
4.1 Laplace Prior and Sparsity
In one dimension, the Laplace-prior MAP solution reduces to soft-thresholding.
Code cell 35
# === 4.1 Laplace prior -> soft-threshold MAP ===
z = 1.8
lam = 0.6
map_scalar = np.sign(z) * max(abs(z) - lam, 0.0)
header("Laplace prior and soft-thresholding")
print(f"Unregularized estimate z : {z:.4f}")
print(f"Soft-threshold MAP : {map_scalar:.4f}")
report("MAP magnitude is no larger than the unregularized estimate", abs(map_scalar) <= abs(z))
4.2 Posterior Covariance and Shrinkage
In multivariate Gaussian models, posterior covariance identifies directions of certainty and uncertainty.
Code cell 37
# === 4.2 Posterior covariance ===
prior_prec = np.array([[1.0, 0.0], [0.0, 1.0]])
like_prec = np.array([[9.0, 6.0], [6.0, 9.0]])
post_cov = la.inv(prior_prec + like_prec)
evals, evecs = la.eigh(post_cov)
header("Posterior covariance geometry")
print("Posterior covariance:")
print(post_cov)
print(f"Eigenvalues: {evals}")
report("Posterior covariance is PSD", np.all(evals > 0))
angles = np.linspace(0, 2 * np.pi, 400)
circle = np.vstack([np.cos(angles), np.sin(angles)])
ellipse = evecs @ np.diag(np.sqrt(evals)) @ circle
if HAS_MPL:
fig, ax = plt.subplots()
ax.plot(ellipse[0], ellipse[1], color=COLORS["primary"])
ax.set_title("Posterior covariance ellipse")
ax.set_xlabel(r"$\theta_1$")
ax.set_ylabel(r"$\theta_2$")
ax.axhline(0, color=COLORS["neutral"], linewidth=1)
ax.axvline(0, color=COLORS["neutral"], linewidth=1)
fig.tight_layout()
plt.show()
4.3 Posterior Predictive Distribution
The posterior predictive averages over parameter uncertainty instead of plugging in one estimate.
Code cell 39
# === 4.3 Posterior predictive Beta-Binomial ===
a, b = 4, 9
k, n = 8, 12
a1, b1 = beta_posterior(a, b, k, n)
future_m = 10
y_vals = np.arange(future_m + 1)
predictive = stats.betabinom.pmf(y_vals, future_m, a1, b1)
header("Beta-Binomial posterior predictive")
print(f"Predictive mean future successes: {future_m * a1 / (a1 + b1):.4f}")
print(f"Predictive variance : {future_m * a1 * b1 * (a1 + b1 + future_m) / ((a1 + b1)**2 * (a1 + b1 + 1)):.4f}")
report("Predictive PMF sums to 1", np.isclose(predictive.sum(), 1.0))
if HAS_MPL:
fig, ax = plt.subplots()
ax.bar(y_vals, predictive, color=COLORS["primary"], alpha=0.8)
ax.set_title("Posterior predictive over future Bernoulli successes")
ax.set_xlabel("Future successes in 10 trials")
ax.set_ylabel("Probability")
fig.tight_layout()
plt.show()
4.4 Posterior Predictive Checks
If the model says replicated data should look unlike the observed data, the model is misspecified even if posterior computation is exact.
Code cell 41
# === 4.4 Posterior predictive checks for overdispersed counts ===
true_rates = np.random.gamma(shape=2.0, scale=2.0, size=80)
observed_counts = np.random.poisson(true_rates)
alpha0, beta0 = 1.0, 1.0
alpha_post = alpha0 + observed_counts.sum()
beta_post = beta0 + len(observed_counts)
replicated_vars = []
for _ in range(2000):
lam = np.random.gamma(shape=alpha_post, scale=1 / beta_post)
repl = np.random.poisson(lam, size=len(observed_counts))
replicated_vars.append(repl.var())
replicated_vars = np.array(replicated_vars)
obs_var = observed_counts.var()
header("Posterior predictive check")
print(f"Observed count variance : {obs_var:.4f}")
print(f"Posterior predictive mean variance : {replicated_vars.mean():.4f}")
print(f"P(replicated variance >= observed) : {np.mean(replicated_vars >= obs_var):.4f}")
report("Observed variance is larger than a typical Poisson replication", obs_var > np.median(replicated_vars))
if HAS_MPL:
fig, ax = plt.subplots()
ax.hist(replicated_vars, bins=40, density=True, color=COLORS["primary"], alpha=0.75)
ax.axvline(obs_var, color=COLORS["error"], linestyle="--", label="Observed variance")
ax.set_title("Posterior predictive check for count variance")
ax.set_xlabel("Replicated variance")
ax.set_ylabel("Density")
ax.legend()
fig.tight_layout()
plt.show()
4.5 Bayesian Linear Regression
Bayesian linear regression combines shrinkage, posterior covariance, and posterior predictive intervals in one exact model.
Code cell 43
# === 4.5 Bayesian linear regression ===
x = np.linspace(-2, 2, 25)
X = np.column_stack([np.ones_like(x), x])
beta_true = np.array([0.5, 1.2])
y = X @ beta_true + np.random.normal(0, 0.5, size=len(x))
sigma2 = 0.25
tau2 = 4.0
Sigma_n = la.inv(X.T @ X / sigma2 + np.eye(X.shape[1]) / tau2)
mu_n = Sigma_n @ (X.T @ y / sigma2)
x_star = np.linspace(-2.5, 2.5, 200)
X_star = np.column_stack([np.ones_like(x_star), x_star])
pred_mean = X_star @ mu_n
pred_var = sigma2 + np.sum(X_star @ Sigma_n * X_star, axis=1)
header("Bayesian linear regression")
print(f"Posterior mean coefficients: {mu_n}")
print("Posterior covariance:")
print(Sigma_n)
report("Predictive variances are positive", np.all(pred_var > 0))
if HAS_MPL:
fig, ax = plt.subplots()
ax.scatter(x, y, color=COLORS["neutral"], alpha=0.7, label="Observed data")
ax.plot(x_star, pred_mean, color=COLORS["primary"], label="Posterior predictive mean")
ax.fill_between(x_star, pred_mean - 1.96 * np.sqrt(pred_var), pred_mean + 1.96 * np.sqrt(pred_var),
color=COLORS["primary"], alpha=0.2, label="95% predictive interval")
ax.set_title("Bayesian linear regression predictive uncertainty")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
fig.tight_layout()
plt.show()
5. Core Theory III: Model Comparison and Hierarchical Bayes
5.1 Marginal Likelihood and Occam's Razor
The marginal likelihood rewards models that put probability mass on data like what we observed, not just models that can fit well at one lucky parameter value.
Code cell 45
# === 5.1 Marginal likelihood versus prior scale ===
x = np.array([0.22, 0.18, 0.25, 0.29, 0.31])
sigma = 0.2
n = len(x)
xbar = x.mean()
sse = np.sum((x - xbar)**2)
tau_grid = np.logspace(-2, 1.2, 150)
log_evidence = []
for tau in tau_grid:
var_bar = sigma**2 / n + tau**2
logp = stats.norm.logpdf(xbar, loc=0.0, scale=np.sqrt(var_bar)) - (n - 1) * np.log(np.sqrt(2 * np.pi) * sigma) - sse / (2 * sigma**2)
log_evidence.append(logp)
log_evidence = np.array(log_evidence)
best_tau = tau_grid[np.argmax(log_evidence)]
header("Occam's razor through marginal likelihood")
print(f"Best prior scale on this grid: {best_tau:.4f}")
report("Log evidence is finite across scales", np.isfinite(log_evidence).all())
if HAS_MPL:
fig, ax = plt.subplots()
ax.plot(tau_grid, log_evidence, color=COLORS["primary"])
ax.set_xscale("log")
ax.set_title("Marginal likelihood vs prior scale")
ax.set_xlabel(r"Prior std $\tau$")
ax.set_ylabel("Log evidence")
fig.tight_layout()
plt.show()
5.2 Bayes Factors and Bayesian Model Comparison
Bayes factors compare integrated support under competing models. They can behave differently from p-values because they average over the alternative model rather than optimizing it.
Code cell 47
# === 5.2 Bayes factor versus p-value ===
x = np.array([0.11, 0.22, 0.28, 0.31, 0.40, 0.33, 0.27, 0.37])
sigma = 0.25
n = len(x)
xbar = x.mean()
z = xbar / (sigma / np.sqrt(n))
p_value = 2 * stats.norm.sf(abs(z))
tau = 0.3
p_data_h0 = stats.norm.pdf(xbar, loc=0.0, scale=sigma / np.sqrt(n))
p_data_h1 = stats.norm.pdf(xbar, loc=0.0, scale=np.sqrt(sigma**2 / n + tau**2))
bf10 = p_data_h1 / p_data_h0
header("Bayes factor versus p-value")
print(f"xbar = {xbar:.4f}")
print(f"z = {z:.4f}")
print(f"p-value= {p_value:.6f}")
print(f"BF10 = {bf10:.6f}")
report("Bayes factor is positive", bf10 > 0)
5.3 Empirical Bayes
Empirical Bayes estimates prior hyperparameters from the data and then uses that fitted prior for shrinkage.
Code cell 49
# === 5.3 Empirical Bayes ===
group_means = np.array([1.4, 0.3, 1.1, -0.2, 0.0, 0.7])
obs_var = 0.2
tau_grid = np.linspace(0.05, 2.0, 300)
log_ml = []
for tau in tau_grid:
log_ml.append(np.sum(stats.norm.logpdf(group_means, loc=0.0, scale=np.sqrt(obs_var + tau**2))))
log_ml = np.array(log_ml)
tau_hat = tau_grid[np.argmax(log_ml)]
shrink_factor = tau_hat**2 / (tau_hat**2 + obs_var)
post_means = shrink_factor * group_means
header("Empirical Bayes shrinkage")
print(f"Estimated prior std tau_hat: {tau_hat:.4f}")
print(f"Shrink factor : {shrink_factor:.4f}")
print(f"Posterior means : {post_means}")
report("Posterior means are shrunk toward zero", np.all(np.abs(post_means) <= np.abs(group_means) + 1e-12))
5.4 Hierarchical Models and Partial Pooling
Partial pooling shrinks weak groups strongly and strong groups weakly.
Code cell 51
# === 5.4 Hierarchical partial pooling ===
n_g = np.array([5, 10, 40, 100])
ybar_g = np.array([1.6, 0.8, 1.1, 1.0])
sigma2 = 1.0
mu0 = 1.0
tau2 = 0.25
weights = (n_g / sigma2) / (n_g / sigma2 + 1 / tau2)
post_means = weights * ybar_g + (1 - weights) * mu0
header("Partial pooling by group size")
for n_i, y_i, m_i in zip(n_g, ybar_g, post_means):
print(f"group n={n_i:3d}: raw mean={y_i:.3f}, posterior mean={m_i:.3f}")
report("Large groups move less", abs(post_means[-1] - ybar_g[-1]) < abs(post_means[0] - ybar_g[0]))
if HAS_MPL:
fig, ax = plt.subplots()
ax.scatter(n_g, ybar_g, color=COLORS["secondary"], label="No pooling")
ax.scatter(n_g, post_means, color=COLORS["primary"], label="Partial pooling")
ax.axhline(mu0, color=COLORS["neutral"], linestyle="--", label="Complete pooling mean")
ax.set_title("Partial pooling shrinks sparse groups more strongly")
ax.set_xlabel("Group sample size")
ax.set_ylabel("Estimated mean")
ax.legend()
fig.tight_layout()
plt.show()
5.5 Prior Sensitivity and Robustness
Posterior decisions can depend materially on the prior scale when the data are weak.
Code cell 53
# === 5.5 Prior sensitivity ===
x = np.array([0.15, 0.19, 0.22, 0.29])
sigma = 0.3
n = len(x)
xbar = x.mean()
tau_grid = np.logspace(-2, 1, 100)
probs = []
for tau in tau_grid:
post_var = 1.0 / (1.0 / tau**2 + n / sigma**2)
post_mean = post_var * (n * xbar / sigma**2)
probs.append(1 - stats.norm.cdf(0.2, loc=post_mean, scale=np.sqrt(post_var)))
probs = np.array(probs)
header("Prior sensitivity of a posterior tail probability")
print(f"P(mu > 0.2 | D) at tau=0.1 : {probs[np.argmin(np.abs(tau_grid - 0.1))]:.4f}")
print(f"P(mu > 0.2 | D) at tau=1.0 : {probs[np.argmin(np.abs(tau_grid - 1.0))]:.4f}")
report("Tail probabilities vary with prior scale", probs.max() - probs.min() > 0.05)
if HAS_MPL:
fig, ax = plt.subplots()
ax.plot(tau_grid, probs, color=COLORS["primary"])
ax.set_xscale("log")
ax.set_title("Posterior tail probability under different prior scales")
ax.set_xlabel(r"Prior std $\tau$")
ax.set_ylabel(r"$P(\mu > 0.2 \mid D)$")
fig.tight_layout()
plt.show()
6. Advanced Topics: Approximate Inference
6.2 MCMC for Posterior Sampling
Markov chain Monte Carlo approximates posterior expectations using samples from an invariant chain.
6.1 Why Exact Posteriors Become Intractable
Exact posterior normalization requires integrating over all parameters. In conjugate models this is algebra. In general models it is an integral that grows exponentially hard with parameter dimension.
Code cell 56
# === 6.1 Curse of dimensionality for posterior normalisation ===
import numpy as np
# Grid-based integration: cost grows as N^d
N_per_dim = 100 # grid points per dimension
dims = np.arange(1, 9)
grid_points = N_per_dim ** dims
header("Grid integration cost vs parameter dimension")
for d, g in zip(dims, grid_points):
feasible = 'feasible ' if g <= 1e8 else 'INFEASIBLE'
print(f" d={d}: {g:.2e} grid points [{feasible}]")
report("Grid becomes infeasible by d=5", grid_points[4] > 1e8)
Code cell 57
# === 6.2 MCMC with Metropolis-Hastings on a Beta posterior ===
k, n = 14, 20
a_post, b_post = beta_posterior(1, 1, k, n)
target = lambda t: stats.beta.logpdf(t, a_post, b_post) if 0 < t < 1 else -np.inf
samples = np.empty(6000)
theta = 0.5
accepted = 0
for i in range(len(samples)):
proposal = theta + np.random.normal(0, 0.08)
log_acc = target(proposal) - target(theta)
if np.log(np.random.rand()) < log_acc:
theta = proposal
accepted += 1
samples[i] = theta
burn = 1000
chain = samples[burn:]
exact_mean = a_post / (a_post + b_post)
header("Metropolis-Hastings for a Beta posterior")
print(f"Acceptance rate : {accepted / len(samples):.4f}")
print(f"Sample mean : {chain.mean():.4f}")
print(f"Exact mean : {exact_mean:.4f}")
report("Sample mean is close to exact mean", abs(chain.mean() - exact_mean) < 0.02)
if HAS_MPL:
fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))
axes[0].plot(chain[:600], color=COLORS["primary"])
axes[0].set_title("Trace plot")
axes[0].set_xlabel("Iteration")
axes[0].set_ylabel(r"$\theta$")
axes[1].hist(chain, bins=40, density=True, color=COLORS["primary"], alpha=0.75, label="MCMC")
axes[1].plot(theta_grid, stats.beta.pdf(theta_grid, a_post, b_post), color=COLORS["error"], label="Exact")
axes[1].set_title("MCMC samples vs exact posterior")
axes[1].set_xlabel(r"$\theta$")
axes[1].set_ylabel("Density")
axes[1].legend()
fig.tight_layout()
plt.show()
6.3 Variational Inference and the ELBO
A restricted variational family can miss posterior structure. The ELBO objective may then prefer a simple approximation that covers only part of the true posterior.
Code cell 59
# === 6.3 Variational inference on a bimodal target ===
z_grid = np.linspace(-5, 5, 2000)
target_density = 0.75 * stats.norm.pdf(z_grid, 2.0, 0.7) + 0.25 * stats.norm.pdf(z_grid, -1.7, 0.5)
target_density = normalize_density(z_grid, target_density)
log_target = np.log(target_density + 1e-12)
mu_candidates = np.linspace(-2.5, 2.5, 120)
sigma_candidates = np.linspace(0.3, 2.0, 100)
best = None
best_elbo = -np.inf
for mu in mu_candidates:
for sig in sigma_candidates:
q = stats.norm.pdf(z_grid, mu, sig)
q = normalize_density(z_grid, q)
elbo = np.trapezoid(q * log_target, z_grid) - np.trapezoid(q * np.log(q + 1e-12), z_grid)
if elbo > best_elbo:
best_elbo = elbo
best = (mu, sig, q)
mu_q, sig_q, q_best = best
header("Mean-field style Gaussian VI on a bimodal target")
print(f"Best q mean : {mu_q:.4f}")
print(f"Best q std : {sig_q:.4f}")
report("Best ELBO is finite", np.isfinite(best_elbo))
if HAS_MPL:
fig, ax = plt.subplots()
ax.plot(z_grid, target_density, color=COLORS["secondary"], label="Target posterior")
ax.plot(z_grid, q_best, color=COLORS["primary"], label="Best Gaussian q")
ax.set_title("Variational approximation can be mode-seeking")
ax.set_xlabel("z")
ax.set_ylabel("Density")
ax.legend()
fig.tight_layout()
plt.show()
6.4 Amortized Inference and VAEs
In a linear-Gaussian latent model, the exact posterior is available. This makes it easy to see what an amortized encoder is trying to approximate.
Code cell 61
# === 6.4 Amortized inference in a linear-Gaussian model ===
sigma_x = 0.5
a_exact = 1 / (1 + sigma_x**2)
var_exact = sigma_x**2 / (1 + sigma_x**2)
x_vals = np.linspace(-2, 2, 9)
post_mean_exact = a_exact * x_vals
a_encoder = 0.7
var_encoder = 0.3
def elbo_for_x(x, a, var_q):
mu_q = a * x
recon = -0.5 * np.log(2 * np.pi * sigma_x**2) - 0.5 * (((x - mu_q) ** 2 + var_q) / sigma_x**2)
kl = 0.5 * (var_q + mu_q**2 - 1 - np.log(var_q))
return recon - kl
elbo_exact = np.mean([elbo_for_x(x, a_exact, var_exact) for x in x_vals])
elbo_encoder = np.mean([elbo_for_x(x, a_encoder, var_encoder) for x in x_vals])
header("Amortized inference as a learned posterior map")
print(f"Exact posterior mean coefficient: {a_exact:.4f}")
print(f"Encoder coefficient : {a_encoder:.4f}")
print(f"Average ELBO with exact q* : {elbo_exact:.4f}")
print(f"Average ELBO with simple encoder: {elbo_encoder:.4f}")
report("Exact posterior gives a higher ELBO", elbo_exact > elbo_encoder)
6.5 Laplace, MC Dropout, SGLD, and SWAG
Practical Bayesian deep-learning methods differ mainly in what posterior object they approximate and how much compute they need.
Code cell 63
# === 6.5 Approximate Bayesian deep-learning surrogates on toy regression ===
x = np.linspace(-1, 1, 30)
Phi = np.column_stack([np.ones_like(x), x, x**2])
w_hat = np.array([0.2, 1.0, -0.8])
y_hat = Phi @ w_hat
laplace_cov = np.diag([0.02, 0.03, 0.04])
laplace_draws = np.random.multivariate_normal(w_hat, laplace_cov, size=200)
swag_cov = np.diag([0.01, 0.02, 0.02]) + 0.01 * np.outer([1, -1, 1], [1, -1, 1])
swag_draws = np.random.multivariate_normal(w_hat, swag_cov, size=200)
dropout_preds = []
for _ in range(200):
mask = np.random.binomial(1, 0.8, size=Phi.shape[1]) / 0.8
dropout_preds.append((Phi * mask) @ w_hat)
dropout_preds = np.array(dropout_preds)
laplace_std = (laplace_draws @ Phi.T).std(axis=0)
swag_std = (swag_draws @ Phi.T).std(axis=0)
dropout_std = dropout_preds.std(axis=0)
header("Toy comparison of posterior-surrogate uncertainty")
print(f"Laplace predictive std at x=1 : {laplace_std[-1]:.4f}")
print(f"SWAG predictive std at x=1 : {swag_std[-1]:.4f}")
print(f"Dropout predictive std at x=1 : {dropout_std[-1]:.4f}")
report("All uncertainty estimates are nonnegative", np.all(laplace_std >= 0) and np.all(swag_std >= 0) and np.all(dropout_std >= 0))
if HAS_MPL:
fig, ax = plt.subplots()
ax.plot(x, y_hat, color=COLORS["neutral"], label="Deterministic predictor")
ax.plot(x, laplace_std, color=COLORS["primary"], label="Laplace std")
ax.plot(x, swag_std, color=COLORS["secondary"], label="SWAG-like std")
ax.plot(x, dropout_std, color=COLORS["tertiary"], label="Dropout-like std")
ax.set_title("Different posterior surrogates give different uncertainty")
ax.set_xlabel("x")
ax.set_ylabel("Predictive standard deviation")
ax.legend()
fig.tight_layout()
plt.show()
7. Applications in Machine Learning
7.1 Naive Bayes as Generative Classification
Dirichlet smoothing keeps class-conditional probabilities from collapsing to zero.
Code cell 65
# === 7.1 Naive Bayes with Dirichlet smoothing ===
vocab = ["bayes", "prior", "gradient", "token"]
class_counts = {
"stats": np.array([18, 14, 2, 1]),
"ml": np.array([3, 2, 16, 12]),
}
alpha = np.ones(len(vocab))
probs = {}
for cls, counts in class_counts.items():
post = counts + alpha
probs[cls] = post / post.sum()
test_doc = np.array([2, 1, 0, 0])
log_scores = {}
for cls, p in probs.items():
log_scores[cls] = np.sum(test_doc * np.log(p))
header("Naive Bayes with Dirichlet smoothing")
print("Posterior token probabilities:")
for cls, p in probs.items():
print(f"{cls:5s}: {p}")
print(f"Document log-scores: {log_scores}")
report("Stats class wins on the Bayesian vocabulary", log_scores["stats"] > log_scores["ml"])
7.2 Bayesian Neural Networks
A full Bayesian neural network is intractable in general, but even a toy Bayesian logistic model shows how posterior uncertainty over parameters propagates to predictions.
Code cell 67
# === 7.2 Toy Bayesian logistic regression ===
x = np.linspace(-2, 2, 25)
y = (x + np.random.normal(0, 0.8, size=len(x)) > 0).astype(float)
w_grid = np.linspace(-6, 6, 1500)
prior_w = stats.norm.pdf(w_grid, 0, 2.0)
logits = np.outer(x, w_grid)
ll = np.sum(y[:, None] * -np.logaddexp(0, -logits) + (1 - y)[:, None] * -np.logaddexp(0, logits), axis=0)
posterior_w = normalize_density(w_grid, prior_w * np.exp(ll - ll.max()))
w_samples = np.random.choice(w_grid, size=3000, p=posterior_w / posterior_w.sum())
x_test = np.linspace(-2.5, 2.5, 200)
pred_probs = special.expit(np.outer(w_samples, x_test))
mean_prob = pred_probs.mean(axis=0)
std_prob = pred_probs.std(axis=0)
header("Toy Bayesian logistic regression")
print(f"Posterior mean of weight: {np.mean(w_samples):.4f}")
print(f"Posterior std of weight : {np.std(w_samples):.4f}")
center_std = std_prob[np.abs(x_test) < 0.5].mean()
edge_std = std_prob[np.abs(x_test) > 2.0].mean()
report("Uncertainty is larger near the boundary on average", center_std > edge_std)
if HAS_MPL:
fig, ax = plt.subplots()
ax.scatter(x, y, color=COLORS["neutral"], alpha=0.7, label="Training labels")
ax.plot(x_test, mean_prob, color=COLORS["primary"], label="Posterior predictive mean")
ax.fill_between(x_test, np.clip(mean_prob - 2 * std_prob, 0, 1), np.clip(mean_prob + 2 * std_prob, 0, 1),
color=COLORS["primary"], alpha=0.2, label="Approx. uncertainty band")
ax.set_title("Parameter uncertainty creates predictive uncertainty")
ax.set_xlabel("x")
ax.set_ylabel("P(y=1 | x, D)")
ax.legend()
fig.tight_layout()
plt.show()
7.3 Uncertainty for Calibration, Active Learning, and OOD Detection
Posterior-aware systems often spend labeling budget or human review budget where predictive uncertainty is high.
Code cell 69
# === 7.3 Uncertainty-driven querying ===
candidate_x = np.linspace(-2.5, 2.5, 25)
pred_probs = special.expit(np.outer(w_samples[:2000], candidate_x))
mean_probs = pred_probs.mean(axis=0)
entropies = -(mean_probs * np.log(mean_probs + 1e-12) + (1 - mean_probs) * np.log(1 - mean_probs + 1e-12))
query_idx = np.argmax(entropies)
header("Active-learning style uncertainty query")
print(f"Most uncertain candidate x: {candidate_x[query_idx]:.4f}")
print(f"Predictive entropy there : {entropies[query_idx]:.4f}")
report("Uncertainty peaks near the class boundary", abs(candidate_x[query_idx]) < 0.7)
if HAS_MPL:
fig, ax = plt.subplots()
ax.plot(candidate_x, entropies, color=COLORS["primary"])
ax.axvline(candidate_x[query_idx], color=COLORS["error"], linestyle="--", label="Suggested query")
ax.set_title("Posterior predictive entropy for candidate queries")
ax.set_xlabel("Candidate x")
ax.set_ylabel("Predictive entropy")
ax.legend()
fig.tight_layout()
plt.show()
7.4 Bayesian Optimization and Thompson Sampling
Posterior sampling turns uncertainty into exploration. A Bernoulli bandit with Beta posteriors is the cleanest demonstration.
Code cell 71
# === 7.4 Thompson sampling for a Bernoulli bandit ===
true_rates = np.array([0.08, 0.11, 0.15])
a = np.ones_like(true_rates)
b = np.ones_like(true_rates)
pulls = np.zeros_like(true_rates, dtype=int)
rewards = np.zeros_like(true_rates, dtype=int)
for _ in range(400):
theta_samp = np.random.beta(a, b)
arm = np.argmax(theta_samp)
reward = np.random.rand() < true_rates[arm]
pulls[arm] += 1
rewards[arm] += int(reward)
a[arm] += int(reward)
b[arm] += 1 - int(reward)
header("Thompson sampling")
print(f"Arm pulls : {pulls}")
print(f"Arm rewards : {rewards}")
print(f"Posterior means: {a / (a + b)}")
report("Best arm is sampled most often", pulls[np.argmax(true_rates)] == pulls.max())
7.5 Bayesian Views of Regularization and Fine-Tuning
Priors need not be centered at zero. In fine-tuning, a natural prior center is often the pretrained parameter value.
Code cell 73
# === 7.5 Prior-centered fine-tuning ===
theta_pretrained = np.array([1.2, -0.4, 0.8])
theta_mle = np.array([1.7, -0.1, 0.4])
prior_strength = 3.0
map_centered = (theta_mle + prior_strength * theta_pretrained) / (1 + prior_strength)
header("Prior centered at pretrained weights")
print(f"Pretrained weights : {theta_pretrained}")
print(f"Task-only MLE : {theta_mle}")
print(f"Prior-centered MAP : {map_centered}")
report("MAP stays closer to pretrained weights than the task-only MLE", la.norm(map_centered - theta_pretrained) < la.norm(theta_mle - theta_pretrained))
8. Common Mistakes
| # | Mistake | Why it is wrong | Fix |
|---|---|---|---|
| 1 | Treating likelihood as a posterior | Likelihood is not normalized over θ | Multiply by prior and normalize |
| 2 | MAP = full Bayesian inference | MAP collapses the posterior to one point | Use full posterior or predictive |
| 3 | Flat prior = objective | Flatness is not parameterization-invariant | Use weakly informative or Jeffreys priors |
| 4 | Credible interval = confidence interval | Different probability statements | Use posterior language for Bayesian intervals |
| 5 | ELBO improvement = posterior quality | Restricted family can still miss mass | Evaluate calibration, not just ELBO |
| 6 | Softmax confidence = Bayesian uncertainty | Deterministic logits can be wildly wrong off-distribution | Use posterior surrogates (MC dropout, SWAG, Laplace) |
| 7 | More MCMC iterations always fixes convergence | Bad geometry persists | Check ESS, split-R̂, trace plots |
| 8 | Conjugate prior is automatically correct | Conjugacy buys tractability, not truth | Validate model fit with PPCs |
What to Notice
- Exact conjugate Bayes is algebraically simple because sufficient statistics add.
- Posterior prediction integrates uncertainty instead of pretending the fitted parameter is exact.
- Bayes factors and marginal likelihoods compare full models, not just best-fitting points.
- Approximate inference is unavoidable in modern ML, so the error mode of the approximation matters.
- In ML deployment, uncertainty is useful only when it changes decisions: abstention, exploration, monitoring, or safer adaptation.