Theory NotebookMath for LLMs

Bayesian Inference

Statistics / Bayesian Inference

Run notebook
Theory Notebook

Theory Notebook

Converted from theory.ipynb for web reading.

Bayesian Inference — Theory Notebook

Posterior reasoning turns uncertainty into something we can compute with, visualize, and act on.

This notebook is the interactive companion to notes.md. It follows the section arc from exact Bayesian updating through posterior prediction, model comparison, approximate inference, and uncertainty-aware ML applications.

CoverageWhat You Will Do
Exact BayesConjugate updates, posterior summaries, credible intervals
PredictionPosterior predictive distributions, posterior predictive checks
Model comparisonMarginal likelihoods, Bayes factors, prior sensitivity
Approximate BayesMCMC, VI, amortized inference, Bayesian deep-learning surrogates
ML applicationsNaive Bayes, Bayesian neural nets, active learning, Thompson sampling

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np

try:
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    mpl.rcParams.update({
        "figure.figsize":    (10, 6),
        "figure.dpi":         120,
        "font.size":           13,
        "axes.titlesize":      15,
        "axes.labelsize":      13,
        "xtick.labelsize":     11,
        "ytick.labelsize":     11,
        "legend.fontsize":     11,
        "legend.framealpha":   0.85,
        "lines.linewidth":      2.0,
        "axes.spines.top":     False,
        "axes.spines.right":   False,
        "savefig.bbox":       "tight",
        "savefig.dpi":         150,
    })
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

try:
    import seaborn as sns
    if HAS_MPL:
        sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    if HAS_MPL:
        plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

np.random.seed(42)
print("Plot setup complete.")

Code cell 4

import numpy.linalg as la
from scipy import stats, special

COLORS = {
    "primary":   "#0077BB",
    "secondary": "#EE7733",
    "tertiary":  "#009988",
    "error":     "#CC3311",
    "neutral":   "#555555",
    "highlight": "#EE3377",
}

np.set_printoptions(precision=6, suppress=True)

def header(title):
    print("\n" + "=" * len(title))
    print(title)
    print("=" * len(title))

def report(name, cond):
    print(f"{'PASS' if cond else 'FAIL'} - {name}")
    return cond

def normalize_density(x, values):
    area = np.trapezoid(values, x)
    return values / area

def beta_posterior(alpha, beta, k, n):
    return alpha + k, beta + n - k

print("Extra imports and helper functions loaded.")

1. Intuition

1.1 From Data to Belief

The first Bayesian habit is to represent uncertainty about a parameter before and after seeing data. A prior over a Bernoulli success probability becomes a posterior after observing outcomes.

Code cell 6

# === 1.1 From Data to Belief ===
theta = np.linspace(0.001, 0.999, 500)
prior = np.ones_like(theta)
k, n = 7, 10
likelihood = theta**k * (1 - theta)**(n - k)
posterior = normalize_density(theta, prior * likelihood)

posterior_mean = np.trapezoid(theta * posterior, theta)
posterior_map = theta[np.argmax(posterior)]

header("Bernoulli belief update")
print(f"Observed data: {k} successes out of {n} trials")
print(f"Posterior mean: {posterior_mean:.4f}")
print(f"Posterior MAP : {posterior_map:.4f}")
report("Posterior mean lies between 0 and 1", 0 < posterior_mean < 1)

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.plot(theta, normalize_density(theta, prior), color=COLORS["neutral"], label="Prior")
    ax.plot(theta, normalize_density(theta, likelihood), color=COLORS["secondary"], label="Likelihood")
    ax.plot(theta, posterior, color=COLORS["primary"], label="Posterior")
    ax.set_title("Prior, likelihood, and posterior for Bernoulli data")
    ax.set_xlabel(r"Success probability $\theta$")
    ax.set_ylabel("Density")
    ax.legend()
    fig.tight_layout()
    plt.show()

1.2 Frequentist vs Bayesian Uncertainty

Bayesian questions often sound like: "Given what we observed, what is the probability the improvement exceeds the deployment threshold?" That is a posterior probability question, not a p-value question.

Code cell 8

# === 1.2 Frequentist vs Bayesian Uncertainty ===
alpha0, beta0 = 2, 18
control_clicks, control_impressions = 45, 500
test_clicks, test_impressions = 60, 500

a_c, b_c = beta_posterior(alpha0, beta0, control_clicks, control_impressions)
a_t, b_t = beta_posterior(alpha0, beta0, test_clicks, test_impressions)

draws_c = np.random.beta(a_c, b_c, size=50_000)
draws_t = np.random.beta(a_t, b_t, size=50_000)
improvement = draws_t - draws_c

prob_better = np.mean(improvement > 0)
prob_big = np.mean(improvement > 0.01)

header("Posterior probability of improvement")
print(f"P(test > control | data)       = {prob_better:.4f}")
print(f"P(test - control > 0.01 | data) = {prob_big:.4f}")
report("Improvement probability exceeds 0.5", prob_better > 0.5)

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.hist(improvement, bins=60, density=True, color=COLORS["primary"], alpha=0.75)
    ax.axvline(0.0, color=COLORS["neutral"], linestyle="--", label="No improvement")
    ax.axvline(0.01, color=COLORS["error"], linestyle="--", label="1 percentage point")
    ax.set_title("Posterior uncertainty over effect size")
    ax.set_xlabel("Test CTR - Control CTR")
    ax.set_ylabel("Density")
    ax.legend()
    fig.tight_layout()
    plt.show()

1.3 Historical Timeline

Bayesian inference has a 250-year history, but most of its computational machinery emerged only in the last 30 years.

YearContributorContribution
1763Thomas BayesPosthumous essay on inverse probability
1774–1812LaplaceGeneralizes to continuous parameters, develops asymptotic ideas
1939JeffreysObjective Bayes, invariant priors
1954SavageSubjective Bayesian decision theory
1990sGelfand, Smith, Neal, GelmanMCMC makes complex posteriors tractable
2014Kingma & WellingVAE turns amortized VI into scalable deep learning
2015Blundell et al.Bayes by Backprop for weight posteriors
2016Gal & GhahramaniMC dropout as approximate Bayesian inference
2019Maddox et al.SWAG uncertainty baseline from SGD iterates

Key pattern: The philosophy was settled early; the computation came later.

1.4 Why Bayesian Inference Matters for AI

When the cost of being overconfident is high, posterior probabilities and posterior predictive uncertainty become decision tools rather than academic extras.

Code cell 11

# === 1.4 Why Bayesian Inference Matters for AI ===
risk_threshold = 0.02
prob_bad = np.mean(improvement < -risk_threshold)

header("Deployment-style decision statistics")
print(f"Posterior risk of harming CTR by more than 2 points: {prob_bad:.6f}")
print(f"Posterior expected uplift: {np.mean(improvement):.5f}")
print(f"Posterior 95% interval: ({np.quantile(improvement, 0.025):.5f}, {np.quantile(improvement, 0.975):.5f})")
report("Expected uplift is positive", np.mean(improvement) > 0)

2. Formal Definitions

2.1 Prior, Likelihood, Posterior, Evidence

In exact Bayesian inference it is often helpful to compute the posterior on a dense grid to see normalization directly.

Code cell 13

# === 2.1 Prior, Likelihood, Posterior, Evidence ===
mu_grid = np.linspace(-2.0, 2.0, 800)
sigma = 0.5
x_obs = np.array([0.35, 0.1, 0.42, 0.55])
mu0, tau0 = 0.0, 0.7

prior_mu = stats.norm.pdf(mu_grid, loc=mu0, scale=tau0)
likelihood_mu = np.exp(np.sum(stats.norm.logpdf(x_obs[:, None], loc=mu_grid[None, :], scale=sigma), axis=0))
posterior_mu = normalize_density(mu_grid, prior_mu * likelihood_mu)
evidence_numeric = np.trapezoid(prior_mu * likelihood_mu, mu_grid)

header("Posterior normalization on a grid")
print(f"Numeric evidence p(D): {evidence_numeric:.8f}")
print(f"Posterior integral   : {np.trapezoid(posterior_mu, mu_grid):.8f}")
report("Posterior integrates to 1", np.isclose(np.trapezoid(posterior_mu, mu_grid), 1.0, atol=1e-5))

2.2 Continuous Bayes and Normalisation

For the Gaussian-Gaussian model, the posterior also has a closed form. We can compare the closed form to the grid approximation.

Code cell 15

# === 2.2 Continuous Bayes and Normalisation ===
n = len(x_obs)
xbar = x_obs.mean()
post_var = 1.0 / (1.0 / tau0**2 + n / sigma**2)
post_mean = post_var * (mu0 / tau0**2 + n * xbar / sigma**2)

posterior_closed = stats.norm.pdf(mu_grid, loc=post_mean, scale=np.sqrt(post_var))

header("Closed-form Gaussian posterior")
print(f"Posterior mean (closed form): {post_mean:.6f}")
print(f"Posterior var  (closed form): {post_var:.6f}")
diff = np.max(np.abs(posterior_closed - posterior_mu))
print(f"Max abs difference vs grid  : {diff:.6e}")
report("Closed form matches grid posterior", diff < 5e-4)

2.3 Posterior Summaries and Bayes Estimators

Posterior mean, median, and MAP need not coincide, especially in skewed posteriors.

Code cell 17

# === 2.3 Posterior Summaries and Bayes Estimators ===
a, b = 2, 8
theta_grid = np.linspace(0.001, 0.999, 800)
post = stats.beta.pdf(theta_grid, a, b)
post = normalize_density(theta_grid, post)

mean_est = a / (a + b)
median_est = stats.beta.ppf(0.5, a, b)
map_est = (a - 1) / (a + b - 2)

header("Posterior summaries of a skewed Beta posterior")
print(f"Posterior mean  : {mean_est:.4f}")
print(f"Posterior median: {median_est:.4f}")
print(f"Posterior MAP   : {map_est:.4f}")
report("Mean, median, and MAP differ", len({round(mean_est, 3), round(median_est, 3), round(map_est, 3)}) == 3)

2.4 Credible Intervals vs Confidence Intervals

The numbers can be similar while the semantics differ. Here we compare a Bayesian equal-tail interval and a simple frequentist normal-approximation interval for a Bernoulli rate.

Code cell 19

# === 2.4 Credible Intervals vs Confidence Intervals ===
k, n = 18, 30
p_hat = k / n
se_hat = np.sqrt(p_hat * (1 - p_hat) / n)
ci = (p_hat - 1.96 * se_hat, p_hat + 1.96 * se_hat)
cred = stats.beta.ppf([0.025, 0.975], 1 + k, 1 + n - k)

header("Credible interval vs confidence interval")
print(f"Frequentist 95% interval (normal approx): ({ci[0]:.4f}, {ci[1]:.4f})")
print(f"Bayesian 95% credible interval          : ({cred[0]:.4f}, {cred[1]:.4f})")
report("Intervals stay inside [0, 1]", 0 <= cred[0] <= 1 and 0 <= cred[1] <= 1)

2.5 Prior Families and Elicitation

Priors encode structure. Here the same data are combined with different Beta priors to show how prior strength changes the posterior.

Code cell 21

# === 2.5 Prior Families and Elicitation ===
priors = {
    "weak uniform": (1, 1),
    "skeptical": (10, 30),
    "optimistic": (20, 10),
}
k, n = 9, 20

header("Effect of prior choice")
if HAS_MPL:
    fig, ax = plt.subplots()
    for name, (a0, b0) in priors.items():
        a1, b1 = beta_posterior(a0, b0, k, n)
        density = stats.beta.pdf(theta_grid, a1, b1)
        ax.plot(theta_grid, density, label=f"{name}: mean={a1/(a1+b1):.3f}")
        print(f"{name:12s} -> posterior mean = {a1/(a1+b1):.4f}")
    ax.set_title("Posterior under different priors with the same data")
    ax.set_xlabel(r"$\theta$")
    ax.set_ylabel("Density")
    ax.legend()
    fig.tight_layout()
    plt.show()

3. Core Theory I: Conjugacy and Closed-Form Updating

3.1 and 3.2 Conjugate Priors and the Beta-Binomial Model

Conjugacy lets us compare exact posterior formulas with brute-force grid integration.

Code cell 23

# === 3.1 and 3.2 Beta-Binomial conjugacy ===
alpha, beta = 3, 7
k, n = 11, 20
posterior_exact = stats.beta.pdf(theta_grid, alpha + k, beta + n - k)
posterior_exact = normalize_density(theta_grid, posterior_exact)

likelihood = theta_grid**k * (1 - theta_grid)**(n - k)
prior = stats.beta.pdf(theta_grid, alpha, beta)
posterior_grid = normalize_density(theta_grid, prior * likelihood)

max_err = np.max(np.abs(posterior_exact - posterior_grid))
header("Beta-Binomial exact update")
print(f"Posterior exact mean: {(alpha + k) / (alpha + beta + n):.6f}")
print(f"Max abs error vs grid: {max_err:.6e}")
report("Exact Beta posterior matches grid", max_err < 1e-5)

3.2 Sequential Updating

Bayesian updating can be done online. Each new Bernoulli outcome increments one posterior shape parameter.

Code cell 25

# === 3.2 Sequential updating ===
outcomes = np.array([1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1])
a_hist, b_hist = [1], [1]
means = [0.5]
for x in outcomes:
    a_hist.append(a_hist[-1] + x)
    b_hist.append(b_hist[-1] + (1 - x))
    means.append(a_hist[-1] / (a_hist[-1] + b_hist[-1]))

header("Online Beta-Binomial updating")
print(f"Final posterior: Beta({a_hist[-1]}, {b_hist[-1]})")
print(f"Final posterior mean: {means[-1]:.4f}")
report("Posterior mean remains in [0, 1]", np.all((np.array(means) >= 0) & (np.array(means) <= 1)))

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.plot(range(len(means)), means, marker="o", color=COLORS["primary"])
    ax.set_title("Posterior mean under sequential Bernoulli updates")
    ax.set_xlabel("Number of observations processed")
    ax.set_ylabel("Posterior mean")
    fig.tight_layout()
    plt.show()

3.3 Gamma-Poisson and Dirichlet-Categorical

Count and categorical models update by adding sufficient statistics.

Code cell 27

# === 3.3 Gamma-Poisson and Dirichlet-Categorical ===
counts = np.array([4, 3, 7, 2, 6])
alpha, beta = 2.0, 1.5
alpha_post = alpha + counts.sum()
beta_post = beta + len(counts)
rate_mean = alpha_post / beta_post

dir_prior = np.array([1.0, 2.0, 1.0, 3.0])
observed = np.array([12, 4, 1, 9])
dir_post = dir_prior + observed
dir_mean = dir_post / dir_post.sum()

header("Gamma-Poisson and Dirichlet updates")
print(f"Posterior Gamma parameters: alpha={alpha_post:.2f}, beta={beta_post:.2f}")
print(f"Posterior mean count rate : {rate_mean:.4f}")
print(f"Dirichlet posterior mean  : {dir_mean}")
report("Dirichlet posterior sums to 1", np.isclose(dir_mean.sum(), 1.0))

3.4 Gaussian-Gaussian Updating

The posterior mean becomes a precision-weighted average of prior mean and sample mean.

Code cell 29

# === 3.4 Gaussian-Gaussian updating ===
sigma = 1.0
tau0 = 2.0
mu0 = -0.5
x = np.array([0.4, 0.2, 0.1, 0.8, 0.5])
n = len(x)
xbar = x.mean()
post_var = 1.0 / (1.0 / tau0**2 + n / sigma**2)
post_mean = post_var * (mu0 / tau0**2 + n * xbar / sigma**2)

header("Gaussian-Gaussian shrinkage")
print(f"Prior mean    : {mu0:.4f}")
print(f"Sample mean   : {xbar:.4f}")
print(f"Posterior mean: {post_mean:.4f}")
print(f"Posterior var : {post_var:.4f}")
report("Posterior mean lies between prior and sample mean", min(mu0, xbar) <= post_mean <= max(mu0, xbar))

grid = np.linspace(-2.5, 2.5, 500)
prior = stats.norm.pdf(grid, mu0, tau0)
post = stats.norm.pdf(grid, post_mean, np.sqrt(post_var))
like = stats.norm.pdf(grid, xbar, sigma / np.sqrt(n))
if HAS_MPL:
    fig, ax = plt.subplots()
    ax.plot(grid, prior, label="Prior", color=COLORS["neutral"])
    ax.plot(grid, like, label="Sampling model for mean", color=COLORS["secondary"])
    ax.plot(grid, post, label="Posterior", color=COLORS["primary"])
    ax.set_title("Precision-weighted Gaussian updating")
    ax.set_xlabel(r"$\mu$")
    ax.set_ylabel("Density")
    ax.legend()
    fig.tight_layout()
    plt.show()

3.5 What Conjugacy Buys

Conjugacy gives exact posteriors and exact evidences in simple models. We can verify the evidence numerically and analytically.

Code cell 31

# === 3.5 Exact evidence in a conjugate Gaussian model ===
sigma = 0.8
tau0 = 1.1
mu0 = 0.0
x = np.array([0.4, 0.1, 0.2, -0.1])
xbar = x.mean()
n = len(x)

grid = np.linspace(-4, 4, 4000)
prior = stats.norm.pdf(grid, mu0, tau0)
likelihood = np.exp(np.sum(stats.norm.logpdf(x[:, None], loc=grid[None, :], scale=sigma), axis=0))
evidence_numeric = np.trapezoid(prior * likelihood, grid)

var_bar = sigma**2 / n + tau0**2
evidence_closed = stats.norm.pdf(xbar, loc=mu0, scale=np.sqrt(var_bar)) * (2 * np.pi * sigma**2) ** (-(n - 1) / 2) * np.exp(-np.sum((x - xbar) ** 2) / (2 * sigma**2)) / np.sqrt(n)

header("Conjugacy gives closed-form evidence")
print(f"Numeric evidence: {evidence_numeric:.8e}")
print(f"Closed evidence : {evidence_closed:.8e}")
report("Numeric and closed-form evidence agree", np.isclose(evidence_numeric, evidence_closed, rtol=1e-3))

4. Core Theory II: Estimation, Prediction, and Uncertainty

4.1 MAP Estimation as Regularised MLE

Gaussian priors produce ridge penalties in the MAP objective.

Code cell 33

# === 4.1 MAP with Gaussian prior equals ridge ===
X = np.array([[1.0, -1.0], [1.0, 0.0], [1.0, 1.0], [1.0, 2.0]])
y = np.array([0.1, 0.9, 2.0, 2.7])
sigma2 = 0.25
lam = 1.5

ridge_map = la.solve(X.T @ X / sigma2 + lam * np.eye(X.shape[1]), X.T @ y / sigma2)
ols = la.lstsq(X, y, rcond=None)[0]

header("Gaussian prior -> ridge-style MAP")
print(f"OLS estimate : {ols}")
print(f"MAP estimate : {ridge_map}")
report("MAP shrinks coefficients toward zero", la.norm(ridge_map) < la.norm(ols))

4.1 Laplace Prior and Sparsity

In one dimension, the Laplace-prior MAP solution reduces to soft-thresholding.

Code cell 35

# === 4.1 Laplace prior -> soft-threshold MAP ===
z = 1.8
lam = 0.6
map_scalar = np.sign(z) * max(abs(z) - lam, 0.0)

header("Laplace prior and soft-thresholding")
print(f"Unregularized estimate z : {z:.4f}")
print(f"Soft-threshold MAP       : {map_scalar:.4f}")
report("MAP magnitude is no larger than the unregularized estimate", abs(map_scalar) <= abs(z))

4.2 Posterior Covariance and Shrinkage

In multivariate Gaussian models, posterior covariance identifies directions of certainty and uncertainty.

Code cell 37

# === 4.2 Posterior covariance ===
prior_prec = np.array([[1.0, 0.0], [0.0, 1.0]])
like_prec = np.array([[9.0, 6.0], [6.0, 9.0]])
post_cov = la.inv(prior_prec + like_prec)
evals, evecs = la.eigh(post_cov)

header("Posterior covariance geometry")
print("Posterior covariance:")
print(post_cov)
print(f"Eigenvalues: {evals}")
report("Posterior covariance is PSD", np.all(evals > 0))

angles = np.linspace(0, 2 * np.pi, 400)
circle = np.vstack([np.cos(angles), np.sin(angles)])
ellipse = evecs @ np.diag(np.sqrt(evals)) @ circle
if HAS_MPL:
    fig, ax = plt.subplots()
    ax.plot(ellipse[0], ellipse[1], color=COLORS["primary"])
    ax.set_title("Posterior covariance ellipse")
    ax.set_xlabel(r"$\theta_1$")
    ax.set_ylabel(r"$\theta_2$")
    ax.axhline(0, color=COLORS["neutral"], linewidth=1)
    ax.axvline(0, color=COLORS["neutral"], linewidth=1)
    fig.tight_layout()
    plt.show()

4.3 Posterior Predictive Distribution

The posterior predictive averages over parameter uncertainty instead of plugging in one estimate.

Code cell 39

# === 4.3 Posterior predictive Beta-Binomial ===
a, b = 4, 9
k, n = 8, 12
a1, b1 = beta_posterior(a, b, k, n)

future_m = 10
y_vals = np.arange(future_m + 1)
predictive = stats.betabinom.pmf(y_vals, future_m, a1, b1)

header("Beta-Binomial posterior predictive")
print(f"Predictive mean future successes: {future_m * a1 / (a1 + b1):.4f}")
print(f"Predictive variance            : {future_m * a1 * b1 * (a1 + b1 + future_m) / ((a1 + b1)**2 * (a1 + b1 + 1)):.4f}")
report("Predictive PMF sums to 1", np.isclose(predictive.sum(), 1.0))

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.bar(y_vals, predictive, color=COLORS["primary"], alpha=0.8)
    ax.set_title("Posterior predictive over future Bernoulli successes")
    ax.set_xlabel("Future successes in 10 trials")
    ax.set_ylabel("Probability")
    fig.tight_layout()
    plt.show()

4.4 Posterior Predictive Checks

If the model says replicated data should look unlike the observed data, the model is misspecified even if posterior computation is exact.

Code cell 41

# === 4.4 Posterior predictive checks for overdispersed counts ===
true_rates = np.random.gamma(shape=2.0, scale=2.0, size=80)
observed_counts = np.random.poisson(true_rates)

alpha0, beta0 = 1.0, 1.0
alpha_post = alpha0 + observed_counts.sum()
beta_post = beta0 + len(observed_counts)

replicated_vars = []
for _ in range(2000):
    lam = np.random.gamma(shape=alpha_post, scale=1 / beta_post)
    repl = np.random.poisson(lam, size=len(observed_counts))
    replicated_vars.append(repl.var())
replicated_vars = np.array(replicated_vars)
obs_var = observed_counts.var()

header("Posterior predictive check")
print(f"Observed count variance              : {obs_var:.4f}")
print(f"Posterior predictive mean variance   : {replicated_vars.mean():.4f}")
print(f"P(replicated variance >= observed)   : {np.mean(replicated_vars >= obs_var):.4f}")
report("Observed variance is larger than a typical Poisson replication", obs_var > np.median(replicated_vars))

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.hist(replicated_vars, bins=40, density=True, color=COLORS["primary"], alpha=0.75)
    ax.axvline(obs_var, color=COLORS["error"], linestyle="--", label="Observed variance")
    ax.set_title("Posterior predictive check for count variance")
    ax.set_xlabel("Replicated variance")
    ax.set_ylabel("Density")
    ax.legend()
    fig.tight_layout()
    plt.show()

4.5 Bayesian Linear Regression

Bayesian linear regression combines shrinkage, posterior covariance, and posterior predictive intervals in one exact model.

Code cell 43

# === 4.5 Bayesian linear regression ===
x = np.linspace(-2, 2, 25)
X = np.column_stack([np.ones_like(x), x])
beta_true = np.array([0.5, 1.2])
y = X @ beta_true + np.random.normal(0, 0.5, size=len(x))

sigma2 = 0.25
tau2 = 4.0
Sigma_n = la.inv(X.T @ X / sigma2 + np.eye(X.shape[1]) / tau2)
mu_n = Sigma_n @ (X.T @ y / sigma2)

x_star = np.linspace(-2.5, 2.5, 200)
X_star = np.column_stack([np.ones_like(x_star), x_star])
pred_mean = X_star @ mu_n
pred_var = sigma2 + np.sum(X_star @ Sigma_n * X_star, axis=1)

header("Bayesian linear regression")
print(f"Posterior mean coefficients: {mu_n}")
print("Posterior covariance:")
print(Sigma_n)
report("Predictive variances are positive", np.all(pred_var > 0))

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.scatter(x, y, color=COLORS["neutral"], alpha=0.7, label="Observed data")
    ax.plot(x_star, pred_mean, color=COLORS["primary"], label="Posterior predictive mean")
    ax.fill_between(x_star, pred_mean - 1.96 * np.sqrt(pred_var), pred_mean + 1.96 * np.sqrt(pred_var),
                    color=COLORS["primary"], alpha=0.2, label="95% predictive interval")
    ax.set_title("Bayesian linear regression predictive uncertainty")
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    ax.legend()
    fig.tight_layout()
    plt.show()

5. Core Theory III: Model Comparison and Hierarchical Bayes

5.1 Marginal Likelihood and Occam's Razor

The marginal likelihood rewards models that put probability mass on data like what we observed, not just models that can fit well at one lucky parameter value.

Code cell 45

# === 5.1 Marginal likelihood versus prior scale ===
x = np.array([0.22, 0.18, 0.25, 0.29, 0.31])
sigma = 0.2
n = len(x)
xbar = x.mean()
sse = np.sum((x - xbar)**2)

tau_grid = np.logspace(-2, 1.2, 150)
log_evidence = []
for tau in tau_grid:
    var_bar = sigma**2 / n + tau**2
    logp = stats.norm.logpdf(xbar, loc=0.0, scale=np.sqrt(var_bar)) - (n - 1) * np.log(np.sqrt(2 * np.pi) * sigma) - sse / (2 * sigma**2)
    log_evidence.append(logp)
log_evidence = np.array(log_evidence)

best_tau = tau_grid[np.argmax(log_evidence)]
header("Occam's razor through marginal likelihood")
print(f"Best prior scale on this grid: {best_tau:.4f}")
report("Log evidence is finite across scales", np.isfinite(log_evidence).all())

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.plot(tau_grid, log_evidence, color=COLORS["primary"])
    ax.set_xscale("log")
    ax.set_title("Marginal likelihood vs prior scale")
    ax.set_xlabel(r"Prior std $\tau$")
    ax.set_ylabel("Log evidence")
    fig.tight_layout()
    plt.show()

5.2 Bayes Factors and Bayesian Model Comparison

Bayes factors compare integrated support under competing models. They can behave differently from p-values because they average over the alternative model rather than optimizing it.

Code cell 47

# === 5.2 Bayes factor versus p-value ===
x = np.array([0.11, 0.22, 0.28, 0.31, 0.40, 0.33, 0.27, 0.37])
sigma = 0.25
n = len(x)
xbar = x.mean()
z = xbar / (sigma / np.sqrt(n))
p_value = 2 * stats.norm.sf(abs(z))

tau = 0.3
p_data_h0 = stats.norm.pdf(xbar, loc=0.0, scale=sigma / np.sqrt(n))
p_data_h1 = stats.norm.pdf(xbar, loc=0.0, scale=np.sqrt(sigma**2 / n + tau**2))
bf10 = p_data_h1 / p_data_h0

header("Bayes factor versus p-value")
print(f"xbar   = {xbar:.4f}")
print(f"z      = {z:.4f}")
print(f"p-value= {p_value:.6f}")
print(f"BF10   = {bf10:.6f}")
report("Bayes factor is positive", bf10 > 0)

5.3 Empirical Bayes

Empirical Bayes estimates prior hyperparameters from the data and then uses that fitted prior for shrinkage.

Code cell 49

# === 5.3 Empirical Bayes ===
group_means = np.array([1.4, 0.3, 1.1, -0.2, 0.0, 0.7])
obs_var = 0.2
tau_grid = np.linspace(0.05, 2.0, 300)
log_ml = []
for tau in tau_grid:
    log_ml.append(np.sum(stats.norm.logpdf(group_means, loc=0.0, scale=np.sqrt(obs_var + tau**2))))
log_ml = np.array(log_ml)
tau_hat = tau_grid[np.argmax(log_ml)]
shrink_factor = tau_hat**2 / (tau_hat**2 + obs_var)
post_means = shrink_factor * group_means

header("Empirical Bayes shrinkage")
print(f"Estimated prior std tau_hat: {tau_hat:.4f}")
print(f"Shrink factor             : {shrink_factor:.4f}")
print(f"Posterior means           : {post_means}")
report("Posterior means are shrunk toward zero", np.all(np.abs(post_means) <= np.abs(group_means) + 1e-12))

5.4 Hierarchical Models and Partial Pooling

Partial pooling shrinks weak groups strongly and strong groups weakly.

Code cell 51

# === 5.4 Hierarchical partial pooling ===
n_g = np.array([5, 10, 40, 100])
ybar_g = np.array([1.6, 0.8, 1.1, 1.0])
sigma2 = 1.0
mu0 = 1.0
tau2 = 0.25

weights = (n_g / sigma2) / (n_g / sigma2 + 1 / tau2)
post_means = weights * ybar_g + (1 - weights) * mu0

header("Partial pooling by group size")
for n_i, y_i, m_i in zip(n_g, ybar_g, post_means):
    print(f"group n={n_i:3d}: raw mean={y_i:.3f}, posterior mean={m_i:.3f}")
report("Large groups move less", abs(post_means[-1] - ybar_g[-1]) < abs(post_means[0] - ybar_g[0]))

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.scatter(n_g, ybar_g, color=COLORS["secondary"], label="No pooling")
    ax.scatter(n_g, post_means, color=COLORS["primary"], label="Partial pooling")
    ax.axhline(mu0, color=COLORS["neutral"], linestyle="--", label="Complete pooling mean")
    ax.set_title("Partial pooling shrinks sparse groups more strongly")
    ax.set_xlabel("Group sample size")
    ax.set_ylabel("Estimated mean")
    ax.legend()
    fig.tight_layout()
    plt.show()

5.5 Prior Sensitivity and Robustness

Posterior decisions can depend materially on the prior scale when the data are weak.

Code cell 53

# === 5.5 Prior sensitivity ===
x = np.array([0.15, 0.19, 0.22, 0.29])
sigma = 0.3
n = len(x)
xbar = x.mean()
tau_grid = np.logspace(-2, 1, 100)
probs = []
for tau in tau_grid:
    post_var = 1.0 / (1.0 / tau**2 + n / sigma**2)
    post_mean = post_var * (n * xbar / sigma**2)
    probs.append(1 - stats.norm.cdf(0.2, loc=post_mean, scale=np.sqrt(post_var)))
probs = np.array(probs)

header("Prior sensitivity of a posterior tail probability")
print(f"P(mu > 0.2 | D) at tau=0.1 : {probs[np.argmin(np.abs(tau_grid - 0.1))]:.4f}")
print(f"P(mu > 0.2 | D) at tau=1.0 : {probs[np.argmin(np.abs(tau_grid - 1.0))]:.4f}")
report("Tail probabilities vary with prior scale", probs.max() - probs.min() > 0.05)

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.plot(tau_grid, probs, color=COLORS["primary"])
    ax.set_xscale("log")
    ax.set_title("Posterior tail probability under different prior scales")
    ax.set_xlabel(r"Prior std $\tau$")
    ax.set_ylabel(r"$P(\mu > 0.2 \mid D)$")
    fig.tight_layout()
    plt.show()

6. Advanced Topics: Approximate Inference

6.2 MCMC for Posterior Sampling

Markov chain Monte Carlo approximates posterior expectations using samples from an invariant chain.

6.1 Why Exact Posteriors Become Intractable

Exact posterior normalization requires integrating over all parameters. In conjugate models this is algebra. In general models it is an integral that grows exponentially hard with parameter dimension.

Code cell 56

# === 6.1 Curse of dimensionality for posterior normalisation ===
import numpy as np

# Grid-based integration: cost grows as N^d
N_per_dim = 100  # grid points per dimension
dims = np.arange(1, 9)
grid_points = N_per_dim ** dims

header("Grid integration cost vs parameter dimension")
for d, g in zip(dims, grid_points):
    feasible = 'feasible  ' if g <= 1e8 else 'INFEASIBLE'
    print(f"  d={d}: {g:.2e} grid points  [{feasible}]")

report("Grid becomes infeasible by d=5", grid_points[4] > 1e8)

Code cell 57

# === 6.2 MCMC with Metropolis-Hastings on a Beta posterior ===
k, n = 14, 20
a_post, b_post = beta_posterior(1, 1, k, n)
target = lambda t: stats.beta.logpdf(t, a_post, b_post) if 0 < t < 1 else -np.inf

samples = np.empty(6000)
theta = 0.5
accepted = 0
for i in range(len(samples)):
    proposal = theta + np.random.normal(0, 0.08)
    log_acc = target(proposal) - target(theta)
    if np.log(np.random.rand()) < log_acc:
        theta = proposal
        accepted += 1
    samples[i] = theta

burn = 1000
chain = samples[burn:]
exact_mean = a_post / (a_post + b_post)

header("Metropolis-Hastings for a Beta posterior")
print(f"Acceptance rate : {accepted / len(samples):.4f}")
print(f"Sample mean     : {chain.mean():.4f}")
print(f"Exact mean      : {exact_mean:.4f}")
report("Sample mean is close to exact mean", abs(chain.mean() - exact_mean) < 0.02)

if HAS_MPL:
    fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))
    axes[0].plot(chain[:600], color=COLORS["primary"])
    axes[0].set_title("Trace plot")
    axes[0].set_xlabel("Iteration")
    axes[0].set_ylabel(r"$\theta$")
    axes[1].hist(chain, bins=40, density=True, color=COLORS["primary"], alpha=0.75, label="MCMC")
    axes[1].plot(theta_grid, stats.beta.pdf(theta_grid, a_post, b_post), color=COLORS["error"], label="Exact")
    axes[1].set_title("MCMC samples vs exact posterior")
    axes[1].set_xlabel(r"$\theta$")
    axes[1].set_ylabel("Density")
    axes[1].legend()
    fig.tight_layout()
    plt.show()

6.3 Variational Inference and the ELBO

A restricted variational family can miss posterior structure. The ELBO objective may then prefer a simple approximation that covers only part of the true posterior.

Code cell 59

# === 6.3 Variational inference on a bimodal target ===
z_grid = np.linspace(-5, 5, 2000)
target_density = 0.75 * stats.norm.pdf(z_grid, 2.0, 0.7) + 0.25 * stats.norm.pdf(z_grid, -1.7, 0.5)
target_density = normalize_density(z_grid, target_density)
log_target = np.log(target_density + 1e-12)

mu_candidates = np.linspace(-2.5, 2.5, 120)
sigma_candidates = np.linspace(0.3, 2.0, 100)
best = None
best_elbo = -np.inf
for mu in mu_candidates:
    for sig in sigma_candidates:
        q = stats.norm.pdf(z_grid, mu, sig)
        q = normalize_density(z_grid, q)
        elbo = np.trapezoid(q * log_target, z_grid) - np.trapezoid(q * np.log(q + 1e-12), z_grid)
        if elbo > best_elbo:
            best_elbo = elbo
            best = (mu, sig, q)

mu_q, sig_q, q_best = best
header("Mean-field style Gaussian VI on a bimodal target")
print(f"Best q mean : {mu_q:.4f}")
print(f"Best q std  : {sig_q:.4f}")
report("Best ELBO is finite", np.isfinite(best_elbo))

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.plot(z_grid, target_density, color=COLORS["secondary"], label="Target posterior")
    ax.plot(z_grid, q_best, color=COLORS["primary"], label="Best Gaussian q")
    ax.set_title("Variational approximation can be mode-seeking")
    ax.set_xlabel("z")
    ax.set_ylabel("Density")
    ax.legend()
    fig.tight_layout()
    plt.show()

6.4 Amortized Inference and VAEs

In a linear-Gaussian latent model, the exact posterior is available. This makes it easy to see what an amortized encoder is trying to approximate.

Code cell 61

# === 6.4 Amortized inference in a linear-Gaussian model ===
sigma_x = 0.5
a_exact = 1 / (1 + sigma_x**2)
var_exact = sigma_x**2 / (1 + sigma_x**2)
x_vals = np.linspace(-2, 2, 9)
post_mean_exact = a_exact * x_vals

a_encoder = 0.7
var_encoder = 0.3

def elbo_for_x(x, a, var_q):
    mu_q = a * x
    recon = -0.5 * np.log(2 * np.pi * sigma_x**2) - 0.5 * (((x - mu_q) ** 2 + var_q) / sigma_x**2)
    kl = 0.5 * (var_q + mu_q**2 - 1 - np.log(var_q))
    return recon - kl

elbo_exact = np.mean([elbo_for_x(x, a_exact, var_exact) for x in x_vals])
elbo_encoder = np.mean([elbo_for_x(x, a_encoder, var_encoder) for x in x_vals])

header("Amortized inference as a learned posterior map")
print(f"Exact posterior mean coefficient: {a_exact:.4f}")
print(f"Encoder coefficient             : {a_encoder:.4f}")
print(f"Average ELBO with exact q*      : {elbo_exact:.4f}")
print(f"Average ELBO with simple encoder: {elbo_encoder:.4f}")
report("Exact posterior gives a higher ELBO", elbo_exact > elbo_encoder)

6.5 Laplace, MC Dropout, SGLD, and SWAG

Practical Bayesian deep-learning methods differ mainly in what posterior object they approximate and how much compute they need.

Code cell 63

# === 6.5 Approximate Bayesian deep-learning surrogates on toy regression ===
x = np.linspace(-1, 1, 30)
Phi = np.column_stack([np.ones_like(x), x, x**2])
w_hat = np.array([0.2, 1.0, -0.8])
y_hat = Phi @ w_hat

laplace_cov = np.diag([0.02, 0.03, 0.04])
laplace_draws = np.random.multivariate_normal(w_hat, laplace_cov, size=200)
swag_cov = np.diag([0.01, 0.02, 0.02]) + 0.01 * np.outer([1, -1, 1], [1, -1, 1])
swag_draws = np.random.multivariate_normal(w_hat, swag_cov, size=200)
dropout_preds = []
for _ in range(200):
    mask = np.random.binomial(1, 0.8, size=Phi.shape[1]) / 0.8
    dropout_preds.append((Phi * mask) @ w_hat)
dropout_preds = np.array(dropout_preds)

laplace_std = (laplace_draws @ Phi.T).std(axis=0)
swag_std = (swag_draws @ Phi.T).std(axis=0)
dropout_std = dropout_preds.std(axis=0)

header("Toy comparison of posterior-surrogate uncertainty")
print(f"Laplace predictive std at x=1 : {laplace_std[-1]:.4f}")
print(f"SWAG predictive std at x=1    : {swag_std[-1]:.4f}")
print(f"Dropout predictive std at x=1 : {dropout_std[-1]:.4f}")
report("All uncertainty estimates are nonnegative", np.all(laplace_std >= 0) and np.all(swag_std >= 0) and np.all(dropout_std >= 0))

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.plot(x, y_hat, color=COLORS["neutral"], label="Deterministic predictor")
    ax.plot(x, laplace_std, color=COLORS["primary"], label="Laplace std")
    ax.plot(x, swag_std, color=COLORS["secondary"], label="SWAG-like std")
    ax.plot(x, dropout_std, color=COLORS["tertiary"], label="Dropout-like std")
    ax.set_title("Different posterior surrogates give different uncertainty")
    ax.set_xlabel("x")
    ax.set_ylabel("Predictive standard deviation")
    ax.legend()
    fig.tight_layout()
    plt.show()

7. Applications in Machine Learning

7.1 Naive Bayes as Generative Classification

Dirichlet smoothing keeps class-conditional probabilities from collapsing to zero.

Code cell 65

# === 7.1 Naive Bayes with Dirichlet smoothing ===
vocab = ["bayes", "prior", "gradient", "token"]
class_counts = {
    "stats": np.array([18, 14, 2, 1]),
    "ml":    np.array([3, 2, 16, 12]),
}
alpha = np.ones(len(vocab))

probs = {}
for cls, counts in class_counts.items():
    post = counts + alpha
    probs[cls] = post / post.sum()

test_doc = np.array([2, 1, 0, 0])
log_scores = {}
for cls, p in probs.items():
    log_scores[cls] = np.sum(test_doc * np.log(p))

header("Naive Bayes with Dirichlet smoothing")
print("Posterior token probabilities:")
for cls, p in probs.items():
    print(f"{cls:5s}: {p}")
print(f"Document log-scores: {log_scores}")
report("Stats class wins on the Bayesian vocabulary", log_scores["stats"] > log_scores["ml"])

7.2 Bayesian Neural Networks

A full Bayesian neural network is intractable in general, but even a toy Bayesian logistic model shows how posterior uncertainty over parameters propagates to predictions.

Code cell 67

# === 7.2 Toy Bayesian logistic regression ===
x = np.linspace(-2, 2, 25)
y = (x + np.random.normal(0, 0.8, size=len(x)) > 0).astype(float)
w_grid = np.linspace(-6, 6, 1500)
prior_w = stats.norm.pdf(w_grid, 0, 2.0)
logits = np.outer(x, w_grid)
ll = np.sum(y[:, None] * -np.logaddexp(0, -logits) + (1 - y)[:, None] * -np.logaddexp(0, logits), axis=0)
posterior_w = normalize_density(w_grid, prior_w * np.exp(ll - ll.max()))

w_samples = np.random.choice(w_grid, size=3000, p=posterior_w / posterior_w.sum())
x_test = np.linspace(-2.5, 2.5, 200)
pred_probs = special.expit(np.outer(w_samples, x_test))
mean_prob = pred_probs.mean(axis=0)
std_prob = pred_probs.std(axis=0)

header("Toy Bayesian logistic regression")
print(f"Posterior mean of weight: {np.mean(w_samples):.4f}")
print(f"Posterior std of weight : {np.std(w_samples):.4f}")
center_std = std_prob[np.abs(x_test) < 0.5].mean()
edge_std = std_prob[np.abs(x_test) > 2.0].mean()
report("Uncertainty is larger near the boundary on average", center_std > edge_std)

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.scatter(x, y, color=COLORS["neutral"], alpha=0.7, label="Training labels")
    ax.plot(x_test, mean_prob, color=COLORS["primary"], label="Posterior predictive mean")
    ax.fill_between(x_test, np.clip(mean_prob - 2 * std_prob, 0, 1), np.clip(mean_prob + 2 * std_prob, 0, 1),
                    color=COLORS["primary"], alpha=0.2, label="Approx. uncertainty band")
    ax.set_title("Parameter uncertainty creates predictive uncertainty")
    ax.set_xlabel("x")
    ax.set_ylabel("P(y=1 | x, D)")
    ax.legend()
    fig.tight_layout()
    plt.show()

7.3 Uncertainty for Calibration, Active Learning, and OOD Detection

Posterior-aware systems often spend labeling budget or human review budget where predictive uncertainty is high.

Code cell 69

# === 7.3 Uncertainty-driven querying ===
candidate_x = np.linspace(-2.5, 2.5, 25)
pred_probs = special.expit(np.outer(w_samples[:2000], candidate_x))
mean_probs = pred_probs.mean(axis=0)
entropies = -(mean_probs * np.log(mean_probs + 1e-12) + (1 - mean_probs) * np.log(1 - mean_probs + 1e-12))
query_idx = np.argmax(entropies)

header("Active-learning style uncertainty query")
print(f"Most uncertain candidate x: {candidate_x[query_idx]:.4f}")
print(f"Predictive entropy there  : {entropies[query_idx]:.4f}")
report("Uncertainty peaks near the class boundary", abs(candidate_x[query_idx]) < 0.7)

if HAS_MPL:
    fig, ax = plt.subplots()
    ax.plot(candidate_x, entropies, color=COLORS["primary"])
    ax.axvline(candidate_x[query_idx], color=COLORS["error"], linestyle="--", label="Suggested query")
    ax.set_title("Posterior predictive entropy for candidate queries")
    ax.set_xlabel("Candidate x")
    ax.set_ylabel("Predictive entropy")
    ax.legend()
    fig.tight_layout()
    plt.show()

7.4 Bayesian Optimization and Thompson Sampling

Posterior sampling turns uncertainty into exploration. A Bernoulli bandit with Beta posteriors is the cleanest demonstration.

Code cell 71

# === 7.4 Thompson sampling for a Bernoulli bandit ===
true_rates = np.array([0.08, 0.11, 0.15])
a = np.ones_like(true_rates)
b = np.ones_like(true_rates)
pulls = np.zeros_like(true_rates, dtype=int)
rewards = np.zeros_like(true_rates, dtype=int)

for _ in range(400):
    theta_samp = np.random.beta(a, b)
    arm = np.argmax(theta_samp)
    reward = np.random.rand() < true_rates[arm]
    pulls[arm] += 1
    rewards[arm] += int(reward)
    a[arm] += int(reward)
    b[arm] += 1 - int(reward)

header("Thompson sampling")
print(f"Arm pulls   : {pulls}")
print(f"Arm rewards : {rewards}")
print(f"Posterior means: {a / (a + b)}")
report("Best arm is sampled most often", pulls[np.argmax(true_rates)] == pulls.max())

7.5 Bayesian Views of Regularization and Fine-Tuning

Priors need not be centered at zero. In fine-tuning, a natural prior center is often the pretrained parameter value.

Code cell 73

# === 7.5 Prior-centered fine-tuning ===
theta_pretrained = np.array([1.2, -0.4, 0.8])
theta_mle = np.array([1.7, -0.1, 0.4])
prior_strength = 3.0
map_centered = (theta_mle + prior_strength * theta_pretrained) / (1 + prior_strength)

header("Prior centered at pretrained weights")
print(f"Pretrained weights : {theta_pretrained}")
print(f"Task-only MLE      : {theta_mle}")
print(f"Prior-centered MAP : {map_centered}")
report("MAP stays closer to pretrained weights than the task-only MLE", la.norm(map_centered - theta_pretrained) < la.norm(theta_mle - theta_pretrained))

8. Common Mistakes

#MistakeWhy it is wrongFix
1Treating likelihood as a posteriorLikelihood is not normalized over θMultiply by prior and normalize
2MAP = full Bayesian inferenceMAP collapses the posterior to one pointUse full posterior or predictive
3Flat prior = objectiveFlatness is not parameterization-invariantUse weakly informative or Jeffreys priors
4Credible interval = confidence intervalDifferent probability statementsUse posterior language for Bayesian intervals
5ELBO improvement = posterior qualityRestricted family can still miss massEvaluate calibration, not just ELBO
6Softmax confidence = Bayesian uncertaintyDeterministic logits can be wildly wrong off-distributionUse posterior surrogates (MC dropout, SWAG, Laplace)
7More MCMC iterations always fixes convergenceBad geometry persistsCheck ESS, split-R̂, trace plots
8Conjugate prior is automatically correctConjugacy buys tractability, not truthValidate model fit with PPCs

What to Notice

  • Exact conjugate Bayes is algebraically simple because sufficient statistics add.
  • Posterior prediction integrates uncertainty instead of pretending the fitted parameter is exact.
  • Bayes factors and marginal likelihoods compare full models, not just best-fitting points.
  • Approximate inference is unavoidable in modern ML, so the error mode of the approximation matters.
  • In ML deployment, uncertainty is useful only when it changes decisions: abstention, exploration, monitoring, or safer adaptation.