Exercises NotebookMath for LLMs

Common Distributions

Probability Theory / Common Distributions

Run notebook
Exercises Notebook

Exercises Notebook

Converted from exercises.ipynb for web reading.

§6.2 Common Distributions — Exercises

10 exercises covering discrete and continuous distributions, MGFs, exponential family, conjugate priors, and ML applications (cross-entropy, VAE KL, language model sampling).

FormatDescription
ProblemMarkdown cell with task description
Your SolutionCode cell with scaffolding
SolutionCode cell with reference solution and checks

Difficulty Levels

LevelExercisesFocus
1–3PMFs, CDFs, simulation
★★4–6MGFs, exponential family, conjugate priors
★★★7-10ML: softmax cross-entropy, VAE KL divergence

Topic Map

TopicExercise
Binomial and Poisson1
Beta distribution2
Gaussian moments and CLT3
Moment generating functions4
Exponential family5
Beta-Binomial conjugate6
Softmax cross-entropy7
VAE KL divergence8
Beta-Binomial posterior predictive9
Temperature-scaled categorical10

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
from scipy import stats
from scipy.special import gamma, gammaln
import math

try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = [10, 5]
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-8):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'}{name}")
    if not ok:
        print(f'  expected: {expected}')
        print(f'  got     : {got}')
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'}{name}")
    return cond

print('Setup complete.')

Exercise 1 ★ — Binomial and Poisson: Token Error Rates

A language model makes transcription errors with probability p=0.02p = 0.02 per token. A document has n=500n = 500 tokens.

Part (a): Model the number of errors XBinomial(500,0.02)X \sim \text{Binomial}(500, 0.02). Compute P(X=0)P(X = 0), P(X5)P(X \leq 5), P(X>15)P(X > 15), E[X]\mathbb{E}[X], Var(X)\text{Var}(X).

Part (b): Approximate using YPoisson(λ=np=10)Y \sim \text{Poisson}(\lambda = np = 10). Compute the same quantities and compare.

Part (c): Compute the total variation distance TVD=12kP(X=k)P(Y=k)\text{TVD} = \frac{1}{2}\sum_k |P(X=k) - P(Y=k)|.

Part (d): Simulate 50,000 documents. Empirically verify P(X5)P(X \leq 5).

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

n, p = 500, 0.02
lam = n * p  # = 10

binom_rv = stats.binom(n, p)
pois_rv  = stats.poisson(lam)

# Part (a)
p_zero_b  = binom_rv.pmf(0)
p_le5_b   = binom_rv.cdf(5)
p_gt15_b  = binom_rv.sf(15)
e_x_b     = binom_rv.mean()
var_x_b   = binom_rv.var()

# Part (b)
p_zero_p  = pois_rv.pmf(0)
p_le5_p   = pois_rv.cdf(5)
p_gt15_p  = pois_rv.sf(15)

# Part (c): TVD
k_vals = np.arange(0, 50)
tvd = 0.5 * np.sum(np.abs(binom_rv.pmf(k_vals) - pois_rv.pmf(k_vals)))

# Part (d): Simulation
n_sim = 50000
errors = stats.binom.rvs(n, p, size=n_sim)
p_le5_sim = np.mean(errors <= 5)

header('Exercise 1: Binomial and Poisson Approximation')
print(f'Binomial:  P(X=0)={p_zero_b:.6f}, P(X<=5)={p_le5_b:.6f}, P(X>15)={p_gt15_b:.6f}')
print(f'Poisson:   P(Y=0)={p_zero_p:.6f}, P(Y<=5)={p_le5_p:.6f}, P(Y>15)={p_gt15_p:.6f}')
print(f'Binom E[X]={e_x_b:.4f}, Var={var_x_b:.4f}')
print(f'Pois  E[Y]={pois_rv.mean():.4f}, Var={pois_rv.var():.4f}')
check_close('Binom E[X] = np', e_x_b, n*p)
check_close('Binom Var = np(1-p)', var_x_b, n*p*(1-p))
print(f'TVD(Binom, Poisson) = {tvd:.6f}')
check_true('TVD < 0.01', tvd < 0.01)
print(f'Simulation P(X<=5) = {p_le5_sim:.4f}, Analytical = {p_le5_b:.4f}')
check_true('Simulation within 0.005', abs(p_le5_sim - p_le5_b) < 0.005)
print('\nTakeaway: For rare events (np fixed, n large), Poisson excellently approximates')
print('Binomial — the theoretical bound on TVD is min(p, np^2).')

Exercise 2 ★ — Beta Distribution: Model Confidence Calibration

A model's confidence score for a class is modelled as θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta).

Part (a): Compute the mean, variance, and mode for (α,β){(2,5),(5,2),(5,5),(0.5,0.5)}(\alpha, \beta) \in \{(2,5), (5,2), (5,5), (0.5,0.5)\}.

Part (b): For Beta(3,7)\text{Beta}(3, 7), compute P(θ>0.5)P(\theta > 0.5), P(0.2θ0.4)P(0.2 \leq \theta \leq 0.4).

Part (c): Show numerically that Beta(1,1)=Uniform(0,1)\text{Beta}(1,1) = \text{Uniform}(0,1) by comparing CDFs at 10 test points.

Part (d): The effective sample size of a Beta(α,β)\text{Beta}(\alpha, \beta) prior is n0=α+βn_0 = \alpha + \beta. After observing 20 successes out of 30 trials, starting from Beta(2,2)\text{Beta}(2, 2), compute the posterior and its 95% credible interval.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
import numpy as np
from scipy import stats

header('Exercise 2: Beta Distribution')

# Part (a)
configs = [(2, 5), (5, 2), (5, 5), (0.5, 0.5)]
print('Beta distribution statistics:')
for a, b in configs:
    rv = stats.beta(a, b)
    mean = rv.mean()
    var  = rv.var()
    mode = (a-1)/(a+b-2) if (a > 1 and b > 1) else 'boundary'
    print(f'  Beta({a},{b}): mean={mean:.4f}, var={var:.6f}, mode={mode}')

# Part (b)
a, b = 3, 7
rv = stats.beta(a, b)
p_gt_half  = rv.sf(0.5)
p_interval = rv.cdf(0.4) - rv.cdf(0.2)
print(f'\nBeta(3,7):')
print(f'  P(theta>0.5) = {p_gt_half:.6f}')
print(f'  P(0.2<=theta<=0.4) = {p_interval:.6f}')
check_close('P(theta>0.5) < 0.1', p_gt_half < 0.1, True, tol=0)  # qualitative check
check_true('P(theta>0.5) < 0.1', p_gt_half < 0.1)

# Part (c): Beta(1,1) = Uniform(0,1)
test_pts = np.linspace(0.1, 0.9, 10)
beta11_cdf = stats.beta(1, 1).cdf(test_pts)
uniform_cdf = stats.uniform.cdf(test_pts)
check_close('Beta(1,1) = Uniform CDF', beta11_cdf, uniform_cdf)

# Part (d): Bayesian update
alpha_prior, beta_prior = 2, 2
successes, failures = 20, 10
alpha_post = alpha_prior + successes
beta_post  = beta_prior  + failures
post_rv    = stats.beta(alpha_post, beta_post)
ci_low, ci_high = post_rv.ppf(0.025), post_rv.ppf(0.975)

print(f'\nBayesian Update:')
print(f'  Prior:     Beta({alpha_prior},{beta_prior}), mean={alpha_prior/(alpha_prior+beta_prior):.4f}')
print(f'  Data:      {successes}/{successes+failures} successes')
print(f'  Posterior: Beta({alpha_post},{beta_post}), mean={post_rv.mean():.4f}')
print(f'  MLE: {successes/(successes+failures):.4f}')
print(f'  95% Credible Interval: [{ci_low:.4f}, {ci_high:.4f}]')
check_true('Posterior mean > prior mean (data is informative)', post_rv.mean() > alpha_prior/(alpha_prior+beta_prior))
check_true('95% CI contains MLE', ci_low <= successes/(successes+failures) <= ci_high)
print('\nTakeaway: Beta is the conjugate prior for Binomial — posterior update is just')
print('adding observations to the prior hyperparameters (pseudo-counts).')

Exercise 3 ★ — Gaussian: Xavier Initialisation and the CLT

Part (a): For XN(0,σ2)X \sim \mathcal{N}(0, \sigma^2), compute E[X2]\mathbb{E}[X^2], E[X4]\mathbb{E}[X^4], and E[X6]\mathbb{E}[X^6] using the moment formula for Gaussians: E[X2k]=(2k1)!!σ2k\mathbb{E}[X^{2k}] = (2k-1)!! \, \sigma^{2k} where (2k1)!!=135(2k1)(2k-1)!! = 1\cdot 3\cdot 5\cdots(2k-1).

Part (b): Xavier initialisation sets σ2=1/nin\sigma^2 = 1/n_{\text{in}}. For a layer with nin=512n_{\text{in}} = 512 inputs, compute the std of the output z=i=1512wixiz = \sum_{i=1}^{512} w_i x_i where wiN(0,σw2)w_i \sim \mathcal{N}(0, \sigma_w^2), xiN(0,1)x_i \sim \mathcal{N}(0, 1) independently. Show that Xavier achieves Var(z)=1\text{Var}(z) = 1.

Part (c): Demonstrate the CLT: if XiUniform(0,1)X_i \sim \text{Uniform}(0,1) iid, then Sn=Xˉn0.51/(12n)dN(0,1)S_n = \frac{\bar{X}_n - 0.5}{1/(\sqrt{12n})} \xrightarrow{d} \mathcal{N}(0,1). Verify via KS test for n=1,5,30n = 1, 5, 30.

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

header('Exercise 3: Gaussian Moments and Xavier Init')

# Part (a): Gaussian even moments
sigma = 2.0
# E[X^{2k}] = (2k-1)!! * sigma^{2k}
E_X2 = sigma**2           # k=1: 1!! * sigma^2 = sigma^2
E_X4 = 3 * sigma**4       # k=2: 3!! * sigma^4 = 3*sigma^4
E_X6 = 15 * sigma**6      # k=3: 5!! * sigma^6 = 15*sigma^6

# Verify numerically
X = np.random.normal(0, sigma, 1000000)
print('Gaussian moments:')
check_close('E[X^2]', (X**2).mean(), E_X2, tol=0.1)
check_close('E[X^4]', (X**4).mean(), E_X4, tol=1.0)
check_close('E[X^6]', (X**6).mean(), E_X6, tol=20.0)

# Part (b): Xavier initialisation
n_in = 512
sigma_w = np.sqrt(1.0 / n_in)  # Xavier
# z = sum_i w_i * x_i; Var(w_i*x_i) = Var(w_i)*Var(x_i) = sigma_w^2 * 1
# Var(z) = n_in * sigma_w^2 = n_in * (1/n_in) = 1
var_z_theory = n_in * sigma_w**2
print(f'\nXavier init: sigma_w={sigma_w:.6f}')
print(f'Var(z) = n_in * sigma_w^2 = {var_z_theory:.6f}')
check_close('Xavier: Var(z) = 1', var_z_theory, 1.0)

# Verify with simulation
n_trials = 10000
W = np.random.normal(0, sigma_w, (n_trials, n_in))
X = np.random.normal(0, 1, (n_trials, n_in))
Z = (W * X).sum(axis=1)
print(f'Simulated: mean={Z.mean():.4f}, var={Z.var():.4f}')
check_close('Xavier simulated Var(z) ≈ 1', Z.var(), 1.0, tol=0.05)

# Part (c): CLT demonstration
print('\nCLT: Uniform -> Normal')
for n in [1, 5, 30]:
    samples = np.random.uniform(0, 1, (50000, n)).mean(axis=1)
    z_scores = (samples - 0.5) / (1 / np.sqrt(12*n))
    ks_stat, p_val = stats.kstest(z_scores, 'norm')
    print(f'  n={n:>2}: KS stat={ks_stat:.4f}, p={p_val:.4f}')
print('\nTakeaway: Xavier init preserves variance through layers — preventing vanishing/')
print('exploding activations. This relies on the fact that Var(wX) = Var(w)*Var(X) for')
print('independent w, X both centred at 0.')

Exercise 4 ★★ — Moment Generating Functions

Part (a): Compute the first four moments of Poisson(λ=3)\text{Poisson}(\lambda=3) using the MGF M(t)=exp(λ(et1))M(t) = \exp(\lambda(e^t - 1)). Recall that M(k)(0)=E[Xk]M^{(k)}(0) = \mathbb{E}[X^k].

Part (b): Use the product rule for MGFs to show that if XExp(λ)X \sim \text{Exp}(\lambda) and YExp(λ)Y \sim \text{Exp}(\lambda) are independent, then X+YGamma(2,λ)X + Y \sim \text{Gamma}(2, \lambda). Verify numerically via a KS test.

Part (c): Write a function empirical_mgf(samples, t) that estimates MX(t)M_X(t) from samples. Compare the empirical MGF of N(2,9)\mathcal{N}(2, 9) against its theoretical value exp(2t+9t2/2)\exp(2t + 9t^2/2) at t{0.1,0.2,0.3}t \in \{0.1, 0.2, 0.3\}.

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

header('Exercise 4: Moment Generating Functions')

# Part (a): Poisson moments via numerical differentiation of MGF
lam = 3.0
def mgf_pois(t):
    return np.exp(lam * (np.exp(t) - 1))

# Numerical derivatives at t=0
eps = 1e-6
E_X1 = (mgf_pois(eps) - mgf_pois(-eps)) / (2*eps)
E_X2_num = (mgf_pois(eps) - 2*mgf_pois(0) + mgf_pois(-eps)) / eps**2

# Analytical: for Poisson(lambda)
# E[X]   = lambda
# E[X^2] = lambda^2 + lambda
# E[X^3] = lambda^3 + 3*lambda^2 + lambda
# E[X^4] = lambda^4 + 6*lambda^3 + 7*lambda^2 + lambda
E_X1_th = lam
E_X2_th = lam**2 + lam
E_X3_th = lam**3 + 3*lam**2 + lam
E_X4_th = lam**4 + 6*lam**3 + 7*lam**2 + lam

# Verify vs simulation
n_sim = 500000
X = stats.poisson(lam).rvs(n_sim)
print('Poisson moments (simulation vs theory):')
check_close('E[X]   = lam', X.mean(), E_X1_th, tol=0.05)
check_close('E[X^2]', (X**2).mean(), E_X2_th, tol=0.3)
check_close('E[X^3]', (X**3).mean(), E_X3_th, tol=1.0)

# Part (b): Exp + Exp ~ Gamma(2, lambda) via MGF product rule
# M_Exp(t) = lambda/(lambda-t)
# M_{X+Y}(t) = [lambda/(lambda-t)]^2 = M_Gamma(2,lambda)(t)
lambda_e = 2.0
X_exp = stats.expon(scale=1/lambda_e).rvs(50000)
Y_exp = stats.expon(scale=1/lambda_e).rvs(50000)
Z = X_exp + Y_exp
gamma_rv = stats.gamma(2, scale=1/lambda_e)
ks_stat, p_val = stats.kstest(Z, gamma_rv.cdf)
print(f'\nExp+Exp ~ Gamma(2,lambda): KS p={p_val:.4f}')
check_true('Exp+Exp ~ Gamma(2,lambda) (p>0.05)', p_val > 0.05)

# Part (c): Empirical MGF
def empirical_mgf(samples, t):
    return np.mean(np.exp(t * samples))

mu_g, sigma_g = 2.0, 3.0
X_g = np.random.normal(mu_g, sigma_g, 200000)
print('\nEmpirical vs theoretical MGF for N(2,9):')
for t in [0.1, 0.2, 0.3]:
    emp    = empirical_mgf(X_g, t)
    theory = np.exp(mu_g*t + sigma_g**2*t**2/2)
    print(f'  t={t}: emp={emp:.4f}, theory={theory:.4f}, rel_err={abs(emp-theory)/theory:.4f}')
    check_close(f'MGF at t={t}', emp, theory, tol=0.02)
print('\nTakeaway: The MGF uniquely determines a distribution. Its derivatives at t=0 give')
print('all moments, and the product rule for independent sums is a powerful tool.')

Exercise 5 ★★ — Exponential Family: Bernoulli and Gaussian

Part (a): Write the Bernoulli(p)\text{Bernoulli}(p) distribution in canonical exponential family form:

p(xη)=h(x)exp(ηT(x)A(η))p(x \mid \eta) = h(x)\exp(\eta T(x) - A(\eta))

Identify η\eta, T(x)T(x), A(η)A(\eta), and h(x)h(x).

Part (b): Verify the log-partition theorem for Bernoulli: E[T(X)]=A(η)\mathbb{E}[T(X)] = A'(\eta) and Var(T(X))=A(η)\text{Var}(T(X)) = A''(\eta).

Part (c): Show that A(η)=σ(η)=1/(1+eη)A'(\eta) = \sigma(\eta) = 1/(1+e^{-\eta}) where σ\sigma is the sigmoid function. This connects maximum likelihood estimation to logistic regression.

Part (d): Write N(μ,σ2)\mathcal{N}(\mu, \sigma^2) in exponential family form with natural parameters η1=μ/σ2\eta_1 = \mu/\sigma^2 and η2=1/(2σ2)\eta_2 = -1/(2\sigma^2). Verify that A/η1=E[X]=μ\partial A/\partial \eta_1 = \mathbb{E}[X] = \mu.

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
import numpy as np
from scipy import stats

header('Exercise 5: Exponential Family')

# Part (a): Bernoulli as exponential family
# p(x|p) = exp(x*log(p/(1-p)) + log(1-p))
# eta = log(p/(1-p))  [log-odds / logit]
# T(x) = x  [sufficient statistic]
# A(eta) = log(1 + e^eta)  [log-partition]
# h(x) = 1  [base measure]

p = 0.7
eta = np.log(p / (1 - p))  # logit(p)
A_eta = np.log(1 + np.exp(eta))
print(f'Bernoulli exponential family:')
print(f'  eta = logit(p) = {eta:.4f}')
print(f'  A(eta) = log(1+e^eta) = {A_eta:.4f}')
print(f'  A(eta) = -log(1-p) = {-np.log(1-p):.4f}')
check_close('A(eta) = -log(1-p)', A_eta, -np.log(1-p))

# Part (b)+(c): A'(eta) = sigmoid(eta) = p
def A_bernoulli(eta):
    return np.log(1 + np.exp(eta))

eps = 1e-7
dA = (A_bernoulli(eta+eps) - A_bernoulli(eta-eps)) / (2*eps)
sigmoid_eta = 1 / (1 + np.exp(-eta))
print(f'\nLog-partition theorem:')
print(f'  A prime(eta) = {dA:.6f}')
print(f'  sigmoid(eta) = {sigmoid_eta:.6f}')
print(f'  p            = {p:.6f}')
check_close('A prime(eta) = sigmoid(eta) = p', dA, p)

# Part (d): Gaussian exponential family
mu, sigma = 2.0, 1.5
eta1 = mu / sigma**2
eta2 = -1 / (2 * sigma**2)
def A_gaussian(eta1, eta2):
    return -eta1**2 / (4*eta2) - 0.5 * np.log(-2*eta2)

eps = 1e-7
dA_deta1 = (A_gaussian(eta1+eps, eta2) - A_gaussian(eta1-eps, eta2)) / (2*eps)
print(f'\nGaussian exponential family:')
print(f'  dA/d_eta1 = {dA_deta1:.6f}  (E[X] = mu = {mu:.6f})')
check_close('dA/d_eta1 = E[X] = mu', dA_deta1, mu)
print('\nTakeaway: The exponential family unifies all common distributions. The log-partition')
print('theorem connects MLE to moment matching — the basis of GLMs and natural gradients.')

Exercise 6 ★★ — Dirichlet-Categorical: Topic Model Simulation

Part (a): Starting from a Dir(1K)\text{Dir}(\boldsymbol{1}_K) prior over K=4K=4 topics, update the distribution after observing word counts n=[15,8,3,4]\mathbf{n} = [15, 8, 3, 4]. Compute the posterior mean and the MAP estimate.

Part (b): Sample 1000 probability vectors from Dir(α)\text{Dir}(\boldsymbol{\alpha}) for α=[1,1,1,1]\boldsymbol{\alpha} = [1,1,1,1], [0.1,0.1,0.1,0.1][0.1,0.1,0.1,0.1], [10,10,10,10][10,10,10,10]. Report the mean entropy H(p)=kpklogpkH(\mathbf{p}) = -\sum_k p_k \log p_k across samples in each case.

Part (c): Implement a single-step LDA document update: given a document with word-topic assignments, update the topic distribution θ\boldsymbol{\theta} using the Dirichlet conjugate update.

Part (d): Show that the posterior predictive probability of observing category kk next, given counts n\mathbf{n} from a Dir(α)\text{Dir}(\boldsymbol{\alpha}) prior, is P(xn+1=kn)=(αk+nk)/(α0+n)P(x_{n+1} = k \mid \mathbf{n}) = (\alpha_k + n_k) / (\alpha_0 + n).

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

header('Exercise 6: Dirichlet-Categorical Conjugate Prior')

K = 4
alpha_prior = np.ones(K)
counts = np.array([15, 8, 3, 4])
n = counts.sum()

# Part (a)
alpha_post = alpha_prior + counts
alpha0_post = alpha_post.sum()
post_mean = alpha_post / alpha0_post
post_mode = (alpha_post - 1) / (alpha0_post - K)  # valid since alpha_k > 1

print('Dirichlet-Categorical update:')
print(f'  Prior: Dir({alpha_prior})')
print(f'  Counts: {counts}')
print(f'  Posterior: Dir({alpha_post})')
print(f'  Posterior mean: {post_mean.round(4)}')
print(f'  MAP estimate:   {post_mode.round(4)}')
print(f'  MLE (counts/n): {(counts/n).round(4)}')
check_close('Post mean sums to 1', post_mean.sum(), 1.0)
check_close('MAP sums to 1', post_mode.sum(), 1.0)

# Part (b): Entropy comparison
def entropy_samples(samples):
    p = samples + 1e-12
    return -np.sum(p * np.log(p), axis=1).mean()

max_entropy = np.log(K)
print('\nDirichlet entropy vs concentration:')
for alpha_scale in [0.1, 1.0, 10.0]:
    alpha = np.ones(K) * alpha_scale
    samples = np.random.dirichlet(alpha, 2000)
    mean_H = entropy_samples(samples)
    print(f'  alpha={alpha_scale}: mean H={mean_H:.4f} (max={max_entropy:.4f})')
check_true('Concentrated (alpha=10) has higher mean entropy than sparse (alpha=0.1)',
    entropy_samples(np.random.dirichlet(np.ones(K)*10, 2000)) >
    entropy_samples(np.random.dirichlet(np.ones(K)*0.1, 2000)))

# Part (d): Posterior predictive
# P(x_{n+1}=k | n) = (alpha_k + n_k) / (alpha_0 + n)
pred_prob = alpha_post / (alpha_prior.sum() + n)
print(f'\nPosterior predictive: {pred_prob.round(4)}')
check_close('Predictive probs sum to 1', pred_prob.sum(), 1.0)
print('\nTakeaway: Dirichlet-Categorical conjugacy is the backbone of LDA topic models.')
print('The posterior predictive with Laplace (alpha=1) smoothing avoids zero-probability')
print('tokens — crucial for language model generalisation.')

Exercise 7 ★★★ — Softmax Cross-Entropy: Loss and Gradient

The cross-entropy loss for multi-class classification is

L(z,y)=logsoftmax(z)y=zy+logkezk\mathcal{L}(\mathbf{z}, y) = -\log \text{softmax}(\mathbf{z})_y = -z_y + \log\sum_k e^{z_k}

Part (a): Implement softmax(z) with numerical stability (subtract max). Implement cross_entropy_loss(z, y) where yy is the true class index.

Part (b): Compute the gradient L/z\partial \mathcal{L} / \partial \mathbf{z}. Show that it equals pey\mathbf{p} - \mathbf{e}_y where p=softmax(z)\mathbf{p} = \text{softmax}(\mathbf{z}) and ey\mathbf{e}_y is the one-hot vector.

Part (c): Verify your gradient via finite differences.

Part (d): Show that label smoothing with parameter ϵ\epsilon replaces the one-hot target with y~k=(1ϵ)1[k=y]+ϵ/K\tilde{y}_k = (1-\epsilon) \mathbf{1}[k=y] + \epsilon/K. Compute the smoothed loss for ϵ=0.1\epsilon = 0.1.

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
import numpy as np

header('Exercise 7: Softmax Cross-Entropy')

def softmax(z):
    z = np.asarray(z, dtype=float)
    z = z - z.max()  # numerical stability
    e = np.exp(z)
    return e / e.sum()

def cross_entropy_loss(z, y):
    p = softmax(z)
    return -np.log(p[y] + 1e-12)

def ce_gradient(z, y):
    K = len(z)
    p = softmax(z)
    e_y = np.zeros(K)
    e_y[y] = 1.0
    return p - e_y  # gradient w.r.t. logits

z = np.array([2.0, 1.0, 0.5, -0.5, -1.0])
y = 0
p = softmax(z)
loss = cross_entropy_loss(z, y)
grad = ce_gradient(z, y)

print(f'softmax(z) = {p.round(4)}')
print(f'CE loss = {loss:.4f}')
print(f'gradient = {grad.round(4)}')
check_close('p sums to 1', p.sum(), 1.0)

# Verify gradient via finite differences
eps = 1e-7
grad_fd = np.zeros(len(z))
for i in range(len(z)):
    z_plus = z.copy(); z_plus[i] += eps
    z_minus = z.copy(); z_minus[i] -= eps
    grad_fd[i] = (cross_entropy_loss(z_plus, y) - cross_entropy_loss(z_minus, y)) / (2*eps)

print('\nGradient verification:')
check_close('Analytical gradient = finite diff', grad, grad_fd)

# Label smoothing
K = len(z)
eps_smooth = 0.1
y_smooth = np.ones(K) * eps_smooth / K
y_smooth[y] += 1 - eps_smooth

loss_smooth = -np.sum(y_smooth * np.log(p + 1e-12))
print(f'\nLabel smoothing (eps={eps_smooth}):')
print(f'  Hard CE loss:    {loss:.4f}')
print(f'  Smoothed CE loss: {loss_smooth:.4f}')
check_true('Label-smoothed loss > hard loss (entropy penalty)', loss_smooth > loss)
print('\nTakeaway: The CE gradient p - e_y has an elegant form: it pushes the model')
print('probability toward 1 for the correct class and toward 0 for others.')
print('Label smoothing prevents over-confident predictions (regularisation).')

Exercise 8 ★★★ — VAE KL Divergence: Closed-Form Regulariser

In a Variational Autoencoder (VAE), the encoder outputs μϕ(x)\mu_\phi(\mathbf{x}) and logσϕ2(x)\log\sigma^2_\phi(\mathbf{x}) that parameterise qϕ(zx)=N(μ,diag(σ2))q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\mu, \text{diag}(\sigma^2)). The prior is p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, I).

Part (a): Derive and implement the closed-form KL divergence:

DKL(qp)=12d=1D(μd2+σd2logσd21)D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{d=1}^D (\mu_d^2 + \sigma_d^2 - \log\sigma_d^2 - 1)

Part (b): Verify that KL = 0 when μ=0\mu = 0 and σ2=1\sigma^2 = 1.

Part (c): Compute KL for μ=[1.0,0.5,2.0]\mu = [1.0, -0.5, 2.0], logσ2=[0.0,0.5,0.5]\log\sigma^2 = [0.0, -0.5, 0.5].

Part (d): Show numerically that this closed-form matches the Monte Carlo estimate D^KL=1Ni[logq(zi)logp(zi)]\hat{D}_{\text{KL}} = \frac{1}{N}\sum_i [\log q(z_i) - \log p(z_i)] using N=100,000N = 100{,}000 samples.

Part (e): Plot how KL varies as μ\mu ranges from 3-3 to 33 (with σ2=1\sigma^2 = 1) and as σ2\sigma^2 ranges from 0.10.1 to 55 (with μ=0\mu = 0).

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
import numpy as np
from scipy import stats

header('Exercise 8: VAE KL Divergence')

def kl_gaussian(mu, log_var):
    """KL(N(mu, diag(sigma^2)) || N(0, I)), per-dimension then summed."""
    sigma2 = np.exp(log_var)
    return 0.5 * np.sum(mu**2 + sigma2 - log_var - 1)

# Part (b): KL = 0 at prior
mu_zero = np.zeros(3)
lv_zero = np.zeros(3)
kl_zero = kl_gaussian(mu_zero, lv_zero)
print(f'KL at mu=0, log_var=0: {kl_zero:.8f}')
check_close('KL = 0 at prior', kl_zero, 0.0)

# Part (c)
mu_c  = np.array([1.0, -0.5, 2.0])
lv_c  = np.array([0.0, -0.5,  0.5])
kl_c  = kl_gaussian(mu_c, lv_c)
print(f'KL = {kl_c:.4f}')
# Per-dimension contributions
per_d = 0.5 * (mu_c**2 + np.exp(lv_c) - lv_c - 1)
print(f'Per-dim contributions: {per_d.round(4)}')
check_close('Sum of per-dim = total KL', per_d.sum(), kl_c)

# Part (d): Monte Carlo verification
D = 3
N = 200000
mu_mc = np.array([1.0, 0.5, -0.5])
lv_mc = np.array([0.0, 0.5, -0.3])
sigma_mc = np.exp(0.5 * lv_mc)

# Sample from q
Z = np.random.normal(0, 1, (N, D)) * sigma_mc + mu_mc

# log q(z) - log p(z)
log_q = stats.norm.logpdf(Z, loc=mu_mc, scale=sigma_mc).sum(axis=1)
log_p = stats.norm.logpdf(Z, loc=0, scale=1).sum(axis=1)
kl_mc = (log_q - log_p).mean()
kl_closed = kl_gaussian(mu_mc, lv_mc)
print(f'\nKL Monte Carlo: {kl_mc:.4f}')
print(f'KL closed-form: {kl_closed:.4f}')
check_close('MC ≈ closed-form', kl_mc, kl_closed, tol=0.05)

try:
    import matplotlib.pyplot as plt
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

# Part (e): KL as function of mu and sigma^2
if HAS_MPL:
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    ax = axes[0]
    mu_range = np.linspace(-3, 3, 200)
    kl_vs_mu = [kl_gaussian(np.array([m]), np.array([0.0])) for m in mu_range]
    ax.plot(mu_range, kl_vs_mu, 'steelblue', lw=2)
    ax.axhline(0, color='red', ls='--', alpha=0.5)
    ax.set_xlabel('μ (σ²=1)'); ax.set_ylabel('KL'); ax.set_title('KL vs Mean')

    ax = axes[1]
    sigma2_range = np.linspace(0.1, 5, 200)
    lv_range = np.log(sigma2_range)
    kl_vs_var = [kl_gaussian(np.array([0.0]), np.array([lv])) for lv in lv_range]
    ax.plot(sigma2_range, kl_vs_var, 'firebrick', lw=2)
    ax.axhline(0, color='blue', ls='--', alpha=0.5)
    ax.set_xlabel('σ² (μ=0)'); ax.set_ylabel('KL'); ax.set_title('KL vs Variance')
    plt.suptitle('VAE Regularisation: KL(N(μ,σ²)||N(0,1))', fontsize=13)
    plt.tight_layout(); plt.show()

print('\nTakeaway: The closed-form KL enables efficient ELBO optimisation in VAEs.')
print('The KL term is 0 only at the prior (mu=0, sigma^2=1), pushing the encoder')
print('toward a compact, unit-variance latent space while the reconstruction term')
print('pulls it toward informative encodings.')

Exercise 9 *** - Beta-Binomial Posterior Predictive

A classifier has been evaluated on n=40n=40 examples and got k=31k=31 correct. Put a Beta(2,2)\operatorname{Beta}(2,2) prior on its unknown accuracy theta\\theta.

Part (a): Compute the posterior distribution for theta\\theta.

Part (b): Compute the posterior mean and a 95% credible interval.

Part (c): Compute the posterior predictive probability that the next label is correct.

Part (d): Compare this Bayesian estimate with the raw empirical accuracy.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
from scipy import stats

a0, b0 = 2, 2
n, k = 40, 31
apost = a0 + k
bpost = b0 + n - k
posterior = stats.beta(apost, bpost)
posterior_mean = apost / (apost + bpost)
ci = posterior.ppf([0.025, 0.975])
raw_acc = k / n

header('Exercise 9: Beta-Binomial Posterior Predictive')
print(f'Posterior: Beta({apost}, {bpost})')
print(f'Raw accuracy: {raw_acc:.4f}')
print(f'Posterior mean / predictive P(next correct): {posterior_mean:.4f}')
print(f'95% credible interval: [{ci[0]:.4f}, {ci[1]:.4f}]')
check_true('Posterior mean is shrunk toward prior mean 0.5', 0.5 < posterior_mean < raw_acc)
check_true('Credible interval contains posterior mean', ci[0] < posterior_mean < ci[1])
print('Takeaway: conjugacy turns counts into calibrated uncertainty, not just a point estimate.')

Exercise 10 *** - Temperature Scaling as a Categorical Distribution Family

Let logits zz define a categorical distribution pT=softmax(z/T)p_T = \operatorname{softmax}(z/T).

Part (a): Compute pTp_T for several temperatures.

Part (b): Verify that entropy increases with temperature for the given logits.

Part (c): Compute the expected negative log-likelihood of a target class under each TT.

Part (d): Connect this to calibration and decoding in language models.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution

def softmax_T(z, T):
    z = np.asarray(z, dtype=float) / T
    z -= z.max()
    p = np.exp(z)
    return p / p.sum()

def entropy_nat(p):
    return -np.sum(p * np.log(p + 1e-15))

z = np.array([2.2, 1.0, 0.3, -0.5])
target = 0
temps = np.array([0.5, 0.8, 1.0, 1.5, 2.0])
entropies = []
nlls = []

header('Exercise 10: Temperature Scaling Categorical Family')
for T in temps:
    p = softmax_T(z, T)
    H = entropy_nat(p)
    nll = -np.log(p[target])
    entropies.append(H)
    nlls.append(nll)
    print(f'T={T:.1f}: p={p.round(4)}, entropy={H:.4f}, target NLL={nll:.4f}')

check_true('Entropy is nondecreasing here as T rises', np.all(np.diff(entropies) > -1e-10))
check_true('Top-class NLL rises as distribution is softened', nlls[-1] > nlls[0])
print('Takeaway: temperature is not a new distribution; it is a reparameterized categorical that controls sharpness.')

What to Review After Finishing

  • Binomial → Poisson approximation: when is TVD small?
  • Beta distribution: mode formula and edge cases (α<1\alpha < 1)
  • Gaussian moment formula: E[X2k]=(2k1)!!σ2k\mathbb{E}[X^{2k}] = (2k-1)!! \sigma^{2k}
  • MGF uniqueness and product rule for independent sums
  • Exponential family: natural parameters, sufficient statistics, log-partition theorem
  • Dirichlet-Categorical conjugacy: counting update rule
  • Softmax gradient: derive L/zk\partial \mathcal{L}/\partial z_k analytically
  • VAE KL: closed-form derivation and connection to ELBO

References

  1. Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
  2. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
  3. Blei, D., Kucukelbir, A., McAuliffe, J. (2017). Variational Inference: A Review for Statisticians.
  4. Wainwright, M., Jordan, M. (2008). Graphical Models, Exponential Families, and Variational Inference.
  • Beta-Binomial posterior predictive - can you explain the probabilistic calculation and the ML relevance?
  • Temperature-scaled categorical - can you connect the computation to model behavior?