Exercises Notebook

Converted from exercises.ipynb for web reading.

§6.2 Common Distributions — Exercises

10 exercises covering discrete and continuous distributions, MGFs, exponential family, conjugate priors, and ML applications (cross-entropy, VAE KL, language model sampling).

Format	Description
Problem	Markdown cell with task description
Your Solution	Code cell with scaffolding
Solution	Code cell with reference solution and checks

Difficulty Levels

Level	Exercises	Focus
★	1–3	PMFs, CDFs, simulation
★★	4–6	MGFs, exponential family, conjugate priors
★★★	7-10	ML: softmax cross-entropy, VAE KL divergence

Topic Map

Topic	Exercise
Binomial and Poisson	1
Beta distribution	2
Gaussian moments and CLT	3
Moment generating functions	4
Exponential family	5
Beta-Binomial conjugate	6
Softmax cross-entropy	7
VAE KL divergence	8
Beta-Binomial posterior predictive	9
Temperature-scaled categorical	10

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
from scipy import stats
from scipy.special import gamma, gammaln
import math

try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = [10, 5]
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-8):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'} — {name}")
    if not ok:
        print(f'  expected: {expected}')
        print(f'  got     : {got}')
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'} — {name}")
    return cond

print('Setup complete.')

Exercise 1 ★ — Binomial and Poisson: Token Error Rates

A language model makes transcription errors with probability $p = 0.02$ per token. A document has $n = 500$ tokens.

Part (a): Model the number of errors $X \sim \text{Binomial}(500, 0.02)$ . Compute $P(X = 0)$ , $P(X \leq 5)$ , $P(X > 15)$ , $\mathbb{E}[X]$ , $\text{Var}(X)$ .

Part (b): Approximate using $Y \sim \text{Poisson}(\lambda = np = 10)$ . Compute the same quantities and compare.

Part (c): Compute the total variation distance $\text{TVD} = \frac{1}{2}\sum_k |P(X=k) - P(Y=k)|$ .

Part (d): Simulate 50,000 documents. Empirically verify $P(X \leq 5)$ .

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

n, p = 500, 0.02
lam = n * p  # = 10

binom_rv = stats.binom(n, p)
pois_rv  = stats.poisson(lam)

# Part (a)
p_zero_b  = binom_rv.pmf(0)
p_le5_b   = binom_rv.cdf(5)
p_gt15_b  = binom_rv.sf(15)
e_x_b     = binom_rv.mean()
var_x_b   = binom_rv.var()

# Part (b)
p_zero_p  = pois_rv.pmf(0)
p_le5_p   = pois_rv.cdf(5)
p_gt15_p  = pois_rv.sf(15)

# Part (c): TVD
k_vals = np.arange(0, 50)
tvd = 0.5 * np.sum(np.abs(binom_rv.pmf(k_vals) - pois_rv.pmf(k_vals)))

# Part (d): Simulation
n_sim = 50000
errors = stats.binom.rvs(n, p, size=n_sim)
p_le5_sim = np.mean(errors <= 5)

header('Exercise 1: Binomial and Poisson Approximation')
print(f'Binomial:  P(X=0)={p_zero_b:.6f}, P(X<=5)={p_le5_b:.6f}, P(X>15)={p_gt15_b:.6f}')
print(f'Poisson:   P(Y=0)={p_zero_p:.6f}, P(Y<=5)={p_le5_p:.6f}, P(Y>15)={p_gt15_p:.6f}')
print(f'Binom E[X]={e_x_b:.4f}, Var={var_x_b:.4f}')
print(f'Pois  E[Y]={pois_rv.mean():.4f}, Var={pois_rv.var():.4f}')
check_close('Binom E[X] = np', e_x_b, n*p)
check_close('Binom Var = np(1-p)', var_x_b, n*p*(1-p))
print(f'TVD(Binom, Poisson) = {tvd:.6f}')
check_true('TVD < 0.01', tvd < 0.01)
print(f'Simulation P(X<=5) = {p_le5_sim:.4f}, Analytical = {p_le5_b:.4f}')
check_true('Simulation within 0.005', abs(p_le5_sim - p_le5_b) < 0.005)
print('\nTakeaway: For rare events (np fixed, n large), Poisson excellently approximates')
print('Binomial — the theoretical bound on TVD is min(p, np^2).')

Exercise 2 ★ — Beta Distribution: Model Confidence Calibration

A model's confidence score for a class is modelled as $\theta \sim \text{Beta}(\alpha, \beta)$ .

Part (a): Compute the mean, variance, and mode for $(\alpha, \beta) \in \{(2,5), (5,2), (5,5), (0.5,0.5)\}$ .

Part (b): For $\text{Beta}(3, 7)$ , compute $P(\theta > 0.5)$ , $P(0.2 \leq \theta \leq 0.4)$ .

Part (c): Show numerically that $\text{Beta}(1,1) = \text{Uniform}(0,1)$ by comparing CDFs at 10 test points.

Part (d): The effective sample size of a $\text{Beta}(\alpha, \beta)$ prior is $n_0 = \alpha + \beta$ . After observing 20 successes out of 30 trials, starting from $\text{Beta}(2, 2)$ , compute the posterior and its 95% credible interval.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
import numpy as np
from scipy import stats

header('Exercise 2: Beta Distribution')

# Part (a)
configs = [(2, 5), (5, 2), (5, 5), (0.5, 0.5)]
print('Beta distribution statistics:')
for a, b in configs:
    rv = stats.beta(a, b)
    mean = rv.mean()
    var  = rv.var()
    mode = (a-1)/(a+b-2) if (a > 1 and b > 1) else 'boundary'
    print(f'  Beta({a},{b}): mean={mean:.4f}, var={var:.6f}, mode={mode}')

# Part (b)
a, b = 3, 7
rv = stats.beta(a, b)
p_gt_half  = rv.sf(0.5)
p_interval = rv.cdf(0.4) - rv.cdf(0.2)
print(f'\nBeta(3,7):')
print(f'  P(theta>0.5) = {p_gt_half:.6f}')
print(f'  P(0.2<=theta<=0.4) = {p_interval:.6f}')
check_close('P(theta>0.5) < 0.1', p_gt_half < 0.1, True, tol=0)  # qualitative check
check_true('P(theta>0.5) < 0.1', p_gt_half < 0.1)

# Part (c): Beta(1,1) = Uniform(0,1)
test_pts = np.linspace(0.1, 0.9, 10)
beta11_cdf = stats.beta(1, 1).cdf(test_pts)
uniform_cdf = stats.uniform.cdf(test_pts)
check_close('Beta(1,1) = Uniform CDF', beta11_cdf, uniform_cdf)

# Part (d): Bayesian update
alpha_prior, beta_prior = 2, 2
successes, failures = 20, 10
alpha_post = alpha_prior + successes
beta_post  = beta_prior  + failures
post_rv    = stats.beta(alpha_post, beta_post)
ci_low, ci_high = post_rv.ppf(0.025), post_rv.ppf(0.975)

print(f'\nBayesian Update:')
print(f'  Prior:     Beta({alpha_prior},{beta_prior}), mean={alpha_prior/(alpha_prior+beta_prior):.4f}')
print(f'  Data:      {successes}/{successes+failures} successes')
print(f'  Posterior: Beta({alpha_post},{beta_post}), mean={post_rv.mean():.4f}')
print(f'  MLE: {successes/(successes+failures):.4f}')
print(f'  95% Credible Interval: [{ci_low:.4f}, {ci_high:.4f}]')
check_true('Posterior mean > prior mean (data is informative)', post_rv.mean() > alpha_prior/(alpha_prior+beta_prior))
check_true('95% CI contains MLE', ci_low <= successes/(successes+failures) <= ci_high)
print('\nTakeaway: Beta is the conjugate prior for Binomial — posterior update is just')
print('adding observations to the prior hyperparameters (pseudo-counts).')

Exercise 3 ★ — Gaussian: Xavier Initialisation and the CLT

Part (a): For $X \sim \mathcal{N}(0, \sigma^2)$ , compute $\mathbb{E}[X^2]$ , $\mathbb{E}[X^4]$ , and $\mathbb{E}[X^6]$ using the moment formula for Gaussians: $\mathbb{E}[X^{2k}] = (2k-1)!! \, \sigma^{2k}$ where $(2k-1)!! = 1\cdot 3\cdot 5\cdots(2k-1)$ .

Part (b): Xavier initialisation sets $\sigma^2 = 1/n_{\text{in}}$ . For a layer with $n_{\text{in}} = 512$ inputs, compute the std of the output $z = \sum_{i=1}^{512} w_i x_i$ where $w_i \sim \mathcal{N}(0, \sigma_w^2)$ , $x_i \sim \mathcal{N}(0, 1)$ independently. Show that Xavier achieves $\text{Var}(z) = 1$ .

Part (c): Demonstrate the CLT: if $X_i \sim \text{Uniform}(0,1)$ iid, then $S_n = \frac{\bar{X}_n - 0.5}{1/(\sqrt{12n})} \xrightarrow{d} \mathcal{N}(0,1)$ . Verify via KS test for $n = 1, 5, 30$ .

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

header('Exercise 3: Gaussian Moments and Xavier Init')

# Part (a): Gaussian even moments
sigma = 2.0
# E[X^{2k}] = (2k-1)!! * sigma^{2k}
E_X2 = sigma**2           # k=1: 1!! * sigma^2 = sigma^2
E_X4 = 3 * sigma**4       # k=2: 3!! * sigma^4 = 3*sigma^4
E_X6 = 15 * sigma**6      # k=3: 5!! * sigma^6 = 15*sigma^6

# Verify numerically
X = np.random.normal(0, sigma, 1000000)
print('Gaussian moments:')
check_close('E[X^2]', (X**2).mean(), E_X2, tol=0.1)
check_close('E[X^4]', (X**4).mean(), E_X4, tol=1.0)
check_close('E[X^6]', (X**6).mean(), E_X6, tol=20.0)

# Part (b): Xavier initialisation
n_in = 512
sigma_w = np.sqrt(1.0 / n_in)  # Xavier
# z = sum_i w_i * x_i; Var(w_i*x_i) = Var(w_i)*Var(x_i) = sigma_w^2 * 1
# Var(z) = n_in * sigma_w^2 = n_in * (1/n_in) = 1
var_z_theory = n_in * sigma_w**2
print(f'\nXavier init: sigma_w={sigma_w:.6f}')
print(f'Var(z) = n_in * sigma_w^2 = {var_z_theory:.6f}')
check_close('Xavier: Var(z) = 1', var_z_theory, 1.0)

# Verify with simulation
n_trials = 10000
W = np.random.normal(0, sigma_w, (n_trials, n_in))
X = np.random.normal(0, 1, (n_trials, n_in))
Z = (W * X).sum(axis=1)
print(f'Simulated: mean={Z.mean():.4f}, var={Z.var():.4f}')
check_close('Xavier simulated Var(z) ≈ 1', Z.var(), 1.0, tol=0.05)

# Part (c): CLT demonstration
print('\nCLT: Uniform -> Normal')
for n in [1, 5, 30]:
    samples = np.random.uniform(0, 1, (50000, n)).mean(axis=1)
    z_scores = (samples - 0.5) / (1 / np.sqrt(12*n))
    ks_stat, p_val = stats.kstest(z_scores, 'norm')
    print(f'  n={n:>2}: KS stat={ks_stat:.4f}, p={p_val:.4f}')
print('\nTakeaway: Xavier init preserves variance through layers — preventing vanishing/')
print('exploding activations. This relies on the fact that Var(wX) = Var(w)*Var(X) for')
print('independent w, X both centred at 0.')

Exercise 4 ★★ — Moment Generating Functions

Part (a): Compute the first four moments of $\text{Poisson}(\lambda=3)$ using the MGF $M(t) = \exp(\lambda(e^t - 1))$ . Recall that $M^{(k)}(0) = \mathbb{E}[X^k]$ .

Part (b): Use the product rule for MGFs to show that if $X \sim \text{Exp}(\lambda)$ and $Y \sim \text{Exp}(\lambda)$ are independent, then $X + Y \sim \text{Gamma}(2, \lambda)$ . Verify numerically via a KS test.

Part (c): Write a function empirical_mgf(samples, t) that estimates $M_X(t)$ from samples. Compare the empirical MGF of $\mathcal{N}(2, 9)$ against its theoretical value $\exp(2t + 9t^2/2)$ at $t \in \{0.1, 0.2, 0.3\}$ .

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

header('Exercise 4: Moment Generating Functions')

# Part (a): Poisson moments via numerical differentiation of MGF
lam = 3.0
def mgf_pois(t):
    return np.exp(lam * (np.exp(t) - 1))

# Numerical derivatives at t=0
eps = 1e-6
E_X1 = (mgf_pois(eps) - mgf_pois(-eps)) / (2*eps)
E_X2_num = (mgf_pois(eps) - 2*mgf_pois(0) + mgf_pois(-eps)) / eps**2

# Analytical: for Poisson(lambda)
# E[X]   = lambda
# E[X^2] = lambda^2 + lambda
# E[X^3] = lambda^3 + 3*lambda^2 + lambda
# E[X^4] = lambda^4 + 6*lambda^3 + 7*lambda^2 + lambda
E_X1_th = lam
E_X2_th = lam**2 + lam
E_X3_th = lam**3 + 3*lam**2 + lam
E_X4_th = lam**4 + 6*lam**3 + 7*lam**2 + lam

# Verify vs simulation
n_sim = 500000
X = stats.poisson(lam).rvs(n_sim)
print('Poisson moments (simulation vs theory):')
check_close('E[X]   = lam', X.mean(), E_X1_th, tol=0.05)
check_close('E[X^2]', (X**2).mean(), E_X2_th, tol=0.3)
check_close('E[X^3]', (X**3).mean(), E_X3_th, tol=1.0)

# Part (b): Exp + Exp ~ Gamma(2, lambda) via MGF product rule
# M_Exp(t) = lambda/(lambda-t)
# M_{X+Y}(t) = [lambda/(lambda-t)]^2 = M_Gamma(2,lambda)(t)
lambda_e = 2.0
X_exp = stats.expon(scale=1/lambda_e).rvs(50000)
Y_exp = stats.expon(scale=1/lambda_e).rvs(50000)
Z = X_exp + Y_exp
gamma_rv = stats.gamma(2, scale=1/lambda_e)
ks_stat, p_val = stats.kstest(Z, gamma_rv.cdf)
print(f'\nExp+Exp ~ Gamma(2,lambda): KS p={p_val:.4f}')
check_true('Exp+Exp ~ Gamma(2,lambda) (p>0.05)', p_val > 0.05)

# Part (c): Empirical MGF
def empirical_mgf(samples, t):
    return np.mean(np.exp(t * samples))

mu_g, sigma_g = 2.0, 3.0
X_g = np.random.normal(mu_g, sigma_g, 200000)
print('\nEmpirical vs theoretical MGF for N(2,9):')
for t in [0.1, 0.2, 0.3]:
    emp    = empirical_mgf(X_g, t)
    theory = np.exp(mu_g*t + sigma_g**2*t**2/2)
    print(f'  t={t}: emp={emp:.4f}, theory={theory:.4f}, rel_err={abs(emp-theory)/theory:.4f}')
    check_close(f'MGF at t={t}', emp, theory, tol=0.02)
print('\nTakeaway: The MGF uniquely determines a distribution. Its derivatives at t=0 give')
print('all moments, and the product rule for independent sums is a powerful tool.')

Exercise 5 ★★ — Exponential Family: Bernoulli and Gaussian

Part (a): Write the $\text{Bernoulli}(p)$ distribution in canonical exponential family form:

p(x \mid \eta) = h(x)\exp(\eta T(x) - A(\eta))

Identify $\eta$ , $T(x)$ , $A(\eta)$ , and $h(x)$ .

Part (b): Verify the log-partition theorem for Bernoulli: $\mathbb{E}[T(X)] = A'(\eta)$ and $\text{Var}(T(X)) = A''(\eta)$ .

Part (c): Show that $A'(\eta) = \sigma(\eta) = 1/(1+e^{-\eta})$ where $\sigma$ is the sigmoid function. This connects maximum likelihood estimation to logistic regression.

Part (d): Write $\mathcal{N}(\mu, \sigma^2)$ in exponential family form with natural parameters $\eta_1 = \mu/\sigma^2$ and $\eta_2 = -1/(2\sigma^2)$ . Verify that $\partial A/\partial \eta_1 = \mathbb{E}[X] = \mu$ .

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
import numpy as np
from scipy import stats

header('Exercise 5: Exponential Family')

# Part (a): Bernoulli as exponential family
# p(x|p) = exp(x*log(p/(1-p)) + log(1-p))
# eta = log(p/(1-p))  [log-odds / logit]
# T(x) = x  [sufficient statistic]
# A(eta) = log(1 + e^eta)  [log-partition]
# h(x) = 1  [base measure]

p = 0.7
eta = np.log(p / (1 - p))  # logit(p)
A_eta = np.log(1 + np.exp(eta))
print(f'Bernoulli exponential family:')
print(f'  eta = logit(p) = {eta:.4f}')
print(f'  A(eta) = log(1+e^eta) = {A_eta:.4f}')
print(f'  A(eta) = -log(1-p) = {-np.log(1-p):.4f}')
check_close('A(eta) = -log(1-p)', A_eta, -np.log(1-p))

# Part (b)+(c): A'(eta) = sigmoid(eta) = p
def A_bernoulli(eta):
    return np.log(1 + np.exp(eta))

eps = 1e-7
dA = (A_bernoulli(eta+eps) - A_bernoulli(eta-eps)) / (2*eps)
sigmoid_eta = 1 / (1 + np.exp(-eta))
print(f'\nLog-partition theorem:')
print(f'  A prime(eta) = {dA:.6f}')
print(f'  sigmoid(eta) = {sigmoid_eta:.6f}')
print(f'  p            = {p:.6f}')
check_close('A prime(eta) = sigmoid(eta) = p', dA, p)

# Part (d): Gaussian exponential family
mu, sigma = 2.0, 1.5
eta1 = mu / sigma**2
eta2 = -1 / (2 * sigma**2)
def A_gaussian(eta1, eta2):
    return -eta1**2 / (4*eta2) - 0.5 * np.log(-2*eta2)

eps = 1e-7
dA_deta1 = (A_gaussian(eta1+eps, eta2) - A_gaussian(eta1-eps, eta2)) / (2*eps)
print(f'\nGaussian exponential family:')
print(f'  dA/d_eta1 = {dA_deta1:.6f}  (E[X] = mu = {mu:.6f})')
check_close('dA/d_eta1 = E[X] = mu', dA_deta1, mu)
print('\nTakeaway: The exponential family unifies all common distributions. The log-partition')
print('theorem connects MLE to moment matching — the basis of GLMs and natural gradients.')

Exercise 6 ★★ — Dirichlet-Categorical: Topic Model Simulation

Part (a): Starting from a $\text{Dir}(\boldsymbol{1}_K)$ prior over $K=4$ topics, update the distribution after observing word counts $\mathbf{n} = [15, 8, 3, 4]$ . Compute the posterior mean and the MAP estimate.

Part (b): Sample 1000 probability vectors from $\text{Dir}(\boldsymbol{\alpha})$ for $\boldsymbol{\alpha} = [1,1,1,1]$ , $[0.1,0.1,0.1,0.1]$ , $[10,10,10,10]$ . Report the mean entropy $H(\mathbf{p}) = -\sum_k p_k \log p_k$ across samples in each case.

Part (c): Implement a single-step LDA document update: given a document with word-topic assignments, update the topic distribution $\boldsymbol{\theta}$ using the Dirichlet conjugate update.

Part (d): Show that the posterior predictive probability of observing category $k$ next, given counts $\mathbf{n}$ from a $\text{Dir}(\boldsymbol{\alpha})$ prior, is $P(x_{n+1} = k \mid \mathbf{n}) = (\alpha_k + n_k) / (\alpha_0 + n)$ .

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
import numpy as np
from scipy import stats
np.random.seed(42)

header('Exercise 6: Dirichlet-Categorical Conjugate Prior')

K = 4
alpha_prior = np.ones(K)
counts = np.array([15, 8, 3, 4])
n = counts.sum()

# Part (a)
alpha_post = alpha_prior + counts
alpha0_post = alpha_post.sum()
post_mean = alpha_post / alpha0_post
post_mode = (alpha_post - 1) / (alpha0_post - K)  # valid since alpha_k > 1

print('Dirichlet-Categorical update:')
print(f'  Prior: Dir({alpha_prior})')
print(f'  Counts: {counts}')
print(f'  Posterior: Dir({alpha_post})')
print(f'  Posterior mean: {post_mean.round(4)}')
print(f'  MAP estimate:   {post_mode.round(4)}')
print(f'  MLE (counts/n): {(counts/n).round(4)}')
check_close('Post mean sums to 1', post_mean.sum(), 1.0)
check_close('MAP sums to 1', post_mode.sum(), 1.0)

# Part (b): Entropy comparison
def entropy_samples(samples):
    p = samples + 1e-12
    return -np.sum(p * np.log(p), axis=1).mean()

max_entropy = np.log(K)
print('\nDirichlet entropy vs concentration:')
for alpha_scale in [0.1, 1.0, 10.0]:
    alpha = np.ones(K) * alpha_scale
    samples = np.random.dirichlet(alpha, 2000)
    mean_H = entropy_samples(samples)
    print(f'  alpha={alpha_scale}: mean H={mean_H:.4f} (max={max_entropy:.4f})')
check_true('Concentrated (alpha=10) has higher mean entropy than sparse (alpha=0.1)',
    entropy_samples(np.random.dirichlet(np.ones(K)*10, 2000)) >
    entropy_samples(np.random.dirichlet(np.ones(K)*0.1, 2000)))

# Part (d): Posterior predictive
# P(x_{n+1}=k | n) = (alpha_k + n_k) / (alpha_0 + n)
pred_prob = alpha_post / (alpha_prior.sum() + n)
print(f'\nPosterior predictive: {pred_prob.round(4)}')
check_close('Predictive probs sum to 1', pred_prob.sum(), 1.0)
print('\nTakeaway: Dirichlet-Categorical conjugacy is the backbone of LDA topic models.')
print('The posterior predictive with Laplace (alpha=1) smoothing avoids zero-probability')
print('tokens — crucial for language model generalisation.')

Exercise 7 ★★★ — Softmax Cross-Entropy: Loss and Gradient

The cross-entropy loss for multi-class classification is

\mathcal{L}(\mathbf{z}, y) = -\log \text{softmax}(\mathbf{z})_y = -z_y + \log\sum_k e^{z_k}

Part (a): Implement softmax(z) with numerical stability (subtract max). Implement cross_entropy_loss(z, y) where $y$ is the true class index.

Part (b): Compute the gradient $\partial \mathcal{L} / \partial \mathbf{z}$ . Show that it equals $\mathbf{p} - \mathbf{e}_y$ where $\mathbf{p} = \text{softmax}(\mathbf{z})$ and $\mathbf{e}_y$ is the one-hot vector.

Part (c): Verify your gradient via finite differences.

Part (d): Show that label smoothing with parameter $\epsilon$ replaces the one-hot target with $\tilde{y}_k = (1-\epsilon) \mathbf{1}[k=y] + \epsilon/K$ . Compute the smoothed loss for $\epsilon = 0.1$ .

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
import numpy as np

header('Exercise 7: Softmax Cross-Entropy')

def softmax(z):
    z = np.asarray(z, dtype=float)
    z = z - z.max()  # numerical stability
    e = np.exp(z)
    return e / e.sum()

def cross_entropy_loss(z, y):
    p = softmax(z)
    return -np.log(p[y] + 1e-12)

def ce_gradient(z, y):
    K = len(z)
    p = softmax(z)
    e_y = np.zeros(K)
    e_y[y] = 1.0
    return p - e_y  # gradient w.r.t. logits

z = np.array([2.0, 1.0, 0.5, -0.5, -1.0])
y = 0
p = softmax(z)
loss = cross_entropy_loss(z, y)
grad = ce_gradient(z, y)

print(f'softmax(z) = {p.round(4)}')
print(f'CE loss = {loss:.4f}')
print(f'gradient = {grad.round(4)}')
check_close('p sums to 1', p.sum(), 1.0)

# Verify gradient via finite differences
eps = 1e-7
grad_fd = np.zeros(len(z))
for i in range(len(z)):
    z_plus = z.copy(); z_plus[i] += eps
    z_minus = z.copy(); z_minus[i] -= eps
    grad_fd[i] = (cross_entropy_loss(z_plus, y) - cross_entropy_loss(z_minus, y)) / (2*eps)

print('\nGradient verification:')
check_close('Analytical gradient = finite diff', grad, grad_fd)

# Label smoothing
K = len(z)
eps_smooth = 0.1
y_smooth = np.ones(K) * eps_smooth / K
y_smooth[y] += 1 - eps_smooth

loss_smooth = -np.sum(y_smooth * np.log(p + 1e-12))
print(f'\nLabel smoothing (eps={eps_smooth}):')
print(f'  Hard CE loss:    {loss:.4f}')
print(f'  Smoothed CE loss: {loss_smooth:.4f}')
check_true('Label-smoothed loss > hard loss (entropy penalty)', loss_smooth > loss)
print('\nTakeaway: The CE gradient p - e_y has an elegant form: it pushes the model')
print('probability toward 1 for the correct class and toward 0 for others.')
print('Label smoothing prevents over-confident predictions (regularisation).')

Exercise 8 ★★★ — VAE KL Divergence: Closed-Form Regulariser

In a Variational Autoencoder (VAE), the encoder outputs $\mu_\phi(\mathbf{x})$ and $\log\sigma^2_\phi(\mathbf{x})$ that parameterise $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$ . The prior is $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, I)$ .

Part (a): Derive and implement the closed-form KL divergence:

D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{d=1}^D (\mu_d^2 + \sigma_d^2 - \log\sigma_d^2 - 1)

Part (b): Verify that KL = 0 when $\mu = 0$ and $\sigma^2 = 1$ .

Part (c): Compute KL for $\mu = [1.0, -0.5, 2.0]$ , $\log\sigma^2 = [0.0, -0.5, 0.5]$ .

Part (d): Show numerically that this closed-form matches the Monte Carlo estimate $\hat{D}_{\text{KL}} = \frac{1}{N}\sum_i [\log q(z_i) - \log p(z_i)]$ using $N = 100{,}000$ samples.

Part (e): Plot how KL varies as $\mu$ ranges from $-3$ to $3$ (with $\sigma^2 = 1$ ) and as $\sigma^2$ ranges from $0.1$ to $5$ (with $\mu = 0$ ).

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
import numpy as np
from scipy import stats

header('Exercise 8: VAE KL Divergence')

def kl_gaussian(mu, log_var):
    """KL(N(mu, diag(sigma^2)) || N(0, I)), per-dimension then summed."""
    sigma2 = np.exp(log_var)
    return 0.5 * np.sum(mu**2 + sigma2 - log_var - 1)

# Part (b): KL = 0 at prior
mu_zero = np.zeros(3)
lv_zero = np.zeros(3)
kl_zero = kl_gaussian(mu_zero, lv_zero)
print(f'KL at mu=0, log_var=0: {kl_zero:.8f}')
check_close('KL = 0 at prior', kl_zero, 0.0)

# Part (c)
mu_c  = np.array([1.0, -0.5, 2.0])
lv_c  = np.array([0.0, -0.5,  0.5])
kl_c  = kl_gaussian(mu_c, lv_c)
print(f'KL = {kl_c:.4f}')
# Per-dimension contributions
per_d = 0.5 * (mu_c**2 + np.exp(lv_c) - lv_c - 1)
print(f'Per-dim contributions: {per_d.round(4)}')
check_close('Sum of per-dim = total KL', per_d.sum(), kl_c)

# Part (d): Monte Carlo verification
D = 3
N = 200000
mu_mc = np.array([1.0, 0.5, -0.5])
lv_mc = np.array([0.0, 0.5, -0.3])
sigma_mc = np.exp(0.5 * lv_mc)

# Sample from q
Z = np.random.normal(0, 1, (N, D)) * sigma_mc + mu_mc

# log q(z) - log p(z)
log_q = stats.norm.logpdf(Z, loc=mu_mc, scale=sigma_mc).sum(axis=1)
log_p = stats.norm.logpdf(Z, loc=0, scale=1).sum(axis=1)
kl_mc = (log_q - log_p).mean()
kl_closed = kl_gaussian(mu_mc, lv_mc)
print(f'\nKL Monte Carlo: {kl_mc:.4f}')
print(f'KL closed-form: {kl_closed:.4f}')
check_close('MC ≈ closed-form', kl_mc, kl_closed, tol=0.05)

try:
    import matplotlib.pyplot as plt
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

# Part (e): KL as function of mu and sigma^2
if HAS_MPL:
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    ax = axes[0]
    mu_range = np.linspace(-3, 3, 200)
    kl_vs_mu = [kl_gaussian(np.array([m]), np.array([0.0])) for m in mu_range]
    ax.plot(mu_range, kl_vs_mu, 'steelblue', lw=2)
    ax.axhline(0, color='red', ls='--', alpha=0.5)
    ax.set_xlabel('μ (σ²=1)'); ax.set_ylabel('KL'); ax.set_title('KL vs Mean')

    ax = axes[1]
    sigma2_range = np.linspace(0.1, 5, 200)
    lv_range = np.log(sigma2_range)
    kl_vs_var = [kl_gaussian(np.array([0.0]), np.array([lv])) for lv in lv_range]
    ax.plot(sigma2_range, kl_vs_var, 'firebrick', lw=2)
    ax.axhline(0, color='blue', ls='--', alpha=0.5)
    ax.set_xlabel('σ² (μ=0)'); ax.set_ylabel('KL'); ax.set_title('KL vs Variance')
    plt.suptitle('VAE Regularisation: KL(N(μ,σ²)||N(0,1))', fontsize=13)
    plt.tight_layout(); plt.show()

print('\nTakeaway: The closed-form KL enables efficient ELBO optimisation in VAEs.')
print('The KL term is 0 only at the prior (mu=0, sigma^2=1), pushing the encoder')
print('toward a compact, unit-variance latent space while the reconstruction term')
print('pulls it toward informative encodings.')

Exercise 9 *** - Beta-Binomial Posterior Predictive

A classifier has been evaluated on $n=40$ examples and got $k=31$ correct. Put a $\operatorname{Beta}(2,2)$ prior on its unknown accuracy $\\theta$ .

Part (a): Compute the posterior distribution for $\\theta$ .

Part (b): Compute the posterior mean and a 95% credible interval.

Part (c): Compute the posterior predictive probability that the next label is correct.

Part (d): Compare this Bayesian estimate with the raw empirical accuracy.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
from scipy import stats

a0, b0 = 2, 2
n, k = 40, 31
apost = a0 + k
bpost = b0 + n - k
posterior = stats.beta(apost, bpost)
posterior_mean = apost / (apost + bpost)
ci = posterior.ppf([0.025, 0.975])
raw_acc = k / n

header('Exercise 9: Beta-Binomial Posterior Predictive')
print(f'Posterior: Beta({apost}, {bpost})')
print(f'Raw accuracy: {raw_acc:.4f}')
print(f'Posterior mean / predictive P(next correct): {posterior_mean:.4f}')
print(f'95% credible interval: [{ci[0]:.4f}, {ci[1]:.4f}]')
check_true('Posterior mean is shrunk toward prior mean 0.5', 0.5 < posterior_mean < raw_acc)
check_true('Credible interval contains posterior mean', ci[0] < posterior_mean < ci[1])
print('Takeaway: conjugacy turns counts into calibrated uncertainty, not just a point estimate.')

Exercise 10 *** - Temperature Scaling as a Categorical Distribution Family

Let logits $z$ define a categorical distribution $p_T = \operatorname{softmax}(z/T)$ .

Part (a): Compute $p_T$ for several temperatures.

Part (b): Verify that entropy increases with temperature for the given logits.

Part (c): Compute the expected negative log-likelihood of a target class under each $T$ .

Part (d): Connect this to calibration and decoding in language models.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution

def softmax_T(z, T):
    z = np.asarray(z, dtype=float) / T
    z -= z.max()
    p = np.exp(z)
    return p / p.sum()

def entropy_nat(p):
    return -np.sum(p * np.log(p + 1e-15))

z = np.array([2.2, 1.0, 0.3, -0.5])
target = 0
temps = np.array([0.5, 0.8, 1.0, 1.5, 2.0])
entropies = []
nlls = []

header('Exercise 10: Temperature Scaling Categorical Family')
for T in temps:
    p = softmax_T(z, T)
    H = entropy_nat(p)
    nll = -np.log(p[target])
    entropies.append(H)
    nlls.append(nll)
    print(f'T={T:.1f}: p={p.round(4)}, entropy={H:.4f}, target NLL={nll:.4f}')

check_true('Entropy is nondecreasing here as T rises', np.all(np.diff(entropies) > -1e-10))
check_true('Top-class NLL rises as distribution is softened', nlls[-1] > nlls[0])
print('Takeaway: temperature is not a new distribution; it is a reparameterized categorical that controls sharpness.')

What to Review After Finishing

Binomial → Poisson approximation: when is TVD small?
Beta distribution: mode formula and edge cases ( $\alpha < 1$ )
Gaussian moment formula: $\mathbb{E}[X^{2k}] = (2k-1)!! \sigma^{2k}$
MGF uniqueness and product rule for independent sums
Exponential family: natural parameters, sufficient statistics, log-partition theorem
Dirichlet-Categorical conjugacy: counting update rule
Softmax gradient: derive $\partial \mathcal{L}/\partial z_k$ analytically
VAE KL: closed-form derivation and connection to ELBO

References

Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
Blei, D., Kucukelbir, A., McAuliffe, J. (2017). Variational Inference: A Review for Statisticians.
Wainwright, M., Jordan, M. (2008). Graphical Models, Exponential Families, and Variational Inference.

Beta-Binomial posterior predictive - can you explain the probabilistic calculation and the ML relevance?
Temperature-scaled categorical - can you connect the computation to model behavior?