Exercises NotebookMath for LLMs

Estimation Theory

Statistics / Estimation Theory

Run notebook
Exercises Notebook

Exercises Notebook

Converted from exercises.ipynb for web reading.

Estimation Theory — Exercises

10 exercises covering the core results of estimation theory, from basic MLE and bias proofs through Fisher information for neural network layers and scaling law estimation.

FormatDescription
ProblemMarkdown cell with task description
Your SolutionCode cell with scaffolding (# YOUR CODE HERE)
SolutionCode cell with reference solution and checks

Difficulty Levels

LevelExercisesFocus
1–3Core mechanics: MLE, bias, CRB
★★4–6Deeper theory: MoM, bootstrap, asymptotic normality
★★★7-10AI applications: FIM for neural layers, scaling laws

Topic Map

TopicExercises
MLE derivation1, 3
Bias/variance/MSE2
Cramér-Rao bound3
Method of moments4
Bootstrap CI5
Asymptotic normality6
Fisher information (neural networks)7
Scaling law MLE8

Additional applied exercises: Exercise 9: Bootstrap CI for model accuracy; Exercise 10: Delta method for log-odds.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
from scipy import stats, optimize, special

try:
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    plt.style.use('seaborn-v0_8-whitegrid')
    mpl.rcParams.update({'figure.figsize': (10, 6), 'font.size': 12,
                         'axes.titlesize': 14, 'axes.labelsize': 12})
    COLORS = {'primary':'#0077BB','secondary':'#EE7733','tertiary':'#009988',
              'error':'#CC3311','neutral':'#555555','highlight':'#EE3377'}
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-4):
    got = np.atleast_1d(np.asarray(got, dtype=float))
    expected = np.atleast_1d(np.asarray(expected, dtype=float))
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'}{name}")
    if not ok:
        print(f'  expected: {expected}')
        print(f'  got:      {got}')
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'}{name}")
    return cond

def softmax(z):
    e = np.exp(z - z.max())
    return e / e.sum()

print('Setup complete.')

Exercise 1 ★ — MLE for Bernoulli and Poisson

Background: The sample proportion is the MLE for the Bernoulli parameter pp. The sample mean is the MLE for the Poisson rate λ\lambda.

Tasks:

(a) Given 10 coin flips with 6 heads, compute p^MLE\hat{p}_{\text{MLE}}, its standard error SE=p^(1p^)/n\text{SE} = \sqrt{\hat{p}(1-\hat{p})/n}, and the 95% Wald CI.

(b) Given Poisson data x=[3,1,4,1,5,9,2,6]\mathbf{x} = [3, 1, 4, 1, 5, 9, 2, 6], compute λ^MLE\hat{\lambda}_{\text{MLE}} by solving the score equation (λ)=S/λn=0\ell'(\lambda) = S/\lambda - n = 0.

(c) Verify that the score evaluated at the MLE equals zero and the second derivative is negative (confirming a maximum).

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
x_bern = np.array([1, 1, 0, 1, 1, 0, 1, 1, 1, 0])
n_b = len(x_bern)
S_b = x_bern.sum()

# (a) Bernoulli MLE
p_hat = S_b / n_b
se_p = np.sqrt(p_hat * (1 - p_hat) / n_b)
z95 = stats.norm.ppf(0.975)
ci_lo = p_hat - z95 * se_p
ci_hi = p_hat + z95 * se_p

# (b) Poisson MLE
x_pois = np.array([3, 1, 4, 1, 5, 9, 2, 6])
n_p = len(x_pois)
S_p = x_pois.sum()
lambda_hat = S_p / n_p

# (c) Score at MLE
score_at_mle = S_p / lambda_hat - n_p
d2l_at_mle = -S_p / lambda_hat**2

header('Exercise 1: MLE for Bernoulli and Poisson')
print(f'(a) Bernoulli: p_hat={p_hat:.4f}, SE={se_p:.4f}')
print(f'    95% Wald CI: [{ci_lo:.4f}, {ci_hi:.4f}]')
check_close('MLE = S/n', p_hat, 0.6)
check_close('SE formula', se_p, np.sqrt(0.6*0.4/10))
print()
print(f'(b) Poisson: lambda_hat = {lambda_hat:.4f}')
check_close('Poisson MLE = xbar', lambda_hat, x_pois.mean())
print()
print(f'(c) Score at MLE = {score_at_mle:.10f}')
print(f'    d^2ell/dlambda^2 = {d2l_at_mle:.4f} (negative = maximum)')
check_true('Score equals zero at MLE', np.isclose(score_at_mle, 0, atol=1e-10))
check_true('Second derivative is negative (maximum)', d2l_at_mle < 0)
print('\nTakeaway: MLE score equation has a unique solution for exponential family distributions')

Exercise 2 ★ — Bias of the MLE Variance Estimator

Background: The MLE of the Gaussian variance uses denominator nn, not n1n-1, giving a biased estimator with bias σ2/n-\sigma^2/n.

Tasks:

(a) For n=10n = 10 and σ2=4\sigma^2 = 4, compute the theoretical bias of σ^MLE2\hat{\sigma}^2_{\text{MLE}}.

(b) Simulate M=5000M = 5000 datasets of size n=10n=10 from N(0,4)\mathcal{N}(0, 4) and verify the empirical bias matches the theoretical value.

(c) Show that σ^MLE=σ^MLE2\hat{\sigma}_{\text{MLE}} = \sqrt{\hat{\sigma}^2_{\text{MLE}}} is a negatively biased estimator of σ\sigma. Verify using simulation.

(d) By MLE invariance, why is σ^MLE\hat{\sigma}_{\text{MLE}} the MLE of σ\sigma, even though it is biased?

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
sigma_true = 2.0
sigma2_true = sigma_true**2
n = 10
M = 5000

# (a) Theoretical bias
bias_theory = -sigma2_true / n

# (b) Simulate
np.random.seed(42)
data = np.random.normal(0, sigma_true, size=(M, n))
xbar = data.mean(axis=1, keepdims=True)
sigma2_mle = ((data - xbar)**2).mean(axis=1)
sigma2_unb = ((data - xbar)**2).sum(axis=1) / (n-1)

bias_empirical = sigma2_mle.mean() - sigma2_true

# (c) Bias of sigma_MLE
sigma_mle = np.sqrt(sigma2_mle)
bias_sigma = sigma_mle.mean() - sigma_true

header('Exercise 2: Bias of MLE Variance Estimator')
print(f'(a) Theoretical bias = {bias_theory:.6f}')
print(f'(b) Empirical bias   = {bias_empirical:.6f}')
check_close('empirical bias ≈ theoretical', bias_empirical, bias_theory, tol=0.05)
print(f'    E[sigma2_MLE]  = {sigma2_mle.mean():.4f} (true {sigma2_true})')
print(f'    E[sigma2_unb]  = {sigma2_unb.mean():.4f} (should be ≈ {sigma2_true})')
check_close('unbiased estimator centred at truth', sigma2_unb.mean(), sigma2_true, tol=0.05)
print(f'\n(c) E[sigma_MLE] = {sigma_mle.mean():.4f}, true sigma = {sigma_true}')
print(f'    Bias of sigma_MLE = {bias_sigma:.6f}')
check_true('sigma_MLE is negatively biased (Jensen: E[sqrt(X)] < sqrt(E[X]))', bias_sigma < 0)
print(f'\n(d) By MLE invariance: g(hat_theta) = hat_g(theta) for any g.')
print(f'    sigma = g(sigma^2) = sqrt(sigma^2), so MLE of sigma = sqrt(MLE of sigma^2).')
print(f'    This holds even though sqrt is nonlinear and introduces bias.')
print('\nTakeaway: MLE invariance gives the MLE of any smooth function, but bias is not invariant')

Exercise 3 ★ — Cramér-Rao Bound Verification for Exponential Distribution

Background: For XExp(λ)X \sim \operatorname{Exp}(\lambda), p(x;λ)=λeλxp(x;\lambda) = \lambda e^{-\lambda x}, the Fisher information is I(λ)=1/λ2\mathcal{I}(\lambda) = 1/\lambda^2 and the CRB is λ2/n\lambda^2/n.

Tasks:

(a) Derive the score s(λ;x)=logp/λs(\lambda;x) = \partial \log p / \partial \lambda and verify E[s]=0\mathbb{E}[s] = 0 numerically.

(b) Compute I(λ)\mathcal{I}(\lambda) using both the variance-of-score formula and the negative-expected-Hessian formula, and verify they match.

(c) Show the MLE λ^=1/xˉ\hat{\lambda} = 1/\bar{x} achieves the CRB: Var(λ^)=λ2/n\operatorname{Var}(\hat{\lambda}) = \lambda^2/n.

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
lam_true = 2.5
n = 100
M = 10000
np.random.seed(42)

data_exp = np.random.exponential(1/lam_true, size=(M, n))

# (a) Score s(lambda; x) = 1/lambda - x
score_values = 1/lam_true - data_exp[:, 0]
mean_score = score_values.mean()

# (b) Fisher information
I_from_var = score_values.var()           # Var(score) = I(lambda)
I_from_hessian = 1 / lam_true**2         # -E[d^2 log p / dlambda^2] = 1/lambda^2
I_theoretical = 1 / lam_true**2

# (c) MLE achieves CRB
lambda_hats = 1.0 / data_exp.mean(axis=1)
var_mle = lambda_hats.var()
crb = lam_true**2 / n

header('Exercise 3: CRB Verification for Exponential Distribution')
print(f'(a) E[score] = {mean_score:.6f} (expected 0)')
check_true('Score has zero mean', np.isclose(mean_score, 0, atol=0.05))
print()
print(f'(b) I(lambda) via Var(score)    = {I_from_var:.5f}')
print(f'    I(lambda) via -E[hessian]   = {I_from_hessian:.5f}')
print(f'    I(lambda) theoretical       = {I_theoretical:.5f}')
check_close('Two FIM formulas agree', I_from_var, I_from_hessian, tol=0.05)
print()
print(f'(c) Var(lambda_hat) = {var_mle:.7f}')
print(f'    CRB = lambda^2/n = {crb:.7f}')
ratio = var_mle / crb
print(f'    Ratio Var/CRB = {ratio:.4f} (should be ≈ 1.0)')
check_close('MLE achieves CRB', var_mle, crb, tol=crb*0.05)
print('\nTakeaway: Exponential MLE is efficient -- it achieves the Cramér-Rao lower bound')

Exercise 4 ★★ — Method of Moments for the Beta Distribution

Background: The Beta distribution Beta(α,β)\operatorname{Beta}(\alpha, \beta) has:

  • Mean: μ=α/(α+β)\mu = \alpha/(\alpha+\beta)
  • Variance: v=αβ/[(α+β)2(α+β+1)]v = \alpha\beta/[(\alpha+\beta)^2(\alpha+\beta+1)]

Tasks:

(a) Derive the MoM estimators: α^=xˉ(xˉ(1xˉ)/s21)\hat{\alpha} = \bar{x}(\bar{x}(1-\bar{x})/s^2 - 1) and β^=(1xˉ)(xˉ(1xˉ)/s21)\hat{\beta} = (1-\bar{x})(\bar{x}(1-\bar{x})/s^2 - 1).

(b) Generate n=300n=300 samples from Beta(3,7)\operatorname{Beta}(3, 7) and compute α^MoM\hat{\alpha}_{\text{MoM}}, β^MoM\hat{\beta}_{\text{MoM}}.

(c) Compare to the MLE (use scipy.stats.beta.fit) and report which has smaller MSE.

(d) Verify the constraint: MoM estimates require s2<xˉ(1xˉ)s^2 < \bar{x}(1-\bar{x}). What does this mean?

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
np.random.seed(42)
alpha_true, beta_true = 3.0, 7.0
n = 300
x_beta = np.random.beta(alpha_true, beta_true, n)

xbar = x_beta.mean()
s2 = x_beta.var()

# (a)-(b) MoM
common = xbar * (1 - xbar) / s2 - 1
alpha_mom = xbar * common
beta_mom  = (1 - xbar) * common

# (c) MLE via scipy
fit_result = stats.beta.fit(x_beta, floc=0, fscale=1)
alpha_mle = fit_result[0]
beta_mle = fit_result[1]

# MSE for alpha
M_comp = 1000
a_mom_sim, a_mle_sim = [], []
for _ in range(M_comp):
    x = np.random.beta(alpha_true, beta_true, n)
    xb, s2_ = x.mean(), x.var()
    c_ = xb*(1-xb)/s2_ - 1
    a_mom_sim.append(xb * c_)
    fr = stats.beta.fit(x, floc=0, fscale=1)
    a_mle_sim.append(fr[0])

mse_mom = np.mean((np.array(a_mom_sim) - alpha_true)**2)
mse_mle = np.mean((np.array(a_mle_sim) - alpha_true)**2)

# (d)
constraint_satisfied = s2 < xbar * (1 - xbar)

header('Exercise 4: Method of Moments for Beta Distribution')
print(f'True: alpha={alpha_true}, beta={beta_true}')
print(f'MoM:  alpha={alpha_mom:.4f}, beta={beta_mom:.4f}')
print(f'MLE:  alpha={alpha_mle:.4f}, beta={beta_mle:.4f}')
check_close('MoM alpha within 20% of truth', alpha_mom, alpha_true, tol=0.6)
print()
print(f'MSE (alpha) — MoM: {mse_mom:.5f}, MLE: {mse_mle:.5f}')
check_true('MLE more efficient than MoM', mse_mle < mse_mom)
print(f'Relative efficiency MLE/MoM: {mse_mom/mse_mle:.2f}')
print()
print(f'(d) Constraint: s2={s2:.6f} < xbar*(1-xbar)={xbar*(1-xbar):.6f}')
check_true('Constraint satisfied', constraint_satisfied)
print('    Meaning: sample variance must be less than max possible variance (at p=0.5).')
print('    If s2 >= xbar*(1-xbar), data are more dispersed than Beta can explain -- MoM fails.')
print('\nTakeaway: MLE is more efficient than MoM for Beta; both are consistent')

Exercise 5 ★★ — Bootstrap Confidence Interval

Background: The bootstrap resamples from the empirical distribution to estimate the sampling distribution of any statistic.

Tasks:

(a) Given the dataset x = [3.1, 1.4, 2.5, 4.2, 1.8, 3.7, 2.1, 4.8, 1.2, 3.3], compute the sample median.

(b) Implement the non-parametric bootstrap with B=2000B=2000 resamples to construct a 95% percentile CI for the median.

(c) Compare this to the asymptotic CI. For a Gaussian distribution, Var(median)πσ2/(2n)\operatorname{Var}(\text{median}) \approx \pi\sigma^2/(2n). Use σ^=s\hat\sigma = s (sample std) to compute the asymptotic CI and compare widths.

(d) Check coverage: repeat the experiment 1000 times (simulate from N(2.7,1.22)\mathcal{N}(2.7, 1.2^2)) and compute empirical coverage for both CI types.

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
x = np.array([3.1, 1.4, 2.5, 4.2, 1.8, 3.7, 2.1, 4.8, 1.2, 3.3])
n = len(x)

# (a)
median_obs = np.median(x)

# (b) Bootstrap
B = 2000
np.random.seed(42)
boot_medians = np.array([np.median(x[np.random.randint(0, n, n)]) for _ in range(B)])
ci_boot_lo = np.percentile(boot_medians, 2.5)
ci_boot_hi = np.percentile(boot_medians, 97.5)

# (c) Asymptotic
sigma_hat = x.std(ddof=1)
var_median_asym = np.pi * sigma_hat**2 / (2 * n)
se_median = np.sqrt(var_median_asym)
z95 = 1.96
ci_asym_lo = median_obs - z95 * se_median
ci_asym_hi = median_obs + z95 * se_median

# (d) Coverage simulation
mu_sim = 2.7
sigma_sim = 1.2
M_cov = 1000
true_median = mu_sim  # median = mean for Gaussian
boot_covers, asym_covers = [], []

for _ in range(M_cov):
    x_sim = np.random.normal(mu_sim, sigma_sim, n)
    m_sim = np.median(x_sim)
    # Bootstrap CI
    bm = np.array([np.median(x_sim[np.random.randint(0, n, n)]) for _ in range(500)])
    bl, bh = np.percentile(bm, [2.5, 97.5])
    boot_covers.append(bl <= true_median <= bh)
    # Asymptotic CI
    s = x_sim.std(ddof=1)
    se = np.sqrt(np.pi * s**2 / (2*n))
    al, ah = m_sim - 1.96*se, m_sim + 1.96*se
    asym_covers.append(al <= true_median <= ah)

from scipy import stats as st
header('Exercise 5: Bootstrap Confidence Interval')
print(f'(a) Sample median = {median_obs:.4f}')
print(f'(b) Bootstrap 95% CI: [{ci_boot_lo:.4f}, {ci_boot_hi:.4f}]  width={ci_boot_hi-ci_boot_lo:.4f}')
print(f'(c) Asymptotic 95% CI: [{ci_asym_lo:.4f}, {ci_asym_hi:.4f}]  width={ci_asym_hi-ci_asym_lo:.4f}')
print(f'(d) Coverage (n=10, M=1000): bootstrap={np.mean(boot_covers):.3f}, asymptotic={np.mean(asym_covers):.3f}')
check_true('Bootstrap CI has reasonable coverage', abs(np.mean(boot_covers) - 0.95) < 0.05)
print('\nTakeaway: Bootstrap captures asymmetry in sampling distribution; better for small n')

Exercise 6 ★★ — Asymptotic Normality Simulation

Background: The MLE satisfies n(θ^MLEθ)dN(0,I(θ)1)\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta^*) \xrightarrow{d} \mathcal{N}(0, \mathcal{I}(\theta^*)^{-1}).

Tasks:

(a) For p=0.3p^* = 0.3, simulate M=5000M=5000 Bernoulli MLEs at each n{10,30,100,500}n \in \{10, 30, 100, 500\} and standardise: Zn=n(p^0.3)/0.3×0.7Z_n = \sqrt{n}(\hat{p} - 0.3)/\sqrt{0.3 \times 0.7}.

(b) At each nn, perform a Kolmogorov-Smirnov test against N(0,1)\mathcal{N}(0,1) and report the p-value.

(c) Compute the empirical coverage of the 95% Wald CI at each nn. When does it first reach [94%, 96%]?

(d) Plot the 4 distributions and overlay N(0,1)\mathcal{N}(0,1).

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
p_true = 0.3
M = 5000
n_values = [10, 30, 100, 500]
np.random.seed(42)

asym_std = np.sqrt(p_true * (1 - p_true))  # 1/sqrt(I) for n=1
z95 = 1.96

results = {}
for n in n_values:
    data = np.random.binomial(1, p_true, size=(M, n))
    p_hats = data.mean(axis=1)
    z_scores = (p_hats - p_true) / (asym_std / np.sqrt(n))
    ks_stat, ks_pval = stats.kstest(z_scores, 'norm')
    se = np.sqrt(p_hats * (1-p_hats) / n)
    ci_lo = p_hats - z95 * se
    ci_hi = p_hats + z95 * se
    coverage = np.mean((ci_lo <= p_true) & (p_true <= ci_hi))
    results[n] = {'z': z_scores, 'ks': ks_pval, 'cov': coverage}

header('Exercise 6: Asymptotic Normality of Bernoulli MLE')
print(f'{"n":>6} {"KS p-value":>12} {"Wald cov":>12} {"Normal?":>10}')
print('-' * 44)
for n in n_values:
    r = results[n]
    normal = 'Yes' if r['ks'] > 0.05 else 'No'
    print(f'{n:>6} {r["ks"]:>12.4f} {r["cov"]:>12.4f} {normal:>10}')

in_range = {n: abs(results[n]['cov'] - 0.95) < 0.01 for n in n_values}
first_good = next((n for n in n_values if in_range[n]), None)
print(f'\nFirst n with coverage in [94%, 96%]: {first_good}')
check_true('Coverage in [94%, 96%] for n>=100', in_range.get(100, False))

if HAS_MPL:
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    x_r = np.linspace(-4, 4, 200)
    normal_pdf = stats.norm.pdf(x_r)
    cols = [COLORS['primary'], COLORS['secondary'], COLORS['tertiary'], COLORS['highlight']]
    for ax, n, col in zip(axes, n_values, cols):
        ax.hist(results[n]['z'], bins=40, density=True, alpha=0.6, color=col, edgecolor='white')
        ax.plot(x_r, normal_pdf, color=COLORS['neutral'], lw=2)
        ax.set_title(f'$n={n}$, KS p={results[n]["ks"]:.3f}')
        ax.set_xlabel('$Z_n$')
        if n == 10:
            ax.set_ylabel('Density')
    fig.suptitle('Asymptotic normality of Bernoulli MLE', fontsize=13)
    fig.tight_layout()
    plt.show()

print('\nTakeaway: n≥30 is often sufficient for Wald CI accuracy in Bernoulli models')

Exercise 7 ★★★ — Fisher Information Matrix for a Neural Network Layer

Background: For a softmax layer p(yx;W)=softmax(Wx)yp(y|\mathbf{x};W) = \operatorname{softmax}(W\mathbf{x})_y, the score is Wlogp(yx;W)=(eyp^)x\nabla_W \log p(y|\mathbf{x};W) = (\mathbf{e}_y - \hat{\mathbf{p}})\mathbf{x}^\top. The FIM has the Kronecker structure GAG \otimes A.

Tasks:

(a) Implement the score computation for the softmax layer.

(b) Compute the full FIM I(W)\mathcal{I}(W) by averaging outer products of score vectors.

(c) Compute the K-FAC approximation I^kfac=GA\hat{\mathcal{I}}_{\text{kfac}} = G \otimes A and measure the relative approximation error.

(d) Show that the natural gradient update WW+ηI1LW \leftarrow W + \eta \mathcal{I}^{-1}\nabla\mathcal{L} gives a larger step in directions of low curvature.

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
np.random.seed(42)
d = 4
K = 3
n_fim = 300

W = np.random.randn(K, d) * 0.5
X_fim = np.random.randn(n_fim, d)
logits = X_fim @ W.T

def softmax_2d(Z):
    e = np.exp(Z - Z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

P = softmax_2d(logits)

# (b) Full FIM
FIM_full = np.zeros((K*d, K*d))
for i in range(n_fim):
    for k in range(K):
        e_k = np.zeros(K)
        e_k[k] = 1.0
        delta = e_k - P[i]
        g = np.outer(delta, X_fim[i]).reshape(-1)
        FIM_full += P[i, k] * np.outer(g, g)
FIM_full /= n_fim

# (c) K-FAC
A = (X_fim.T @ X_fim) / n_fim
G = np.zeros((K, K))
for i in range(n_fim):
    G += np.diag(P[i]) - np.outer(P[i], P[i])
G /= n_fim
FIM_kfac = np.kron(G, A)

rel_err = np.linalg.norm(FIM_full - FIM_kfac) / np.linalg.norm(FIM_full)

# (d) Natural gradient step comparison
grad_L = np.random.randn(K*d)  # mock gradient
reg = 0.01 * np.eye(K*d)
nat_grad = np.linalg.solve(FIM_full + reg, grad_L)
eigs_fim = np.linalg.eigvalsh(FIM_full)

header('Exercise 7: FIM for Softmax Layer')
print(f'Full FIM shape: {FIM_full.shape}')
print(f'FIM min eigenvalue: {eigs_fim.min():.6f}')
print(f'FIM max eigenvalue: {eigs_fim.max():.6f}')
print(f'Condition number: {eigs_fim.max()/max(eigs_fim.min(), 1e-10):.2f}')
print()
print(f'K-FAC relative error: {rel_err:.4f} ({100*rel_err:.1f}%)')
check_true('FIM is positive semidefinite', np.all(eigs_fim >= -1e-10))
check_true('K-FAC relative error < 20%', rel_err < 0.20)
print()
print(f'Ratio ||nat_grad||/||grad||: {np.linalg.norm(nat_grad)/np.linalg.norm(grad_L):.4f}')
print('Natural gradient rescales steps by 1/curvature -- larger steps in flat directions')
print('\nTakeaway: K-FAC approximates the FIM with Kronecker factorisation for tractable inversion')

Exercise 8 ★★★ — Chinchilla Scaling Law Estimation

Background: The Chinchilla model (Hoffmann et al., 2022) fits:

L(N,D)=E+A/Nα+B/DβL(N, D) = E + A/N^\alpha + B/D^\beta

by nonlinear least squares (MLE with Gaussian noise), then minimises LL under the compute budget constraint C=6NDC = 6ND.

Tasks:

(a) Generate 30 synthetic (N,D,L)(N, D, L) training runs with realistic parameters (E=1.69E=1.69, A=406A=406, B=410B=410, α=0.34\alpha=0.34, β=0.28\beta=0.28) and Gaussian noise σ=0.02\sigma=0.02.

(b) Fit the 5 parameters using scipy.optimize.curve_fit and report 95% CIs for α\alpha and β\beta.

(c) For C=6×1020C = 6 \times 10^{20} FLOPs, find the optimal (N,D)(N^*, D^*) and compare to the Chinchilla recommendation (N70BN^* \approx 70B, D1.4TD^* \approx 1.4T).

(d) How sensitive is NN^* to ±1σ\pm 1\sigma uncertainty in α\alpha?

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
from scipy.optimize import curve_fit
np.random.seed(42)

E_t, A_t, B_t, alpha_t, beta_t = 1.69, 406.4, 410.7, 0.34, 0.28

def scaling_law(ND, E, A, B, alpha, beta):
    N, D = ND
    return E + A / N**alpha + B / D**beta

# (a) Generate data
N_vals = np.logspace(8, 12, 5)
D_vals = np.logspace(9, 13, 6)
NN, DD = np.meshgrid(N_vals, D_vals)
N_flat = NN.flatten()
D_flat = DD.flatten()

L_true = scaling_law((N_flat, D_flat), E_t, A_t, B_t, alpha_t, beta_t)
L_obs = L_true + 0.02 * np.random.randn(len(L_true))

# (b) Fit
p0 = [1.7, 400, 400, 0.3, 0.3]
bounds = ([0, 0, 0, 0.1, 0.1], [5, 1e4, 1e4, 1.0, 1.0])
popt, pcov = curve_fit(scaling_law, (N_flat, D_flat), L_obs,
                        p0=p0, bounds=bounds, maxfev=20000)
perr = np.sqrt(np.diag(pcov))

# (c) Optimal allocation for fixed C
C_budget = 6e20
N_cand = np.logspace(8, 13, 2000)
D_cand = C_budget / (6 * N_cand)
L_cand = scaling_law((N_cand, D_cand), *popt)
idx_opt = np.argmin(L_cand)
N_star = N_cand[idx_opt]
D_star = D_cand[idx_opt]

# (d) Sensitivity to alpha uncertainty
alpha_lo = popt[3] - perr[3]
alpha_hi = popt[3] + perr[3]

def optimal_N(alpha_val):
    params_mod = list(popt)
    params_mod[3] = alpha_val
    L_v = scaling_law((N_cand, D_cand), *params_mod)
    return N_cand[np.argmin(L_v)]

N_star_lo = optimal_N(alpha_lo)
N_star_hi = optimal_N(alpha_hi)

header('Exercise 8: Chinchilla Scaling Law Estimation')
names = ['E', 'A', 'B', 'alpha', 'beta']
true_vals = [E_t, A_t, B_t, alpha_t, beta_t]
print(f'{"Param":<8} {"True":>8} {"MLE":>8} {"SE":>8}  95% CI')
print('-' * 52)
for nm, tv, est, se in zip(names, true_vals, popt, perr):
    ci = f'[{est-1.96*se:.3f}, {est+1.96*se:.3f}]'
    print(f'{nm:<8} {tv:>8.3f} {est:>8.3f} {se:>8.4f}  {ci}')

print(f'\n(c) Optimal allocation:')
print(f'    N* = {N_star:.2e} parameters')
print(f'    D* = {D_star:.2e} tokens')
check_true('N* in plausible range (1e9 to 1e12)', 1e9 < N_star < 1e12)

print(f'\n(d) Sensitivity to alpha uncertainty (±{perr[3]:.4f}):')
print(f'    N*(alpha - 1SE): {N_star_lo:.2e}')
print(f'    N*(alpha + 1SE): {N_star_hi:.2e}')
print(f'    Ratio: {N_star_hi/N_star_lo:.2f}x variation in N* per 1-sigma change in alpha')
print('\nTakeaway: Scaling law estimation via MLE enables compute-optimal training allocation')

Exercise 9 *** - Bootstrap Confidence Interval for Model Accuracy

A classifier is evaluated on held-out examples with binary correctness indicators.

Part (a): Estimate accuracy.

Part (b): Build a percentile bootstrap 95% confidence interval.

Part (c): Compare with the normal approximation interval.

Part (d): Explain when bootstrap intervals are more trustworthy than asymptotic formulas.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
n = 600
true_acc = 0.82
correct = (np.random.rand(n) < true_acc).astype(float)
acc_hat = correct.mean()
B = 3000
boot = np.array([np.random.choice(correct, size=n, replace=True).mean() for _ in range(B)])
ci_boot = np.percentile(boot, [2.5, 97.5])
se = np.sqrt(acc_hat * (1 - acc_hat) / n)
ci_norm = acc_hat + np.array([-1.96, 1.96]) * se

header('Exercise 9: Bootstrap CI for Accuracy')
print(f'accuracy estimate: {acc_hat:.4f}')
print(f'bootstrap 95% CI: [{ci_boot[0]:.4f}, {ci_boot[1]:.4f}]')
print(f'normal 95% CI:    [{ci_norm[0]:.4f}, {ci_norm[1]:.4f}]')
check_true('Both intervals contain the estimate', ci_boot[0] < acc_hat < ci_boot[1] and ci_norm[0] < acc_hat < ci_norm[1])
check_true('Intervals have similar width here', abs((ci_boot[1]-ci_boot[0]) - (ci_norm[1]-ci_norm[0])) < 0.02)
print('Takeaway: bootstrap treats the empirical sample as a reusable proxy population.')

Exercise 10 *** - Delta Method for Log-Odds

Let p^\hat p be a sample proportion and define g(p)=log(p/(1p))g(p)=\log(p/(1-p)).

Part (a): Derive the delta-method standard error for g(p^)g(\hat p).

Part (b): Compare the delta-method interval with a Monte Carlo sampling distribution.

Part (c): Explain why transformed estimands need transformed uncertainty, not only transformed point estimates.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
from scipy import stats

n = 1000
p = 0.72
phat = np.random.binomial(n, p) / n
logit = lambda u: np.log(u / (1 - u))
gprime = lambda u: 1 / (u * (1 - u))
se_p = np.sqrt(phat * (1 - phat) / n)
se_logit = abs(gprime(phat)) * se_p
ci_delta = logit(phat) + np.array([-1.96, 1.96]) * se_logit

reps = 20000
phats = np.random.binomial(n, p, size=reps) / n
logits = logit(np.clip(phats, 1e-6, 1 - 1e-6))
ci_mc = np.percentile(logits, [2.5, 97.5])

header('Exercise 10: Delta Method for Log-Odds')
print(f'phat={phat:.4f}, logit(phat)={logit(phat):.4f}')
print(f'delta SE={se_logit:.4f}')
print(f'delta CI: [{ci_delta[0]:.4f}, {ci_delta[1]:.4f}]')
print(f'MC CI:    [{ci_mc[0]:.4f}, {ci_mc[1]:.4f}]')
check_true('Delta and MC intervals are close', np.max(np.abs(ci_delta - ci_mc)) < 0.08)
print('Takeaway: the delta method propagates estimator uncertainty through nonlinear transformations.')

What to Review After Finishing

  • Bias-variance decomposition — can you prove MSE = Bias² + Var from scratch?
  • CRB proof — key step is differentiating the unbiasedness condition and applying Cauchy-Schwarz
  • MLE derivation — can you find the score equation and verify it is a maximum for Bernoulli, Poisson, Gaussian?
  • Asymptotic normality — key steps: Taylor expand score, apply CLT to numerator, LLN to denominator
  • Bootstrap CI — understand what distribution you are resampling from (empirical CDF)
  • FIM structure — why does I(W)=GA\mathcal{I}(W) = G \otimes A for a linear softmax layer?
  • Scaling laws — why is nonlinear least squares equivalent to MLE under Gaussian noise?

References

  1. Casella & Berger (2002). Statistical Inference, 2nd ed. — complete treatment of all exercises
  2. Hoffmann et al. (2022). 'Training Compute-Optimal LLMs' — Exercise 8
  3. Kirkpatrick et al. (2017). 'EWC' — Fisher information for continual learning
  4. Martens & Grosse (2015). 'K-FAC' — Exercise 7
  5. Efron (1979). 'Bootstrap Methods' — Exercise 5
  • Bootstrap CI for model accuracy - can you connect the computation to statistical ML practice?
  • Delta method for log-odds - can you connect the computation to statistical ML practice?