Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Estimation Theory — Exercises
10 exercises covering the core results of estimation theory, from basic MLE and bias proofs through Fisher information for neural network layers and scaling law estimation.
| Format | Description |
|---|---|
| Problem | Markdown cell with task description |
| Your Solution | Code cell with scaffolding (# YOUR CODE HERE) |
| Solution | Code cell with reference solution and checks |
Difficulty Levels
| Level | Exercises | Focus |
|---|---|---|
| ★ | 1–3 | Core mechanics: MLE, bias, CRB |
| ★★ | 4–6 | Deeper theory: MoM, bootstrap, asymptotic normality |
| ★★★ | 7-10 | AI applications: FIM for neural layers, scaling laws |
Topic Map
| Topic | Exercises |
|---|---|
| MLE derivation | 1, 3 |
| Bias/variance/MSE | 2 |
| Cramér-Rao bound | 3 |
| Method of moments | 4 |
| Bootstrap CI | 5 |
| Asymptotic normality | 6 |
| Fisher information (neural networks) | 7 |
| Scaling law MLE | 8 |
Additional applied exercises: Exercise 9: Bootstrap CI for model accuracy; Exercise 10: Delta method for log-odds.
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
import numpy.linalg as la
from scipy import stats, optimize, special
try:
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-v0_8-whitegrid')
mpl.rcParams.update({'figure.figsize': (10, 6), 'font.size': 12,
'axes.titlesize': 14, 'axes.labelsize': 12})
COLORS = {'primary':'#0077BB','secondary':'#EE7733','tertiary':'#009988',
'error':'#CC3311','neutral':'#555555','highlight':'#EE3377'}
HAS_MPL = True
except ImportError:
HAS_MPL = False
np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
def header(title):
print('\n' + '=' * len(title))
print(title)
print('=' * len(title))
def check_close(name, got, expected, tol=1e-4):
got = np.atleast_1d(np.asarray(got, dtype=float))
expected = np.atleast_1d(np.asarray(expected, dtype=float))
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {name}")
if not ok:
print(f' expected: {expected}')
print(f' got: {got}')
return ok
def check_true(name, cond):
print(f"{'PASS' if cond else 'FAIL'} — {name}")
return cond
def softmax(z):
e = np.exp(z - z.max())
return e / e.sum()
print('Setup complete.')
Exercise 1 ★ — MLE for Bernoulli and Poisson
Background: The sample proportion is the MLE for the Bernoulli parameter . The sample mean is the MLE for the Poisson rate .
Tasks:
(a) Given 10 coin flips with 6 heads, compute , its standard error , and the 95% Wald CI.
(b) Given Poisson data , compute by solving the score equation .
(c) Verify that the score evaluated at the MLE equals zero and the second derivative is negative (confirming a maximum).
Code cell 5
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 6
# Solution
x_bern = np.array([1, 1, 0, 1, 1, 0, 1, 1, 1, 0])
n_b = len(x_bern)
S_b = x_bern.sum()
# (a) Bernoulli MLE
p_hat = S_b / n_b
se_p = np.sqrt(p_hat * (1 - p_hat) / n_b)
z95 = stats.norm.ppf(0.975)
ci_lo = p_hat - z95 * se_p
ci_hi = p_hat + z95 * se_p
# (b) Poisson MLE
x_pois = np.array([3, 1, 4, 1, 5, 9, 2, 6])
n_p = len(x_pois)
S_p = x_pois.sum()
lambda_hat = S_p / n_p
# (c) Score at MLE
score_at_mle = S_p / lambda_hat - n_p
d2l_at_mle = -S_p / lambda_hat**2
header('Exercise 1: MLE for Bernoulli and Poisson')
print(f'(a) Bernoulli: p_hat={p_hat:.4f}, SE={se_p:.4f}')
print(f' 95% Wald CI: [{ci_lo:.4f}, {ci_hi:.4f}]')
check_close('MLE = S/n', p_hat, 0.6)
check_close('SE formula', se_p, np.sqrt(0.6*0.4/10))
print()
print(f'(b) Poisson: lambda_hat = {lambda_hat:.4f}')
check_close('Poisson MLE = xbar', lambda_hat, x_pois.mean())
print()
print(f'(c) Score at MLE = {score_at_mle:.10f}')
print(f' d^2ell/dlambda^2 = {d2l_at_mle:.4f} (negative = maximum)')
check_true('Score equals zero at MLE', np.isclose(score_at_mle, 0, atol=1e-10))
check_true('Second derivative is negative (maximum)', d2l_at_mle < 0)
print('\nTakeaway: MLE score equation has a unique solution for exponential family distributions')
Exercise 2 ★ — Bias of the MLE Variance Estimator
Background: The MLE of the Gaussian variance uses denominator , not , giving a biased estimator with bias .
Tasks:
(a) For and , compute the theoretical bias of .
(b) Simulate datasets of size from and verify the empirical bias matches the theoretical value.
(c) Show that is a negatively biased estimator of . Verify using simulation.
(d) By MLE invariance, why is the MLE of , even though it is biased?
Code cell 8
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 9
# Solution
sigma_true = 2.0
sigma2_true = sigma_true**2
n = 10
M = 5000
# (a) Theoretical bias
bias_theory = -sigma2_true / n
# (b) Simulate
np.random.seed(42)
data = np.random.normal(0, sigma_true, size=(M, n))
xbar = data.mean(axis=1, keepdims=True)
sigma2_mle = ((data - xbar)**2).mean(axis=1)
sigma2_unb = ((data - xbar)**2).sum(axis=1) / (n-1)
bias_empirical = sigma2_mle.mean() - sigma2_true
# (c) Bias of sigma_MLE
sigma_mle = np.sqrt(sigma2_mle)
bias_sigma = sigma_mle.mean() - sigma_true
header('Exercise 2: Bias of MLE Variance Estimator')
print(f'(a) Theoretical bias = {bias_theory:.6f}')
print(f'(b) Empirical bias = {bias_empirical:.6f}')
check_close('empirical bias ≈ theoretical', bias_empirical, bias_theory, tol=0.05)
print(f' E[sigma2_MLE] = {sigma2_mle.mean():.4f} (true {sigma2_true})')
print(f' E[sigma2_unb] = {sigma2_unb.mean():.4f} (should be ≈ {sigma2_true})')
check_close('unbiased estimator centred at truth', sigma2_unb.mean(), sigma2_true, tol=0.05)
print(f'\n(c) E[sigma_MLE] = {sigma_mle.mean():.4f}, true sigma = {sigma_true}')
print(f' Bias of sigma_MLE = {bias_sigma:.6f}')
check_true('sigma_MLE is negatively biased (Jensen: E[sqrt(X)] < sqrt(E[X]))', bias_sigma < 0)
print(f'\n(d) By MLE invariance: g(hat_theta) = hat_g(theta) for any g.')
print(f' sigma = g(sigma^2) = sqrt(sigma^2), so MLE of sigma = sqrt(MLE of sigma^2).')
print(f' This holds even though sqrt is nonlinear and introduces bias.')
print('\nTakeaway: MLE invariance gives the MLE of any smooth function, but bias is not invariant')
Exercise 3 ★ — Cramér-Rao Bound Verification for Exponential Distribution
Background: For , , the Fisher information is and the CRB is .
Tasks:
(a) Derive the score and verify numerically.
(b) Compute using both the variance-of-score formula and the negative-expected-Hessian formula, and verify they match.
(c) Show the MLE achieves the CRB: .
Code cell 11
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 12
# Solution
lam_true = 2.5
n = 100
M = 10000
np.random.seed(42)
data_exp = np.random.exponential(1/lam_true, size=(M, n))
# (a) Score s(lambda; x) = 1/lambda - x
score_values = 1/lam_true - data_exp[:, 0]
mean_score = score_values.mean()
# (b) Fisher information
I_from_var = score_values.var() # Var(score) = I(lambda)
I_from_hessian = 1 / lam_true**2 # -E[d^2 log p / dlambda^2] = 1/lambda^2
I_theoretical = 1 / lam_true**2
# (c) MLE achieves CRB
lambda_hats = 1.0 / data_exp.mean(axis=1)
var_mle = lambda_hats.var()
crb = lam_true**2 / n
header('Exercise 3: CRB Verification for Exponential Distribution')
print(f'(a) E[score] = {mean_score:.6f} (expected 0)')
check_true('Score has zero mean', np.isclose(mean_score, 0, atol=0.05))
print()
print(f'(b) I(lambda) via Var(score) = {I_from_var:.5f}')
print(f' I(lambda) via -E[hessian] = {I_from_hessian:.5f}')
print(f' I(lambda) theoretical = {I_theoretical:.5f}')
check_close('Two FIM formulas agree', I_from_var, I_from_hessian, tol=0.05)
print()
print(f'(c) Var(lambda_hat) = {var_mle:.7f}')
print(f' CRB = lambda^2/n = {crb:.7f}')
ratio = var_mle / crb
print(f' Ratio Var/CRB = {ratio:.4f} (should be ≈ 1.0)')
check_close('MLE achieves CRB', var_mle, crb, tol=crb*0.05)
print('\nTakeaway: Exponential MLE is efficient -- it achieves the Cramér-Rao lower bound')
Exercise 4 ★★ — Method of Moments for the Beta Distribution
Background: The Beta distribution has:
- Mean:
- Variance:
Tasks:
(a) Derive the MoM estimators: and .
(b) Generate samples from and compute , .
(c) Compare to the MLE (use scipy.stats.beta.fit) and report which has smaller MSE.
(d) Verify the constraint: MoM estimates require . What does this mean?
Code cell 14
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 15
# Solution
np.random.seed(42)
alpha_true, beta_true = 3.0, 7.0
n = 300
x_beta = np.random.beta(alpha_true, beta_true, n)
xbar = x_beta.mean()
s2 = x_beta.var()
# (a)-(b) MoM
common = xbar * (1 - xbar) / s2 - 1
alpha_mom = xbar * common
beta_mom = (1 - xbar) * common
# (c) MLE via scipy
fit_result = stats.beta.fit(x_beta, floc=0, fscale=1)
alpha_mle = fit_result[0]
beta_mle = fit_result[1]
# MSE for alpha
M_comp = 1000
a_mom_sim, a_mle_sim = [], []
for _ in range(M_comp):
x = np.random.beta(alpha_true, beta_true, n)
xb, s2_ = x.mean(), x.var()
c_ = xb*(1-xb)/s2_ - 1
a_mom_sim.append(xb * c_)
fr = stats.beta.fit(x, floc=0, fscale=1)
a_mle_sim.append(fr[0])
mse_mom = np.mean((np.array(a_mom_sim) - alpha_true)**2)
mse_mle = np.mean((np.array(a_mle_sim) - alpha_true)**2)
# (d)
constraint_satisfied = s2 < xbar * (1 - xbar)
header('Exercise 4: Method of Moments for Beta Distribution')
print(f'True: alpha={alpha_true}, beta={beta_true}')
print(f'MoM: alpha={alpha_mom:.4f}, beta={beta_mom:.4f}')
print(f'MLE: alpha={alpha_mle:.4f}, beta={beta_mle:.4f}')
check_close('MoM alpha within 20% of truth', alpha_mom, alpha_true, tol=0.6)
print()
print(f'MSE (alpha) — MoM: {mse_mom:.5f}, MLE: {mse_mle:.5f}')
check_true('MLE more efficient than MoM', mse_mle < mse_mom)
print(f'Relative efficiency MLE/MoM: {mse_mom/mse_mle:.2f}')
print()
print(f'(d) Constraint: s2={s2:.6f} < xbar*(1-xbar)={xbar*(1-xbar):.6f}')
check_true('Constraint satisfied', constraint_satisfied)
print(' Meaning: sample variance must be less than max possible variance (at p=0.5).')
print(' If s2 >= xbar*(1-xbar), data are more dispersed than Beta can explain -- MoM fails.')
print('\nTakeaway: MLE is more efficient than MoM for Beta; both are consistent')
Exercise 5 ★★ — Bootstrap Confidence Interval
Background: The bootstrap resamples from the empirical distribution to estimate the sampling distribution of any statistic.
Tasks:
(a) Given the dataset x = [3.1, 1.4, 2.5, 4.2, 1.8, 3.7, 2.1, 4.8, 1.2, 3.3], compute the sample median.
(b) Implement the non-parametric bootstrap with resamples to construct a 95% percentile CI for the median.
(c) Compare this to the asymptotic CI. For a Gaussian distribution, . Use (sample std) to compute the asymptotic CI and compare widths.
(d) Check coverage: repeat the experiment 1000 times (simulate from ) and compute empirical coverage for both CI types.
Code cell 17
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 18
# Solution
x = np.array([3.1, 1.4, 2.5, 4.2, 1.8, 3.7, 2.1, 4.8, 1.2, 3.3])
n = len(x)
# (a)
median_obs = np.median(x)
# (b) Bootstrap
B = 2000
np.random.seed(42)
boot_medians = np.array([np.median(x[np.random.randint(0, n, n)]) for _ in range(B)])
ci_boot_lo = np.percentile(boot_medians, 2.5)
ci_boot_hi = np.percentile(boot_medians, 97.5)
# (c) Asymptotic
sigma_hat = x.std(ddof=1)
var_median_asym = np.pi * sigma_hat**2 / (2 * n)
se_median = np.sqrt(var_median_asym)
z95 = 1.96
ci_asym_lo = median_obs - z95 * se_median
ci_asym_hi = median_obs + z95 * se_median
# (d) Coverage simulation
mu_sim = 2.7
sigma_sim = 1.2
M_cov = 1000
true_median = mu_sim # median = mean for Gaussian
boot_covers, asym_covers = [], []
for _ in range(M_cov):
x_sim = np.random.normal(mu_sim, sigma_sim, n)
m_sim = np.median(x_sim)
# Bootstrap CI
bm = np.array([np.median(x_sim[np.random.randint(0, n, n)]) for _ in range(500)])
bl, bh = np.percentile(bm, [2.5, 97.5])
boot_covers.append(bl <= true_median <= bh)
# Asymptotic CI
s = x_sim.std(ddof=1)
se = np.sqrt(np.pi * s**2 / (2*n))
al, ah = m_sim - 1.96*se, m_sim + 1.96*se
asym_covers.append(al <= true_median <= ah)
from scipy import stats as st
header('Exercise 5: Bootstrap Confidence Interval')
print(f'(a) Sample median = {median_obs:.4f}')
print(f'(b) Bootstrap 95% CI: [{ci_boot_lo:.4f}, {ci_boot_hi:.4f}] width={ci_boot_hi-ci_boot_lo:.4f}')
print(f'(c) Asymptotic 95% CI: [{ci_asym_lo:.4f}, {ci_asym_hi:.4f}] width={ci_asym_hi-ci_asym_lo:.4f}')
print(f'(d) Coverage (n=10, M=1000): bootstrap={np.mean(boot_covers):.3f}, asymptotic={np.mean(asym_covers):.3f}')
check_true('Bootstrap CI has reasonable coverage', abs(np.mean(boot_covers) - 0.95) < 0.05)
print('\nTakeaway: Bootstrap captures asymmetry in sampling distribution; better for small n')
Exercise 6 ★★ — Asymptotic Normality Simulation
Background: The MLE satisfies .
Tasks:
(a) For , simulate Bernoulli MLEs at each and standardise: .
(b) At each , perform a Kolmogorov-Smirnov test against and report the p-value.
(c) Compute the empirical coverage of the 95% Wald CI at each . When does it first reach [94%, 96%]?
(d) Plot the 4 distributions and overlay .
Code cell 20
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 21
# Solution
p_true = 0.3
M = 5000
n_values = [10, 30, 100, 500]
np.random.seed(42)
asym_std = np.sqrt(p_true * (1 - p_true)) # 1/sqrt(I) for n=1
z95 = 1.96
results = {}
for n in n_values:
data = np.random.binomial(1, p_true, size=(M, n))
p_hats = data.mean(axis=1)
z_scores = (p_hats - p_true) / (asym_std / np.sqrt(n))
ks_stat, ks_pval = stats.kstest(z_scores, 'norm')
se = np.sqrt(p_hats * (1-p_hats) / n)
ci_lo = p_hats - z95 * se
ci_hi = p_hats + z95 * se
coverage = np.mean((ci_lo <= p_true) & (p_true <= ci_hi))
results[n] = {'z': z_scores, 'ks': ks_pval, 'cov': coverage}
header('Exercise 6: Asymptotic Normality of Bernoulli MLE')
print(f'{"n":>6} {"KS p-value":>12} {"Wald cov":>12} {"Normal?":>10}')
print('-' * 44)
for n in n_values:
r = results[n]
normal = 'Yes' if r['ks'] > 0.05 else 'No'
print(f'{n:>6} {r["ks"]:>12.4f} {r["cov"]:>12.4f} {normal:>10}')
in_range = {n: abs(results[n]['cov'] - 0.95) < 0.01 for n in n_values}
first_good = next((n for n in n_values if in_range[n]), None)
print(f'\nFirst n with coverage in [94%, 96%]: {first_good}')
check_true('Coverage in [94%, 96%] for n>=100', in_range.get(100, False))
if HAS_MPL:
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
x_r = np.linspace(-4, 4, 200)
normal_pdf = stats.norm.pdf(x_r)
cols = [COLORS['primary'], COLORS['secondary'], COLORS['tertiary'], COLORS['highlight']]
for ax, n, col in zip(axes, n_values, cols):
ax.hist(results[n]['z'], bins=40, density=True, alpha=0.6, color=col, edgecolor='white')
ax.plot(x_r, normal_pdf, color=COLORS['neutral'], lw=2)
ax.set_title(f'$n={n}$, KS p={results[n]["ks"]:.3f}')
ax.set_xlabel('$Z_n$')
if n == 10:
ax.set_ylabel('Density')
fig.suptitle('Asymptotic normality of Bernoulli MLE', fontsize=13)
fig.tight_layout()
plt.show()
print('\nTakeaway: n≥30 is often sufficient for Wald CI accuracy in Bernoulli models')
Exercise 7 ★★★ — Fisher Information Matrix for a Neural Network Layer
Background: For a softmax layer , the score is . The FIM has the Kronecker structure .
Tasks:
(a) Implement the score computation for the softmax layer.
(b) Compute the full FIM by averaging outer products of score vectors.
(c) Compute the K-FAC approximation and measure the relative approximation error.
(d) Show that the natural gradient update gives a larger step in directions of low curvature.
Code cell 23
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 24
# Solution
np.random.seed(42)
d = 4
K = 3
n_fim = 300
W = np.random.randn(K, d) * 0.5
X_fim = np.random.randn(n_fim, d)
logits = X_fim @ W.T
def softmax_2d(Z):
e = np.exp(Z - Z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
P = softmax_2d(logits)
# (b) Full FIM
FIM_full = np.zeros((K*d, K*d))
for i in range(n_fim):
for k in range(K):
e_k = np.zeros(K)
e_k[k] = 1.0
delta = e_k - P[i]
g = np.outer(delta, X_fim[i]).reshape(-1)
FIM_full += P[i, k] * np.outer(g, g)
FIM_full /= n_fim
# (c) K-FAC
A = (X_fim.T @ X_fim) / n_fim
G = np.zeros((K, K))
for i in range(n_fim):
G += np.diag(P[i]) - np.outer(P[i], P[i])
G /= n_fim
FIM_kfac = np.kron(G, A)
rel_err = np.linalg.norm(FIM_full - FIM_kfac) / np.linalg.norm(FIM_full)
# (d) Natural gradient step comparison
grad_L = np.random.randn(K*d) # mock gradient
reg = 0.01 * np.eye(K*d)
nat_grad = np.linalg.solve(FIM_full + reg, grad_L)
eigs_fim = np.linalg.eigvalsh(FIM_full)
header('Exercise 7: FIM for Softmax Layer')
print(f'Full FIM shape: {FIM_full.shape}')
print(f'FIM min eigenvalue: {eigs_fim.min():.6f}')
print(f'FIM max eigenvalue: {eigs_fim.max():.6f}')
print(f'Condition number: {eigs_fim.max()/max(eigs_fim.min(), 1e-10):.2f}')
print()
print(f'K-FAC relative error: {rel_err:.4f} ({100*rel_err:.1f}%)')
check_true('FIM is positive semidefinite', np.all(eigs_fim >= -1e-10))
check_true('K-FAC relative error < 20%', rel_err < 0.20)
print()
print(f'Ratio ||nat_grad||/||grad||: {np.linalg.norm(nat_grad)/np.linalg.norm(grad_L):.4f}')
print('Natural gradient rescales steps by 1/curvature -- larger steps in flat directions')
print('\nTakeaway: K-FAC approximates the FIM with Kronecker factorisation for tractable inversion')
Exercise 8 ★★★ — Chinchilla Scaling Law Estimation
Background: The Chinchilla model (Hoffmann et al., 2022) fits:
by nonlinear least squares (MLE with Gaussian noise), then minimises under the compute budget constraint .
Tasks:
(a) Generate 30 synthetic training runs with realistic parameters (, , , , ) and Gaussian noise .
(b) Fit the 5 parameters using scipy.optimize.curve_fit and report 95% CIs for and .
(c) For FLOPs, find the optimal and compare to the Chinchilla recommendation (, ).
(d) How sensitive is to uncertainty in ?
Code cell 26
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 27
# Solution
from scipy.optimize import curve_fit
np.random.seed(42)
E_t, A_t, B_t, alpha_t, beta_t = 1.69, 406.4, 410.7, 0.34, 0.28
def scaling_law(ND, E, A, B, alpha, beta):
N, D = ND
return E + A / N**alpha + B / D**beta
# (a) Generate data
N_vals = np.logspace(8, 12, 5)
D_vals = np.logspace(9, 13, 6)
NN, DD = np.meshgrid(N_vals, D_vals)
N_flat = NN.flatten()
D_flat = DD.flatten()
L_true = scaling_law((N_flat, D_flat), E_t, A_t, B_t, alpha_t, beta_t)
L_obs = L_true + 0.02 * np.random.randn(len(L_true))
# (b) Fit
p0 = [1.7, 400, 400, 0.3, 0.3]
bounds = ([0, 0, 0, 0.1, 0.1], [5, 1e4, 1e4, 1.0, 1.0])
popt, pcov = curve_fit(scaling_law, (N_flat, D_flat), L_obs,
p0=p0, bounds=bounds, maxfev=20000)
perr = np.sqrt(np.diag(pcov))
# (c) Optimal allocation for fixed C
C_budget = 6e20
N_cand = np.logspace(8, 13, 2000)
D_cand = C_budget / (6 * N_cand)
L_cand = scaling_law((N_cand, D_cand), *popt)
idx_opt = np.argmin(L_cand)
N_star = N_cand[idx_opt]
D_star = D_cand[idx_opt]
# (d) Sensitivity to alpha uncertainty
alpha_lo = popt[3] - perr[3]
alpha_hi = popt[3] + perr[3]
def optimal_N(alpha_val):
params_mod = list(popt)
params_mod[3] = alpha_val
L_v = scaling_law((N_cand, D_cand), *params_mod)
return N_cand[np.argmin(L_v)]
N_star_lo = optimal_N(alpha_lo)
N_star_hi = optimal_N(alpha_hi)
header('Exercise 8: Chinchilla Scaling Law Estimation')
names = ['E', 'A', 'B', 'alpha', 'beta']
true_vals = [E_t, A_t, B_t, alpha_t, beta_t]
print(f'{"Param":<8} {"True":>8} {"MLE":>8} {"SE":>8} 95% CI')
print('-' * 52)
for nm, tv, est, se in zip(names, true_vals, popt, perr):
ci = f'[{est-1.96*se:.3f}, {est+1.96*se:.3f}]'
print(f'{nm:<8} {tv:>8.3f} {est:>8.3f} {se:>8.4f} {ci}')
print(f'\n(c) Optimal allocation:')
print(f' N* = {N_star:.2e} parameters')
print(f' D* = {D_star:.2e} tokens')
check_true('N* in plausible range (1e9 to 1e12)', 1e9 < N_star < 1e12)
print(f'\n(d) Sensitivity to alpha uncertainty (±{perr[3]:.4f}):')
print(f' N*(alpha - 1SE): {N_star_lo:.2e}')
print(f' N*(alpha + 1SE): {N_star_hi:.2e}')
print(f' Ratio: {N_star_hi/N_star_lo:.2f}x variation in N* per 1-sigma change in alpha')
print('\nTakeaway: Scaling law estimation via MLE enables compute-optimal training allocation')
Exercise 9 *** - Bootstrap Confidence Interval for Model Accuracy
A classifier is evaluated on held-out examples with binary correctness indicators.
Part (a): Estimate accuracy.
Part (b): Build a percentile bootstrap 95% confidence interval.
Part (c): Compare with the normal approximation interval.
Part (d): Explain when bootstrap intervals are more trustworthy than asymptotic formulas.
Code cell 29
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 30
# Solution
n = 600
true_acc = 0.82
correct = (np.random.rand(n) < true_acc).astype(float)
acc_hat = correct.mean()
B = 3000
boot = np.array([np.random.choice(correct, size=n, replace=True).mean() for _ in range(B)])
ci_boot = np.percentile(boot, [2.5, 97.5])
se = np.sqrt(acc_hat * (1 - acc_hat) / n)
ci_norm = acc_hat + np.array([-1.96, 1.96]) * se
header('Exercise 9: Bootstrap CI for Accuracy')
print(f'accuracy estimate: {acc_hat:.4f}')
print(f'bootstrap 95% CI: [{ci_boot[0]:.4f}, {ci_boot[1]:.4f}]')
print(f'normal 95% CI: [{ci_norm[0]:.4f}, {ci_norm[1]:.4f}]')
check_true('Both intervals contain the estimate', ci_boot[0] < acc_hat < ci_boot[1] and ci_norm[0] < acc_hat < ci_norm[1])
check_true('Intervals have similar width here', abs((ci_boot[1]-ci_boot[0]) - (ci_norm[1]-ci_norm[0])) < 0.02)
print('Takeaway: bootstrap treats the empirical sample as a reusable proxy population.')
Exercise 10 *** - Delta Method for Log-Odds
Let be a sample proportion and define .
Part (a): Derive the delta-method standard error for .
Part (b): Compare the delta-method interval with a Monte Carlo sampling distribution.
Part (c): Explain why transformed estimands need transformed uncertainty, not only transformed point estimates.
Code cell 32
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 33
# Solution
from scipy import stats
n = 1000
p = 0.72
phat = np.random.binomial(n, p) / n
logit = lambda u: np.log(u / (1 - u))
gprime = lambda u: 1 / (u * (1 - u))
se_p = np.sqrt(phat * (1 - phat) / n)
se_logit = abs(gprime(phat)) * se_p
ci_delta = logit(phat) + np.array([-1.96, 1.96]) * se_logit
reps = 20000
phats = np.random.binomial(n, p, size=reps) / n
logits = logit(np.clip(phats, 1e-6, 1 - 1e-6))
ci_mc = np.percentile(logits, [2.5, 97.5])
header('Exercise 10: Delta Method for Log-Odds')
print(f'phat={phat:.4f}, logit(phat)={logit(phat):.4f}')
print(f'delta SE={se_logit:.4f}')
print(f'delta CI: [{ci_delta[0]:.4f}, {ci_delta[1]:.4f}]')
print(f'MC CI: [{ci_mc[0]:.4f}, {ci_mc[1]:.4f}]')
check_true('Delta and MC intervals are close', np.max(np.abs(ci_delta - ci_mc)) < 0.08)
print('Takeaway: the delta method propagates estimator uncertainty through nonlinear transformations.')
What to Review After Finishing
- Bias-variance decomposition — can you prove MSE = Bias² + Var from scratch?
- CRB proof — key step is differentiating the unbiasedness condition and applying Cauchy-Schwarz
- MLE derivation — can you find the score equation and verify it is a maximum for Bernoulli, Poisson, Gaussian?
- Asymptotic normality — key steps: Taylor expand score, apply CLT to numerator, LLN to denominator
- Bootstrap CI — understand what distribution you are resampling from (empirical CDF)
- FIM structure — why does for a linear softmax layer?
- Scaling laws — why is nonlinear least squares equivalent to MLE under Gaussian noise?
References
- Casella & Berger (2002). Statistical Inference, 2nd ed. — complete treatment of all exercises
- Hoffmann et al. (2022). 'Training Compute-Optimal LLMs' — Exercise 8
- Kirkpatrick et al. (2017). 'EWC' — Fisher information for continual learning
- Martens & Grosse (2015). 'K-FAC' — Exercise 7
- Efron (1979). 'Bootstrap Methods' — Exercise 5
- Bootstrap CI for model accuracy - can you connect the computation to statistical ML practice?
- Delta method for log-odds - can you connect the computation to statistical ML practice?