Part 2

29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Estimation Theory: Part 6: Method of Moments to Appendix L: Worked Examples - End-to-End Estimation

6. Method of Moments

6.1 The Method of Moments

The method of moments (MoM), introduced by Karl Pearson (1894), is the oldest systematic estimation procedure. The idea is simple: match sample moments to population moments and solve for $\boldsymbol{\theta}$ .

Definition 6.1 (Method of Moments Estimator). For a $k$ -parameter model, the MoM estimator solves the system:

\hat{\mu}_j = \mathbb{E}[X^j;\boldsymbol{\theta}], \quad j = 1, \ldots, k

where $\hat{\mu}_j = \frac{1}{n}\sum_{i=1}^n (x^{(i)})^j$ is the $j$ -th sample moment.

Example: Normal distribution. For $\mathcal{N}(\mu, \sigma^2)$ with 2 parameters:

1st moment equation: $\bar{x} = \mathbb{E}[X] = \mu \implies \hat{\mu} = \bar{x}$
2nd moment equation: $\frac{1}{n}\sum x_i^2 = \mathbb{E}[X^2] = \mu^2 + \sigma^2 \implies \hat{\sigma}^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$

For the Gaussian, MoM and MLE coincide.

Example: Gamma distribution. $X \sim \text{Gamma}(\alpha, \beta)$ with mean $\alpha/\beta$ and variance $\alpha/\beta^2$ .

$\hat{\mu}_1 = \bar{x} = \alpha/\beta$
$\hat{\mu}_2 - \hat{\mu}_1^2 = s^2 = \alpha/\beta^2$

Solving: $\hat{\beta} = \bar{x}/s^2$ , $\hat{\alpha} = \bar{x}^2/s^2$ . Unlike the Gaussian case, the Gamma MLE requires numerical optimisation, making MoM attractive for quick estimation.

Example: Beta distribution. $X \sim \text{Beta}(\alpha, \beta)$ with mean $\alpha/(\alpha+\beta)$ and variance $\alpha\beta/[(\alpha+\beta)^2(\alpha+\beta+1)]$ .

Setting $m = \bar{x}$ and $v = s^2$ , solving the moment equations:

\hat{\alpha} = m\left(\frac{m(1-m)}{v} - 1\right), \quad \hat{\beta} = (1-m)\left(\frac{m(1-m)}{v} - 1\right)

This requires $v < m(1-m)$ (the variance must be less than the theoretical maximum for the Beta).

Properties of MoM:

Consistent: by the LLN, sample moments converge to population moments; moment equations are continuous, so MoM estimators are consistent
Asymptotically normal: by the multivariate CLT and delta method
Inefficient: generally does not achieve the CRB; MLE has smaller asymptotic variance when it exists in closed form

6.2 Generalised Method of Moments

The Generalised Method of Moments (GMM), introduced by Hansen (1982), allows for more moment conditions than parameters.

Suppose we have $m$ moment conditions $\mathbb{E}[g(\mathbf{x};\boldsymbol{\theta})] = \mathbf{0}$ , where $g: \mathbb{R}^d \times \Theta \to \mathbb{R}^m$ and $m \geq k$ (over-identified system).

Sample moment conditions: $\hat{\mathbf{g}}_n(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n g(\mathbf{x}^{(i)};\boldsymbol{\theta})$

When $m > k$ , we cannot set all $m$ conditions to zero simultaneously. The GMM estimator minimises the weighted quadratic form:

\hat{\boldsymbol{\theta}}_{\text{GMM}} = \arg\min_{\boldsymbol{\theta}} \hat{\mathbf{g}}_n(\boldsymbol{\theta})^\top W \hat{\mathbf{g}}_n(\boldsymbol{\theta})

for some positive-definite weighting matrix $W$ . The optimal weight matrix (minimising asymptotic variance) is $W^* = S^{-1}$ where $S = \operatorname{Var}(g(\mathbf{x};\boldsymbol{\theta}^*))$ .

GMM is widely used in econometrics. In ML, it appears implicitly when models are estimated by matching distributional moments rather than maximising likelihood.

6.3 MoM vs MLE

Property	Method of Moments	Maximum Likelihood
Computational	Often closed-form	May require numerical optimisation
Asymptotic efficiency	Generally inefficient	Efficient (achieves CRB)
Consistency	Yes	Yes (under regularity conditions)
Robustness	Depends only on moments	Can be sensitive to tail misspecification
Applicability	Works even when likelihood is intractable	Requires tractable likelihood
Parameter constraints	Can produce invalid estimates (e.g., $\hat{\sigma}^2 < 0$ )	Respects constraints if optimised on $\Theta$

When to prefer MoM:

The likelihood is intractable (latent variable models without EM)
Speed matters more than efficiency
Only the first few moments are of interest
The distributional assumption may be wrong (moments are more robust)

When to prefer MLE:

Full likelihood is tractable and the model is well-specified
Maximum statistical efficiency is needed
Regularity conditions are satisfied
A standard reference distribution is being fitted

7. Confidence Intervals

7.1 The Frequentist Interpretation

Definition 7.1 (Confidence Interval). A random interval $[\hat{\theta}_L(\mathbf{X}), \hat{\theta}_U(\mathbf{X})]$ is a $(1-\alpha)$ confidence interval for $\theta$ if:

P_\theta\!\left(\hat{\theta}_L(\mathbf{X}) \leq \theta \leq \hat{\theta}_U(\mathbf{X})\right) \geq 1-\alpha \quad \text{for all } \theta \in \Theta

The key is that $\hat{\theta}_L$ and $\hat{\theta}_U$ are random (functions of the data $\mathbf{X}$ ), while $\theta$ is fixed (unknown but not random in the frequentist framework).

The correct interpretation: If we repeat the experiment many times and construct a 95% CI each time, 95% of those intervals will contain the true $\theta$ . This specific computed interval either contains $\theta$ or it does not - there is no 95% probability attached to this particular interval after observing data.

Common misconceptions:

Incorrect statement	Why it's wrong
"There is a 95% probability that $\theta \in [2.1, 3.4]$ ."	After computing the interval, $\theta$ is either in it or not - no probability remains
"95% of the data lies within the CI."	CIs are about the parameter, not the data distribution
"If I repeat the experiment, my estimate will be in this CI 95% of the time."	The CI is about the parameter being covered, not about future estimates
"A wider CI means I'm 95% more confident."	Width affects precision but all 95% CIs have the same coverage probability

The Bayesian analogue - "there is a 95% probability that $\theta$ is in this interval" - is a credible interval and requires a prior distribution on $\theta$ . The full treatment is in Section04 Bayesian Inference.

7.2 Exact Confidence Intervals via Pivoting

The pivotal method constructs CIs by finding a pivot: a function of data and parameter whose distribution is known and does not depend on $\theta$ .

Gaussian mean, $\sigma^2$ known. Let $x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ .

The pivot is $Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1)$ .

P\!\left(-z_{\alpha/2} \leq \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \leq z_{\alpha/2}\right) = 1 - \alpha

Solving for $\mu$ : the $(1-\alpha)$ CI is $\left[\bar{x} - z_{\alpha/2}\frac{\sigma}{\sqrt{n}},\; \bar{x} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right]$ .

Gaussian mean, $\sigma^2$ unknown. Replacing $\sigma$ with the sample standard deviation $s$ changes the distribution: $T = \frac{\bar{x}-\mu}{s/\sqrt{n}} \sim t_{n-1}$ (Student's $t$ with $n-1$ degrees of freedom).

$(1-\alpha)$ CI: $\left[\bar{x} - t_{n-1,\alpha/2}\frac{s}{\sqrt{n}},\; \bar{x} + t_{n-1,\alpha/2}\frac{s}{\sqrt{n}}\right]$

As $n \to \infty$ , $t_{n-1,\alpha/2} \to z_{\alpha/2}$ (Student's $t$ converges to standard normal).

STUDENT'S t DISTRIBUTION vs. NORMAL
========================================================================

  p(t)
   |
   |            --- Normal(0,1)
   |          ==== t(df=5)
   |        - - - t(df=1) [Cauchy]
   |
   |          +-----+         Heavier tails of t distribution
   |         ++     ++        reflect additional uncertainty
   |        ++       ++       from estimating \\sigma.
   |      --+         +--
   |--------------------------- t
       -3  -2  -1   0   1   2   3

  For df \\geq 30, t \\approx Normal - practical "large sample" threshold.

========================================================================

Gaussian variance CI. The pivot is $(n-1)s^2/\sigma^2 \sim \chi^2_{n-1}$ , giving the CI:

\left[\frac{(n-1)s^2}{\chi^2_{n-1, \alpha/2}},\; \frac{(n-1)s^2}{\chi^2_{n-1, 1-\alpha/2}}\right]

Note this CI is asymmetric because the $\chi^2$ distribution is skewed.

7.3 Asymptotic Confidence Intervals

When exact pivots are unavailable, use the asymptotic normality of MLE: for large $n$ , $\hat{\theta}_{\text{MLE}} \approx \mathcal{N}(\theta^*, \frac{1}{n}\mathcal{I}(\theta^*)^{-1})$ .

Wald interval. Plugging in the MLE for the unknown $\theta^*$ in the Fisher information:

\hat{\theta}_{\text{MLE}} \pm z_{\alpha/2} \cdot \frac{1}{\sqrt{n\mathcal{I}(\hat{\theta}_{\text{MLE}})}}

For Bernoulli with $n$ observations: $\hat{p} \pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$ .

Limitations of the Wald interval near boundaries. For $p$ near 0 or 1, the Wald interval can extend outside $[0,1]$ . The Wilson score interval is better-behaved:

\frac{\hat{p} + z^2/(2n)}{1 + z^2/n} \pm \frac{z\sqrt{\hat{p}(1-\hat{p})/n + z^2/(4n^2)}}{1 + z^2/n}

where $z = z_{\alpha/2}$ . This is the standard CI for proportions in ML evaluation.

Likelihood ratio interval. Based on the likelihood ratio statistic $\Lambda = 2[\ell(\hat{\theta}) - \ell(\theta_0)]$ , which satisfies $\Lambda \xrightarrow{d} \chi^2_1$ under $H_0: \theta = \theta_0$ . The CI is the set of $\theta_0$ values that pass the test:

\text{CI}_{1-\alpha} = \{\theta : 2[\ell(\hat{\theta}) - \ell(\theta)] \leq \chi^2_{1,\alpha}\}

This is often more accurate than the Wald interval for small $n$ .

7.4 Bootstrap Confidence Intervals

The bootstrap (Efron, 1979) is a non-parametric resampling method for constructing CIs without distributional assumptions.

Non-parametric bootstrap algorithm:

From the original sample $\mathcal{X} = \{x^{(1)}, \ldots, x^{(n)}\}$ , draw $B$ bootstrap samples $\mathcal{X}^*_1, \ldots, \mathcal{X}^*_B$ (each of size $n$ , with replacement)
Compute the statistic $\hat{\theta}^*_b = T(\mathcal{X}^*_b)$ for each bootstrap sample
Estimate the sampling distribution of $\hat{\theta}$ by the empirical distribution of $\{\hat{\theta}^*_1, \ldots, \hat{\theta}^*_B\}$

Percentile bootstrap CI: Use the $(\alpha/2)$ and $(1-\alpha/2)$ quantiles of $\{\hat{\theta}^*_b\}$ as CI endpoints.

BCa (bias-corrected and accelerated) bootstrap: Adjusts for bias in the bootstrap distribution and non-constant acceleration. Typically more accurate than the percentile method for small $n$ .

When to use bootstrap:

The sampling distribution of $\hat{\theta}$ is non-standard (no closed-form pivot)
The distributional assumptions for parametric CIs are suspect
The statistic is complex (e.g., ratio of two MLEs, AUC, BLEU score)
$n$ is moderate and asymptotic approximations are poor

Bootstrap for ML evaluation (example). To compute a 95% CI for test accuracy:

Let the test set be $\{(x^{(i)}, y^{(i)})\}_{i=1}^n$ ; accuracy $\hat{a} = \frac{1}{n}\sum \mathbb{1}[\hat{y}^{(i)} = y^{(i)}]$
Bootstrap: resample with replacement $B = 2000$ times, compute accuracy on each bootstrap test set
CI = [2.5th percentile, 97.5th percentile] of bootstrap accuracies

7.5 Confidence Intervals in ML Evaluation

Understanding CI coverage for ML evaluation prevents erroneous conclusions.

Binary accuracy. Test accuracy $\hat{a}$ with $n$ examples and $k$ correct. Wilson score CI:

\text{CI}_{95\%} \approx \hat{a} \pm 1.96\sqrt{\hat{a}(1-\hat{a})/n}

For $n = 1000$ , $\hat{a} = 0.85$ : CI = $[0.828, 0.872]$ - width $\pm 2.2\%$ . For $n = 100$ : CI = $[0.780, 0.920]$ - width $\pm 7\%$ .

This shows why benchmark results on $n < 500$ test examples are statistically unreliable.

Comparing two models. If Model A achieves $\hat{a}_A = 0.852$ and Model B achieves $\hat{a}_B = 0.847$ on the same $n = 1000$ test examples, is A significantly better? The difference $\Delta = 0.005$ with standard error $\approx \sqrt{\hat{a}_A(1-\hat{a}_A)/n + \hat{a}_B(1-\hat{a}_B)/n} \approx 0.011$ gives a 95% CI for $\Delta$ of $[-0.017, +0.027]$ , which includes zero. The models are not statistically distinguishable.

McNemar's test is more powerful than comparing independent CIs when models are evaluated on the same examples (the paired structure reduces variance). See Section03 Hypothesis Testing for the full treatment.

Multiple benchmark correction. When evaluating on $k$ benchmarks, the probability of at least one spuriously significant result rises. This is the multiple comparison problem - addressed fully in Section03. For reporting CIs, Bonferroni correction replaces $\alpha$ with $\alpha/k$ , widening each CI to maintain family-wise coverage.

8. Asymptotic Theory

8.1 Convergence Modes

Before stating the asymptotic normality theorem, we review the convergence modes used.

Almost-sure (a.s.) convergence ( $\xrightarrow{a.s.}$ ): $P(\lim_{n\to\infty} X_n = X) = 1$ . The sequence converges on every sample path except a set of probability zero. This is the strongest mode; the Strong LLN gives $\bar{x}_n \xrightarrow{a.s.} \mu$ .

Convergence in probability ( $\xrightarrow{P}$ ): For every $\varepsilon > 0$ , $P(|X_n - X| > \varepsilon) \to 0$ . Weaker than a.s. convergence. The Weak LLN gives $\bar{x}_n \xrightarrow{P} \mu$ . Sufficient for consistency.

Convergence in distribution ( $\xrightarrow{d}$ ): $F_{X_n}(t) \to F_X(t)$ at every continuity point of $F_X$ . The weakest mode; the CLT gives $\sqrt{n}(\bar{x}_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0,1)$ . Used for asymptotic normality.

Key theorems for manipulating convergence:

Slutsky's Theorem. If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{P} c$ (a constant), then:

$X_n + Y_n \xrightarrow{d} X + c$
$X_n \cdot Y_n \xrightarrow{d} c \cdot X$
$X_n / Y_n \xrightarrow{d} X/c$ (if $c \neq 0$ )

Continuous Mapping Theorem. If $X_n \xrightarrow{P} X$ and $g$ is continuous, then $g(X_n) \xrightarrow{P} g(X)$ . Same holds for $\xrightarrow{d}$ .

Application: The MLE $\hat{\theta} \xrightarrow{P} \theta^*$ (consistency) implies $\mathcal{I}(\hat{\theta}) \xrightarrow{P} \mathcal{I}(\theta^*)$ by the continuous mapping theorem (assuming $\mathcal{I}$ is continuous). Then by Slutsky: $\sqrt{n\mathcal{I}(\hat{\theta})}(\hat{\theta} - \theta^*) \xrightarrow{d} \mathcal{N}(0,1)$ .

8.2 The Delta Method

The delta method extends the CLT to smooth functions of asymptotically normal estimators.

Theorem 8.2 (Delta Method - Scalar). Suppose $\sqrt{n}(\hat{\theta} - \theta^*) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$ and $g$ is differentiable at $\theta^*$ with $g'(\theta^*) \neq 0$ . Then:

\sqrt{n}\left(g(\hat{\theta}) - g(\theta^*)\right) \xrightarrow{d} \mathcal{N}\!\left(0,\, [g'(\theta^*)]^2 \sigma^2\right)

Proof sketch: By differentiability, $g(\hat{\theta}) \approx g(\theta^*) + g'(\theta^*)(\hat{\theta} - \theta^*)$ . Multiply through by $\sqrt{n}$ : $\sqrt{n}(g(\hat{\theta}) - g(\theta^*)) \approx g'(\theta^*) \cdot \sqrt{n}(\hat{\theta} - \theta^*) \xrightarrow{d} g'(\theta^*)\mathcal{N}(0,\sigma^2) = \mathcal{N}(0, [g'(\theta^*)]^2\sigma^2)$ .

Examples:

CI for log-odds. For Bernoulli with $\hat{p} = \bar{x}$ , the log-odds are $\theta = \log[p/(1-p)]$ . The MLE is $\hat{\theta} = \log[\hat{p}/(1-\hat{p})]$ . The asymptotic variance: $g(p) = \log[p/(1-p)]$ , $g'(p) = 1/[p(1-p)]$ . So:

\operatorname{Var}(\hat{\theta}) \approx \frac{1}{n} \cdot \frac{1}{[p(1-p)]^2} \cdot p(1-p) = \frac{1}{np(1-p)}

CI for standard deviation. Given $\hat{\sigma}^2_{\text{MLE}} \approx \mathcal{N}(\sigma^2, 2\sigma^4/n)$ (asymptotic variance of variance estimator), the delta method with $g(v) = \sqrt{v}$ , $g'(v) = 1/(2\sqrt{v})$ :

\operatorname{Var}(\hat{\sigma}) \approx \frac{1}{4\sigma^2} \cdot \frac{2\sigma^4}{n} = \frac{\sigma^2}{2n}

Multivariate delta method. If $\sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}^*) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \Sigma)$ and $g: \mathbb{R}^k \to \mathbb{R}^m$ is differentiable:

\sqrt{n}(g(\hat{\boldsymbol{\theta}}) - g(\boldsymbol{\theta}^*)) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\, J_g \Sigma J_g^\top\right)

where $J_g$ is the $m \times k$ Jacobian of $g$ at $\boldsymbol{\theta}^*$ .

8.3 Asymptotic Normality of MLE

Theorem 8.3 (Asymptotic Normality of MLE). Under the Cramer-Rao regularity conditions, if $\hat{\boldsymbol{\theta}}_n$ is the MLE based on $n$ iid observations:

\sqrt{n}\left(\hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta}^*\right) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\, \mathcal{I}(\boldsymbol{\theta}^*)^{-1}\right)

Proof sketch for the scalar case. Taylor-expand the score equation around $\boldsymbol{\theta}^*$ :

0 = \frac{1}{\sqrt{n}} S_n(\hat{\theta}) \approx \frac{1}{\sqrt{n}} S_n(\theta^*) + \frac{1}{n} S_n'(\theta^*) \cdot \sqrt{n}(\hat{\theta} - \theta^*)

Rearranging:

\sqrt{n}(\hat{\theta} - \theta^*) \approx -\frac{\frac{1}{\sqrt{n}}S_n(\theta^*)}{\frac{1}{n}S_n'(\theta^*)}

Numerator: $\frac{1}{\sqrt{n}}S_n(\theta^*) = \frac{1}{\sqrt{n}}\sum_i s(\theta^*; x^{(i)})$ . By CLT (scores are iid with mean 0 and variance $\mathcal{I}(\theta^*)$ ): $\frac{1}{\sqrt{n}}S_n(\theta^*) \xrightarrow{d} \mathcal{N}(0, \mathcal{I}(\theta^*))$ .

Denominator: $\frac{1}{n}S_n'(\theta^*) = \frac{1}{n}\sum_i \frac{\partial^2 \log p}{\partial\theta^2}\big|_{\theta^*}$ . By LLN: $\frac{1}{n}S_n'(\theta^*) \xrightarrow{P} \mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial\theta^2}\right] = -\mathcal{I}(\theta^*)$ .

By Slutsky: $\sqrt{n}(\hat{\theta} - \theta^*) \xrightarrow{d} \mathcal{N}(0, \mathcal{I}(\theta^*)) \cdot \frac{-1}{-\mathcal{I}(\theta^*)} = \mathcal{N}(0, \mathcal{I}(\theta^*)^{-1})$ . $\square$

Implications for practice:

Large sample CI: For $n \geq 50$ (rule of thumb), the $(1-\alpha)$ CI is $\hat{\theta} \pm z_{\alpha/2}/\sqrt{n\mathcal{I}(\hat{\theta})}$
Convergence rate: The MLE error is $O(1/\sqrt{n})$ - doubling data halves the standard error
Efficiency: The asymptotic covariance $\mathcal{I}^{-1}/n$ achieves the CRB - no consistent estimator is asymptotically better
Standard errors in ML: The standard errors reported in logistic regression, GLMs, and survival models are all based on this theorem, using the observed or expected Fisher information

8.4 Misspecified Models

In practice, the model $\{p(\cdot;\boldsymbol{\theta})\}$ rarely contains the true data-generating distribution $p^*$ . What happens when the model is misspecified?

Theorem 8.4 (MLE under Misspecification). Under regularity conditions, even when $p^* \notin \mathcal{M}$ , the MLE converges to:

\hat{\boldsymbol{\theta}}_{\text{MLE}} \xrightarrow{P} \boldsymbol{\theta}^{**} = \arg\min_{\boldsymbol{\theta} \in \Theta} D_{\mathrm{KL}}(p^* \| p(\cdot;\boldsymbol{\theta}))

The "pseudo-true parameter" $\boldsymbol{\theta}^{**}$ is the closest point in the model family to the truth in KL divergence. The asymptotic distribution under misspecification is the sandwich estimator:

\sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}^{**}) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\, \mathcal{I}^{-1} V \mathcal{I}^{-1}\right)

where $V = \operatorname{Var}[s(\boldsymbol{\theta}^{**}; X)]$ is the actual variance of the score (not equal to $\mathcal{I}$ when misspecified). Under correct specification, $V = \mathcal{I}$ and the sandwich collapses to $\mathcal{I}^{-1}$ .

Implications for LLM training: Language models trained by NLL minimisation are nearly always misspecified (the model cannot perfectly represent natural language). The trained model converges to the KL-minimising distribution within the model class. This is why language models exhibit characteristic failure modes related to the model architecture's inductive biases - they converge to the nearest representable distribution, not the true distribution.

9. Applications in Machine Learning

9.1 MLE Is Cross-Entropy Training

We derived in Section5.4 that MLE = cross-entropy minimisation. Here we examine the consequences.

NLL loss for common model types:

| Task | Model $p(y|\mathbf{x};\boldsymbol{\theta})$ | NLL loss | Standard name | | --- | --- | --- | --- | | Binary classification | $\operatorname{Bern}(\sigma(\mathbf{w}^\top\mathbf{x}))$ | $-y\log\hat{p} - (1-y)\log(1-\hat{p})$ | Binary cross-entropy | | Multi-class | $\operatorname{Cat}(\operatorname{softmax}(\mathbf{W}\mathbf{x}))$ | $-\sum_k y_k\log\hat{p}_k$ | Categorical cross-entropy | | Regression | $\mathcal{N}(\mathbf{w}^\top\mathbf{x}, \sigma^2)$ | $\frac{(y-\hat{y})^2}{2\sigma^2}$ | MSE (up to constant) | | Language model | $\prod_t p_\theta(x_t|x_{<t})$ | $-\frac{1}{T}\sum_t\log p_\theta(x_t|x_{<t})$ | NLL / perplexity |

Temperature scaling for calibration. After training a classifier, its softmax probabilities are often overconfident (Guo et al., 2017). Temperature scaling finds $T^* = \arg\min_T \mathcal{L}_{\text{NLL}}(\boldsymbol{\theta}, T)$ where logits are divided by $T$ . This is a 1-parameter MLE problem on a calibration set, provably maintaining accuracy while improving Expected Calibration Error (ECE).

Label smoothing. Standard MLE on one-hot targets drives $\log p(y|\mathbf{x}) \to 0$ (i.e., $p \to 1$ ) for the correct class, making the model overconfident. Label smoothing (Szegedy et al., 2016) uses target $\tilde{y}_k = (1-\varepsilon)\mathbb{1}[k=y] + \varepsilon/K$ for small $\varepsilon$ (typically 0.1). This regularises MLE toward a mixture of the data distribution and uniform - reducing overconfidence without sacrificing accuracy.

9.2 Natural Gradient and Fisher Information

The problem with Euclidean gradient descent. The ordinary gradient $\nabla_{\boldsymbol{\theta}} \mathcal{L}$ is the direction of steepest ascent in Euclidean space. But this is parameterisation-dependent: rescaling parameters changes the gradient direction. Different parameterisations of the same model family give different gradient descent trajectories.

Information geometry. The space of probability distributions has a natural Riemannian metric - the Fisher information metric $g_{jk} = \mathcal{I}(\boldsymbol{\theta})_{jk}$ . In this curved space, the steepest ascent direction is:

\tilde{\nabla}_{\boldsymbol{\theta}} \mathcal{L} = \mathcal{I}(\boldsymbol{\theta})^{-1} \nabla_{\boldsymbol{\theta}} \mathcal{L}

Natural gradient descent (Amari, 1998):

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \mathcal{I}(\boldsymbol{\theta}_t)^{-1} \nabla_{\boldsymbol{\theta}} \mathcal{L}

Properties:

Parameterisation-invariant: changing parameterisation gives the same update in the function space
Adaptive: steps are small in directions of high curvature (where the loss changes rapidly), large in flat directions
Connection to Newton's method: natural gradient = Newton's method when the Hessian equals the FIM (true for GLMs at the maximum)

K-FAC (Martens & Grosse, 2015). Full FIM inversion is $O(p^3)$ - impossible for neural networks. K-FAC approximates $\mathcal{I}^{-1}$ layer-wise using the Kronecker structure:

\mathcal{I}^{[l]} \approx A^{[l-1]} \otimes G^{[l]}

where $A^{[l-1]} = \mathbb{E}[\mathbf{a}^{[l-1]}(\mathbf{a}^{[l-1]})^\top]$ and $G^{[l]} = \mathbb{E}[\delta^{[l]}(\delta^{[l]})^\top]$ . This reduces the cost from $O(p^3)$ to $O(d_{\text{in}}^3 + d_{\text{out}}^3)$ per layer, making natural gradient feasible for moderate-sized networks.

Shampoo (Gupta et al., 2018) is a related second-order optimiser that maintains full-matrix preconditioners per parameter group, used at Google-scale training.

9.3 Confidence Intervals for Model Evaluation

The evaluation uncertainty problem. A model is evaluated on a test set of $n$ examples. The reported accuracy $\hat{a}$ is an estimate of the true accuracy $a^*$ . Without a CI, it is impossible to determine if two models are genuinely different or differ only by chance.

Standard practice for ML papers:

Wilson score CI for binary metrics (accuracy, F1 per class)
Bootstrap CI for non-standard metrics (BLEU, ROUGE, perplexity)
Correct for multiple benchmarks using Bonferroni or BH correction

Confidence interval for AUC. The Area Under the ROC Curve is a function of ranks, not a simple proportion. Its asymptotic distribution follows the Wilcoxon-Mann-Whitney form, giving a CI:

\widehat{\text{AUC}} \pm z_{\alpha/2} \sqrt{\frac{\widehat{\text{AUC}}(1-\widehat{\text{AUC}}) + (n_+-1)(Q_1-\widehat{\text{AUC}}^2) + (n_--1)(Q_2-\widehat{\text{AUC}}^2)}{n_+ n_-}}

For complex metrics like this, bootstrap is typically preferred.

LLM benchmark confidence. Major benchmarks (MMLU, HumanEval, HellaSwag) have 1000-14000 questions. For MMLU ( $n \approx 14000$ ), the 95% Wilson CI for an accuracy of 70% is approximately $\pm 0.76\%$ . For HumanEval ( $n = 164$ ), the CI is $\pm 7.0\%$ - models within 7 percentage points cannot be reliably ranked.

9.4 Estimating Scaling Laws

Scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) describe how model performance $L$ depends on model size $N$ (parameters), training tokens $D$ , and compute $C$ . These are regression problems - a direct application of estimation theory.

Chinchilla scaling law (Hoffmann et al., 2022). The loss function is modelled as:

L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

where $E, A, B, \alpha, \beta$ are estimated by nonlinear least squares MLE on $(N, D, L)$ observations from hundreds of training runs. The fitted parameters ( $\alpha \approx 0.34$ , $\beta \approx 0.28$ ) are MLE estimates with associated uncertainty.

The optimal allocation. For a fixed compute budget $C = 6ND$ , the optimal $(N^*, D^*)$ is found by constrained optimisation of the fitted loss model. Chinchilla showed that LLaMA-1 (Touvron et al., 2023) and GPT-3 were undertrained relative to their model size - the optimal allocation uses smaller models trained on more tokens.

For AI: The Chinchilla scaling law computation is MLE applied at the scale of frontier AI research. The "parameters" being estimated ( $\alpha, \beta, A, B, E$ ) are 5 numbers that determine the optimal architecture and training strategy for models costing millions of dollars to train.

9.5 Fisher Information in Fine-Tuning

Elastic Weight Consolidation (EWC, Kirkpatrick et al., 2017). In continual learning, a model trained sequentially on tasks $T_1, T_2, \ldots$ tends to forget earlier tasks - catastrophic forgetting. EWC prevents this by penalising changes to weights that were important for task $T_1$ :

\mathcal{L}_{\text{EWC}}(\boldsymbol{\theta}) = \mathcal{L}_{T_2}(\boldsymbol{\theta}) + \frac{\lambda}{2} \sum_j F_j (\theta_j - \theta^*_{1,j})^2

where $F_j$ is the $j$ -th diagonal element of the Fisher information matrix computed on task $T_1$ 's data: $F_j = \frac{1}{n}\sum_i (\partial \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta}^*_1)/\partial\theta_j)^2$ .

Parameters with high $F_j$ are those to which the task-1 likelihood is most sensitive - changing them most damages task-1 performance. EWC protects these parameters with strong quadratic penalties.

LoRA as low-rank MLE (Hu et al., 2022). LoRA fine-tunes large models by constraining weight updates to be low-rank: $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d,k)$ . This is MLE with a rank constraint on the parameter update - the MLE over a restricted parameter manifold. The rank $r$ controls the trade-off between expressivity and the number of estimated parameters.

Optimal Brain Damage (LeCun et al., 1990). Neural network pruning by removing parameters $\theta_j$ that, according to a second-order Taylor expansion, increase the loss least. The per-parameter importance is approximately $\frac{1}{2} H_{jj} \theta_j^2$ , where $H_{jj}$ is the diagonal Hessian. Replacing $H_{jj}$ with $F_{jj}$ (diagonal FIM) recovers FIM-based pruning - estimation theory meets efficient neural architecture.

10. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	"The 95% CI means there is a 95% probability that $\theta \in [a,b]$ ."	After observing data, $\theta$ is either in $[a,b]$ or not - no randomness remains. The probability refers to the procedure, not this interval.	Say "In repeated experiments, 95% of such intervals cover $\theta$ ." Use Bayesian credible intervals if a posterior probability statement is needed.
2	Confusing bias with MSE, treating "unbiased" as synonymous with "accurate."	Low bias says nothing about variance or overall error. An unbiased estimator with high variance can have larger MSE than a biased estimator with low variance.	Always decompose MSE = Bias^2 + Var and consider both terms.
3	Using $\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ as the unbiased variance estimator.	The $1/n$ estimator is the MLE and is biased; $\mathbb{E}[\hat{\sigma}^2] = (n-1)\sigma^2/n$ .	Use $s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2$ (Bessel's correction) for unbiased estimation.
4	Treating the MLE as automatically unbiased.	MLE is consistent and asymptotically efficient but not generally unbiased. The Gaussian variance MLE, Uniform maximum MLE, and many others are biased.	Compute $\mathbb{E}[\hat{\theta}_{\text{MLE}}]$ explicitly or use the delta method for composite parameters.
5	Applying the CRB to biased estimators using the unbiased form $\operatorname{Var} \geq 1/\mathcal{I}$ .	The CRB for biased estimators is $\operatorname{Var}(\hat{\theta}) \geq (1+b'(\theta))^2/\mathcal{I}(\theta)$ . Using the unbiased form gives an incorrect lower bound.	For biased estimators, always use the biased-estimator CRB, or switch to MSE rather than variance comparisons.
6	Using the Wald CI for proportions near 0 or 1 (e.g., $\hat{p} = 0.02$ , $n = 50$ ).	The Wald interval $\hat{p} \pm z\sqrt{\hat{p}(1-\hat{p})/n}$ can extend below 0 or above 1, and has poor coverage near the boundary.	Use the Wilson score interval or Clopper-Pearson exact interval for proportions near 0 or 1.
7	Treating the Fisher information (expected curvature) as identical to the Hessian of the log-likelihood (observed curvature).	$\mathcal{I}(\theta) = -\mathbb{E}[H_{\log p}]$ equals the expected Hessian. For a specific sample, the observed Hessian $-H_{\log p}(\hat{\theta})$ differs. The observed Hessian is used in practice (faster) but they are equal only in expectation.	Use observed Hessian when computational speed is needed; expected FIM for theoretical comparisons. Both give consistent estimates of $\mathcal{I}(\theta^*)$ .
8	Applying the MLE invariance property in reverse: assuming $g(\hat{\theta})$ is unbiased for $g(\theta)$ because $\hat{\theta}$ is unbiased for $\theta$ .	MLE invariance says $\arg\max L(g(\theta)) = g(\hat{\theta}_{\text{MLE}})$ , but says nothing about bias. If $g$ is nonlinear, Jensen's inequality shows $\mathbb{E}[g(\hat{\theta})] \neq g(\mathbb{E}[\hat{\theta}]) = g(\theta)$ in general.	Use the delta method to compute the asymptotic bias of $g(\hat{\theta})$ and add a bias correction if needed.
9	Applying asymptotic ( $n \to \infty$ ) CI formulas for small $n$ (e.g., $n = 15$ ).	Asymptotic normality of MLE typically requires $n \geq 30$ - $100$ depending on the model. For small $n$ , the $t$ -distribution, exact intervals, or bootstrap should be used.	Use the $t$ -distribution CI for Gaussian mean with small $n$ ; bootstrap CI for non-standard statistics; exact binomial CI for small count data.
10	Using the sum of likelihoods $\sum L(\theta; x_i)$ instead of the product $\prod L(\theta; x_i)$ (or sum of log-likelihoods).	The likelihood function is the joint probability of the data - a product for independent observations, not a sum. Maximising the sum gives a different, generally inconsistent estimator.	Always write $\ell(\theta) = \sum_i \log p(x^{(i)};\theta)$ (log-likelihood = sum of log densities).

11. Exercises

Exercise 1 [*] MLE for Bernoulli and Poisson.

(a) Given $n = 10$ coin flips with outcomes $\{H, H, T, H, T, T, H, H, H, T\}$ (6 heads, 4 tails), compute $\hat{p}_{\text{MLE}}$ and its standard error $\sqrt{\hat{p}(1-\hat{p})/n}$ .

(b) For the Poisson model $\operatorname{Poisson}(\lambda)$ with $n$ iid observations summing to $S$ , derive the MLE $\hat{\lambda}$ by solving the score equation and verify it is a maximum (second derivative < 0).

(c) Using the MLE from (b) and the Fisher information $\mathcal{I}(\lambda) = 1/\lambda$ , verify the MLE achieves the CRB: $\operatorname{Var}(\hat{\lambda}) = \lambda/n = 1/(n\mathcal{I}(\lambda))$ .

Exercise 2 [*] Bias of the MLE variance estimator.

Let $x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ with both parameters unknown.

(a) Prove that $\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x^{(i)} - \bar{x})^2$ has $\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2$ . (Hint: write $x^{(i)} - \bar{x}$ in terms of $x^{(i)} - \mu$ and $\bar{x} - \mu$ .)

(b) Show that the corrected estimator $s^2 = \frac{n}{n-1}\hat{\sigma}^2_{\text{MLE}}$ is unbiased.

(c) By MLE invariance, $\hat{\sigma}_{\text{MLE}} = \sqrt{\hat{\sigma}^2_{\text{MLE}}}$ . Is this an unbiased estimator of $\sigma$ ? If not, what is the sign of its bias? Use Jensen's inequality.

Exercise 3 [*] Cramer-Rao bound verification.

For $n$ iid observations from $\operatorname{Exp}(\lambda)$ with $p(x;\lambda) = \lambda e^{-\lambda x}$ :

(a) Compute the score $s(\lambda; x) = \partial \log p(x;\lambda)/\partial \lambda$ and verify $\mathbb{E}[s] = 0$ .

(b) Compute the Fisher information $\mathcal{I}(\lambda)$ using both the variance-of-score and negative-expected-Hessian formulas and verify they are equal.

(c) Show that $\hat{\lambda} = n/\sum_{i=1}^n x^{(i)}$ is the MLE and compute $\operatorname{Var}(\hat{\lambda})$ using the delta method applied to $g(\bar{x}) = 1/\bar{x}$ with $\operatorname{Var}(\bar{x}) = (1/\lambda^2)/n$ . Verify that $\operatorname{Var}(\hat{\lambda}) = \lambda^2/n = 1/(n\mathcal{I}(\lambda))$ - the CRB is achieved.

Exercise 4 [**] Method of Moments for the Beta distribution.

Let $x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Beta}(\alpha, \beta)$ with moments $\mathbb{E}[X] = \alpha/(\alpha+\beta)$ and $\operatorname{Var}(X) = \alpha\beta/[(\alpha+\beta)^2(\alpha+\beta+1)]$ .

(a) Derive the MoM estimators $\hat{\alpha}$ and $\hat{\beta}$ in terms of $\bar{x}$ and $s^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ .

(b) Show that the MoM estimates are only valid when $s^2 < \bar{x}(1-\bar{x})$ . What is the statistical interpretation of this constraint?

(c) Generate $n = 200$ samples from $\operatorname{Beta}(2, 5)$ and compute both the MoM and the MLE (via numerical optimisation). Compare the estimates and their standard errors. Which is more efficient?

Exercise 5 [**] Bootstrap confidence interval.

Let $\mathbf{x} = \{3.1, 1.4, 2.5, 4.2, 1.8, 3.7, 2.1, 4.8, 1.2, 3.3\}$ (a sample of size 10).

(a) Compute the sample median $\tilde{x}$ and the 95% asymptotic CI for the median using the fact that, for a distribution with density $f(\tilde{x})$ at the true median, $\operatorname{Var}(\text{sample median}) \approx \frac{1}{4nf(\tilde{x})^2}$ .

(b) Implement the non-parametric bootstrap with $B = 2000$ resamples and compute the percentile bootstrap CI for the median. Compare the width to the asymptotic CI.

Exercise 6 [**] Asymptotic normality simulation.

Verify the asymptotic normality of the Bernoulli MLE empirically.

(a) For true parameter $p = 0.3$ and sample sizes $n \in \{10, 30, 100, 500\}$ , simulate $M = 5000$ MLEs $\hat{p}^{(m)} = \bar{x}^{(m)}$ and plot their distributions.

(b) For each $n$ , standardise: $Z^{(m)} = \frac{\hat{p}^{(m)} - 0.3}{\sqrt{0.3 \cdot 0.7/n}}$ and overlay the $\mathcal{N}(0,1)$ PDF. At what $n$ does the approximation appear adequate?

(c) Compute the empirical coverage of the 95% Wald CI at each $n$ . At what $n$ does coverage first reach 94%-96%? Does the CRB bound hold: is $n \cdot \operatorname{Var}(\hat{p})$ always $\geq 1/\mathcal{I}(0.3) = p(1-p)$ ?

Exercise 7 [***] Fisher information for a neural network layer.

Consider a one-layer softmax classifier $p(y|\mathbf{x};\mathbf{W}) = \operatorname{softmax}(\mathbf{W}\mathbf{x})_y$ for $y \in \{1,\ldots,K\}$ , where $\mathbf{W} \in \mathbb{R}^{K \times d}$ .

(a) Derive the score $\nabla_{\mathbf{W}} \log p(y|\mathbf{x};\mathbf{W}) = (\mathbf{e}_y - \hat{\mathbf{p}}) \mathbf{x}^\top$ where $\hat{\mathbf{p}} = \operatorname{softmax}(\mathbf{W}\mathbf{x})$ .

(b) Show that the Fisher information matrix (as a $Kd \times Kd$ matrix) is:

\mathcal{I}(\mathbf{W}) = \mathbb{E}_{\mathbf{x}}\!\left[(\operatorname{Diag}(\hat{\mathbf{p}}) - \hat{\mathbf{p}}\hat{\mathbf{p}}^\top) \otimes \mathbf{x}\mathbf{x}^\top\right]

(c) Identify the two factors in the K-FAC approximation: $A = \mathbb{E}[\mathbf{x}\mathbf{x}^\top]$ (input covariance) and $G = \mathbb{E}[\operatorname{Diag}(\hat{\mathbf{p}}) - \hat{\mathbf{p}}\hat{\mathbf{p}}^\top]$ (output covariance). Simulate this for $d = 10$ , $K = 5$ , and compare $A \otimes G$ to the full FIM.

Exercise 8 [***] Chinchilla scaling law estimation.

The Chinchilla model (Hoffmann et al., 2022) estimates language model loss as $L(N,D) = E + A/N^\alpha + B/D^\beta$ .

(a) Given synthetic training run data (simulate 50 runs with $(N, D)$ pairs and losses with added Gaussian noise), fit the 5 parameters $(E, A, B, \alpha, \beta)$ by nonlinear least squares using scipy.optimize.curve_fit.

(b) Compute 95% CIs for each parameter using the covariance matrix returned by curve_fit (which estimates $\sigma^2(J^\top J)^{-1}$ , where $J$ is the Jacobian at the solution).

(c) For a fixed compute budget $C = 6ND$ (FLOPs), find the optimal $(N^*, D^*)$ by minimising $L(N, C/(6N))$ over $N$ . How sensitive is this optimal allocation to uncertainty in $\alpha$ and $\beta$ ?

12. Why This Matters for AI (2026 Perspective)

Concept	AI/LLM Impact
MLE = cross-entropy minimisation	Every language model (GPT-4, LLaMA-3, Gemini, Claude) is trained by NLL minimisation; MLE provides the theoretical foundation for why this converges to the right distribution
Fisher information matrix	K-FAC (Martens & Grosse, 2015) enables tractable second-order optimisation for deep networks; Shampoo (Google) uses full-matrix FIM approximations in large-scale training
Natural gradient	Invariant to reparametrisation; theoretically optimal steepest direction in distribution space; practical approximations (K-FAC, SOAP) achieve closer-to-optimal convergence rates
Asymptotic normality of MLE	Justifies confidence intervals for model parameters; provides the basis for Laplace approximation in Bayesian neural networks (Section04)
Cramer-Rao bound	Fundamental limit on estimation from data; explains why larger datasets ( $\sim 1/\sqrt{n}$ ) always help; motivates data-efficient fine-tuning methods
Confidence intervals	Benchmark evaluation without CIs is scientifically misleading; Wilson CIs for accuracy, bootstrap CIs for composite metrics; the HELM/OpenLLM leaderboards report point estimates without CIs - a known limitation
Bias-variance decomposition	Theoretical foundation for regularisation in ML: L2/L1 regularisation = adding bias to reduce variance; early stopping, dropout, and weight decay all operate on this trade-off
Sufficient statistics	Exponential family sufficient statistics underlie variational autoencoders (encoder outputs $\boldsymbol{\mu}$ , $\boldsymbol{\sigma}$ of Gaussian posterior) and normalising flows
Scaling law estimation	Chinchilla (Hoffmann et al., 2022) uses nonlinear MLE to fit $(N, D, L)$ data; the estimated scaling exponents ( $\alpha \approx 0.34$ , $\beta \approx 0.28$ ) determine optimal model sizes at every compute budget
Elastic weight consolidation	FIM-based penalty prevents catastrophic forgetting in continual learning; applied in fine-tuning pipelines to protect pre-trained capabilities while adapting to new tasks
Temperature calibration	MLE on a calibration set finds the scalar $T$ that minimises NLL; well-calibrated models produce reliable uncertainty estimates for downstream decision-making
Bootstrap evaluation	Non-parametric CI construction for arbitrary metrics (BLEU, pass@k, ECE); essential for comparing models with complex evaluation protocols

13. Conceptual Bridge

Estimation theory is the bridge between probability and action. Probability theory (Chapter 6) defines random variables, distributions, and their properties - it answers "if we know the model, what data should we expect?" Estimation theory inverts this: "given data, what can we infer about the model?" This inversion from data to parameters is the core operation of every machine learning training algorithm.

The section you have just completed builds on Descriptive Statistics (Section01) in a crucial way: sample statistics like $\bar{x}$ and $s^2$ were introduced there as data summaries; here they are elevated to estimators - random variables with sampling distributions, bias, variance, and convergence properties. The concept of Bessel's correction ( $n-1$ denominator) was motivated in Section01 by practicality; here it is derived from the requirement of unbiasedness.

Looking forward to Hypothesis Testing (Section03): the confidence intervals of this section have a duality with hypothesis tests - rejecting $H_0: \theta = \theta_0$ at level $\alpha$ is equivalent to $\theta_0$ falling outside the $(1-\alpha)$ CI. The sampling distributions developed here (Gaussian, Student's $t$ , chi-squared, $F$ ) are the exact distributions that hypothesis tests use.

The next major extension is Bayesian Inference (Section04), which reframes the estimation problem by treating $\boldsymbol{\theta}$ as a random variable with a prior distribution $p(\boldsymbol{\theta})$ . The posterior $p(\boldsymbol{\theta}|\mathcal{D}) \propto L(\boldsymbol{\theta};\mathcal{D})\, p(\boldsymbol{\theta})$ combines the likelihood (covered here) with the prior. MAP estimation - maximising the posterior - is MLE regularised by the log-prior. L2 regularisation is MAP with a Gaussian prior; L1 regularisation is MAP with a Laplace prior. The full treatment of conjugate priors, credible intervals, and MCMC is in Section04.

ESTIMATION THEORY IN THE CURRICULUM
========================================================================

  Chapter 6 - Probability Theory
  |  Section01 Random Variables (distributions, CDF/PDF)
  |  Section04 Expectation and Moments (E[X], Var(X), CLT, LLN)
  |  Section03 Joint Distributions (Bayes' theorem)
  +------------------------------------------+
                                             | prerequisites
  Chapter 7 - Statistics                     v
  |  Section01 Descriptive Statistics    ----------------------------------
  |       sample mean, variance,   data summaries -> estimators
  |       covariance, correlation
  |
  +--> Section02 ESTIMATION THEORY (this section)
  |       - Point estimators: bias, variance, MSE
  |       - Cramer-Rao bound, Fisher information
  |       - MLE: derivation, properties, examples
  |       - Asymptotic normality, delta method
  |       - Confidence intervals (frequentist)
  |       - MLE = cross-entropy training
  |
  +--> Section03 Hypothesis Testing      (duality: CIs <-> tests)
  |       p-values, power, t/z/\\chi^2 tests, A/B testing
  |
  +--> Section04 Bayesian Inference      (extension: add prior to likelihood)
  |       posterior = likelihood x prior, MAP = regularised MLE
  |       conjugate priors, MCMC, variational inference
  |
  +--> Section05 Time Series             (sequential estimation, Kalman filter)
  |
  +--> Section06 Regression Analysis     (applied MLE: OLS, Ridge, Lasso)

  Chapter 8 - Optimisation         (natural gradient uses FIM)
  Chapter 9 - Information Theory   (FIM and Shannon information)

========================================================================

The Fisher information matrix is the central object connecting estimation theory outward: it bounds estimation variance (CRB), defines the geometry of natural gradient optimisation (Ch8), underpins the Laplace approximation in Bayesian inference (Section04), and is related to the Shannon capacity of statistical experiments (Ch9). Understanding the FIM deeply is understanding the mathematical infrastructure of modern ML training.

End of Section02 Estimation Theory

<- Back to Chapter 7: Statistics | Next: Hypothesis Testing ->

Appendix A: Exponential Families

Exponential families are the most important class of parametric models in statistics, unifying the Gaussian, Bernoulli, Poisson, Exponential, Gamma, Beta, and dozens of other distributions under one framework.

Definition A.1 (Exponential Family). A parametric family $\{p(x;\boldsymbol{\eta})\}$ is an exponential family if it can be written in the form:

p(x;\boldsymbol{\eta}) = h(x) \exp\!\left(\boldsymbol{\eta}^\top T(x) - A(\boldsymbol{\eta})\right)

where:

$\boldsymbol{\eta} \in \mathbb{R}^k$ is the natural (canonical) parameter
$T(x) \in \mathbb{R}^k$ is the sufficient statistic (vector of natural statistics)
$h(x) \geq 0$ is the base measure
$A(\boldsymbol{\eta}) = \log \int h(x) e^{\boldsymbol{\eta}^\top T(x)} dx$ is the log-partition function (ensures normalisation)

The log-partition function encodes all moments: $\nabla_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \mathbb{E}[T(X)]$ and $\nabla^2_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \operatorname{Cov}(T(X))$ .

Verification examples:

Bernoulli. $p(x;p) = p^x(1-p)^{1-x}$ . Write as $\exp(x\log(p/(1-p)) + \log(1-p))$ . Natural parameter $\eta = \log(p/(1-p))$ (log-odds), sufficient statistic $T(x) = x$ , $A(\eta) = \log(1 + e^\eta)$ .

Gaussian. $\mathcal{N}(\mu,\sigma^2)$ : natural parameters $\boldsymbol{\eta} = (\mu/\sigma^2, -1/(2\sigma^2))$ , sufficient statistics $T(x) = (x, x^2)$ .

Key properties of exponential families for MLE:

MLE score equation: $\hat{\boldsymbol{\eta}}_{\text{MLE}}$ satisfies $\mathbb{E}_{\hat{\boldsymbol{\eta}}}[T(X)] = \frac{1}{n}\sum_i T(x^{(i)})$ - the MLE moment-matches the sufficient statistics.
MLE = MoM: For exponential families, MLE and method of moments coincide (they both set $\mathbb{E}[T] = \hat{T}$ ).
Efficient MLE: The MLE achieves the CRB for all exponential families - it is always efficient.
Log-likelihood is concave: For minimal exponential families, $\ell(\boldsymbol{\eta})$ is strictly concave - the MLE is the unique global maximum, and Newton-Raphson converges from any starting point.
Complete sufficient statistics: $T(x^{(1)}, \ldots, x^{(n)}) = \sum_i T(x^{(i)})$ is a complete sufficient statistic, so the MLE is the UMVUE by Lehmann-Scheffe.

Importance for ML: The categorical distribution (softmax output of neural networks), Gaussian (regression, VAE latent space), Bernoulli (binary classification), and Poisson (count data, Poisson regression) are all exponential families. The VAE encoder outputs the natural parameters $(\boldsymbol{\mu}, \boldsymbol{\sigma})$ of the Gaussian posterior - the sufficient statistics. Generalised linear models (GLMs, of which logistic regression is a special case) are defined precisely as: exponential family response + linear natural parameter $\boldsymbol{\eta} = \boldsymbol{W}\mathbf{x}$ .

Appendix B: Sufficient Statistics - Detailed Examples

B.1 The Exponential Distribution

For $x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Exp}(\lambda)$ :

p(\mathbf{x};\lambda) = \lambda^n \exp\!\left(-\lambda \sum_{i=1}^n x^{(i)}\right) \cdot \mathbb{1}[x^{(i)} \geq 0 \; \forall i]

By the factorisation theorem: $T = \sum_i x^{(i)}$ (total waiting time) is a sufficient statistic. It factors as $g(T,\lambda) \cdot h(\mathbf{x})$ with $g(T,\lambda) = \lambda^n e^{-\lambda T}$ and $h(\mathbf{x}) = \prod \mathbb{1}[x^{(i)}\geq 0]$ .

MLE: $\hat{\lambda} = n/T = 1/\bar{x}$ , which depends on the data only through $T$ - consistent with sufficiency.

B.2 Order Statistics and the Uniform Distribution

For $x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{U}(0,\theta)$ :

p(\mathbf{x};\theta) = \frac{1}{\theta^n} \prod_{i=1}^n \mathbb{1}[0 \leq x^{(i)} \leq \theta] = \frac{1}{\theta^n}\mathbb{1}[\max_i x^{(i)} \leq \theta]\mathbb{1}[\min_i x^{(i)} \geq 0]

The sufficient statistic is $T = X_{(n)} = \max_i x^{(i)}$ (the maximum order statistic). The MLE $\hat{\theta} = X_{(n)}$ is a function of $T$ as expected.

Bias correction. Since $\mathbb{E}[X_{(n)}] = n\theta/(n+1)$ , the unbiased estimator is $\tilde{\theta} = \frac{n+1}{n}X_{(n)}$ . By Rao-Blackwell applied to $\tilde{\theta}' = \frac{n+1}{n}x^{(1)}$ (use only first observation): conditioning on $T = X_{(n)}$ gives $\mathbb{E}[\tilde{\theta}' | T] = \frac{n+1}{n}X_{(n)}\cdot\frac{1}{n} \cdot n = \frac{n+1}{n}X_{(n)}$ , so the Rao-Blackwell estimator is indeed $\tilde{\theta}$ .

Appendix C: The EM Algorithm - Connecting MLE to Latent Variable Models

Many models of interest (Gaussian mixture models, HMMs, topic models, VAEs) have latent variables $Z$ in addition to observed data $X$ . The complete-data log-likelihood $\log p(X, Z; \boldsymbol{\theta})$ would be easy to maximise if $Z$ were observed, but we only observe $X$ .

The EM algorithm maximises the marginal log-likelihood $\ell(\boldsymbol{\theta}) = \log p(X;\boldsymbol{\theta}) = \log \sum_Z p(X,Z;\boldsymbol{\theta})$ by iterating two steps:

E-step: Compute the expected complete-data log-likelihood under the current posterior over $Z$ :

Q(\boldsymbol{\theta};\boldsymbol{\theta}^{(t)}) = \mathbb{E}_{Z \mid X;\boldsymbol{\theta}^{(t)}}[\log p(X, Z;\boldsymbol{\theta})]

M-step: Maximise:

\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta};\boldsymbol{\theta}^{(t)})

Why EM guarantees $\ell(\boldsymbol{\theta}^{(t+1)}) \geq \ell(\boldsymbol{\theta}^{(t)})$ : The evidence lower bound (ELBO):

\ell(\boldsymbol{\theta}) = Q(\boldsymbol{\theta};\boldsymbol{\theta}^{(t)}) + H(\boldsymbol{\theta}^{(t)})

where $H$ is the entropy term not involving $\boldsymbol{\theta}$ . The M-step maximises $Q$ , and the E-step tightens the bound at the new $\boldsymbol{\theta}^{(t+1)}$ .

Gaussian Mixture Model (GMM). For $K$ -component GMM with means $\boldsymbol{\mu}_k$ , covariances $\Sigma_k$ , and mixing weights $\pi_k$ :

E-step: Compute responsibilities $r_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}^{(i)};\boldsymbol{\mu}_k,\Sigma_k)}{\sum_j \pi_j \mathcal{N}(\mathbf{x}^{(i)};\boldsymbol{\mu}_j,\Sigma_j)}$
M-step: Update parameters:

\hat{\pi}_k = \frac{\sum_i r_{ik}}{n}, \quad \hat{\boldsymbol{\mu}}_k = \frac{\sum_i r_{ik}\mathbf{x}^{(i)}}{\sum_i r_{ik}}, \quad \hat{\Sigma}_k = \frac{\sum_i r_{ik}(\mathbf{x}^{(i)}-\hat{\boldsymbol{\mu}}_k)(\mathbf{x}^{(i)}-\hat{\boldsymbol{\mu}}_k)^\top}{\sum_i r_{ik}}

Each M-step is a weighted MLE for a Gaussian - exactly the closed-form solutions from Section5.3, but with fractional weights.

For AI: The EM algorithm is the algorithmic prototype for variational inference in VAEs (Section04). In the VAE, the E-step is replaced by the encoder $q_\phi(\mathbf{z}|\mathbf{x})$ (an amortised posterior approximation) and the M-step is replaced by joint gradient ascent on the ELBO with respect to both encoder parameters $\phi$ and decoder parameters $\boldsymbol{\theta}$ .

Appendix D: Cramer-Rao Bound - Multivariate Proof and Geometry

Multivariate CRB (full proof). Let $\hat{\boldsymbol{\theta}}$ be an unbiased estimator of $\boldsymbol{\theta} \in \mathbb{R}^k$ . Define the score vector $\mathbf{s}(\boldsymbol{\theta};\mathbf{x}) = \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x};\boldsymbol{\theta}) \in \mathbb{R}^k$ .

Since $\hat{\boldsymbol{\theta}}$ is unbiased: $\mathbb{E}[\hat{\boldsymbol{\theta}}] = \boldsymbol{\theta}$ . Differentiating with respect to $\theta_j$ :

I_j = \frac{\partial}{\partial\theta_j}\mathbb{E}[\hat{\boldsymbol{\theta}}] = \mathbb{E}[\hat{\boldsymbol{\theta}} \cdot s_j(\boldsymbol{\theta};\mathbf{x})] = \mathbb{E}[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}) s_j]

In matrix form: $I_k = \mathbb{E}[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta})\mathbf{s}^\top]$ (a $k \times k$ identity).

By the matrix Cauchy-Schwarz (Lowner ordering):

\begin{pmatrix} \operatorname{Cov}(\hat{\boldsymbol{\theta}}) & I_k \\ I_k & \mathcal{I}(\boldsymbol{\theta}) \end{pmatrix} \succeq 0

The Schur complement condition gives: $\operatorname{Cov}(\hat{\boldsymbol{\theta}}) - \mathcal{I}(\boldsymbol{\theta})^{-1} \succeq 0$ , i.e., $\operatorname{Cov}(\hat{\boldsymbol{\theta}}) \succeq \mathcal{I}(\boldsymbol{\theta})^{-1}$ . $\square$

Information-geometric interpretation. The statistical manifold $\mathcal{M} = \{p(\cdot;\boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta\}$ is a Riemannian manifold with the Fisher-Rao metric $g_{jl} = \mathcal{I}(\boldsymbol{\theta})_{jl}$ . The CRB states that the covariance of any unbiased estimator is at least as large as the inverse metric - the "volume" of estimation uncertainty is bounded below by the curvature of the statistical manifold.

This geometric view motivates the natural gradient: gradient descent in the Fisher-Rao metric on the statistical manifold corresponds to the natural gradient update $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \mathcal{I}^{-1}\nabla\ell$ .

Appendix E: The Score Test, Wald Test, and Likelihood Ratio Test

These three test statistics form the classical trinity of asymptotic tests in statistics. While their full treatment belongs in Section03 Hypothesis Testing, their derivations from MLE theory belong here.

For testing $H_0: \boldsymbol{\theta} = \boldsymbol{\theta}_0$ :

Score (Rao) test: $\Lambda_S = \frac{1}{n}\mathbf{s}_n(\boldsymbol{\theta}_0)^\top \mathcal{I}(\boldsymbol{\theta}_0)^{-1} \mathbf{s}_n(\boldsymbol{\theta}_0) \xrightarrow{d} \chi^2_k$ under $H_0$ .

Uses only the restricted model - does not require fitting the unrestricted MLE.

Wald test: $\Lambda_W = n(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)^\top \mathcal{I}(\hat{\boldsymbol{\theta}})(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0) \xrightarrow{d} \chi^2_k$ under $H_0$ .

Uses only the unrestricted MLE.

Likelihood ratio test: $\Lambda_{LR} = 2[\ell(\hat{\boldsymbol{\theta}}) - \ell(\boldsymbol{\theta}_0)] \xrightarrow{d} \chi^2_k$ under $H_0$ .

Requires fitting both restricted and unrestricted models. By Wilks' theorem (1938), this converges to the chi-squared distribution.

Asymptotic equivalence. Under $H_0$ : $\Lambda_S = \Lambda_W = \Lambda_{LR} + O_P(n^{-1/2})$ - all three are first-order equivalent. They differ in finite samples. The LRT is generally most accurate for small $n$ .

These tests are developed fully in Section03 Hypothesis Testing, including the Neyman-Pearson lemma, p-values, and power calculations.

Appendix F: Sampling Distributions of Classical Statistics

Understanding the exact (finite-sample) distributions of key estimators is essential for constructing valid hypothesis tests and confidence intervals when $n$ is small.

F.1 Distribution of the Sample Mean

Let $x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ .

\bar{x} = \frac{1}{n}\sum_{i=1}^n x^{(i)} \sim \mathcal{N}\!\left(\mu, \frac{\sigma^2}{n}\right)

This is an exact result (not asymptotic) - the sum of independent Gaussians is Gaussian. The standardised version $Z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1)$ gives exact z-tests and z-intervals.

F.2 Distribution of the Sample Variance: Chi-Squared

Theorem F.1. Let $x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ . Then:

\frac{(n-1)s^2}{\sigma^2} = \frac{\sum_{i=1}^n (x^{(i)} - \bar{x})^2}{\sigma^2} \sim \chi^2_{n-1}

and this is independent of $\bar{x}$ .

Proof sketch. Write $\sum(x^{(i)}-\bar{x})^2/\sigma^2 = \sum(z^{(i)}-\bar{z})^2$ where $z^{(i)} = (x^{(i)}-\mu)/\sigma \overset{\text{iid}}{\sim} \mathcal{N}(0,1)$ . The quadratic form $\sum(z^{(i)}-\bar{z})^2$ equals $\mathbf{z}^\top(I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top)\mathbf{z}$ , a projection of $\mathbf{z}$ onto the $(n-1)$ -dimensional subspace orthogonal to $\mathbf{1}$ . By the spectral theorem for Gaussian quadratic forms, this has a $\chi^2_{n-1}$ distribution. Independence from $\bar{x}$ (a function of $\mathbf{1}^\top\mathbf{z}$ ) follows from the orthogonality.

The $n-1$ degrees of freedom arise because estimating $\mu$ by $\bar{x}$ removes one degree of freedom from the $n$ deviations $x^{(i)}-\bar{x}$ (they sum to zero, so only $n-1$ are free).

F.3 Student's $t$ Distribution

Definition. If $Z \sim \mathcal{N}(0,1)$ and $V \sim \chi^2_\nu$ independently, then $T = Z/\sqrt{V/\nu} \sim t_\nu$ .

Application to Gaussian mean. With $Z = (\bar{x}-\mu)/(\sigma/\sqrt{n}) \sim \mathcal{N}(0,1)$ and $V = (n-1)s^2/\sigma^2 \sim \chi^2_{n-1}$ (independent of $Z$ ):

T = \frac{\bar{x}-\mu}{s/\sqrt{n}} = \frac{Z}{\sqrt{V/(n-1)}} \sim t_{n-1}

This is exact for all $n \geq 2$ - no large-sample approximation needed. The $t$ -distribution is symmetric, bell-shaped, heavier-tailed than the Gaussian, and converges to $\mathcal{N}(0,1)$ as $\nu \to \infty$ .

Quantiles comparison:

$\alpha$ (two-sided)	$z_{\alpha/2}$	$t_{9, \alpha/2}$	$t_{29, \alpha/2}$
10%	1.645	1.833	1.699
5%	1.960	2.262	2.045
1%	2.576	3.250	2.756

The $t$ -distribution with small df requires wider intervals to achieve the same coverage - reflecting the additional uncertainty from estimating $\sigma$ .

F.4 The $F$ Distribution

Definition. If $U \sim \chi^2_m$ and $V \sim \chi^2_n$ independently, then $F = (U/m)/(V/n) \sim F_{m,n}$ .

Application. Comparing variances: $s_1^2/s_2^2 \cdot (\sigma_2^2/\sigma_1^2) \sim F_{n_1-1, n_2-1}$ . Testing equality of variances $H_0: \sigma_1^2 = \sigma_2^2$ : $F_{\text{obs}} = s_1^2/s_2^2 \sim F_{n_1-1, n_2-1}$ under $H_0$ .

The $F$ -distribution also underlies ANOVA (testing equality of multiple means), regression significance tests, and model comparison - fully developed in Section03 Hypothesis Testing and Section06 Regression Analysis.

Appendix G: Numerical Example - Fitting a Gaussian Mixture Model via EM

This worked example demonstrates MLE for a latent variable model.

Setup. Suppose we observe $n = 200$ observations believed to come from a mixture of two Gaussians: $p(x;\boldsymbol{\theta}) = \pi_1 \mathcal{N}(x;\mu_1,\sigma_1^2) + (1-\pi_1)\mathcal{N}(x;\mu_2,\sigma_2^2)$ .

The parameters are $\boldsymbol{\theta} = (\pi_1, \mu_1, \sigma_1^2, \mu_2, \sigma_2^2)$ .

Direct MLE is hard. The log-likelihood is $\ell(\boldsymbol{\theta}) = \sum_{i=1}^n \log[\pi_1\mathcal{N}(x^{(i)};\mu_1,\sigma_1^2) + (1-\pi_1)\mathcal{N}(x^{(i)};\mu_2,\sigma_2^2)]$ , which contains a log of a sum - non-convex, no closed form.

EM solution. Introduce latent indicators $Z^{(i)} \in \{1,2\}$ indicating which component generated $x^{(i)}$ .

E-step: Compute responsibilities

r^{(i)}_1 = \frac{\pi_1\mathcal{N}(x^{(i)};\mu_1,\sigma_1^2)}{\pi_1\mathcal{N}(x^{(i)};\mu_1,\sigma_1^2) + (1-\pi_1)\mathcal{N}(x^{(i)};\mu_2,\sigma_2^2)}, \quad r^{(i)}_2 = 1 - r^{(i)}_1

M-step: Update parameters (weighted MLEs for each component):

\hat{\pi}_1 = \frac{\sum_i r^{(i)}_1}{n}, \quad \hat{\mu}_k = \frac{\sum_i r^{(i)}_k x^{(i)}}{\sum_i r^{(i)}_k}, \quad \hat{\sigma}_k^2 = \frac{\sum_i r^{(i)}_k (x^{(i)} - \hat{\mu}_k)^2}{\sum_i r^{(i)}_k}

Convergence. EM guarantees $\ell(\boldsymbol{\theta}^{(t+1)}) \geq \ell(\boldsymbol{\theta}^{(t)})$ at every iteration, converging to a local maximum. Multiple random initialisations are needed to find the global maximum.

For AI: The EM algorithm is the prototype for alternating optimisation in deep learning. The BERT pre-training objective (masked language modelling) alternates between predicting masked tokens (like the E-step computing responsibilities) and updating parameters (like the M-step). Expectation-Maximisation underlies the theory of the EM algorithm, which is used for fitting GMMs, HMMs, and topic models in NLP preprocessing pipelines.

Appendix H: Regularised MLE and the Bias-Variance Trade-Off in Practice

Ridge regression (L2-regularised linear regression) is the canonical example of regularised MLE.

Setup. For linear regression $y^{(i)} = \boldsymbol{\theta}^\top \mathbf{x}^{(i)} + \varepsilon^{(i)}$ with $\varepsilon^{(i)} \overset{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2)$ , the MLE is OLS: $\hat{\boldsymbol{\theta}}_{\text{OLS}} = (X^\top X)^{-1}X^\top\mathbf{y}$ .

Ridge regression: Add L2 penalty:

\hat{\boldsymbol{\theta}}_{\text{ridge}}(\lambda) = \arg\min_{\boldsymbol{\theta}} \lVert \mathbf{y} - X\boldsymbol{\theta} \rVert_2^2 + \lambda\lVert\boldsymbol{\theta}\rVert_2^2 = (X^\top X + \lambda I)^{-1}X^\top\mathbf{y}

Bias: $\operatorname{Bias}(\hat{\boldsymbol{\theta}}_{\text{ridge}}) = -\lambda(X^\top X + \lambda I)^{-1}\boldsymbol{\theta}^* \neq \mathbf{0}$ - ridge is biased.

Variance: $\operatorname{Cov}(\hat{\boldsymbol{\theta}}_{\text{ridge}}) = \sigma^2(X^\top X + \lambda I)^{-1}X^\top X(X^\top X + \lambda I)^{-1} \preceq \sigma^2(X^\top X)^{-1} = \operatorname{Cov}(\hat{\boldsymbol{\theta}}_{\text{OLS}})$ .

Ridge has smaller covariance than OLS - introducing bias reduces variance. The bias-variance trade-off: for well-chosen $\lambda$ , MSE of ridge < MSE of OLS.

MAP interpretation. Ridge is MAP estimation with Gaussian prior $\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \sigma^2/\lambda \cdot I)$ . The prior pulls $\boldsymbol{\theta}$ toward zero, introducing bias toward zero. The full development of MAP (adding a prior to MLE) is in Section04 Bayesian Inference.

Optimal $\lambda$ selection. Cross-validation estimates the prediction MSE for each $\lambda$ , selecting the one that minimises the bias-variance trade-off from a prediction perspective. Alternatively, analytical formula: the MSE-minimising $\lambda^* = k\sigma^2/\lVert\boldsymbol{\theta}^*\rVert_2^2$ (depends on unknown $\boldsymbol{\theta}^*$ , so must be estimated).

Connection to weight decay in LLMs. AdamW (Loshchilov & Hutter, 2019) adds weight decay $\lambda\lVert\boldsymbol{\theta}\rVert_2^2$ to the Adam update. This is regularised MLE with a Gaussian prior on all parameters - the standard training procedure for all frontier language models. The weight decay coefficient $\lambda$ (typically 0.1) controls the strength of the prior, determining how much parameters are regularised toward zero.

Appendix I: Non-Parametric Estimation

When the parametric model is uncertain or clearly wrong, non-parametric estimation makes minimal distributional assumptions.

Empirical Distribution Function (EDF). Given sample $x^{(1)}, \ldots, x^{(n)}$ :

\hat{F}_n(t) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[x^{(i)} \leq t]

The EDF is the non-parametric MLE of the CDF $F$ . By the Glivenko-Cantelli theorem: $\sup_t |\hat{F}_n(t) - F(t)| \xrightarrow{a.s.} 0$ - the EDF converges uniformly to the true CDF.

The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality gives a finite-sample bound:

P\!\left(\sup_t |\hat{F}_n(t) - F(t)| > \varepsilon\right) \leq 2e^{-2n\varepsilon^2}

This is the non-parametric Cramer-Rao analogue - it bounds estimation error for the CDF without any parametric assumption.

Kernel density estimation. Estimating the PDF $f$ non-parametrically:

\hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^n K\!\left(\frac{x - x^{(i)}}{h}\right)

where $K$ is a kernel (e.g., Gaussian) and $h$ is the bandwidth. The bias-variance trade-off applies: small $h$ -> low bias, high variance; large $h$ -> high bias, low variance. Optimal bandwidth: $h^* \propto n^{-1/5}$ (mean integrated squared error).

Bootstrap as non-parametric MLE. The bootstrap resamples from the EDF $\hat{F}_n$ - effectively replacing the unknown $F$ with its non-parametric MLE. Bootstrap CI validity follows from the closeness of $\hat{F}_n$ to $F$ (Glivenko-Cantelli) and the continuity of the statistic of interest.

Appendix J: Selected Proofs and Derivations

J.1 Asymptotic Variance of the Sample Median

For a distribution with PDF $f$ and CDF $F$ , let $m$ denote the population median ( $F(m) = 1/2$ ). The sample median $\tilde{x}_n$ satisfies:

\sqrt{n}(\tilde{x}_n - m) \xrightarrow{d} \mathcal{N}\!\left(0, \frac{1}{4f(m)^2}\right)

Proof sketch. The sample median is the MLE for the double-exponential (Laplace) distribution, and the asymptotic variance of the $p$ -th quantile estimator $\hat{q}_p$ is $p(1-p)/[nf(q_p)^2]$ . For $p = 1/2$ : $(1/2)(1/2)/[n f(m)^2] = 1/(4nf(m)^2)$ .

Comparison to mean (for Gaussian data). For $\mathcal{N}(\mu,\sigma^2)$ : $f(\mu) = 1/(\sigma\sqrt{2\pi})$ , so the median's asymptotic variance is $\pi\sigma^2/(2n)$ vs. $\sigma^2/n$ for the mean. The relative efficiency is $2/\pi \approx 0.637$ - the mean extracts $100/0.637 \approx 57\%$ more information from Gaussian data than the median. However, for heavy-tailed distributions (e.g., Cauchy), the mean has infinite variance while the median remains consistent - robustness matters.

J.2 The Neyman-Scott Paradox - Inconsistency of MLE

MLE is not always consistent. The Neyman-Scott example (1948) provides a famous cautionary case.

Setup. Observe $n$ pairs $(x_j^{(1)}, x_j^{(2)})$ for $j = 1, \ldots, n$ , where $x_j^{(k)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu_j, \sigma^2)$ . The parameters are $(\mu_1, \ldots, \mu_n, \sigma^2)$ - $n+1$ parameters for $2n$ observations.

MLE. For each pair: $\hat{\mu}_j = (x_j^{(1)} + x_j^{(2)})/2$ . For $\sigma^2$ : $\hat{\sigma}^2_{\text{MLE}} = \frac{1}{2n}\sum_{j=1}^n \left[(x_j^{(1)} - \hat{\mu}_j)^2 + (x_j^{(2)} - \hat{\mu}_j)^2\right] = \frac{1}{2n}\sum_j \frac{(x_j^{(1)}-x_j^{(2)})^2}{2}$ .

Inconsistency. $\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \sigma^2/2$ - the MLE permanently underestimates $\sigma^2$ by a factor of 2, regardless of $n$ . The bias does not vanish because the number of parameters grows with $n$ .

Root cause. The regularity conditions for MLE consistency require the number of parameters to be fixed as $n \to \infty$ . When nuisance parameters grow with $n$ (an "incidental parameters" problem), MLE can fail. The fix: use a conditional or marginal likelihood that eliminates the nuisance parameters.

For AI. This paradox is relevant to neural network generalisation. A model with $p \gg n$ parameters (overparameterised regime) has more parameters than observations. Classical MLE theory breaks down - yet in practice, overparameterised networks often generalise well due to implicit regularisation (gradient descent inductive bias). This remains an active research area.

J.3 Locally Most Powerful Tests and the Neyman-Pearson Lemma

(Preview - full treatment in Section03 Hypothesis Testing)

The Neyman-Pearson lemma provides the most powerful test for simple hypotheses $H_0: \theta = \theta_0$ vs. $H_1: \theta = \theta_1$ . The rejection region is:

\{x : L(\theta_1;x)/L(\theta_0;x) > c\} = \{\text{reject } H_0\}

where $c$ is chosen to control type-I error at level $\alpha$ . The likelihood ratio $L(\theta_1;x)/L(\theta_0;x)$ is a sufficient statistic for the test - no other test can be more powerful at the same level. The connection to MLE: the most powerful test uses the likelihood, and the MLE is the point that maximises the likelihood.

For composite alternatives $H_1: \theta \in \Theta_1$ , the likelihood ratio test statistic $\Lambda = 2[\ell(\hat{\theta}) - \ell(\hat{\theta}_{\text{restricted}})] \xrightarrow{d} \chi^2_k$ is the standard generalisation, and connects to the Wald and score tests via asymptotic equivalence.

Appendix K: Reference Tables

K.1 Common MLE Formulas

Distribution	Parameters	Log-likelihood	MLE
$\operatorname{Bern}(p)$	$p \in (0,1)$	$S\log p + (n-S)\log(1-p)$	$\hat{p} = S/n$
$\operatorname{Poisson}(\lambda)$	$\lambda > 0$	$S\log\lambda - n\lambda$	$\hat{\lambda} = S/n = \bar{x}$
$\operatorname{Exp}(\lambda)$	$\lambda > 0$	$n\log\lambda - \lambda\sum x_i$	$\hat{\lambda} = n/\sum x_i = 1/\bar{x}$
$\mathcal{N}(\mu,\sigma^2)$	$\mu \in \mathbb{R}$ , $\sigma^2 > 0$	$-n\log\sigma - \frac{\sum(x_i-\mu)^2}{2\sigma^2}$	$\hat{\mu} = \bar{x}$ , $\hat{\sigma}^2 = \frac{1}{n}\sum(x_i-\bar{x})^2$
$\mathcal{U}(0,\theta)$	$\theta > 0$	$-n\log\theta \cdot \mathbb{1}[\theta \geq x_{(n)}]$	$\hat{\theta} = x_{(n)} = \max_i x_i$
$\operatorname{Gamma}(\alpha,\beta)$	$\alpha,\beta > 0$	$n\alpha\log\beta - n\log\Gamma(\alpha) + (\alpha-1)\sum\log x_i - \beta\sum x_i$	No closed form; requires Newton-Raphson
$\operatorname{Beta}(\alpha,\beta)$	$\alpha,\beta > 0$	$n\log\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} + (\alpha-1)\sum\log x_i + (\beta-1)\sum\log(1-x_i)$	No closed form

K.2 Fisher Information Summary

Distribution	Parameter	$\mathcal{I}(\theta)$	Efficient MLE	$\operatorname{Var}_{\text{CRB}} = 1/(n\mathcal{I})$
$\operatorname{Bern}(p)$	$p$	$\frac{1}{p(1-p)}$	$\bar{x}$	$p(1-p)/n$
$\mathcal{N}(\mu,\sigma^2)$ , $\sigma^2$ known	$\mu$	$1/\sigma^2$	$\bar{x}$	$\sigma^2/n$
$\mathcal{N}(\mu,\sigma^2)$ , $\mu$ known	$\sigma^2$	$1/(2\sigma^4)$	$\frac{1}{n}\sum(x_i-\mu)^2$	$2\sigma^4/n$
$\operatorname{Poisson}(\lambda)$	$\lambda$	$1/\lambda$	$\bar{x}$	$\lambda/n$
$\operatorname{Exp}(\lambda)$	$\lambda$	$1/\lambda^2$	$1/\bar{x}$	$\lambda^2/n$
$\operatorname{Categorical}(p_1,\ldots,p_K)$	$p_k$ (one of $K-1$ free)	$\frac{1}{p_k} + \frac{1}{p_K}$ (marginal)	$n_k/n$	Multinomial variance

K.3 Confidence Interval Reference

Parameter	Setting	Pivot	CI formula
Gaussian mean $\mu$	$\sigma^2$ known	$Z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1)$	$\bar{x} \pm z_{\alpha/2}\sigma/\sqrt{n}$
Gaussian mean $\mu$	$\sigma^2$ unknown	$T = \frac{\bar{x}-\mu}{s/\sqrt{n}} \sim t_{n-1}$	$\bar{x} \pm t_{n-1,\alpha/2} s/\sqrt{n}$
Gaussian variance $\sigma^2$	$\mu$ unknown	$(n-1)s^2/\sigma^2 \sim \chi^2_{n-1}$	$\left[\frac{(n-1)s^2}{\chi^2_{n-1,\alpha/2}}, \frac{(n-1)s^2}{\chi^2_{n-1,1-\alpha/2}}\right]$
Bernoulli $p$ (large $n$ )	-	Wald: $\hat{p} \pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$	Wilson score preferred near 0/1
General MLE $\hat{\theta}$	Large $n$	$\sqrt{n\mathcal{I}(\hat{\theta})}(\hat{\theta}-\theta) \xrightarrow{d} \mathcal{N}(0,1)$	$\hat{\theta} \pm z_{\alpha/2}/\sqrt{n\mathcal{I}(\hat{\theta})}$
Any statistic	Non-parametric	Bootstrap	Percentile/BCa bootstrap

End of Appendices

Appendix L: Worked Examples - End-to-End Estimation

L.1 Estimating the Parameters of a Sentiment Classifier

Problem. A BERT-based sentiment classifier is evaluated on a test set of $n = 500$ examples. It classifies 418 correctly ( $\hat{a} = 0.836$ ). Report (a) the point estimate with standard error, (b) the 95% Wilson CI, (c) determine if this differs significantly from a baseline accuracy of 80%.

Step 1: Point estimate and SE.

$\hat{a} = 418/500 = 0.836$ . Standard error: $\text{SE} = \sqrt{\hat{a}(1-\hat{a})/n} = \sqrt{0.836 \times 0.164/500} = \sqrt{0.000274} \approx 0.0166$ .

Step 2: Wilson score CI.

With $z = z_{0.025} = 1.96$ , $n = 500$ , $\hat{a} = 0.836$ :

\text{CI} = \left[\frac{0.836 + 1.96^2/(2 \cdot 500)}{1 + 1.96^2/500} \pm \frac{1.96\sqrt{0.836 \times 0.164/500 + 1.96^2/(4 \cdot 500^2)}}{1 + 1.96^2/500}\right]

Numerically: centre $\approx (0.836 + 0.00384)/1.00768 \approx 0.8364$ ; margin $\approx 0.0323$ . CI $\approx [0.804, 0.869]$ .

Step 3: Comparison to baseline 80%.

The null hypothesis $H_0: a^* = 0.80$ corresponds to checking if 0.80 is inside the CI. Since 0.80 is inside $[0.804, 0.869]$ ? No - 0.80 is just outside. Alternatively, $z$ -test: $z = (0.836 - 0.80)/\text{SE} = 0.036/0.0166 = 2.17 > 1.96$ . The improvement is statistically significant at $\alpha = 0.05$ .

Statistical interpretation. The classifier significantly outperforms the 80% baseline, but uncertainty spans from 80.4% to 86.9% - real-world performance could be anywhere in this range, highlighting the importance of CIs even for significant results.

L.2 MLE for a Gaussian Mixture - Worked Iteration

Problem. Two clusters are observed: $\{1.1, 1.3, 0.9, 1.2\}$ and $\{5.8, 6.2, 5.5, 6.1\}$ (but we don't know the assignments). Fit a 2-component Gaussian mixture.

Initialisation. $\hat{\pi}_1 = 0.5$ , $\hat{\mu}_1 = 2.0$ , $\hat{\sigma}_1^2 = 1.0$ , $\hat{\mu}_2 = 5.0$ , $\hat{\sigma}_2^2 = 1.0$ .

E-step (iteration 1). For $x = 1.1$ :

r_1 = \frac{0.5 \cdot \mathcal{N}(1.1;2.0,1.0)}{0.5 \cdot \mathcal{N}(1.1;2.0,1.0) + 0.5 \cdot \mathcal{N}(1.1;5.0,1.0)} = \frac{0.5 \times 0.266}{0.5 \times 0.266 + 0.5 \times 0.000}\approx 0.9999

For $x = 5.8$ : $r_1 \approx 0.0001$ . The algorithm already strongly assigns each point to the correct cluster.

M-step (iteration 1). With $\sum r^{(i)}_1 \approx 4.0$ (the four small values):

\hat{\mu}_1 \approx (1.1+1.3+0.9+1.2)/4 = 1.125, \quad \hat{\mu}_2 \approx (5.8+6.2+5.5+6.1)/4 = 5.9

After 2-3 iterations, the algorithm converges to the obvious clusters. The key insight: EM separates the cluster assignment (E-step soft assignments $r_{ik}$ ) from parameter estimation (M-step weighted MLEs), alternating until convergence.

L.3 Bootstrap CI for Median Income - ML Context

Problem. A dataset of $n = 50$ yearly salaries (thousands) for ML engineers is collected. Sample statistics: $\tilde{x} = 145$ , $s = 42$ . Construct a 95% bootstrap CI for the median.

Why bootstrap? The Gaussian CI for the median requires knowing the PDF at the median - unknown without assuming a distribution. The salary distribution is likely right-skewed (some outliers earn much more), making parametric CIs unreliable.

Bootstrap procedure (pseudocode):

for b = 1 to B=2000:
    x_star = sample(x, n=50, replace=True)
    median_star[b] = median(x_star)

CI_95 = [percentile(median_star, 2.5), percentile(median_star, 97.5)]

Typical result: $\text{CI}_{95} \approx [132, 158]$ thousand. The CI is slightly asymmetric (the right tail is longer due to income skewness), which the bootstrap correctly captures and a symmetric Gaussian CI would miss.

For AI. This exact procedure is used to report confidence intervals for ML engineer compensation surveys, which inform hiring benchmarks and compensation strategy at AI companies. Statistically sound salary benchmarks require bootstrap CIs because income distributions are strongly non-Gaussian.

Estimation Theory: Part 2 - Method Of Moments To Appendix L Worked Examples End To End Estimatio