Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Estimation Theory: Part 1: Intuition to 5. Maximum Likelihood Estimation

1. Intuition

1.1 What Is Statistical Estimation?

Suppose you flip a coin 100 times and observe 63 heads. The coin's true bias pp - the probability of heads on any single flip - is unknown. What is your best guess for pp? The obvious answer is p^=0.63\hat{p} = 0.63. But why is this the right answer? What makes it "best"? Could a different estimate be better in some sense? And how confident should you be - could the true pp be 0.5 (a fair coin) despite observing 63 heads?

These are precisely the questions estimation theory addresses. The statistical estimation problem has three ingredients:

  1. An unknown parameter θΘ\theta \in \Theta governing some probability distribution p(x;θ)p(\mathbf{x}; \theta)
  2. Observed data x(1),,x(n)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} drawn independently from p(;θ)p(\cdot; \theta)
  3. An estimator θ^=T(x(1),,x(n))\hat{\theta} = T(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}): a function of the data that produces a guess for θ\theta

The key insight, often overlooked by beginners, is that the estimator is a random variable. Before we collect data, θ^\hat{\theta} is uncertain - its value depends on which random sample we happen to observe. If we repeated the experiment with a fresh sample, we would get a different θ^\hat{\theta}. The sampling distribution of θ^\hat{\theta} - the distribution of θ^\hat{\theta} across all possible samples - is the object of study. A good estimator has a sampling distribution tightly concentrated around the true θ\theta.

For AI: This perspective matters directly. When you train a neural network on a dataset D\mathcal{D} and obtain weights θ^\hat{\boldsymbol{\theta}}, those weights are a realisation of an estimator - a function of the random training data. Training on a different shuffle or subsample gives different θ^\hat{\boldsymbol{\theta}}. The variation you see across training runs, seeds, and dataset splits is the sampling distribution of the MLE manifesting in practice.

1.2 The Estimation Landscape

Estimation theory spans three major frameworks, each answering a different question:

Point estimation produces a single value θ^\hat{\theta} for the unknown parameter. The challenge is specifying what "best" means. Three criteria dominate:

  • Unbiasedness: E[θ^]=θ\mathbb{E}[\hat{\theta}] = \theta - on average, the estimator is correct
  • Minimum variance: among unbiased estimators, prefer the one with smallest Var(θ^)\operatorname{Var}(\hat{\theta})
  • Consistency: θ^nPθ\hat{\theta}_n \xrightarrow{P} \theta as nn \to \infty - with enough data, the estimator converges to the truth

These criteria can conflict. The Cramer-Rao lower bound establishes a hard floor on variance for unbiased estimators, and the estimators that achieve this floor are called efficient. Maximum likelihood estimation (MLE) is the dominant method: it is consistent, asymptotically efficient, and invariant under smooth reparametrisations.

Interval estimation goes beyond a point to a range [θ^L,θ^U][\hat{\theta}_L, \hat{\theta}_U] with a coverage guarantee: in repeated experiments, 1α1 - \alpha fraction of such intervals will contain the true θ\theta. Frequentist confidence intervals capture estimation uncertainty without requiring a prior distribution on θ\theta.

Asymptotic theory studies estimator behaviour as nn \to \infty. The flagship result is the asymptotic normality of MLE: under regularity conditions,

n(θ^MLEθ)dN ⁣(0,I(θ)1)\sqrt{n}\left(\hat{\theta}_{\text{MLE}} - \theta^*\right) \xrightarrow{d} \mathcal{N}\!\left(0,\, \mathcal{I}(\theta^*)^{-1}\right)

where I(θ)\mathcal{I}(\theta^*) is the Fisher information. This single result simultaneously justifies MLE as an estimation method, provides the basis for asymptotic confidence intervals, and connects estimation theory to information geometry.

For AI: The three frameworks map directly to ML practice. Point estimation = model training (find θ^\hat{\boldsymbol{\theta}}). Interval estimation = evaluation with error bars (report accuracy ±\pm 95% CI). Asymptotic theory = understanding why larger datasets produce more reliable models (the 1/n1/\sqrt{n} convergence rate).

1.3 Historical Timeline

ESTIMATION THEORY - HISTORICAL TIMELINE
========================================================================

  1795  Gauss (age 18) uses least squares to predict asteroid orbits
        - the first systematic estimation method

  1805  Legendre publishes least squares formally (Gauss priority dispute)

  1809  Gauss derives least squares from the assumption of Gaussian errors
        - first connection between MLE and squared-error minimisation

  1894  Pearson introduces method of moments - systematic moment matching

  1912  Fisher introduces "maximum likelihood" as a term; begins systematic
        study of estimation properties

  1922  Fisher: "On the Mathematical Foundations of Theoretical Statistics"
        - defines sufficiency, efficiency, consistency; derives properties of MLE

  1925  Fisher introduces Fisher information and the information inequality
        (precursor to Cramer-Rao)

  1945  Cramer proves the Cramer-Rao lower bound (independently of Rao)

  1945  Rao independently proves the same bound + Rao-Blackwell theorem

  1947  Lehmann & Scheffe: completeness and UMVUE theory

  1950s Wald develops statistical decision theory - unified framework for
        estimation as minimising expected loss

  1979  Efron introduces the bootstrap - non-parametric confidence intervals
        without distributional assumptions

  1998  Amari: natural gradient via Fisher information geometry
        - direct connection to modern ML optimisation

  2015  Martens & Grosse: K-FAC - tractable approximation of FIM for DNNs

  2017  Kirkpatrick et al.: Elastic Weight Consolidation (EWC) uses FIM
        to prevent catastrophic forgetting in continual learning

  2022  Hoffmann et al. (Chinchilla): scaling laws estimated via MLE on
        (N, D, L) data - estimation theory at trillion-parameter scale

========================================================================

1.4 Why Estimation Theory Matters for AI

Every major component of a modern ML pipeline rests on estimation theory:

Training objectives. Minimising cross-entropy loss is equivalent to performing MLE. The connection is exact: argminθilogp(y(i)x(i);θ)=argmaxθip(y(i)x(i);θ)\arg\min_{\boldsymbol{\theta}} \sum_i -\log p(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \prod_i p(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta}). Every training run of GPT-4, LLaMA-3, Gemini, or any neural network trained with NLL loss is performing MLE on the conditional distribution.

Evaluation uncertainty. When a benchmark reports "Model A achieves 73.2% accuracy", this is a point estimate. How uncertain is it? With n=1000n = 1000 test examples, the 95% CI is approximately ±2.8%\pm 2.8\% - meaning Model A and Model B with 73.5% accuracy are statistically indistinguishable. Estimation theory makes this precise.

Second-order optimisation. Adam, K-FAC, and natural gradient all incorporate curvature information. The natural gradient multiplies the gradient by I(θ)1\mathcal{I}(\boldsymbol{\theta})^{-1}, where I\mathcal{I} is the Fisher information matrix. This reparametrises the optimisation in the geometry of the statistical manifold, making gradient steps invariant to reparametrisation. K-FAC (Martens & Grosse, 2015) approximates I1\mathcal{I}^{-1} tractably for deep networks.

Continual learning. Elastic weight consolidation (Kirkpatrick et al., 2017) protects important weights during sequential task learning. "Importance" is measured by the diagonal of the FIM: parameters with high Fisher information are those to which the likelihood is most sensitive, and perturbing them most damages performance.

Calibration. A well-calibrated model's confidence scores match empirical frequencies: when it says "80% confident", it's right 80% of the time. Temperature scaling - dividing logits by a scalar τ\tau - is an MLE problem: find the τ\tau that maximises likelihood on a held-out calibration set.


2. The Formal Estimation Problem

2.1 Statistical Model and Parametric Family

Definition 2.1 (Parametric Statistical Model). A parametric statistical model is a collection of probability distributions indexed by a parameter:

M={p(;θ):θΘ}\mathcal{M} = \{p(\cdot\,;\boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta\}

where ΘRk\Theta \subseteq \mathbb{R}^k is the parameter space. The model is:

  • Correctly specified if the true data-generating distribution pp^* satisfies p=p(;θ)p^* = p(\cdot\,;\boldsymbol{\theta}^*) for some θΘ\boldsymbol{\theta}^* \in \Theta
  • Misspecified if pMp^* \notin \mathcal{M} - the model family does not contain the truth (most practical ML models are misspecified)
  • Identifiable if different parameters give different distributions: θθp(;θ)p(;θ)\boldsymbol{\theta} \neq \boldsymbol{\theta}' \Rightarrow p(\cdot\,;\boldsymbol{\theta}) \neq p(\cdot\,;\boldsymbol{\theta}')

Identifiability is necessary for consistent estimation - you cannot consistently estimate a parameter that leaves the data distribution unchanged.

Standard examples of parametric families:

ModelParameter(s)Parameter spaceNotes
Bern(p)\operatorname{Bern}(p)pp(0,1)(0,1)Single binary outcome
N(μ,σ2)\mathcal{N}(\mu, \sigma^2)(μ,σ2)(\mu, \sigma^2)R×R>0\mathbb{R} \times \mathbb{R}_{>0}Location-scale family
N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma)(μ,Σ)(\boldsymbol{\mu}, \Sigma)Rd×S++d\mathbb{R}^d \times \mathbb{S}^d_{++}Multivariate Gaussian
Poisson(λ)\operatorname{Poisson}(\lambda)λ\lambdaR>0\mathbb{R}_{>0}Count data
Exp(λ)\operatorname{Exp}(\lambda)λ\lambdaR>0\mathbb{R}_{>0}Time-to-event
Beta(α,β)\operatorname{Beta}(\alpha, \beta)(α,β)(\alpha, \beta)R>02\mathbb{R}_{>0}^2Probabilities

Non-examples (non-identifiable):

  • N(μ1+μ2,1)\mathcal{N}(\mu_1 + \mu_2, 1) with parameter (μ1,μ2)(\mu_1, \mu_2): we can only estimate μ1+μ2\mu_1 + \mu_2, not the individual components. The model is not identifiable.
  • A neural network with two hidden layers of width 1 using tanh\tanh: swapping the two neurons gives the same function, so the parameterisation is not identifiable (though the function class may be).

The iid assumption. Throughout this section, observations x(1),,x(n)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} are assumed independent and identically distributed (iid) from p(;θ)p(\cdot\,;\boldsymbol{\theta}^*). The iid assumption enables the joint log-likelihood to factorise as a sum:

logp(x(1),,x(n);θ)=i=1nlogp(x(i);θ)\log p(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)};\boldsymbol{\theta}) = \sum_{i=1}^n \log p(\mathbf{x}^{(i)};\boldsymbol{\theta})

This factorisation is what makes MLE computationally tractable and theoretically tractable. In practice, this assumption is violated whenever data points are correlated (time series, text sequences, spatially clustered data), and estimation theory has extensions to handle dependent data.

2.2 Point Estimators

Definition 2.2 (Estimator). An estimator of θ\boldsymbol{\theta} is any measurable function θ^n=T(x(1),,x(n))\hat{\boldsymbol{\theta}}_n = T(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}) of the sample. The key points:

  1. θ^n\hat{\boldsymbol{\theta}}_n is a random variable - before observing data, it is uncertain
  2. A specific value T(x(1),,x(n))=tT(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}) = t after observing data is called an estimate (not estimator)
  3. The sampling distribution of θ^n\hat{\boldsymbol{\theta}}_n is the distribution of TT across all possible samples of size nn

This distinction - estimator (random variable) vs. estimate (specific realisation) - is pedantic but consequential. Confidence interval coverage is a property of the estimator (random), not the estimate (fixed). When we say "95% CI", we mean that the random interval contains θ\boldsymbol{\theta}^* with probability 95%, not that this specific computed interval has a 95% chance of containing θ\boldsymbol{\theta}^* (after observing data, θ\boldsymbol{\theta}^* is either in the interval or not - there is no remaining randomness).

Examples of estimators for the Gaussian mean μ\mu:

Let x(1),,x(n)iidN(μ,σ2)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2).

  • Sample mean: μ^1=1ni=1nx(i)\hat{\mu}_1 = \frac{1}{n}\sum_{i=1}^n x^{(i)} - unbiased, consistent, efficient
  • Constant zero: μ^2=0\hat{\mu}_2 = 0 - biased unless μ=0\mu = 0, inconsistent
  • First observation: μ^3=x(1)\hat{\mu}_3 = x^{(1)} - unbiased! but inconsistent (variance σ2\sigma^2 regardless of nn)
  • Trimmed mean: μ^4=\hat{\mu}_4 = mean of middle 90% - slightly biased for Gaussian, but robust to outliers
  • Constant cc: μ^5=c\hat{\mu}_5 = c - biased by cμ|c - \mu|, inconsistent; can have lower MSE than the sample mean when μc|\mu - c| is small (illustrates bias-variance trade-off)

The existence of unbiased but inconsistent estimators (μ^3\hat{\mu}_3) and consistent but biased estimators (trimmed mean) shows these properties are logically independent.

2.3 The Loss Framework: MSE, Bias, and Variance

To compare estimators, we need a criterion. The most widely used is Mean Squared Error (MSE):

Definition 2.3 (MSE, Bias, Variance). For an estimator θ^\hat{\theta} of a scalar parameter θ\theta:

MSE(θ^)=E[(θ^θ)2]\operatorname{MSE}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \theta)^2\right]

The bias-variance decomposition is the central identity of estimation theory:

MSE(θ^)=Bias2(θ^)+Var(θ^)\operatorname{MSE}(\hat{\theta}) = \operatorname{Bias}^2(\hat{\theta}) + \operatorname{Var}(\hat{\theta})

Proof:

MSE(θ^)=E[(θ^θ)2]\operatorname{MSE}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \theta)^2\right]

Add and subtract E[θ^]\mathbb{E}[\hat{\theta}]:

=E[((θ^E[θ^])+(E[θ^]θ))2]= \mathbb{E}\left[\left((\hat{\theta} - \mathbb{E}[\hat{\theta}]) + (\mathbb{E}[\hat{\theta}] - \theta)\right)^2\right] =E[(θ^E[θ^])2]+2(E[θ^]θ)E[θ^E[θ^]]=0+(E[θ^]θ)2= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right] + 2(\mathbb{E}[\hat{\theta}] - \theta)\underbrace{\mathbb{E}[\hat{\theta} - \mathbb{E}[\hat{\theta}]]}_{=0} + (\mathbb{E}[\hat{\theta}] - \theta)^2 =Var(θ^)+Bias(θ^)2= \operatorname{Var}(\hat{\theta}) + \operatorname{Bias}(\hat{\theta})^2 \qquad \square

where Bias(θ^)=E[θ^]θ\operatorname{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta.

Geometric interpretation:

BIAS-VARIANCE DECOMPOSITION - GEOMETRIC VIEW
========================================================================

  True parameter \\theta* = 0 (target)

  Case 1: Low bias, low variance (good estimator)
    ---    Estimates cluster tightly around \\theta*
    ----   MSE \\approx small
    ---

  Case 2: Low bias, high variance (unbiased but noisy)
        -             -
    -       -   x   -       -
        -             -
    Centred on \\theta* but spread out.  MSE = Var is large.

  Case 3: High bias, low variance (biased but stable)
              ---
             ----    x  (\\theta*)
              ---
    Centred off target.  MSE = Bias^2 + small Var.

  Case 4: High bias, high variance (worst case)
    -           -
            -       x  (\\theta*)
        -       -
    Both centred off target AND spread out.

  Key trade-off: Reducing variance by shrinkage introduces bias.
  Increasing bias via regularisation often reduces variance more than
  it increases bias-squared, giving lower MSE overall.

========================================================================

The bias-variance trade-off in ML is the exact same phenomenon: regularisation (L2, dropout, early stopping) introduces bias (the model is pulled away from the MLE toward a constrained region) but reduces variance (the model is less sensitive to the particular training sample), often reducing test MSE overall.

Minimax estimators. MSE is not the only loss criterion. The minimax estimator minimises the worst-case risk: minTmaxθE[(θ^θ)2]\min_T \max_\theta \mathbb{E}[(\hat{\theta} - \theta)^2]. The James-Stein estimator (1961) famously showed that in Rk\mathbb{R}^k for k3k \geq 3, the sample mean is inadmissible - there exists an estimator with strictly lower MSE at every θ\theta. This is the Stein paradox: shrinkage toward the origin (a biased estimator!) uniformly dominates the unbiased sample mean in 3+ dimensions.

2.4 Examples of Biased and Unbiased Estimators

Example 1: Sample mean is unbiased.

Let x(1),,x(n)iidp(;θ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} p(\cdot;\theta) with E[x(i)]=μ\mathbb{E}[x^{(i)}] = \mu. Then:

E ⁣[xˉ]=E ⁣[1ni=1nx(i)]=1ni=1nE[x(i)]=nμn=μ\mathbb{E}\!\left[\bar{x}\right] = \mathbb{E}\!\left[\frac{1}{n}\sum_{i=1}^n x^{(i)}\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[x^{(i)}] = \frac{n\mu}{n} = \mu

The sample mean is an unbiased estimator of μ\mu for any distribution with finite mean.

Example 2: MLE of Gaussian variance is biased.

Let x(1),,x(n)iidN(μ,σ2)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2). The MLE of σ2\sigma^2 is:

σ^MLE2=1ni=1n(x(i)xˉ)2\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x^{(i)} - \bar{x})^2

Computing the bias: Using x(i)xˉ=(x(i)μ)(xˉμ)x^{(i)} - \bar{x} = (x^{(i)} - \mu) - (\bar{x} - \mu):

i=1n(x(i)xˉ)2=i=1n(x(i)μ)2n(xˉμ)2\sum_{i=1}^n (x^{(i)} - \bar{x})^2 = \sum_{i=1}^n (x^{(i)} - \mu)^2 - n(\bar{x} - \mu)^2

Taking expectations:

E ⁣[i=1n(x(i)xˉ)2]=nσ2nσ2n=(n1)σ2\mathbb{E}\!\left[\sum_{i=1}^n (x^{(i)} - \bar{x})^2\right] = n\sigma^2 - n \cdot \frac{\sigma^2}{n} = (n-1)\sigma^2

Therefore:

E[σ^MLE2]=(n1)σ2n=σ2σ2n\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{(n-1)\sigma^2}{n} = \sigma^2 - \frac{\sigma^2}{n}

The bias is σ2/n-\sigma^2/n - the MLE underestimates variance. The correction is Bessel's correction: s2=1n1(x(i)xˉ)2s^2 = \frac{1}{n-1}\sum(x^{(i)} - \bar{x})^2 is unbiased.

Recall from Section01: We saw Bessel's correction s2=1n1(x(i)xˉ)2s^2 = \frac{1}{n-1}\sum(x^{(i)}-\bar{x})^2 introduced as the "standard" sample variance formula. Now we understand why: MLE gives 1/n1/n, which is biased. The n1n-1 denominator corrects for the one degree of freedom lost in estimating μ\mu by xˉ\bar{x}.

Example 3: MLE is biased but consistent.

The Gaussian variance MLE σ^MLE2\hat{\sigma}^2_{\text{MLE}} has bias σ2/n0-\sigma^2/n \to 0 as nn \to \infty. It is asymptotically unbiased and consistent. This illustrates that bias can be acceptable if it vanishes with sample size.

Example 4: When bias reduces MSE.

Consider estimating μ\mu with the shrinkage estimator μ^c=cxˉ\hat{\mu}_c = c \bar{x} for some c(0,1)c \in (0,1). Then:

MSE(μ^c)=(c1)2μ2+c2σ2n\operatorname{MSE}(\hat{\mu}_c) = (c-1)^2 \mu^2 + c^2 \frac{\sigma^2}{n}

The optimal c=μ2μ2+σ2/nc^* = \frac{\mu^2}{\mu^2 + \sigma^2/n}, which gives lower MSE than c=1c = 1 (unbiased sample mean) whenever σ2/n>0\sigma^2/n > 0. The key insight: when μ|\mu| is small relative to σ/n\sigma/\sqrt{n}, shrinking toward zero introduces small bias but large variance reduction.


3. Properties of Estimators

3.1 Bias and Unbiasedness

Definition 3.1 (Bias). The bias of an estimator θ^\hat{\theta} for parameter θ\theta is:

Bias(θ^;θ)=E[θ^]θ\operatorname{Bias}(\hat{\theta};\theta) = \mathbb{E}[\hat{\theta}] - \theta

The estimator is unbiased if Bias(θ^;θ)=0\operatorname{Bias}(\hat{\theta};\theta) = 0 for all θΘ\theta \in \Theta, and asymptotically unbiased if Bias(θ^n;θ)0\operatorname{Bias}(\hat{\theta}_n;\theta) \to 0 as nn \to \infty.

Why unbiasedness is desirable - but not sacred. An unbiased estimator is correct on average over infinite repetitions of the experiment. This is a natural property to want. However, unbiasedness alone is insufficient:

  • The estimator θ^=x(1)\hat{\theta} = x^{(1)} (use only the first observation) is unbiased but throws away n1n-1 observations - clearly wasteful
  • No unbiased estimator exists for some parameters: there is no unbiased estimator of eμe^\mu based on a Gaussian sample (by the Lehmann-Scheffe theorem, for this to exist eμe^\mu would need to be a polynomial in xˉ\bar{x}, which it is not)
  • The Stein phenomenon (Section2.3) shows that in 3+ dimensions, unbiased estimators can be uniformly dominated by biased ones

Bias correction. When an estimator θ^\hat{\theta} has known bias b(θ)=E[θ^]θb(\theta) = \mathbb{E}[\hat{\theta}] - \theta, we can correct it: θ^corr=θ^b(θ^)\hat{\theta}_{\text{corr}} = \hat{\theta} - b(\hat{\theta}). This is the idea behind Bessel's correction (n1n-1 vs nn in the variance formula) and jackknife bias reduction.

For AI: Batch normalisation computes μ^B=1Bibatchxi\hat{\mu}_B = \frac{1}{B}\sum_{i \in \text{batch}} x_i as the batch mean during training. This is an unbiased estimator of the feature mean. During inference, PyTorch uses the running mean accumulated over training, which is a biased estimator of the current data mean if the distribution has shifted - illustrating that unbiasedness depends on whether the estimation scenario matches the training scenario.

3.2 Consistency

Definition 3.2 (Consistency). An estimator θ^n\hat{\theta}_n is:

  • Weakly consistent if θ^nPθ\hat{\theta}_n \xrightarrow{P} \theta for all θΘ\theta \in \Theta (convergence in probability)
  • Strongly consistent if θ^na.s.θ\hat{\theta}_n \xrightarrow{a.s.} \theta for all θΘ\theta \in \Theta (almost-sure convergence)

Weak consistency: for every ε>0\varepsilon > 0, P(θ^nθ>ε)0P(|\hat{\theta}_n - \theta| > \varepsilon) \to 0 as nn \to \infty.

Sufficient conditions for consistency:

  • If Bias(θ^n;θ)0\operatorname{Bias}(\hat{\theta}_n;\theta) \to 0 and Var(θ^n)0\operatorname{Var}(\hat{\theta}_n) \to 0 as nn \to \infty, then θ^n\hat{\theta}_n is weakly consistent. (By Chebyshev + the bias decomposition: P(θ^nθ>ε)MSE(θ^n)/ε20P(|\hat{\theta}_n - \theta| > \varepsilon) \leq \operatorname{MSE}(\hat{\theta}_n)/\varepsilon^2 \to 0.)

Example: Sample mean consistency.

xˉn=1ni=1nx(i)\bar{x}_n = \frac{1}{n}\sum_{i=1}^n x^{(i)} satisfies Bias=0\operatorname{Bias} = 0 and Var(xˉn)=σ2/n0\operatorname{Var}(\bar{x}_n) = \sigma^2/n \to 0. Therefore xˉnPμ\bar{x}_n \xrightarrow{P} \mu. This is exactly the Weak Law of Large Numbers - consistency of the sample mean is the LLN.

Example: MLE of Gaussian variance is consistent.

σ^MLE2=1n(x(i)xˉ)2\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum(x^{(i)} - \bar{x})^2 has bias σ2/n0-\sigma^2/n \to 0 and variance O(1/n)O(1/n), so it is consistent even though it is biased for finite nn.

Inconsistent estimator example: θ^=x(1)\hat{\theta} = x^{(1)} (take only the first observation) has Var=σ2\operatorname{Var} = \sigma^2 regardless of nn. It never concentrates - it is inconsistent.

Why consistency matters: Consistency is the minimal requirement for an estimator to be scientifically useful. A method that doesn't converge to the right answer with infinite data is fundamentally broken. Consistency does not guarantee that the estimator is good for small nn - convergence can be arbitrarily slow - but it at least ensures the method is correct in the large-data limit.

For AI: The "double descent" phenomenon in over-parameterised neural networks challenges classical bias-variance thinking. A massively over-parameterised model (dnd \gg n) can still be consistent for the Bayes-optimal predictor if trained with appropriate implicit regularisation (e.g., gradient descent from zero initialisation). This is an active area connecting statistical estimation theory to modern deep learning theory.

3.3 Efficiency

Among all unbiased estimators of θ\theta, which has the smallest variance? This question is answered by the Cramer-Rao bound (Section 4), but we can define efficiency as a property here.

Definition 3.3 (Efficiency and Relative Efficiency).

  • An unbiased estimator θ^\hat{\theta} is efficient if its variance equals the Cramer-Rao lower bound: Var(θ^)=1/I(θ)\operatorname{Var}(\hat{\theta}) = 1/\mathcal{I}(\theta).
  • The relative efficiency of two unbiased estimators θ^1\hat{\theta}_1 and θ^2\hat{\theta}_2 is e(θ^1,θ^2)=Var(θ^2)/Var(θ^1)e(\hat{\theta}_1, \hat{\theta}_2) = \operatorname{Var}(\hat{\theta}_2)/\operatorname{Var}(\hat{\theta}_1). Values greater than 1 mean θ^1\hat{\theta}_1 is more efficient.

Examples:

  1. Gaussian mean: The sample mean xˉ\bar{x} is efficient - it achieves the CRB σ2/n\sigma^2/n. Relative efficiency of the sample median vs. mean is 2/π0.6372/\pi \approx 0.637 for Gaussian data - the median throws away ~36% of efficiency.

  2. Gaussian variance: The unbiased sample variance s2s^2 is not efficient (the MLE σ^2\hat{\sigma}^2 achieves the CRB but is biased; correcting for bias loses the CRB).

Asymptotic efficiency. For large nn, MLE achieves the CRB asymptotically - it is asymptotically efficient. This is one of its key advantages.

For AI: The sample mean is the most efficient estimator of the mean for Gaussian data. Batch gradient descent using all nn samples computes the exact gradient (equivalent to efficient estimation), while SGD uses a single sample or mini-batch (less efficient, higher variance, but much faster per update). The trade-off between statistical efficiency and computational efficiency is fundamental to modern ML training.

3.4 Sufficiency and the Rao-Blackwell Theorem

A sufficient statistic captures all the information in the data that is relevant to estimating θ\theta.

Definition 3.4 (Sufficient Statistic). A statistic T=T(x(1),,x(n))T = T(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}) is sufficient for θ\theta if the conditional distribution p(x(1),,x(n)T=t;θ)p(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} \mid T = t; \theta) does not depend on θ\theta for any tt.

Intuitively: once you know TT, the raw data x(1),,x(n)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} provide no additional information about θ\theta.

Fisher-Neyman Factorisation Theorem. TT is sufficient for θ\theta if and only if the likelihood factors as:

p(x(1),,x(n);θ)=g(T(x(1),,x(n)),θ)h(x(1),,x(n))p(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)};\theta) = g(T(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}), \theta) \cdot h(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)})

where gg depends on data only through TT, and hh does not depend on θ\theta.

Examples of sufficient statistics:

ModelSufficient statistic TT
Bern(p)\operatorname{Bern}(p), nn samplesT=i=1nx(i)T = \sum_{i=1}^n x^{(i)} (total successes)
N(μ,σ2)\mathcal{N}(\mu, \sigma^2), σ2\sigma^2 knownT=xˉT = \bar{x} (sample mean)
N(μ,σ2)\mathcal{N}(\mu, \sigma^2), both unknownT=(xˉ,(x(i)xˉ)2)T = (\bar{x}, \sum(x^{(i)} - \bar{x})^2) (mean and SS)
Poisson(λ)\operatorname{Poisson}(\lambda)T=i=1nx(i)T = \sum_{i=1}^n x^{(i)} (total count)
Exp(λ)\operatorname{Exp}(\lambda)T=i=1nx(i)T = \sum_{i=1}^n x^{(i)} (total time)

Rao-Blackwell Theorem. If θ^\hat{\theta} is any unbiased estimator and TT is a sufficient statistic, then θ~=E[θ^T]\tilde{\theta} = \mathbb{E}[\hat{\theta} \mid T] satisfies:

Var(θ~)Var(θ^)\operatorname{Var}(\tilde{\theta}) \leq \operatorname{Var}(\hat{\theta})

with equality iff θ^\hat{\theta} is already a function of TT. The "Rao-Blackwellised" estimator is at least as good.

Proof sketch: By the law of total variance, Var(θ^)=Var(E[θ^T])+E[Var(θ^T)]Var(θ~)\operatorname{Var}(\hat{\theta}) = \operatorname{Var}(\mathbb{E}[\hat{\theta}|T]) + \mathbb{E}[\operatorname{Var}(\hat{\theta}|T)] \geq \operatorname{Var}(\tilde{\theta}).

For AI: Sufficient statistics are the foundation of exponential families, which include Gaussian, Bernoulli, Poisson, Gamma, and most common distributions. The natural parameterisation of exponential families (used in generalised linear models and variational inference) exploits sufficient statistics to simplify computation. In the VAE (Kingma & Welling, 2014), the encoder qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) outputs sufficient statistics (μϕ(x),σϕ(x))(\boldsymbol{\mu}_\phi(\mathbf{x}), \boldsymbol{\sigma}_\phi(\mathbf{x})) of the Gaussian posterior approximation.

3.5 Completeness and the Lehmann-Scheffe Theorem

Definition 3.5 (Complete Statistic). A sufficient statistic TT is complete if for any measurable function gg: E[g(T);θ]=0\mathbb{E}[g(T); \theta] = 0 for all θΘ\theta \in \Theta implies g(T)=0g(T) = 0 a.s. for all θ\theta.

Completeness means the only function of TT that has zero expectation everywhere is the zero function - there are no "hidden" unbiasedness conditions.

Lehmann-Scheffe Theorem. If TT is a complete sufficient statistic and θ~=h(T)\tilde{\theta} = h(T) is unbiased for θ\theta, then θ~\tilde{\theta} is the unique minimum variance unbiased estimator (UMVUE) of θ\theta.

The UMVUE is the "best possible" unbiased estimator - it has the lowest variance among all unbiased estimators, and it is unique.

Example: For N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known, T=xˉT = \bar{x} is complete sufficient, and xˉ\bar{x} itself is unbiased for μ\mu. By Lehmann-Scheffe, xˉ\bar{x} is the UMVUE of μ\mu - there is no unbiased estimator with smaller variance. This also confirms that xˉ\bar{x} achieves the CRB σ2/n\sigma^2/n.


4. Fisher Information and the Cramer-Rao Bound

4.1 The Score Function

The score function is the fundamental quantity linking estimation and information theory.

Definition 4.1 (Score Function). For a single observation xx from p(x;θ)p(x;\theta), the score is:

s(θ;x)=θlogp(x;θ)s(\theta; x) = \frac{\partial}{\partial \theta} \log p(x; \theta)

For a sample of nn iid observations, the total score is:

Sn(θ)=i=1nθlogp(x(i);θ)S_n(\theta) = \sum_{i=1}^n \frac{\partial}{\partial \theta} \log p(x^{(i)};\theta)

The MLE solves Sn(θ^)=0S_n(\hat{\theta}) = 0 (the score equation).

Zero-mean property of the score:

E ⁣[s(θ;X)]=logp(x;θ)θp(x;θ)dx=p(x;θ)θp(x;θ)p(x;θ)dx=θp(x;θ)dx=θ(1)=0\mathbb{E}\!\left[s(\theta; X)\right] = \int \frac{\partial \log p(x;\theta)}{\partial \theta} p(x;\theta)\, dx = \int \frac{\frac{\partial p(x;\theta)}{\partial \theta}}{p(x;\theta)} p(x;\theta)\, dx = \frac{\partial}{\partial \theta}\int p(x;\theta)\, dx = \frac{\partial}{\partial \theta}(1) = 0

under regularity conditions permitting interchange of differentiation and integration.

Interpretation: The score measures how sensitively the log-likelihood responds to perturbations of θ\theta. At the true parameter value, the score has zero mean - on average, the likelihood is at a stationary point. But it has positive variance, measuring how much information the data carries about θ\theta.

SCORE FUNCTION - GEOMETRIC INTUITION
========================================================================

  Log-likelihood \\ell(\\theta) = \\Sigma_i log p(x_i;\\theta) as a function of \\theta:

         \\ell(\\theta)
          |
          |         +-----+
          |       +-+     +-+
          |     +-+         +-+
          |---+-+             +-+----
          |                        \\theta
                    ^
                 \\theta* (true parameter)
                 slope = score = 0 at maximum

  Score s(\\theta; data) = d\\ell/d\\theta is the slope of the log-likelihood.
  High variance of score at \\theta* means the data sharply identifies
  \\theta* (the likelihood is "peaky"). Low variance means the data
  are relatively uninformative about \\theta.

========================================================================

4.2 Fisher Information

Definition 4.2 (Fisher Information). The Fisher information for parameter θ\theta in model p(x;θ)p(x;\theta) is:

I(θ)=E ⁣[(logp(X;θ)θ)2]=Var ⁣(logp(X;θ)θ)\mathcal{I}(\theta) = \mathbb{E}\!\left[\left(\frac{\partial \log p(X;\theta)}{\partial \theta}\right)^2\right] = \operatorname{Var}\!\left(\frac{\partial \log p(X;\theta)}{\partial \theta}\right)

(The second equality uses E[s]=0\mathbb{E}[s] = 0.)

Alternative form (under regularity conditions):

I(θ)=E ⁣[2logp(X;θ)θ2]\mathcal{I}(\theta) = -\mathbb{E}\!\left[\frac{\partial^2 \log p(X;\theta)}{\partial \theta^2}\right]

Proof of equivalence: Differentiating E[s(θ;X)]=0\mathbb{E}[s(\theta;X)] = 0 with respect to θ\theta:

0=θlogpθpdx=[2logpθ2p+logpθpθ]dx0 = \frac{\partial}{\partial\theta}\int \frac{\partial \log p}{\partial \theta} p\, dx = \int \left[\frac{\partial^2 \log p}{\partial \theta^2} p + \frac{\partial \log p}{\partial \theta} \cdot \frac{\partial p}{\partial \theta}\right] dx =E ⁣[2logpθ2]+E ⁣[(logpθ)2]= \mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial \theta^2}\right] + \mathbb{E}\!\left[\left(\frac{\partial \log p}{\partial\theta}\right)^2\right]

Therefore I(θ)=E ⁣[2logpθ2]\mathcal{I}(\theta) = -\mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial\theta^2}\right]. \square

The second-derivative form has intuitive content: I(θ)\mathcal{I}(\theta) is the expected (negative) curvature of the log-likelihood. High curvature = the likelihood has a sharp peak = the data strongly identifies θ\theta. Low curvature = flat likelihood = data is relatively uninformative.

Fisher information for an iid sample of nn observations:

In(θ)=nI1(θ)\mathcal{I}_n(\theta) = n \cdot \mathcal{I}_1(\theta)

Information accumulates linearly with sample size - doubling data doubles information, halving uncertainty (in variance terms: 1/In=(1/I1)/n1/\mathcal{I}_n = (1/\mathcal{I}_1)/n).

Computed examples:

Bernoulli(p)(p): logp(x;p)=xlogp+(1x)log(1p)\log p(x;p) = x\log p + (1-x)\log(1-p). Score: s=x/p(1x)/(1p)s = x/p - (1-x)/(1-p). Score squared: s2=(xp)2p2(1p)2s^2 = \frac{(x-p)^2}{p^2(1-p)^2}.

I(p)=E[s2]=Var(X)p2(1p)2=p(1p)p2(1p)2=1p(1p)\mathcal{I}(p) = \mathbb{E}[s^2] = \frac{\operatorname{Var}(X)}{p^2(1-p)^2} = \frac{p(1-p)}{p^2(1-p)^2} = \frac{1}{p(1-p)}

N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known: logp(x;μ)=(xμ)22σ2+const\log p(x;\mu) = -\frac{(x-\mu)^2}{2\sigma^2} + \text{const}. Score: s=(xμ)/σ2s = (x-\mu)/\sigma^2.

I(μ)=E ⁣[(Xμ)2σ4]=Var(X)σ4=σ2σ4=1σ2\mathcal{I}(\mu) = \mathbb{E}\!\left[\frac{(X-\mu)^2}{\sigma^4}\right] = \frac{\operatorname{Var}(X)}{\sigma^4} = \frac{\sigma^2}{\sigma^4} = \frac{1}{\sigma^2}

Poisson(λ)(\lambda): logp(x;λ)=xlogλλlog(x!)\log p(x;\lambda) = x\log\lambda - \lambda - \log(x!). Score: s=x/λ1s = x/\lambda - 1.

I(λ)=E ⁣[(Xλ1)2]=Var(X)λ2=λλ2=1λ\mathcal{I}(\lambda) = \mathbb{E}\!\left[\left(\frac{X}{\lambda} - 1\right)^2\right] = \frac{\operatorname{Var}(X)}{\lambda^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}

Pattern: For exponential family distributions, I(θ)=1/Var(θ^MLE)\mathcal{I}(\theta) = 1/\operatorname{Var}(\hat{\theta}_{\text{MLE}}) in scalar cases - information is the reciprocal of the natural parameter variance.

Multivariate Fisher Information Matrix. For θRk\boldsymbol{\theta} \in \mathbb{R}^k:

I(θ)jl=E ⁣[logpθjlogpθl]=E ⁣[2logpθjθl]\mathcal{I}(\boldsymbol{\theta})_{jl} = \mathbb{E}\!\left[\frac{\partial \log p}{\partial \theta_j} \cdot \frac{\partial \log p}{\partial \theta_l}\right] = -\mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial \theta_j \partial \theta_l}\right]

In matrix form: I(θ)=E[ss]=E[Hlogp]\mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}[\mathbf{s}\mathbf{s}^\top] = -\mathbb{E}[H_{\log p}] where HlogpH_{\log p} is the Hessian of the log-likelihood.

The FIM is always positive semi-definite (as an outer product expectation), and positive definite under identifiability.

4.3 The Cramer-Rao Lower Bound

The Cramer-Rao Lower Bound (CRB) is the fundamental limit on estimation accuracy.

Theorem 4.3 (Cramer-Rao Lower Bound). Let θ^\hat{\theta} be any unbiased estimator of θ\theta based on nn iid observations. Under regularity conditions:

Var(θ^)1nI(θ)\operatorname{Var}(\hat{\theta}) \geq \frac{1}{n\mathcal{I}(\theta)}

For the multivariate case with unbiased estimator θ^\hat{\boldsymbol{\theta}} of θRk\boldsymbol{\theta} \in \mathbb{R}^k:

Cov(θ^)1nI(θ)1\operatorname{Cov}(\hat{\boldsymbol{\theta}}) \succeq \frac{1}{n}\mathcal{I}(\boldsymbol{\theta})^{-1}

(meaning Cov(θ^)1nI(θ)1\operatorname{Cov}(\hat{\boldsymbol{\theta}}) - \frac{1}{n}\mathcal{I}(\boldsymbol{\theta})^{-1} is positive semidefinite).

Proof (scalar case, single observation):

We need to show Var(θ^)I(θ)1\operatorname{Var}(\hat{\theta}) \cdot \mathcal{I}(\theta) \geq 1.

Since θ^\hat{\theta} is unbiased: E[θ^]=θ\mathbb{E}[\hat{\theta}] = \theta. Differentiating with respect to θ\theta:

1=θθ^(x)p(x;θ)dx=θ^(x)logp(x;θ)θp(x;θ)dx=E[θ^s(θ;X)]1 = \frac{\partial}{\partial\theta}\int \hat{\theta}(x)\, p(x;\theta)\, dx = \int \hat{\theta}(x)\, \frac{\partial \log p(x;\theta)}{\partial\theta}\, p(x;\theta)\, dx = \mathbb{E}[\hat{\theta} \cdot s(\theta;X)]

Now apply the Cauchy-Schwarz inequality to E[θ^s]\mathbb{E}[\hat{\theta} \cdot s]:

1=E[θ^s]=E[(θ^θ)s](since E[s]=0)1 = \mathbb{E}[\hat{\theta} \cdot s] = \mathbb{E}[(\hat{\theta} - \theta) \cdot s] \quad (\text{since } \mathbb{E}[s] = 0)

By Cauchy-Schwarz: E[(θ^θ)s]2E[(θ^θ)2]E[s2]\mathbb{E}[(\hat{\theta} - \theta) \cdot s]^2 \leq \mathbb{E}[(\hat{\theta}-\theta)^2] \cdot \mathbb{E}[s^2]

1Var(θ^)I(θ)1 \leq \operatorname{Var}(\hat{\theta}) \cdot \mathcal{I}(\theta) Var(θ^)1I(θ)\operatorname{Var}(\hat{\theta}) \geq \frac{1}{\mathcal{I}(\theta)} \qquad \square

When is the bound tight? The CRB is achieved (with equality in Cauchy-Schwarz) if and only if (θ^θ)=c(θ)s(θ;X)(\hat{\theta} - \theta) = c(\theta) \cdot s(\theta; X) for some function c(θ)c(\theta). This occurs precisely for exponential family distributions, and the efficient estimator is the MLE.

Biased estimator CRB. For biased estimators with E[θ^]=θ+b(θ)\mathbb{E}[\hat{\theta}] = \theta + b(\theta):

Var(θ^)(1+b(θ))2I(θ)\operatorname{Var}(\hat{\theta}) \geq \frac{(1 + b'(\theta))^2}{\mathcal{I}(\theta)}

This shows that introducing bias (via regularisation) can actually reduce the CRB - a biased estimator faces a different, potentially less demanding bound.

Examples of efficient estimators:

ModelParameterEfficient estimatorVar\operatorname{Var}CRB
N(μ,σ2)\mathcal{N}(\mu,\sigma^2), σ2\sigma^2 knownμ\muxˉ\bar{x}σ2/n\sigma^2/nσ2/n\sigma^2/n [ok]
Bern(p)\operatorname{Bern}(p)ppxˉ\bar{x}p(1p)/np(1-p)/np(1p)/np(1-p)/n [ok]
Poisson(λ)\operatorname{Poisson}(\lambda)λ\lambdaxˉ\bar{x}λ/n\lambda/nλ/n\lambda/n [ok]
Exp(λ)\operatorname{Exp}(\lambda)λ\lambdan/x(i)n/\sum x^{(i)}λ2/n\lambda^2/nλ2/n\lambda^2/n [ok]

The sample mean is efficient for all these single-parameter exponential families. This is not a coincidence - it follows from the structure of exponential families.

4.4 Fisher Information Matrix in Practice

For a neural network with parameters θRp\boldsymbol{\theta} \in \mathbb{R}^p defining a conditional distribution p(yx;θ)p(y|\mathbf{x};\boldsymbol{\theta}), the FIM is:

I(θ)=E(x,y)pdata ⁣[θlogp(yx;θ)θlogp(yx;θ)]\mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}_{(\mathbf{x},y)\sim p_{\text{data}}}\!\left[\nabla_{\boldsymbol{\theta}} \log p(y|\mathbf{x};\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(y|\mathbf{x};\boldsymbol{\theta})^\top\right]

This is an outer product of gradient vectors, making it a p×pp \times p PSD matrix. For modern LLMs with p1011p \sim 10^{11} parameters, this matrix is completely intractable to store (would require 102210^{22} bytes).

Empirical FIM. In practice, approximate by sampling:

I^(θ)=1ni=1nθlogp(y(i)x(i);θ)θlogp(y(i)x(i);θ)\hat{\mathcal{I}}(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n \nabla_{\boldsymbol{\theta}} \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta})^\top

This is the average squared gradient - exactly what is accumulated in gradient variance estimates.

Natural gradient (Amari, 1998). The ordinary gradient θL\nabla_{\boldsymbol{\theta}} \mathcal{L} is the steepest ascent direction in Euclidean parameter space Rp\mathbb{R}^p. But parameters do not live in Euclidean space - they live in the statistical manifold of distributions, where the natural metric is the Fisher information. The natural gradient is:

~θL=I(θ)1θL\tilde{\nabla}_{\boldsymbol{\theta}} \mathcal{L} = \mathcal{I}(\boldsymbol{\theta})^{-1} \nabla_{\boldsymbol{\theta}} \mathcal{L}

This is the steepest ascent direction in the metric defined by the FIM. Natural gradient is invariant to reparametrisation: if we change from θ\boldsymbol{\theta} to ϕ=g(θ)\boldsymbol{\phi} = g(\boldsymbol{\theta}), the natural gradient update gives the same parameter change.

K-FAC approximation (Martens & Grosse, 2015). Full FIM inversion costs O(p3)O(p^3). K-FAC approximates I1\mathcal{I}^{-1} by assuming layer-wise independence and Kronecker factorisation of each layer's FIM: I[l]A[l]G[l]\mathcal{I}^{[l]} \approx A^{[l]} \otimes G^{[l]}, where A[l]=E[a[l1](a[l1])]A^{[l]} = \mathbb{E}[\mathbf{a}^{[l-1]}(\mathbf{a}^{[l-1]})^\top] (input covariance) and G[l]=E[δ[l](δ[l])]G^{[l]} = \mathbb{E}[\delta^{[l]}(\delta^{[l]})^\top] (gradient covariance). Inverting the Kronecker product: (AG)1=A1G1(A \otimes G)^{-1} = A^{-1} \otimes G^{-1}, reducing cost from O(p3)O(p^3) to O(din3+dout3)O(d_{\text{in}}^3 + d_{\text{out}}^3) per layer.

Jeffreys prior. In Bayesian statistics (preview of Section04), the Jeffreys prior p(θ)I(θ)p(\theta) \propto \sqrt{\mathcal{I}(\theta)} is the "uninformative" prior that is invariant to reparametrisation. For θ=p\theta = p in Bernoulli: I(p)=1/(p(1p))\mathcal{I}(p) = 1/(p(1-p)), so pJ(p)p1/2(1p)1/2p_J(p) \propto p^{-1/2}(1-p)^{-1/2}, which is Beta(1/2,1/2)\operatorname{Beta}(1/2, 1/2).

Preview: MAP Estimation. Adding a prior p(θ)p(\boldsymbol{\theta}) and finding argmaxθ[p(Dθ)p(θ)]\arg\max_{\boldsymbol{\theta}} [p(\mathcal{D}|\boldsymbol{\theta}) p(\boldsymbol{\theta})] gives the Maximum A Posteriori (MAP) estimate - MLE regularised by the log-prior. MAP with a Gaussian prior N(0,λ1I)\mathcal{N}(0, \lambda^{-1}I) gives L2-regularised MLE. The full Bayesian treatment of priors, posteriors, and conjugate families is in Section04 Bayesian Inference.


5. Maximum Likelihood Estimation

5.1 The Likelihood Principle

Definition 5.1 (Likelihood Function). Given observed data x(1),,x(n)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} and a parametric model p(;θ)p(\cdot;\boldsymbol{\theta}), the likelihood function is:

L(θ;x(1),,x(n))=i=1np(x(i);θ)L(\boldsymbol{\theta}; \mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}) = \prod_{i=1}^n p(\mathbf{x}^{(i)};\boldsymbol{\theta})

The log-likelihood is:

(θ)=logL(θ)=i=1nlogp(x(i);θ)\ell(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \sum_{i=1}^n \log p(\mathbf{x}^{(i)};\boldsymbol{\theta})

The Maximum Likelihood Estimator (MLE) is:

θ^MLE=argmaxθΘ(θ)=argmaxθΘi=1nlogp(x(i);θ)\hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta} \in \Theta} \ell(\boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta} \in \Theta} \sum_{i=1}^n \log p(\mathbf{x}^{(i)};\boldsymbol{\theta})

The likelihood is not a probability. L(θ;x)L(\boldsymbol{\theta}; \mathbf{x}) is a function of θ\boldsymbol{\theta} for fixed data x\mathbf{x}, not a probability distribution over θ\boldsymbol{\theta}. It does not integrate to 1 over θ\boldsymbol{\theta}. The statement "L(θ)=0.3L(\boldsymbol{\theta}) = 0.3" means "the data has probability density 0.3 under parameter θ\boldsymbol{\theta}", not "there is a 30% chance θ\boldsymbol{\theta} is the true parameter".

Why log? Three reasons:

  1. Computational: i=1npi\prod_{i=1}^n p_i underflows to zero for large nn (e.g., 0.51000103010.5^{1000} \approx 10^{-301}); sums are numerically stable
  2. Mathematical: sums are easier to differentiate and optimise than products
  3. Statistical: log is a monotone transformation, so argmaxL=argmaxlogL\arg\max L = \arg\max \log L

The likelihood principle states that all information about θ\boldsymbol{\theta} in the data is contained in the likelihood function. Two datasets with proportional likelihoods should lead to the same inferences about θ\boldsymbol{\theta}. This is a philosophical principle - frequentists may not accept it fully (confidence intervals depend on the sampling procedure, not just the likelihood), but it underlies MLE and Bayesian inference alike.

5.2 Deriving MLEs: Scalar Parameters

The standard procedure for finding the MLE:

  1. Write the log-likelihood (θ)=ilogp(x(i);θ)\ell(\theta) = \sum_i \log p(x^{(i)};\theta)
  2. Differentiate and set (θ)=0\ell'(\theta) = 0 (the score equation)
  3. Verify it is a maximum (second derivative <0< 0 or global structure)
  4. Check boundary cases (the maximum might be at Θ\Theta's boundary)

Bernoulli MLE. Let x(1),,x(n)iidBern(p)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Bern}(p).

(p)=i=1n[x(i)logp+(1x(i))log(1p)]=Slogp+(nS)log(1p)\ell(p) = \sum_{i=1}^n \left[x^{(i)}\log p + (1-x^{(i)})\log(1-p)\right] = S\log p + (n-S)\log(1-p)

where S=i=1nx(i)S = \sum_{i=1}^n x^{(i)} is the total number of successes. Setting (p)=0\ell'(p) = 0:

SpnS1p=0    S(1p)=p(nS)    p^MLE=Sn=xˉ\frac{S}{p} - \frac{n-S}{1-p} = 0 \implies S(1-p) = p(n-S) \implies \hat{p}_{\text{MLE}} = \frac{S}{n} = \bar{x}

The MLE is the sample proportion - the intuitive answer.

Poisson MLE. Let x(1),,x(n)iidPoisson(λ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Poisson}(\lambda).

(λ)=i=1n[x(i)logλλlog(x(i)!)]=Slogλnλ+const\ell(\lambda) = \sum_{i=1}^n \left[x^{(i)}\log\lambda - \lambda - \log(x^{(i)}!)\right] = S\log\lambda - n\lambda + \text{const}

Setting (λ)=S/λn=0\ell'(\lambda) = S/\lambda - n = 0: λ^MLE=S/n=xˉ\hat{\lambda}_{\text{MLE}} = S/n = \bar{x}.

The MLE of the Poisson rate is the sample mean - the estimator of the mean equals the MLE because E[X]=λ\mathbb{E}[X] = \lambda.

Exponential MLE. Let x(1),,x(n)iidExp(λ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Exp}(\lambda), i.e., p(x;λ)=λeλxp(x;\lambda) = \lambda e^{-\lambda x} for x>0x > 0.

(λ)=nlogλλi=1nx(i)\ell(\lambda) = n\log\lambda - \lambda\sum_{i=1}^n x^{(i)}

Setting (λ)=n/λx(i)=0\ell'(\lambda) = n/\lambda - \sum x^{(i)} = 0: λ^MLE=n/x(i)=1/xˉ\hat{\lambda}_{\text{MLE}} = n/\sum x^{(i)} = 1/\bar{x}.

The MLE of the rate is the reciprocal of the sample mean - since E[X]=1/λ\mathbb{E}[X] = 1/\lambda, this is natural.

Uniform MLE. Let x(1),,x(n)iidU(0,θ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{U}(0,\theta), p(x;θ)=1/θp(x;\theta) = 1/\theta for x[0,θ]x \in [0,\theta].

(θ)=nlogθ1[θmaxix(i)]\ell(\theta) = -n\log\theta \cdot \mathbb{1}[\theta \geq \max_i x^{(i)}]

The likelihood is zero for θ<maxix(i)\theta < \max_i x^{(i)} (some observation would be outside [0,θ][0,\theta]) and decreasing in θ\theta for θmaxix(i)\theta \geq \max_i x^{(i)}. The MLE is at the boundary: θ^MLE=maxix(i)\hat{\theta}_{\text{MLE}} = \max_i x^{(i)}.

This is an example where the score equation gives no solution (the log-likelihood has no interior critical point); the MLE is found by reasoning about the likelihood's shape. Also, the MLE here is biased: E[maxix(i)]=nθ/(n+1)<θ\mathbb{E}[\max_i x^{(i)}] = n\theta/(n+1) < \theta.

5.3 Deriving MLEs: Multivariate Gaussian

Let x(1),,x(n)iidN(μ,Σ)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\boldsymbol{\mu}, \Sigma) with μRd\boldsymbol{\mu} \in \mathbb{R}^d and ΣS++d\Sigma \in \mathbb{S}^d_{++}.

The log-likelihood is:

(μ,Σ)=n2logdet(Σ)12i=1n(x(i)μ)Σ1(x(i)μ)nd2log(2π)\ell(\boldsymbol{\mu}, \Sigma) = -\frac{n}{2}\log\det(\Sigma) - \frac{1}{2}\sum_{i=1}^n (\mathbf{x}^{(i)} - \boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}^{(i)} - \boldsymbol{\mu}) - \frac{nd}{2}\log(2\pi)

MLE of μ\boldsymbol{\mu}: Taking the gradient with respect to μ\boldsymbol{\mu} and setting to zero:

μ=i=1nΣ1(x(i)μ)=0    μ^MLE=1ni=1nx(i)=xˉ\nabla_{\boldsymbol{\mu}} \ell = \sum_{i=1}^n \Sigma^{-1}(\mathbf{x}^{(i)} - \boldsymbol{\mu}) = 0 \implies \hat{\boldsymbol{\mu}}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}^{(i)} = \bar{\mathbf{x}}

MLE of Σ\Sigma: Substituting μ^=xˉ\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}} and differentiating with respect to Σ1\Sigma^{-1} (using matrix calculus identities AlogdetA=A\nabla_A \log\det A = A^{-\top} and Atr(BA)=B\nabla_A \operatorname{tr}(BA) = B^\top):

Σ1=n2Σ12i=1n(x(i)xˉ)(x(i)xˉ)=0\frac{\partial \ell}{\partial \Sigma^{-1}} = \frac{n}{2}\Sigma - \frac{1}{2}\sum_{i=1}^n (\mathbf{x}^{(i)} - \bar{\mathbf{x}})(\mathbf{x}^{(i)} - \bar{\mathbf{x}})^\top = 0 Σ^MLE=1ni=1n(x(i)xˉ)(x(i)xˉ)\hat{\Sigma}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{(i)} - \bar{\mathbf{x}})(\mathbf{x}^{(i)} - \bar{\mathbf{x}})^\top

Bias of Σ^MLE\hat{\Sigma}_{\text{MLE}}: By the same calculation as in the scalar case, E[Σ^MLE]=n1nΣ\mathbb{E}[\hat{\Sigma}_{\text{MLE}}] = \frac{n-1}{n}\Sigma. The unbiased estimator is S=1n1i(x(i)xˉ)(x(i)xˉ)S = \frac{1}{n-1}\sum_i (\mathbf{x}^{(i)} - \bar{\mathbf{x}})(\mathbf{x}^{(i)} - \bar{\mathbf{x}})^\top.

For AI: Fitting a Gaussian to activations or embeddings is done in numerous ML methods. The empirical covariance of a layer's activations, used in whitening transforms, weight initialisation (Xavier/He initialisation matches the second moment), and covariance-regularised fine-tuning, all use this MLE formula. The multivariate Gaussian MLE is also the building block for Gaussian mixture models (EM algorithm) and linear discriminant analysis.

5.4 MLE as Cross-Entropy Minimisation

This connection is perhaps the most important in all of applied ML.

Theorem 5.4. For a classification model p(yx;θ)p(y|\mathbf{x};\boldsymbol{\theta}) and iid data {(x(i),y(i))}i=1n\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n:

θ^MLE=argmaxθi=1nlogp(y(i)x(i);θ)=argminθ[1ni=1nlogp(y(i)x(i);θ)]\hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \sum_{i=1}^n \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta}) = \arg\min_{\boldsymbol{\theta}} \left[-\frac{1}{n}\sum_{i=1}^n \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta})\right]

The right-hand side is the negative log-likelihood (NLL) or cross-entropy loss:

LCE(θ)=1ni=1nlogp(y(i)x(i);θ)\mathcal{L}_{\text{CE}}(\boldsymbol{\theta}) = -\frac{1}{n}\sum_{i=1}^n \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta})

Connection to KL divergence: Define the empirical distribution pdata(yx)=1niδ(x(i),y(i))p_{\text{data}}(y|\mathbf{x}) = \frac{1}{n}\sum_i \delta_{(x^{(i)}, y^{(i)})}. Then:

LCE=1nilogp(y(i)x(i);θ)=H(pdata)+DKL(pdatapθ)\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_i \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta}) = H(p_{\text{data}}) + D_{\mathrm{KL}}(p_{\text{data}} \| p_\theta)

Since H(pdata)H(p_{\text{data}}) does not depend on θ\boldsymbol{\theta}:

argminθLCE=argminθDKL(pdatapθ)\arg\min_{\boldsymbol{\theta}} \mathcal{L}_{\text{CE}} = \arg\min_{\boldsymbol{\theta}} D_{\mathrm{KL}}(p_{\text{data}} \| p_{\boldsymbol{\theta}})

MLE minimises the KL divergence from the data distribution to the model. This is the information-theoretic interpretation of MLE.

For language models: An LLM defines pθ(xtx1,,xt1)p_{\boldsymbol{\theta}}(x_t | x_1, \ldots, x_{t-1}). Training by NLL is MLE:

θ^=argminθ1Tt=1Tlogpθ(xtx<t)\hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta}} -\frac{1}{T}\sum_{t=1}^T \log p_{\boldsymbol{\theta}}(x_t | x_{<t})

where the sum is over all tokens in the training corpus. This objective is used verbatim in every large language model - GPT-4 (OpenAI, 2023), LLaMA-3 (Meta, 2024), Gemini (Google, 2023), Claude (Anthropic, 2024). The perplexity exp(LCE)\exp(\mathcal{L}_{\text{CE}}) is the standard evaluation metric.

Label smoothing as regularised MLE. Standard MLE uses hard labels y(i){0,1}Ky^{(i)} \in \{0,1\}^K (one-hot). Label smoothing (Szegedy et al., 2016) uses soft targets y~(i)=(1ε)y(i)+ε/K\tilde{y}^{(i)} = (1-\varepsilon)y^{(i)} + \varepsilon/K, adding a small weight to incorrect classes. This is equivalent to regularising MLE by mixing the empirical distribution with a uniform prior - reducing overconfidence and improving calibration.

5.5 Properties of MLE

Under regularity conditions (the "Cramer-Rao conditions" on the model), MLE satisfies:

1. Consistency: θ^MLEPθ\hat{\boldsymbol{\theta}}_{\text{MLE}} \xrightarrow{P} \boldsymbol{\theta}^*. The proof uses the fact that the expected log-likelihood E[(θ)/n]\mathbb{E}[\ell(\boldsymbol{\theta})/n] is uniquely maximised at the true parameter θ\boldsymbol{\theta}^* (by the strict convexity of KL divergence), and the LLN ensures (θ)/nE[logp(X;θ)]\ell(\boldsymbol{\theta})/n \to \mathbb{E}[\log p(X;\boldsymbol{\theta})] uniformly.

2. Asymptotic normality:

n(θ^MLEθ)dN ⁣(0,I(θ)1)\sqrt{n}\left(\hat{\boldsymbol{\theta}}_{\text{MLE}} - \boldsymbol{\theta}^*\right) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\, \mathcal{I}(\boldsymbol{\theta}^*)^{-1}\right)

For large nn: θ^MLEN(θ,1nI(θ)1)\hat{\boldsymbol{\theta}}_{\text{MLE}} \approx \mathcal{N}(\boldsymbol{\theta}^*, \frac{1}{n}\mathcal{I}(\boldsymbol{\theta}^*)^{-1}).

3. Asymptotic efficiency: The asymptotic covariance 1nI1\frac{1}{n}\mathcal{I}^{-1} achieves the CRB. No consistent estimator can have smaller asymptotic variance (at every θ\boldsymbol{\theta}). MLE is asymptotically optimal.

4. Invariance under reparametrisation: If θ^\hat{\boldsymbol{\theta}} is the MLE of θ\boldsymbol{\theta}, then g(θ^)g(\hat{\boldsymbol{\theta}}) is the MLE of g(θ)g(\boldsymbol{\theta}) for any function gg (not required to be injective). This is the invariance property of MLE.

Examples of invariance:

  • MLE of σ\sigma is σ^=σ^MLE2\hat{\sigma} = \sqrt{\hat{\sigma}^2_{\text{MLE}}} (not ss, which is the unbiased estimator)
  • MLE of 1/λ1/\lambda in Exponential is xˉ\bar{x} (consistent with λ^=1/xˉ\hat{\lambda} = 1/\bar{x})
  • MLE of p2p^2 in Bernoulli is p^2=xˉ2\hat{p}^2 = \bar{x}^2 (biased, but the MLE)

Regularity conditions (Cramer-Rao conditions). These are the sufficient conditions for the above properties:

  1. The parameter space Θ\Theta is an open subset of Rk\mathbb{R}^k
  2. The model is identifiable
  3. The support {x:p(x;θ)>0}\{x: p(x;\boldsymbol{\theta}) > 0\} does not depend on θ\boldsymbol{\theta} (excludes Uniform)
  4. The log-likelihood is three times differentiable in θ\boldsymbol{\theta}
  5. The Fisher information matrix is positive definite at θ\boldsymbol{\theta}^*

When these fail (as in the Uniform example), MLE may still be the natural estimator but its properties differ.

5.6 Numerical MLE

For complex models, the score equation (θ)=0\ell'(\boldsymbol{\theta}) = 0 has no closed-form solution and must be solved numerically.

Gradient ascent on log-likelihood. The simplest approach:

θt+1=θt+ηθ(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \eta \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}_t)

For neural networks trained with mini-batches: stochastic gradient ascent on the mini-batch log-likelihood. This is exactly SGD on NLL loss with a negated gradient.

Newton-Raphson / Fisher scoring. Use second-order information:

θt+1=θt[H(θt)]1(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - [H_\ell(\boldsymbol{\theta}_t)]^{-1} \nabla \ell(\boldsymbol{\theta}_t)

where HH_\ell is the Hessian of the log-likelihood. Replacing the Hessian with the expected negative Hessian (Fisher information) gives Fisher scoring:

θt+1=θt+1nI(θt)1(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \frac{1}{n}\mathcal{I}(\boldsymbol{\theta}_t)^{-1} \nabla \ell(\boldsymbol{\theta}_t)

This is exactly the natural gradient update. Newton-Raphson converges quadratically near the maximum - much faster than gradient ascent - but requires inverting a k×kk \times k matrix per step.

EM algorithm. For models with latent variables (mixtures, HMMs, VAEs), the EM algorithm alternates between:

  • E-step: compute Q(θ;θt)=Ezx;θt[logp(x,z;θ)]Q(\boldsymbol{\theta};\boldsymbol{\theta}_t) = \mathbb{E}_{z|\mathbf{x};\boldsymbol{\theta}_t}[\log p(\mathbf{x},z;\boldsymbol{\theta})]
  • M-step: θt+1=argmaxθQ(θ;θt)\boldsymbol{\theta}_{t+1} = \arg\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta};\boldsymbol{\theta}_t)

EM guarantees that (θt+1)(θt)\ell(\boldsymbol{\theta}_{t+1}) \geq \ell(\boldsymbol{\theta}_t) - the log-likelihood is non-decreasing. EM is a specialised case of variational inference (to be developed in Section04 Bayesian Inference).

Numerical pitfalls:

  • Overflow/underflow: compute =logpi\ell = \sum \log p_i, not logpi\log \prod p_i (avoid multiplying probabilities)
  • Log-sum-exp trick: for softmax-based likelihoods, logkezk=zmax+logkezkzmax\log \sum_k e^{z_k} = z_{\max} + \log\sum_k e^{z_k - z_{\max}}
  • Flat log-likelihoods: when I(θ)0\mathcal{I}(\boldsymbol{\theta}) \approx 0 near the solution, gradient methods converge extremely slowly; natural gradient methods help
  • Local maxima: the log-likelihood may be multimodal for mixture models and neural networks; multiple restarts are necessary

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue