Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
8 min read13 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Estimation Theory: Appendix M: Glossary of Key Terms to Appendix P: Practice Problems

Appendix M: Glossary of Key Terms

TermFormal definitionIntuition
EstimatorA measurable function T(x(1),,x(n))T(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(n)}) of the dataA rule for computing a guess from observed data
EstimateA specific value T(x(1),,x(n))=tT(x^{(1)},\ldots,x^{(n)}) = t after observing dataThe actual number produced by the estimator
Sampling distributionThe distribution of θ^\hat{\theta} across all possible samples of size nnHow much the estimate varies from experiment to experiment
BiasE[θ^]θ\mathbb{E}[\hat{\theta}] - \thetaHow far off the estimator is on average
VarianceE[(θ^E[θ^])2]\mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2]How much the estimator fluctuates around its mean
MSEE[(θ^θ)2]=Bias2+Var\mathbb{E}[(\hat{\theta}-\theta)^2] = \text{Bias}^2 + \text{Var}Total estimation error
Consistencyθ^nPθ\hat{\theta}_n \xrightarrow{P} \theta as nn \to \inftyConverges to the right answer with more data
EfficiencyVar(θ^)=1/(nI(θ))\operatorname{Var}(\hat{\theta}) = 1/(n\mathcal{I}(\theta))Achieves the minimum possible variance
Sufficient statisticTT such that $p(\mathbf{x}T;\theta)doesnotdependondoes not depend on\theta$
Fisher informationI(θ)=E[s2(θ;X)]=E[(θ;X)]\mathcal{I}(\theta) = \mathbb{E}[s^2(\theta;X)] = -\mathbb{E}[\ell''(\theta;X)]How much the data informs θ\theta; curvature of log-likelihood
CRBVar(θ^)1/(nI(θ))\operatorname{Var}(\hat{\theta}) \geq 1/(n\mathcal{I}(\theta)) for unbiased θ^\hat{\theta}Hard lower bound on variance; no unbiased estimator can beat it
MLEθ^=argmaxθilogp(x(i);θ)\hat{\theta} = \arg\max_\theta \sum_i \log p(x^{(i)};\theta)Parameter that makes observed data most probable
Asymptotic normalityn(θ^MLEθ)dN(0,I1)\sqrt{n}(\hat{\theta}_{\text{MLE}}-\theta^*) \xrightarrow{d} \mathcal{N}(0,\mathcal{I}^{-1})MLE is approximately normal for large nn
Confidence intervalRandom interval covering θ\theta with probability 1α1-\alphaUncertainty quantification for frequentist estimation
Natural parameterη\boldsymbol{\eta} in p(x;η)=h(x)exp(ηT(x)A(η))p(x;\boldsymbol{\eta}) = h(x)\exp(\boldsymbol{\eta}^\top T(x) - A(\boldsymbol{\eta}))Parameterisation of exponential families
Natural gradientI(θ)1θL\mathcal{I}(\boldsymbol{\theta})^{-1}\nabla_{\boldsymbol{\theta}}\mathcal{L}Steepest ascent in Fisher-Rao metric on statistical manifold
Misspecificationp{p(;θ):θΘ}p^* \notin \{p(\cdot;\theta):\theta \in \Theta\}True distribution is not in the model family
Pseudo-true parameterθ=argminθDKL(pp(;θ))\theta^{**} = \arg\min_\theta D_{\mathrm{KL}}(p^*\|p(\cdot;\theta))Closest point in model to truth under KL divergence

Appendix N: Further Reading and References

Primary References

  1. Casella, G. & Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. - The definitive graduate textbook on classical estimation theory; covers all topics in this section at full rigor.

  2. Lehmann, E.L. & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. - Advanced treatment of UMVUE theory, Rao-Blackwell, and sufficiency.

  3. Fisher, R.A. (1922). "On the Mathematical Foundations of Theoretical Statistics." Philosophical Transactions of the Royal Society A, 222, 309-368. - The foundational paper defining sufficiency, efficiency, consistency, and MLE.

  4. Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press. - Original proof of the CRB.

  5. Rao, C.R. (1945). "Information and the Accuracy Attainable in the Estimation of Statistical Parameters." Bulletin of the Calcutta Mathematical Society, 37, 81-91. - Independent CRB proof and Rao-Blackwell theorem.

  6. Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife." Annals of Statistics, 7(1), 1-26. - Original bootstrap paper.

ML-Specific References

  1. Amari, S. (1998). "Natural Gradient Works Efficiently in Learning." Neural Computation, 10(2), 251-276. - Foundation of natural gradient methods.

  2. Martens, J. & Grosse, R. (2015). "Optimizing Neural Networks with Kronecker-factored Approximate Curvature." ICML. - K-FAC for tractable FIM approximation in deep networks.

  3. Kirkpatrick, J. et al. (2017). "Overcoming Catastrophic Forgetting in Neural Networks." PNAS, 114(13), 3521-3526. - EWC using FIM for continual learning.

  4. Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. - Chinchilla scaling laws via nonlinear MLE.

  5. Hu, E. et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. - Low-rank MLE for efficient fine-tuning.

  6. Guo, C. et al. (2017). "On Calibration of Modern Neural Networks." ICML. - Temperature scaling as MLE on calibration set.

  7. Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. - Chapter 5 covers MLE in the context of deep learning at the appropriate depth for ML practitioners.

Advanced References

  1. van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge. - Graduate-level asymptotic theory; proofs of asymptotic normality, delta method, semiparametric theory.

  2. Wasserman, L. (2004). All of Statistics. Springer. - Excellent graduate-level survey connecting classical statistics to modern methods.

  3. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. - Chapters 1-3 cover MLE, MAP, and Bayesian estimation with ML motivation.


This section is part of the Math for LLMs curriculum. For corrections or contributions, see CONTRIBUTING.md.


Appendix O: Connections to Information Theory

The Fisher information matrix has a deep connection to Shannon information theory, previewing material from Chapter 9 - Information Theory.

O.1 Fisher Information as Local Curvature of KL Divergence

The KL divergence between distributions at nearby parameter values satisfies:

DKL(p(;θ)p(;θ+dθ))=12dθI(θ)dθ+O(dθ3)D_{\mathrm{KL}}(p(\cdot;\boldsymbol{\theta}) \| p(\cdot;\boldsymbol{\theta}+\mathrm{d}\boldsymbol{\theta})) = \frac{1}{2}\mathrm{d}\boldsymbol{\theta}^\top \mathcal{I}(\boldsymbol{\theta})\, \mathrm{d}\boldsymbol{\theta} + O(\|\mathrm{d}\boldsymbol{\theta}\|^3)

The Fisher information matrix is the Hessian of the KL divergence with respect to the second argument, evaluated at the point where both arguments agree. This identifies I\mathcal{I} as the local curvature of the divergence surface.

Consequence: Moving in the Fisher metric direction by dθ\mathrm{d}\boldsymbol{\theta} produces the maximal increase in KL divergence from the current distribution - this is why the natural gradient (moving I1\mathcal{I}^{-1}\nabla\ell in parameter space) corresponds to moving in the direction of maximal KL divergence gain in distribution space.

O.2 Cramer-Rao and Channel Capacity

Shannon's channel capacity theorem and the Cramer-Rao bound are related through the van Trees inequality (a Bayesian version of the CRB for random θ\theta):

MSE(θ^)1nI(θ)+Iprior(θ)\operatorname{MSE}(\hat{\theta}) \geq \frac{1}{n\mathcal{I}(\theta) + \mathcal{I}_{\text{prior}}(\theta)}

where Iprior\mathcal{I}_{\text{prior}} is the Fisher information of the prior. As the prior becomes uninformative (Iprior0\mathcal{I}_{\text{prior}} \to 0), this recovers the standard CRB. The connection: information accumulates (Fisher information adds) across observations and from the prior, exactly as Shannon information adds in a noisy channel.

O.3 Sufficient Statistics and Data Compression

A sufficient statistic TT satisfies the data processing inequality: processing data through TT cannot increase Fisher information. In fact, for a sufficient statistic, IT(θ)=IX(θ)\mathcal{I}_T(\theta) = \mathcal{I}_{\mathbf{X}}(\theta) - no information is lost. This is the estimation-theory analogue of lossless compression: a sufficient statistic compresses the data without losing any information about θ\theta.

Minimal sufficient statistics provide the maximum compression while retaining all information - analogous to the minimum description length (MDL) principle in information theory.


Appendix P: Practice Problems

P.1 Identification Problems

For each of the following, state whether the model is identifiable. If not, identify the unidentifiable combination.

  1. XN(μ1μ2,1)X \sim \mathcal{N}(\mu_1 - \mu_2, 1) with parameters (μ1,μ2)R2(\mu_1, \mu_2) \in \mathbb{R}^2.
  2. p(xθ)=θeθxp(x|\theta) = \theta e^{-\theta x} for x>0x > 0, with θ>0\theta > 0. (Identifiable?)
  3. A two-layer ReLU network f(x)=ReLU(w1x)+ReLU(w2x)f(x) = \text{ReLU}(w_1 x) + \text{ReLU}(w_2 x) with parameters (w1,w2)(w_1, w_2). Is the function class identifiable?

P.2 Computational Problems

  1. MLE and invariance. For Poisson data with n=20n = 20 and xi=48\sum x_i = 48: (a) find λ^\hat{\lambda}; (b) find the MLE of eλe^{-\lambda} (the probability of observing zero events); (c) find the MLE of 1/λ1/\lambda (mean inter-arrival time).

  2. CRB for a transformed parameter. For XN(μ,1)X \sim \mathcal{N}(\mu, 1) with nn observations, derive the CRB for estimating g(μ)=μ2g(\mu) = \mu^2 using the biased-estimator form of the CRB. Show that the MLE μ^2=xˉ2\hat{\mu}^2 = \bar{x}^2 does not achieve this bound for finite nn, but does so asymptotically.

  3. Bootstrap vs. asymptotic. For n=20n = 20 observations from N(μ,1)\mathcal{N}(\mu, 1) with xˉ=3.2\bar{x} = 3.2: (a) compute the exact 95% tt-CI; (b) simulate the bootstrap CI with B=1000B = 1000; (c) compare coverage by repeating 1000 times and computing the empirical coverage of each CI type.

P.3 Conceptual Problems

  1. Stein paradox. For d=1d = 1, the sample mean xˉ\bar{x} is the admissible MLE. For d=3d = 3, the James-Stein estimator μ~=(1d2nxˉ22)xˉ\tilde{\boldsymbol{\mu}} = (1 - \frac{d-2}{n\lVert\bar{\mathbf{x}}\rVert_2^2})\bar{\mathbf{x}} dominates xˉ\bar{\mathbf{x}} in MSE. Simulate this for d=3d = 3, n=1n = 1, true μ=(2,2,2)\boldsymbol{\mu} = (2,2,2), σ=1\sigma = 1, and verify MSE(μ~)<MSE(xˉ)\text{MSE}(\tilde{\boldsymbol{\mu}}) < \text{MSE}(\bar{\mathbf{x}}).

  2. Model misspecification. A logistic regression model is fitted to data from a probit model (p(y=1x)=Φ(wx)p(y=1|\mathbf{x}) = \Phi(\mathbf{w}^\top\mathbf{x})). The MLE converges to what? Explain using the misspecified MLE theorem and the KL divergence interpretation.

  3. FIM for temperature scaling. For a KK-class classifier with logits z\mathbf{z} and temperature TT, the softmax is pk=ezk/T/jezj/Tp_k = e^{z_k/T}/\sum_j e^{z_j/T}. Derive the Fisher information I(T)\mathcal{I}(T) for a single example and explain why Newton-Raphson temperature calibration converges in 1-2 steps.


Section Section02 Estimation Theory - Chapter 7 Statistics - Math for LLMs curriculum

Lines: ~2000 | Theory notebook: 50+ cells | Exercises: 8 graded problems

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue