Lesson overview | Previous part | Lesson overview
Estimation Theory: Appendix M: Glossary of Key Terms to Appendix P: Practice Problems
Appendix M: Glossary of Key Terms
| Term | Formal definition | Intuition |
|---|---|---|
| Estimator | A measurable function of the data | A rule for computing a guess from observed data |
| Estimate | A specific value after observing data | The actual number produced by the estimator |
| Sampling distribution | The distribution of across all possible samples of size | How much the estimate varies from experiment to experiment |
| Bias | How far off the estimator is on average | |
| Variance | How much the estimator fluctuates around its mean | |
| MSE | Total estimation error | |
| Consistency | as | Converges to the right answer with more data |
| Efficiency | Achieves the minimum possible variance | |
| Sufficient statistic | such that $p(\mathbf{x} | T;\theta)\theta$ |
| Fisher information | How much the data informs ; curvature of log-likelihood | |
| CRB | for unbiased | Hard lower bound on variance; no unbiased estimator can beat it |
| MLE | Parameter that makes observed data most probable | |
| Asymptotic normality | MLE is approximately normal for large | |
| Confidence interval | Random interval covering with probability | Uncertainty quantification for frequentist estimation |
| Natural parameter | in | Parameterisation of exponential families |
| Natural gradient | Steepest ascent in Fisher-Rao metric on statistical manifold | |
| Misspecification | True distribution is not in the model family | |
| Pseudo-true parameter | Closest point in model to truth under KL divergence |
Appendix N: Further Reading and References
Primary References
-
Casella, G. & Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. - The definitive graduate textbook on classical estimation theory; covers all topics in this section at full rigor.
-
Lehmann, E.L. & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. - Advanced treatment of UMVUE theory, Rao-Blackwell, and sufficiency.
-
Fisher, R.A. (1922). "On the Mathematical Foundations of Theoretical Statistics." Philosophical Transactions of the Royal Society A, 222, 309-368. - The foundational paper defining sufficiency, efficiency, consistency, and MLE.
-
Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press. - Original proof of the CRB.
-
Rao, C.R. (1945). "Information and the Accuracy Attainable in the Estimation of Statistical Parameters." Bulletin of the Calcutta Mathematical Society, 37, 81-91. - Independent CRB proof and Rao-Blackwell theorem.
-
Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife." Annals of Statistics, 7(1), 1-26. - Original bootstrap paper.
ML-Specific References
-
Amari, S. (1998). "Natural Gradient Works Efficiently in Learning." Neural Computation, 10(2), 251-276. - Foundation of natural gradient methods.
-
Martens, J. & Grosse, R. (2015). "Optimizing Neural Networks with Kronecker-factored Approximate Curvature." ICML. - K-FAC for tractable FIM approximation in deep networks.
-
Kirkpatrick, J. et al. (2017). "Overcoming Catastrophic Forgetting in Neural Networks." PNAS, 114(13), 3521-3526. - EWC using FIM for continual learning.
-
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. - Chinchilla scaling laws via nonlinear MLE.
-
Hu, E. et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. - Low-rank MLE for efficient fine-tuning.
-
Guo, C. et al. (2017). "On Calibration of Modern Neural Networks." ICML. - Temperature scaling as MLE on calibration set.
-
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. - Chapter 5 covers MLE in the context of deep learning at the appropriate depth for ML practitioners.
Advanced References
-
van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge. - Graduate-level asymptotic theory; proofs of asymptotic normality, delta method, semiparametric theory.
-
Wasserman, L. (2004). All of Statistics. Springer. - Excellent graduate-level survey connecting classical statistics to modern methods.
-
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. - Chapters 1-3 cover MLE, MAP, and Bayesian estimation with ML motivation.
This section is part of the Math for LLMs curriculum. For corrections or contributions, see CONTRIBUTING.md.
Appendix O: Connections to Information Theory
The Fisher information matrix has a deep connection to Shannon information theory, previewing material from Chapter 9 - Information Theory.
O.1 Fisher Information as Local Curvature of KL Divergence
The KL divergence between distributions at nearby parameter values satisfies:
The Fisher information matrix is the Hessian of the KL divergence with respect to the second argument, evaluated at the point where both arguments agree. This identifies as the local curvature of the divergence surface.
Consequence: Moving in the Fisher metric direction by produces the maximal increase in KL divergence from the current distribution - this is why the natural gradient (moving in parameter space) corresponds to moving in the direction of maximal KL divergence gain in distribution space.
O.2 Cramer-Rao and Channel Capacity
Shannon's channel capacity theorem and the Cramer-Rao bound are related through the van Trees inequality (a Bayesian version of the CRB for random ):
where is the Fisher information of the prior. As the prior becomes uninformative (), this recovers the standard CRB. The connection: information accumulates (Fisher information adds) across observations and from the prior, exactly as Shannon information adds in a noisy channel.
O.3 Sufficient Statistics and Data Compression
A sufficient statistic satisfies the data processing inequality: processing data through cannot increase Fisher information. In fact, for a sufficient statistic, - no information is lost. This is the estimation-theory analogue of lossless compression: a sufficient statistic compresses the data without losing any information about .
Minimal sufficient statistics provide the maximum compression while retaining all information - analogous to the minimum description length (MDL) principle in information theory.
Appendix P: Practice Problems
P.1 Identification Problems
For each of the following, state whether the model is identifiable. If not, identify the unidentifiable combination.
- with parameters .
- for , with . (Identifiable?)
- A two-layer ReLU network with parameters . Is the function class identifiable?
P.2 Computational Problems
-
MLE and invariance. For Poisson data with and : (a) find ; (b) find the MLE of (the probability of observing zero events); (c) find the MLE of (mean inter-arrival time).
-
CRB for a transformed parameter. For with observations, derive the CRB for estimating using the biased-estimator form of the CRB. Show that the MLE does not achieve this bound for finite , but does so asymptotically.
-
Bootstrap vs. asymptotic. For observations from with : (a) compute the exact 95% -CI; (b) simulate the bootstrap CI with ; (c) compare coverage by repeating 1000 times and computing the empirical coverage of each CI type.
P.3 Conceptual Problems
-
Stein paradox. For , the sample mean is the admissible MLE. For , the James-Stein estimator dominates in MSE. Simulate this for , , true , , and verify .
-
Model misspecification. A logistic regression model is fitted to data from a probit model (). The MLE converges to what? Explain using the misspecified MLE theorem and the KL divergence interpretation.
-
FIM for temperature scaling. For a -class classifier with logits and temperature , the softmax is . Derive the Fisher information for a single example and explain why Newton-Raphson temperature calibration converges in 1-2 steps.
Section Section02 Estimation Theory - Chapter 7 Statistics - Math for LLMs curriculum
Lines: ~2000 | Theory notebook: 50+ cells | Exercises: 8 graded problems