Estimation Theory: Appendix M: Glossary of Key Terms to Appendix P: Practice Problems

Appendix M: Glossary of Key Terms

Term	Formal definition	Intuition
Estimator	A measurable function $T(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(n)})$ of the data	A rule for computing a guess from observed data
Estimate	A specific value $T(x^{(1)},\ldots,x^{(n)}) = t$ after observing data	The actual number produced by the estimator
Sampling distribution	The distribution of $\hat{\theta}$ across all possible samples of size $n$	How much the estimate varies from experiment to experiment
Bias	$\mathbb{E}[\hat{\theta}] - \theta$	How far off the estimator is on average
Variance	$\mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2]$	How much the estimator fluctuates around its mean
MSE	$\mathbb{E}[(\hat{\theta}-\theta)^2] = \text{Bias}^2 + \text{Var}$	Total estimation error
Consistency	$\hat{\theta}_n \xrightarrow{P} \theta$ as $n \to \infty$	Converges to the right answer with more data
Efficiency	$\operatorname{Var}(\hat{\theta}) = 1/(n\mathcal{I}(\theta))$	Achieves the minimum possible variance
Sufficient statistic	$T$ such that $p(\mathbf{x}	T;\theta) $does not depend on$ \theta$
Fisher information	$\mathcal{I}(\theta) = \mathbb{E}[s^2(\theta;X)] = -\mathbb{E}[\ell''(\theta;X)]$	How much the data informs $\theta$ ; curvature of log-likelihood
CRB	$\operatorname{Var}(\hat{\theta}) \geq 1/(n\mathcal{I}(\theta))$ for unbiased $\hat{\theta}$	Hard lower bound on variance; no unbiased estimator can beat it
MLE	$\hat{\theta} = \arg\max_\theta \sum_i \log p(x^{(i)};\theta)$	Parameter that makes observed data most probable
Asymptotic normality	$\sqrt{n}(\hat{\theta}_{\text{MLE}}-\theta^*) \xrightarrow{d} \mathcal{N}(0,\mathcal{I}^{-1})$	MLE is approximately normal for large $n$
Confidence interval	Random interval covering $\theta$ with probability $1-\alpha$	Uncertainty quantification for frequentist estimation
Natural parameter	$\boldsymbol{\eta}$ in $p(x;\boldsymbol{\eta}) = h(x)\exp(\boldsymbol{\eta}^\top T(x) - A(\boldsymbol{\eta}))$	Parameterisation of exponential families
Natural gradient	$\mathcal{I}(\boldsymbol{\theta})^{-1}\nabla_{\boldsymbol{\theta}}\mathcal{L}$	Steepest ascent in Fisher-Rao metric on statistical manifold
Misspecification	$p^* \notin \{p(\cdot;\theta):\theta \in \Theta\}$	True distribution is not in the model family
Pseudo-true parameter	$\theta^{*} = \arg\min_\theta D_{\mathrm{KL}}(p^\\|p(\cdot;\theta))$	Closest point in model to truth under KL divergence

Appendix N: Further Reading and References

Primary References

Casella, G. & Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. - The definitive graduate textbook on classical estimation theory; covers all topics in this section at full rigor.
Lehmann, E.L. & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. - Advanced treatment of UMVUE theory, Rao-Blackwell, and sufficiency.
Fisher, R.A. (1922). "On the Mathematical Foundations of Theoretical Statistics." Philosophical Transactions of the Royal Society A, 222, 309-368. - The foundational paper defining sufficiency, efficiency, consistency, and MLE.
Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press. - Original proof of the CRB.
Rao, C.R. (1945). "Information and the Accuracy Attainable in the Estimation of Statistical Parameters." Bulletin of the Calcutta Mathematical Society, 37, 81-91. - Independent CRB proof and Rao-Blackwell theorem.
Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife." Annals of Statistics, 7(1), 1-26. - Original bootstrap paper.

ML-Specific References

Amari, S. (1998). "Natural Gradient Works Efficiently in Learning." Neural Computation, 10(2), 251-276. - Foundation of natural gradient methods.
Martens, J. & Grosse, R. (2015). "Optimizing Neural Networks with Kronecker-factored Approximate Curvature." ICML. - K-FAC for tractable FIM approximation in deep networks.
Kirkpatrick, J. et al. (2017). "Overcoming Catastrophic Forgetting in Neural Networks." PNAS, 114(13), 3521-3526. - EWC using FIM for continual learning.
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. - Chinchilla scaling laws via nonlinear MLE.
Hu, E. et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. - Low-rank MLE for efficient fine-tuning.
Guo, C. et al. (2017). "On Calibration of Modern Neural Networks." ICML. - Temperature scaling as MLE on calibration set.
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. - Chapter 5 covers MLE in the context of deep learning at the appropriate depth for ML practitioners.

Advanced References

van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge. - Graduate-level asymptotic theory; proofs of asymptotic normality, delta method, semiparametric theory.
Wasserman, L. (2004). All of Statistics. Springer. - Excellent graduate-level survey connecting classical statistics to modern methods.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. - Chapters 1-3 cover MLE, MAP, and Bayesian estimation with ML motivation.

This section is part of the Math for LLMs curriculum. For corrections or contributions, see CONTRIBUTING.md.

Appendix O: Connections to Information Theory

The Fisher information matrix has a deep connection to Shannon information theory, previewing material from Chapter 9 - Information Theory.

O.1 Fisher Information as Local Curvature of KL Divergence

The KL divergence between distributions at nearby parameter values satisfies:

D_{\mathrm{KL}}(p(\cdot;\boldsymbol{\theta}) \| p(\cdot;\boldsymbol{\theta}+\mathrm{d}\boldsymbol{\theta})) = \frac{1}{2}\mathrm{d}\boldsymbol{\theta}^\top \mathcal{I}(\boldsymbol{\theta})\, \mathrm{d}\boldsymbol{\theta} + O(\|\mathrm{d}\boldsymbol{\theta}\|^3)

The Fisher information matrix is the Hessian of the KL divergence with respect to the second argument, evaluated at the point where both arguments agree. This identifies $\mathcal{I}$ as the local curvature of the divergence surface.

Consequence: Moving in the Fisher metric direction by $\mathrm{d}\boldsymbol{\theta}$ produces the maximal increase in KL divergence from the current distribution - this is why the natural gradient (moving $\mathcal{I}^{-1}\nabla\ell$ in parameter space) corresponds to moving in the direction of maximal KL divergence gain in distribution space.

O.2 Cramer-Rao and Channel Capacity

Shannon's channel capacity theorem and the Cramer-Rao bound are related through the van Trees inequality (a Bayesian version of the CRB for random $\theta$ ):

\operatorname{MSE}(\hat{\theta}) \geq \frac{1}{n\mathcal{I}(\theta) + \mathcal{I}_{\text{prior}}(\theta)}

where $\mathcal{I}_{\text{prior}}$ is the Fisher information of the prior. As the prior becomes uninformative ( $\mathcal{I}_{\text{prior}} \to 0$ ), this recovers the standard CRB. The connection: information accumulates (Fisher information adds) across observations and from the prior, exactly as Shannon information adds in a noisy channel.

O.3 Sufficient Statistics and Data Compression

A sufficient statistic $T$ satisfies the data processing inequality: processing data through $T$ cannot increase Fisher information. In fact, for a sufficient statistic, $\mathcal{I}_T(\theta) = \mathcal{I}_{\mathbf{X}}(\theta)$ - no information is lost. This is the estimation-theory analogue of lossless compression: a sufficient statistic compresses the data without losing any information about $\theta$ .

Minimal sufficient statistics provide the maximum compression while retaining all information - analogous to the minimum description length (MDL) principle in information theory.

Appendix P: Practice Problems

P.1 Identification Problems

For each of the following, state whether the model is identifiable. If not, identify the unidentifiable combination.

$X \sim \mathcal{N}(\mu_1 - \mu_2, 1)$ with parameters $(\mu_1, \mu_2) \in \mathbb{R}^2$ .
$p(x|\theta) = \theta e^{-\theta x}$ for $x > 0$ , with $\theta > 0$ . (Identifiable?)
A two-layer ReLU network $f(x) = \text{ReLU}(w_1 x) + \text{ReLU}(w_2 x)$ with parameters $(w_1, w_2)$ . Is the function class identifiable?

P.2 Computational Problems

MLE and invariance. For Poisson data with $n = 20$ and $\sum x_i = 48$ : (a) find $\hat{\lambda}$ ; (b) find the MLE of $e^{-\lambda}$ (the probability of observing zero events); (c) find the MLE of $1/\lambda$ (mean inter-arrival time).
CRB for a transformed parameter. For $X \sim \mathcal{N}(\mu, 1)$ with $n$ observations, derive the CRB for estimating $g(\mu) = \mu^2$ using the biased-estimator form of the CRB. Show that the MLE $\hat{\mu}^2 = \bar{x}^2$ does not achieve this bound for finite $n$ , but does so asymptotically.
Bootstrap vs. asymptotic. For $n = 20$ observations from $\mathcal{N}(\mu, 1)$ with $\bar{x} = 3.2$ : (a) compute the exact 95% $t$ -CI; (b) simulate the bootstrap CI with $B = 1000$ ; (c) compare coverage by repeating 1000 times and computing the empirical coverage of each CI type.

P.3 Conceptual Problems

Stein paradox. For $d = 1$ , the sample mean $\bar{x}$ is the admissible MLE. For $d = 3$ , the James-Stein estimator $\tilde{\boldsymbol{\mu}} = (1 - \frac{d-2}{n\lVert\bar{\mathbf{x}}\rVert_2^2})\bar{\mathbf{x}}$ dominates $\bar{\mathbf{x}}$ in MSE. Simulate this for $d = 3$ , $n = 1$ , true $\boldsymbol{\mu} = (2,2,2)$ , $\sigma = 1$ , and verify $\text{MSE}(\tilde{\boldsymbol{\mu}}) < \text{MSE}(\bar{\mathbf{x}})$ .
Model misspecification. A logistic regression model is fitted to data from a probit model ( $p(y=1|\mathbf{x}) = \Phi(\mathbf{w}^\top\mathbf{x})$ ). The MLE converges to what? Explain using the misspecified MLE theorem and the KL divergence interpretation.
FIM for temperature scaling. For a $K$ -class classifier with logits $\mathbf{z}$ and temperature $T$ , the softmax is $p_k = e^{z_k/T}/\sum_j e^{z_j/T}$ . Derive the Fisher information $\mathcal{I}(T)$ for a single example and explain why Newton-Raphson temperature calibration converges in 1-2 steps.

Section Section02 Estimation Theory - Chapter 7 Statistics - Math for LLMs curriculum

Lines: ~2000 | Theory notebook: 50+ cells | Exercises: 8 graded problems

Estimation Theory: Part 3 - Appendix M Glossary Of Key Terms To Appendix P Practice Problems