"The amount of information carried by a sample is not how surprising the sample is in isolation, but how sharply it lets us distinguish one model from its neighbors."
Overview
Fisher information is one of the rare ideas that lives comfortably in several mathematical worlds at once. In statistics, it measures how much an observation reveals about an unknown parameter. In differential geometry, it becomes the metric tensor of a statistical manifold. In information theory, it is the local second-order curvature of KL divergence and a bridge between estimation and entropy. In machine learning, it reappears as the geometry behind natural gradient descent, curvature-aware optimization, continual-learning penalties, uncertainty approximations, and score-based generative modeling.
The intuition is simple but deep. If two nearby parameter values and produce nearly indistinguishable distributions, then the data is weakly informative about the parameter. If tiny parameter changes induce visibly different likelihoods, then the parameter is statistically easy to estimate. Fisher information is the quantitative object that measures this local distinguishability.
This section gives Fisher information its own canonical home inside the Information Theory chapter. The Statistics chapter already covers Fisher information as part of estimation theory, especially through maximum likelihood estimation, asymptotic normality, and the Cramer-Rao lower bound. The Optimization chapter already covers natural gradient and practical second-order methods. Here we focus on what is unique about Fisher information itself: score-based definitions, structural properties, local KL geometry, Fisher-Rao metrics, Jeffreys priors, Fisher divergence, de Bruijn-style identities, and the modern AI systems that use these ideas.
Prerequisites
- Entropy, KL divergence, and cross-entropy - 01-Entropy, 02-KL-Divergence, 04-Cross-Entropy
- Mutual information and statistical dependence - 03-Mutual-Information
- Gradients, Hessians, and multivariate calculus - 05-Multivariate-Calculus/README.md
- Maximum likelihood estimation and asymptotic statistics - 07-Statistics/02-Estimation-Theory
- Natural gradient and curvature-aware optimization - 08-Optimization/03-Second-Order-Methods
- Basic probability distributions and expectations - 06-Probability-Theory/README.md
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive derivations, closed-form Fisher calculations, KL-curvature checks, Jeffreys priors, and natural-gradient demos |
| exercises.ipynb | 10 graded exercises covering score functions, matrix Fisher information, KL curvature, Jeffreys priors, empirical Fisher pitfalls, and ML applications |
Learning Objectives
After completing this section, you will be able to:
- Define the score function and explain why Fisher information is its variance
- Compute scalar Fisher information and matrix Fisher information for standard models
- Distinguish expected Fisher, observed information, and empirical Fisher clearly
- Prove or explain the key structural properties: additivity, positive semidefiniteness, and reparameterization behavior
- Interpret Fisher information as local curvature of KL divergence
- Explain the Fisher-Rao metric and why it gives a geometry on parametric model families
- Derive Jeffreys priors from
- Distinguish Fisher divergence from KL divergence and cross-entropy
- Understand the role of Fisher information in natural gradient methods without confusing it with the full optimization chapter
- Explain why empirical Fisher approximations can fail in modern deep learning
- Connect Fisher information to K-FAC, EWC, continual learning, and score-based generative modeling
- Relate Fisher information to entropy smoothing identities such as de Bruijn's identity
Table of Contents
- Fisher Information
- Overview
- Prerequisites
- Companion Notebooks
- Learning Objectives
- Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Core Theory I: Structural Properties
- 4. Core Theory II: Geometry and Local Divergence
- 5. Core Theory III: Information-Theoretic Identities
- 6. Advanced Topics
- 7. Applications in Machine Learning
- 8. Common Mistakes
- 9. Exercises
- 10. Why This Matters for AI (2026 Perspective)
- 11. Conceptual Bridge
- References
1. Intuition
1.1 Local Distinguishability of Models
Suppose we have a family of probability distributions . A parameter value by itself is not the object we observe. What we actually observe is data sampled from the distribution associated with that parameter. This means that "how much we know about " can only be interpreted through the distributions the parameter generates.
That observation immediately suggests a local perspective. If two nearby parameter values and produce nearly identical distributions, then even a large amount of data may struggle to distinguish them. If tiny perturbations in produce visibly different likelihoods, then a small sample can already tell us a lot about the parameter.
Fisher information formalizes this local distinguishability. It does not ask:
- how much uncertainty the parameter has in some prior sense
- how surprising a specific observation is
- how different two arbitrary distributions are globally
Instead, it asks:
How sensitive is the model distribution to infinitesimal changes in the parameter?
This is why Fisher information sits naturally between statistics and information theory. It is not a global divergence like KL divergence, but it is also not only an optimization curvature quantity. It is a local information metric on the model family itself.
LOCAL DISTINGUISHABILITY
===============================================================
parameter space:
theta -------- theta + dtheta
induced model family:
p(x|theta) ---- p(x|theta + dtheta)
If the two model distributions are almost the same:
little statistical information about theta
If the two model distributions separate quickly:
much statistical information about theta
Fisher information measures:
sensitivity of the distribution, not just sensitivity of a number
===============================================================
There is a useful contrast here with entropy. Entropy measures uncertainty inside a single distribution. Fisher information measures the sharpness with which a family of distributions changes as we move in parameter space.
Three quick examples build the instinct:
- Bernoulli coin with near : the likelihood changes moderately with the parameter, and the Fisher information per sample is finite and symmetric around .
- Bernoulli coin with near or : tiny changes in dramatically affect the likelihood of rare outcomes, so the Fisher information becomes large.
- Overparameterized neural network with redundant weights: many different weight vectors can induce almost the same function, so there are directions in parameter space with very low local distinguishability.
The third example is the bridge to modern AI. Flat directions, symmetries, and degeneracies in deep networks are not just optimization curiosities. They are geometric statements about information in parameter space.
1.2 From Curvature to Information
There are several equivalent ways to think about Fisher information. Each is useful in a different context.
The first is the score-variance view. The score is the derivative of the log-likelihood:
If the score fluctuates a lot under the model, then the likelihood is very sensitive to the parameter. That means the sample carries strong information about .
The second is the curvature view. Near the truth, the log-likelihood has a local quadratic shape. Sharp curvature means small parameter perturbations rapidly reduce the likelihood, which again means the data localizes the parameter well.
The third is the distance view. The KL divergence between two nearby model distributions satisfies a second-order approximation:
So Fisher information is literally the quadratic form that measures local KL distance.
These three views are not separate facts to memorize. They are three readings of the same object:
| View | Object | Interpretation |
|---|---|---|
| Score view | or | sensitivity of log-likelihood |
| Curvature view | expected local sharpness | |
| Divergence view | local Hessian of KL | infinitesimal information geometry |
This is also why Fisher information shows up in so many chapters of the curriculum:
- in statistics, because curvature controls estimation precision
- in optimization, because curvature affects update geometry
- in Bayesian inference, because local curvature shapes Laplace-style approximations and Jeffreys priors
- in information theory, because local KL structure is part of the geometry of model families
THREE VIEWS OF FISHER INFORMATION
===============================================================
score variance
Var_theta[ d/dtheta log p(X|theta) ]
|
v
likelihood curvature
- E_theta[ d^2/dtheta^2 log p(X|theta) ]
|
v
local KL geometry
D_KL(p_theta || p_{theta + dtheta})
~ (1/2) dtheta^T I(theta) dtheta
same object, three interpretations
===============================================================
One subtle but important point: Fisher information is local. It tells us what happens in a tiny neighborhood around a parameter value. It does not summarize the full global geometry of the model family. This local-global distinction matters in deep learning, where local curvature can be informative even when the global landscape is highly nonconvex.
1.3 Why Fisher Information Matters for AI
If you work on modern AI systems, Fisher information matters for at least five practical reasons.
First, it is the geometric object behind natural gradient descent. Ordinary gradient descent treats parameter space as Euclidean. Natural gradient treats the space of model distributions as primary and uses Fisher information as the metric tensor. This matters because two parameter vectors can be numerically different yet functionally similar, and Fisher geometry is designed to respect that difference.
Second, it appears in structured second-order methods such as K-FAC and related preconditioners. These methods do not use the full Fisher matrix exactly, but they are motivated by the idea that steepest descent should be measured in distribution space, not naive coordinate space.
Third, Fisher information appears in continual learning, especially Elastic Weight Consolidation (EWC). There, diagonal approximations to Fisher information are used as importance weights: parameters that strongly affect the old task's likelihood are penalized more heavily when learning a new task.
Fourth, Fisher ideas connect to uncertainty and calibration. The inverse of the Fisher matrix appears in asymptotic covariance approximations, which is why observed and expected information are tied to standard errors, Laplace approximations, and local uncertainty summaries.
Fifth, Fisher-related identities connect to score-based modeling and diffusion-style generative methods. There the relevant "score" is with respect to the data variable rather than the parameter, but the conceptual bridge is the same: derivatives of log densities are information-bearing objects.
| AI setting | Where Fisher enters | Why it matters |
|---|---|---|
| Natural gradient | metric on model family | parameterization-aware updates |
| K-FAC / structured curvature | tractable Fisher approximation | faster curvature-aware optimization |
| Continual learning | diagonal Fisher importance weights | reduce catastrophic forgetting |
| Laplace / uncertainty | inverse local curvature | local posterior or confidence shape |
| Diffusion / score models | score-field viewpoint | local density geometry and denoising |
A common misconception is that Fisher information is "old classical stats" and therefore not central to frontier ML. The opposite is closer to the truth. Deep learning repeatedly rediscovers that geometry matters, and Fisher information is one of the cleanest geometry-carrying objects we have for probabilistic models.
1.4 Historical Timeline
FISHER INFORMATION -- KEY MILESTONES
===============================================================
1920s-1930s Ronald Fisher
Introduces likelihood, score, and information as central
objects in statistical inference.
1940s Cramer and Rao
Fisher information becomes the denominator of a fundamental
lower bound on estimator variance.
1945-1946 Jeffreys and Cramer
Jeffreys prior and asymptotic efficiency deepen the link
between geometry, invariance, and estimation.
1940s-1970s Information theory era
Fisher information becomes connected to inequalities,
de Bruijn identities, and entropy power arguments.
1945-1990s Rao and information geometry
Fisher information is recognized as a Riemannian metric
on statistical model manifolds.
1998 Amari
Natural gradient reinterprets learning through Fisher-Rao
geometry.
2015 Martens and Grosse
K-FAC makes approximate Fisher-based curvature methods more
practical for neural networks.
2017 Kirkpatrick et al.
EWC uses diagonal Fisher approximations for continual learning.
2019-2026 Modern AI
Fisher remains central in natural-gradient methods, curvature
approximations, continual learning, uncertainty, and
information-geometric interpretations of model families.
===============================================================
Historically, Fisher information began as an inference object: a way of quantifying how much a sample says about a parameter. It later became clear that this quantity carries geometry. The parameter space of distributions is not just a coordinate box; it is a curved manifold whose local metric is given by Fisher information.
That shift from "variance formula" to "geometry of distributions" is exactly why Fisher information belongs in this chapter. Its deepest meaning is not only statistical efficiency. Its deepest meaning is local information geometry.
2. Formal Definitions
2.1 Score Function
Let be a random variable with density or mass function depending on a scalar parameter . The score function is
For vector parameters , the score is the gradient
The score is not itself the amount of information. It is the local sensitivity of the log-likelihood. If the likelihood responds strongly to changes in the parameter, the score will be large in magnitude. If the model is locally insensitive, the score will be small.
Under standard regularity conditions, the score has mean zero:
The proof is simple and worth remembering because it drives many later identities:
This identity says that the score is centered. Positive and negative local perturbation signals cancel in expectation under the model itself.
Three examples make the score concrete.
Example 1: Bernoulli(). For with ,
so
Example 2: Poisson(). For ,
so
Example 3: Gaussian mean with known variance. If with fixed ,
so
These examples already show the pattern: the score is a standardized residual.
Non-examples help too:
- the raw likelihood is not the score
- the negative log-likelihood itself is not the score
- the gradient with respect to the data variable is not the parameter score
This third distinction becomes crucial later, because score matching and diffusion models use derivatives with respect to , not with respect to .
2.2 Scalar Fisher Information
For a scalar parameter , the Fisher information is defined by
Because the score has mean zero, this is the variance of the score:
That single formula already explains a lot.
- If the score barely fluctuates, the log-likelihood is locally insensitive and the sample provides little information about the parameter.
- If the score fluctuates strongly, the likelihood sharply reacts to parameter changes and the sample is informative.
Since it is a variance, Fisher information is always nonnegative:
The units also make sense. If is measured in some parameter unit, then the score has units of inverse parameter, and the Fisher information has units of inverse parameter squared. This is why its inverse naturally behaves like a variance scale.
For the earlier examples:
Bernoulli().
Poisson().
Gaussian mean with known variance.
These formulas are worth internalizing because they reveal qualitative behavior:
- Bernoulli information explodes near and
- Poisson information decreases as the mean grows
- Gaussian mean information is constant when variance is fixed
SCALAR FISHER INFORMATION EXAMPLES
===============================================================
Bernoulli(p): I(p) = 1 / (p(1-p))
Poisson(lambda): I(lambda) = 1 / lambda
Gaussian mean mu: I(mu) = 1 / sigma^2
Interpretation:
smaller noise -> larger information
larger variance -> smaller information
more fragile likelihood -> sharper distinguishability
===============================================================
There is an important forward reference here:
Preview: Cramer-Rao lower bound
In Estimation Theory, Fisher information appears in the bound for unbiased estimators. That section is the canonical home for the full theorem and its proof. Here we keep the focus on Fisher information as an information object rather than fully redeveloping classical efficiency theory.
2.3 Fisher Information Matrix
For vector parameters , the scalar definition generalizes to the Fisher information matrix
Because this is an outer product averaged over the model distribution, the matrix is symmetric and positive semidefinite:
For any vector ,
This quadratic-form identity is one of the cleanest ways to understand the matrix. The matrix does not merely record coordinate-wise sensitivity. It tells us how informative the model is in every direction of parameter space.
Interpret the matrix entries carefully:
- Diagonal entries measure sensitivity with respect to single coordinates.
- Off-diagonal entries measure coupling between coordinates.
- Small eigenvalues indicate flat or weakly identifiable directions.
- Large eigenvalues indicate sharply distinguishable directions.
Example: Gaussian mean with known covariance. Let with fixed positive definite . Then
so
This example is foundational. In a Gaussian location model, covariance and information are literal inverses. High observation noise means low information; low observation noise means high information.
Another example: scalar Gaussian variance parameter with known mean. If and we parameterize by , then the Fisher information depends strongly on the parameterization. This is our first warning that Fisher information values are not invariant as plain numbers across different coordinates.
2.4 Expected Hessian Form
Under regularity conditions that allow interchange of differentiation and integration, Fisher information admits another representation:
In the scalar case,
This is often called the expected-curvature form of Fisher information. It matches the intuition that Fisher information measures how sharply the log-likelihood bends on average.
Why the minus sign? Because log-likelihoods are often locally concave near the truth, so their second derivatives are negative. Taking the negative expected Hessian yields a nonnegative information quantity.
It is tempting to conclude that Fisher information is just "the Hessian." That is too quick. The exact relationship depends on context:
- Fisher information is the expected negative Hessian under the model
- observed information is the sample-specific negative Hessian
- empirical Fisher is a different object again, based on gradients from observed labels or samples
These distinctions matter a lot in machine learning practice.
Regularity conditions are not decorative technicalities here. The identity can fail when:
- the support of the distribution depends on the parameter
- derivatives cannot be exchanged with integrals safely
- the model is singular or not differentiable in the required sense
For most textbook parametric families used in ML and statistics, the identity works cleanly. But it is still conceptually healthier to treat the score-outer- product definition as primary and the expected-Hessian formula as a theorem.
2.5 Observed, Expected, and Empirical Fisher
One of the biggest sources of confusion in modern ML is that several related curvature-like objects are all called "Fisher" in casual conversation.
We need to separate three of them carefully.
Expected Fisher
This is the population object. It depends on the model family and the parameter value.
Observed information
This is the sample-specific negative Hessian of the log-likelihood.
Empirical Fisher
This is a gradient outer-product computed on observed data. In practice it is often easier to estimate than the true Fisher, which is why it shows up in diagonal approximations and heuristics.
These objects can be close in some settings, but they are not generally the same.
| Object | Depends on | Interpretation | Same as Fisher? |
|---|---|---|---|
| Expected Fisher | model distribution | population local information geometry | yes |
| Observed information | specific dataset | sample-specific local curvature | not exactly |
| Empirical Fisher | data gradients | computational approximation | not generally |
This section will keep these distinctions explicit throughout.
Preview: Optimization chapter
The full optimizer-level discussion of Hessian, Gauss-Newton, natural gradient, K-FAC, and empirical Fisher approximations belongs in 08-Optimization/03-Second-Order-Methods. Here we focus on the mathematical identity of these objects, not only on their engineering tradeoffs.
3. Core Theory I: Structural Properties
3.1 Nonnegativity and PSD Geometry
For scalar parameters, Fisher information is the variance of the score:
That immediately implies
For vector parameters, the matrix form
implies that is positive semidefinite. The proof is a single line:
This line is more than a proof. It is the geometric interpretation. Every vector specifies a direction in parameter space. The quantity
is the one-dimensional Fisher information in that direction. So the full matrix packages directional information over the entire tangent space.
When is the Fisher information strictly positive definite rather than only semidefinite?
- when nearby parameter perturbations in every direction change the distribution distinguishably
- when the parameterization is locally identifiable
- when there are no redundant or symmetry directions
It becomes singular when some directions fail to change the distribution to first order.
Three recurring reasons for singularity are:
- Redundant parameterization: different parameter vectors represent the same distribution
- Boundary phenomena: the model loses smooth identifiability near parameter boundaries
- Symmetry: permutations or rescalings leave the function unchanged
Neural networks exhibit all three. Hidden-unit permutations, scale symmetries under normalization, and saturated activations all create low-information directions.
PSD GEOMETRY OF FISHER INFORMATION
===============================================================
direction in parameter space: v
directional information:
v^T I(theta) v
if large:
changing theta along v quickly changes the model
if near zero:
changing theta along v barely changes the model
eigenvalues of I(theta):
large -> sharp / informative directions
small -> flat / weakly identifiable directions
zero -> exact local redundancy
===============================================================
This is the first place where Fisher differs from plain Euclidean geometry. Euclidean length treats all coordinate directions equally. Fisher geometry weights directions by their effect on the model distribution.
3.2 Additivity over Independent Observations
One of the most important structural facts about Fisher information is additivity under independence.
Suppose are iid from . Then the log-likelihood adds:
Therefore the score adds too:
Taking the variance under the model gives
Because the scores are iid and mean zero,
For vectors, exactly the same logic yields
This additivity is one reason Fisher information is so natural in asymptotic statistics. Information grows linearly with sample size, so standard errors often shrink like .
There is a deeper conceptual reading too:
- entropy does not generally add under dependence
- KL divergence adds over independent product models
- Fisher information also adds over independent observations because it is the local second-order form associated with that product structure
This makes Fisher information an "accumulating resource." Each independent observation contributes a local information increment.
Example: Gaussian mean with known variance. If iid, then each observation contributes , so
Example: Bernoulli(). Each sample contributes , so
This is exactly why confidence intervals narrow with more samples and why temperature-calibration or logistic-regression standard errors become more stable on large datasets.
Two caveats matter:
- If observations are dependent, simple linear additivity fails.
- If the model is misspecified, the relevant variance and curvature objects can separate, leading to sandwich-type corrections.
Those caveats do not destroy the additivity idea. They just clarify that it belongs to the clean iid parametric setting.
3.3 Reparameterization Law
Fisher information has a famous and subtle invariance structure.
Suppose we reparameterize a scalar model using a smooth bijection . Then by the chain rule,
Squaring and taking expectation gives
So scalar Fisher information is not invariant as a plain number. It transforms like a metric coefficient.
For vector parameters with and Jacobian , the matrix transforms as
This is exactly how a Riemannian metric tensor transforms.
What remains invariant is not the matrix itself, but the quadratic differential form
That is why Fisher information defines geometry. Coordinate descriptions change, but the intrinsic local distance does not.
Simple scalar example: Let for a Bernoulli model. Then
Now reparameterize with the logit
Since
the Fisher information in logit coordinates is
These look very different numerically, but they describe the same local statistical geometry once expressed in their own coordinates.
This is one of the strongest reasons Fisher information matters in machine learning. Raw gradients depend heavily on parameterization. Fisher geometry tells us how to measure change in a coordinate-aware but intrinsically meaningful way.
REPARAMETERIZATION LAW
===============================================================
same model family
p(x|theta) = p(x|phi(theta))
coordinate matrix/value changes
I_theta != I_phi
but local statistical distance stays the same
dtheta^T I_theta dtheta
=
dphi^T I_phi dphi
Fisher is not "number invariant"
Fisher is "geometry invariant"
===============================================================
3.4 Sufficiency and Information Preservation
A statistic is sufficient for if it compresses the sample without losing information about the parameter. In classical statistics this is usually stated via factorization theorems or conditional independence, but it has a Fisher-information reading as well.
If is sufficient under regularity conditions, then
That is, the statistic preserves Fisher information completely.
This is an information-theoretic compression statement. The raw observation may contain many bits of irrelevant randomness, but as long as a statistic retains the entire local sensitivity structure relevant to , Fisher information is unchanged.
Example: Gaussian mean with known variance. If iid with known , then the sample mean is sufficient for . The full sample Fisher information is . Since
the Fisher information in is also
No local information is lost by replacing the full sample with the sample mean.
This is a beautiful bridge between statistics and information theory:
- sufficient statistics are data compressions
- but unlike generic compressions, they preserve all parameter-relevant local information
In deep learning language, a sufficient statistic is a representation that retains everything needed about the target parameter while discarding nuisance variation. The analogy is imperfect but instructive.
We should also state a limitation clearly. Equality of Fisher information is a local statement about parameter sensitivity. It does not automatically imply that all global inferential tasks are identical outside the regular parametric setting. Still, in the classical setting it is the right local preservation criterion.
3.5 Closed-Form Examples
Closed-form examples are where Fisher information stops being abstract.
Example 1: Bernoulli()
For ,
We already computed the score:
Then
Interpretation:
- information is lowest at
- information grows near the boundaries
- rare outcomes are highly informative about whether the parameter is near an edge
Example 2: Poisson()
For ,
Interpretation:
- larger count scales produce smaller per-sample information about the rate itself
- to estimate a large rate accurately, we usually need more samples
Example 3: Gaussian mean, known variance
For with known ,
Interpretation:
- noisier data means less information
- every observation contributes the same amount of information regardless of the mean itself
Example 4: Gaussian variance, known mean
Parameterize by . Then
so
A straightforward calculation yields
This example is a good reminder that parameterization matters: using instead of produces a different numerical Fisher quantity, though the induced geometry is consistent after transformation.
Example 5: Multivariate Gaussian mean, known covariance
For ,
Interpretation:
- correlated noise creates coupled information directions
- principal directions of correspond to inverse principal directions of information
These examples are enough to establish the recurring theme:
- noise variance suppresses information
- repeated independent samples add information
- nonlinear parameterizations rescale information
- matrix Fisher captures anisotropy and coupling
4. Core Theory II: Geometry and Local Divergence
4.1 Fisher as Local Curvature of KL Divergence
Among all interpretations of Fisher information, the most important for this chapter is the local KL-curvature view.
Consider a smooth parametric family . Compare the distribution at to the nearby distribution at . Then the KL divergence satisfies the second-order expansion
This formula is a cornerstone of information geometry.
Why is there no first-order term? Because KL divergence is minimized at zero when the two distributions coincide, so the linear term vanishes. The second derivative is the first nontrivial local structure, and that derivative is the Fisher information matrix.
This identity tells us something profound:
- KL divergence is a global asymmetric divergence
- Fisher information is its local symmetric quadratic approximation
So Fisher information is the infinitesimal geometry hidden inside KL divergence.
Sketch of the derivation:
- Expand to second order around .
- Insert the expansion into the KL formula .
- The first-order term disappears because the expected score is zero.
- The remaining quadratic term becomes the Fisher matrix.
This local-KL identity is one of the cleanest examples of why information theory and statistics belong together. Global divergence and local estimation geometry are not separate topics. One is the infinitesimal limit of the other.
Three consequences are worth stating immediately:
- Fisher information is symmetric even though KL divergence is not.
- Fisher information captures only local behavior; two model families with the same local Fisher metric can still differ globally.
- Small KL-trust regions naturally induce Fisher-metric balls.
That third point is what powers natural-gradient derivations.
4.2 Fisher-Rao Metric
Because the quadratic KL expansion defines an intrinsic positive semidefinite form on the tangent space, Fisher information gives a Riemannian metric on a smooth identifiable statistical model. This metric is called the Fisher-Rao metric.
Infinitesimally, the line element is
This formula should be read the same way one reads a metric tensor in ordinary differential geometry:
- coordinates are parameters
- tangent directions are parameter perturbations
- local lengths are measured by Fisher information, not Euclidean dot products
Why should this metric be privileged rather than any arbitrary positive definite matrix field? The classical answer comes from invariance. Fisher-Rao is the canonical metric compatible with probabilistic reparameterization structure. It transforms correctly under smooth coordinate changes and is deeply tied to KL divergence.
Simple Gaussian example: For the one-dimensional Gaussian location family with known , the metric is
So the manifold is flat in , but distances are scaled by the inverse noise variance. This makes intuitive sense: if noise is high, two means need to be further apart numerically to be statistically distinguishable.
For richer families, the geometry becomes curved, and this curvature encodes how parameter changes distort distributions.
FISHER-RAO GEOMETRY
===============================================================
parameter coordinates:
theta_1, theta_2, ..., theta_d
ordinary Euclidean view:
distance depends on coordinate differences only
Fisher-Rao view:
distance depends on how those differences change the distributions
local metric:
ds^2 = dtheta^T I(theta) dtheta
so the model family is not just a parameter grid
it is a curved statistical manifold
===============================================================
This perspective matters for AI because neural-network parameter spaces are full of anisotropy, redundancy, and badly scaled coordinates. A geometry based on distributional effect is often much more meaningful than one based on raw coordinate magnitudes.
4.3 Jeffreys Prior
Jeffreys proposed using Fisher information to define a parameterization- invariant prior. For a scalar parameter,
For vector parameters,
Why the square root of the determinant? Because the metric tensor defines a volume form on the statistical manifold. Jeffreys prior is the density induced by that intrinsic volume.
The invariance story is the key:
- ordinary "uniform priors" are not invariant under reparameterization
- Jeffreys prior transforms correctly with the metric volume element
This is one reason Jeffreys prior is often introduced as a default objective or reference prior in Bayesian analysis.
Example: Bernoulli(). We know
So
This is the Beta prior, also known as the arcsine prior.
Example: Exponential rate parameter . If , then
This may be improper on , which is an important practical caveat: Jeffreys priors are geometrically natural, but they are not always proper probability distributions without additional conditions or truncations.
Why Jeffreys prior belongs in this section:
- it depends directly on Fisher information
- it is about invariant local geometry
- it links estimation, Bayesian default priors, and information-theoretic redundancy arguments
Why it does not belong here in full generality:
- the whole prior/posterior machinery belongs in 07-Statistics/04-Bayesian-Inference
So this section presents Jeffreys prior as a Fisher-geometric object and not as a full Bayesian workflow.
4.4 Natural Gradient as KL-Constrained Steepest Descent
Natural gradient is often introduced in optimization classes, but its cleanest derivation is actually information-theoretic.
Suppose we want to improve an objective , but we want to measure the step size not in Euclidean norm , but in the local KL change it induces on the model distribution.
Using the quadratic KL approximation, we solve:
This is a constrained linear optimization problem in a quadratic metric. Using Lagrange multipliers yields the solution direction
That is the natural gradient.
So natural gradient is the steepest-ascent direction when distance is measured by local KL divergence rather than coordinate norm.
This derivation matters because it shows what natural gradient is for. It is not just "gradient preconditioning with some matrix." It is the geometry-aware update induced by the local information metric.
Still, we should keep the chapter boundaries clear:
Forward reference: Optimization
The full algorithmic story of natural gradient, K-FAC, damping, inversion approximations, and large-scale training tradeoffs belongs in 08-Optimization/03-Second-Order-Methods. Here the goal is to understand why Fisher information defines the relevant geometry.
Two high-level consequences matter for AI:
- reparameterization invariance in distribution space
- step sizes that adapt to local model sensitivity
This is particularly attractive in large probabilistic models where coordinate scales can be wildly heterogeneous.
4.5 Identifiability, Flat Directions, and Singular Models
Not every model family has a nicely invertible Fisher matrix.
If two distinct parameter values produce the same distribution, the model is not identifiable. Then some directions in parameter space do not change the distribution locally, and the Fisher matrix becomes singular.
This is not a pathological corner case. It is common in modern deep learning.
Examples:
- Permutation symmetries: permuting hidden units can leave the function unchanged
- Scaling symmetries: in some architectures, scaling one layer up and another down preserves the same input-output map
- Inactive units or saturated gates: local parameter directions cease to affect the output distribution
When Fisher has near-zero eigenvalues, several things happen:
- natural gradient updates become unstable without damping or pseudo-inverses
- local uncertainty approximations become broad or ill-conditioned
- empirical diagonals can mislead us about which directions matter
This is why "invert the Fisher matrix" is conceptually elegant but often computationally delicate in deep networks.
There is a broader lesson here too. In singular models, local quadratic approximations can fail to capture the true inferential geometry. The Fisher matrix remains informative, but it is no longer a complete summary.
WHEN FISHER LOSES RANK
===============================================================
parameter direction changes coordinates
but does not change the distribution
then:
score in that direction is ~ 0
local KL curvature is ~ 0
Fisher eigenvalue is ~ 0
interpretation:
the model cannot locally distinguish movement in that direction
deep-learning sources of rank loss:
hidden-unit permutation
scale symmetry
dead/saturated units
redundant parameterization
===============================================================
This section is a good place to be honest about a common misconception:
Fisher information is not magic curvature that resolves every deep-learning geometry problem. It is the correct local metric for probabilistic model families, but local correctness does not remove singularity, redundancy, or large-scale computational barriers.
5. Core Theory III: Information-Theoretic Identities
5.1 Fisher Divergence
KL divergence compares probability distributions directly through log-density ratios. Fisher divergence compares them through the difference of their score functions.
For smooth densities and on , the Fisher divergence is
Be careful about the object here: this uses derivatives with respect to the data variable , not the parameter . So despite the name, Fisher divergence is not a parameter-space curvature matrix. It is a discrepancy between score fields.
This distinction matters enough to say twice:
- Fisher information uses
- Fisher divergence uses
Why keep Fisher divergence in this section at all?
Because it belongs to the same conceptual family:
- both are built from log-density derivatives
- both encode local distinguishability
- both become central when we study score-based objects
There is also a practical reason. Fisher divergence is the population quantity underlying score matching, which provides a route to training unnormalized density models without needing the partition function explicitly.
High-level contrast:
| Quantity | Variable differentiated | What it compares |
|---|---|---|
| Fisher information | parameter | local sensitivity of a model family |
| Fisher divergence | data | mismatch of score fields between densities |
| KL divergence | none directly | global expected log-ratio mismatch |
Fisher divergence is local in a different sense from KL. It does not directly care about global normalizing constants. This is why score-matching objectives can be attractive when exact likelihoods are intractable.
5.2 de Bruijn Identity
One of the most beautiful links between entropy and Fisher information is the de Bruijn identity.
Let be a random vector with smooth density, and let be independent Gaussian noise. Define a Gaussian- smoothed random variable
Then the differential entropy of evolves according to
where denotes the Fisher information with respect to location.
This identity is extraordinary because it directly links:
- entropy, a global uncertainty quantity
- Gaussian smoothing, a heat-flow or noise-injection process
- Fisher information, a local derivative-based sharpness quantity
Intuitively:
- adding Gaussian noise smooths the distribution
- smoothing increases entropy
- the instantaneous rate of entropy increase is governed by Fisher information
So Fisher information measures how rapidly uncertainty expands under infinitesimal Gaussian perturbation.
This identity is a key bridge between the "estimation world" and the "information world." Entropy looks global and combinatorial. Fisher looks local and differential. de Bruijn's identity says they are dynamically connected.
We do not need the full proof here, but the structure matters:
- Gaussian convolution corresponds to heat-flow evolution.
- Differentiating entropy along the heat flow produces an integral involving gradients of the density.
- After algebraic rearrangement, that integral is exactly Fisher information.
This is not just a classical theorem with little modern relevance. It explains why score fields become the natural infinitesimal objects when studying noisy distribution evolution.
5.3 Stam-Type Inequalities
Another family of classical results connects Fisher information and entropy through inequalities of convolution and smoothing. The exact technical forms vary, but the conceptual theme is stable:
Gaussian distributions optimize or extremize many joint entropy-Fisher relationships.
One representative example is Stam's inequality, which in one of its common forms implies a reciprocal-information inequality for sums of independent random variables:
for suitable smooth independent variables and .
This is structurally reminiscent of other Gaussian-extremal facts:
- entropy power inequalities
- central limit smoothing effects
- Gaussian maximality under variance constraints
What should you take away without diving into every proof?
- Fisher information decreases under additive Gaussian smoothing.
- Convolution tends to reduce local sharpness.
- Gaussian laws occupy a distinguished extremal role.
This matters because many modern generative and denoising procedures systematically inject noise and then learn to reverse or control that smoothing process. Stam-type results tell us that Fisher is the right local quantity for tracking how sharpness behaves under such transformations.
5.4 Heat Flow, Noise Injection, and Scores
The de Bruijn identity and Stam-type inequalities are easiest to remember if we view them through heat flow.
Adding Gaussian noise to a random variable corresponds to evolving its density under the heat equation. As this happens:
- sharp local structure gets blurred
- entropy rises
- Fisher information falls
This gives a clean dynamic picture:
GAUSSIAN SMOOTHING FLOW
===============================================================
start with density p(x)
|
v
add small Gaussian noise
|
v
density becomes smoother
|
+--> entropy increases
+--> Fisher information decreases
+--> score field becomes less sharp
entropy tracks global spread
Fisher tracks local sharpness
===============================================================
In one dimension, Fisher information for a density can be written as
So when the log-density has steep derivatives, Fisher information is large. Heat flow blunts those derivatives, pushing Fisher downward.
This point is especially useful pedagogically because it cures a possible misunderstanding:
Fisher information is not "more uncertainty."
It is closer to the opposite. Large Fisher information corresponds to sharp local structure and strong local distinguishability. When Gaussian noise blurs the density, entropy increases but Fisher information tends to decrease.
This entropy-Fisher tradeoff is one of the most elegant dualities in the entire subject.
5.5 Why These Identities Matter for Modern AI
At first glance, de Bruijn and Stam may look too classical for machine learning. But they matter because they teach the right local language for noisy generative modeling.
Three modern AI-facing lessons emerge.
1. Score fields are natural objects under noise. Diffusion and score-based models are built around learning vector fields related to gradients of log densities. The entropy-Fisher identities explain why gradients of log densities are the right infinitesimal objects to study under Gaussian corruption.
2. Noise schedules are not arbitrary smoothing tricks. Adding noise changes entropy and Fisher in predictable directional ways. Even if we do not use de Bruijn's identity explicitly in every training loop, the theory explains why Gaussian perturbation creates increasingly smooth intermediate distributions.
3. Local geometry matters in generative modeling. Likelihood, score matching, denoising objectives, and reverse-time SDE/ODE views are all easier to interpret once we understand that Fisher-type objects measure local structural sharpness.
This section is not the canonical home for diffusion models or score matching in full generality, but the identities here provide conceptual infrastructure that those topics rely on.
6. Advanced Topics
6.1 Fisher vs Hessian vs Gauss-Newton
In machine learning, these three objects are often used as if they were interchangeable:
- Fisher information
- Hessian of the loss
- Gauss-Newton matrix
They are related, but they are not the same object in general.
Fisher information is the expected outer product of score vectors under the model distribution.
Hessian is the exact second derivative matrix of the objective being optimized.
Gauss-Newton is a positive-semidefinite approximation that drops certain second-derivative terms and is especially natural for least-squares and likelihood-style problems.
When do they align most closely?
- near well-specified likelihood optima
- in generalized linear-model style settings
- when the objective is negative log-likelihood and model assumptions are well matched
When do they differ?
- for misspecified models
- for non-log-likelihood objectives
- far from optimum
- in large neural nets with strong nonlinearities and saturation
This is why the sentence "the Fisher is the Hessian" is both common and dangerous. It can be a helpful local approximation, but it is not an identity you should use blindly.
6.2 True Fisher vs Empirical Fisher
The empirical Fisher is popular because it is easy to compute: take gradients on observed examples and average their outer products.
But the true Fisher is defined through expectation under the model distribution, not just through the observed-label gradients on a finite sample.
Why practitioners like empirical Fisher:
- simple to estimate from minibatches
- automatically positive semidefinite
- cheap compared to exact curvature objects
Why the approximation can fail:
- it uses realized targets rather than the full model-induced distribution
- it need not capture the same second-order structure as the true Fisher
- it can behave very differently away from the optimum
This distinction became important enough in ML theory that there are now dedicated analyses showing the empirical Fisher can have undesirable pathologies when used as a stand-in for natural-gradient geometry.
The safe takeaway is:
empirical Fisher can be a useful heuristic approximation, but it is not the theoretical Fisher information matrix by default.
6.3 Misspecification and Robust Information
In the ideal parametric story, the model family contains the truth. In reality, models are often misspecified.
When misspecification occurs, two score-based matrices matter:
- the outer-product variance of the score
- the expected negative Hessian
Under correct specification, these coincide. Under misspecification, they can separate. The resulting asymptotic covariance formulas use sandwich-type corrections rather than a single clean Fisher inverse.
This section will not fully develop robust asymptotic theory, because that belongs more naturally in advanced statistics. But it is important to know that "the Fisher matrix" is cleanest in the well-specified setting.
For machine learning, misspecification is the norm rather than the exception. That is one reason local Fisher geometry remains useful as a principled reference object, even if practical uncertainty quantification often needs more than naive Fisher inverses.
6.4 Large-Scale Approximations
The exact Fisher matrix for a modern deep model is too large to form or invert directly. This forces approximations.
Common approximation families:
- Diagonal Fisher - cheapest, but loses coupling
- Block-diagonal Fisher - keeps some structure by layer or module
- Kronecker-factored approximations - exploit layer structure in dense or convolutional modules
- Low-rank or sketching methods - keep only dominant directions
Each approximation trades off:
- memory
- computation
- faithfulness to the true geometry
This section treats these approximations conceptually. The engineering and optimizer-level details belong in Chapter 8.
Still, the high-level idea is worth emphasizing:
Approximate Fisher methods are not trying to approximate "some arbitrary matrix." They are trying to preserve the local statistical geometry while making it computationally tractable.
6.5 Singular and Symmetric Neural Models
Deep networks often live in regimes where classical regularity assumptions are strained:
- parameters can be redundant
- functions can be unchanged by permutations
- activations can become locally insensitive
- the same predictor can correspond to many parameter vectors
This means Fisher matrices can be poorly conditioned or singular even when training is proceeding normally.
The consequence is not that Fisher information becomes useless. Rather, we should interpret it more carefully:
- large eigenvalues identify highly sensitive parameter directions
- tiny eigenvalues often reveal symmetry or redundancy
- damping, pseudo-inverses, or structured approximations become essential
This is a major reason why practical Fisher-based optimization is an art as well as a theorem.
7. Applications in Machine Learning
7.1 Logistic and Softmax Models
Some of the cleanest Fisher calculations in ML come from logistic and softmax models.
For binary logistic regression with
the Fisher matrix has the form
This matrix automatically weights examples by predictive uncertainty:
- examples with probability near contribute strongly
- saturated examples near or contribute less
That is a very intuitive and useful fact. Fisher information in logistic models emphasizes regions where the model is locally most sensitive.
For multiclass softmax models, the Fisher matrix involves the softmax covariance structure
This is exactly the same local probability geometry that appears in cross- entropy Hessians and generalized Gauss-Newton approximations.
So Fisher information is not an abstract add-on to classification. It is built into the local curvature of the most common predictive models in ML.
7.2 Natural Gradient and Curvature-Aware Training
Natural gradient methods use Fisher information to define steepest descent in distribution space rather than Euclidean parameter space.
Why is this attractive for ML?
- different parameter coordinates can be arbitrarily scaled
- raw gradients can be badly conditioned
- two large Euclidean steps can produce almost no change in the model, while a tiny Euclidean step in another direction can radically alter predictions
Fisher-aware updates attempt to normalize for this mismatch.
The conceptual update is
In practice, exact inversion is impossible for large networks, but the idea still guides structured approximations and trust-region methods.
This is especially relevant in large probabilistic models, recurrent systems, and settings where invariance to reparameterization is not just mathematically nice, but operationally stabilizing.
7.3 K-FAC and Structured Fisher Approximations
K-FAC is one of the most influential large-scale approximations to natural- gradient geometry.
Its core idea is that in layered neural networks, certain blocks of the Fisher matrix can be approximated by Kronecker products of smaller factors. This makes matrix inversion much cheaper while preserving more structure than a purely diagonal approximation.
Why K-FAC matters conceptually:
- it treats Fisher geometry as worth preserving
- it exploits model structure rather than discarding it
- it demonstrates that local information geometry can be operationalized at useful scale
K-FAC is not the only such approximation, but it is a canonical example of how Fisher information motivates algorithm design rather than only statistical analysis.
The right mental model is:
K-FAC is a tractability strategy for Fisher-aware optimization, not a claim that full Fisher inversion has become cheap.
7.4 Elastic Weight Consolidation
Elastic Weight Consolidation (EWC) uses Fisher information in a very different way from natural gradient.
Instead of preconditioning updates, EWC uses diagonal Fisher information to estimate how important each parameter was for a previously learned task. When training on a new task, it penalizes movement along parameters deemed important for old-task likelihood.
A typical penalty has the form
where is a diagonal Fisher estimate from the old task and is the old-task parameter value.
Interpretation:
- if changing a parameter would strongly alter the old task's likelihood, the Fisher entry is large
- if a parameter barely mattered for the old task, the Fisher entry is small
This is a powerful reuse of the local-distinguishability idea. Fisher information identifies directions that were important to the model's previous behavior, and EWC tries not to forget them.
The limitations are also instructive:
- diagonal approximations ignore parameter coupling
- empirical Fisher estimates can be noisy
- deep-network symmetries can complicate the notion of parameter importance
Still, EWC is one of the clearest examples of Fisher information leaving classical statistics and becoming an ML systems tool.
7.5 Score Matching, Denoising, and Diffusion Connections
Score-based generative modeling uses derivatives of log densities with respect to data, not parameters:
This is not the same object as parameter-space Fisher information, but the conceptual bridge is direct:
- both are built from local log-density derivatives
- both encode local geometric structure
- both become especially meaningful under Gaussian smoothing
In score matching, one minimizes objectives related to Fisher divergence. In diffusion and denoising models, Gaussian corruption and score recovery are the central moves. The de Bruijn identity and related Fisher-entropy connections help explain why these procedures are so natural mathematically.
So while the full diffusion chapter would live elsewhere, Fisher information provides part of the local calculus that makes score-based generative modeling coherent.
8. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Treating Fisher information as "just the Hessian." | Fisher is the expected score outer product and, under regularity, the negative expected Hessian. The observed Hessian is sample-specific and not identical in general. | Keep expected Fisher, observed information, and Hessian separate unless you have conditions that justify equating them. |
| 2 | Thinking Fisher information is globally invariant as a number. | Scalar and matrix entries change under reparameterization. What is invariant is the induced metric form. | Track the quadratic form or metric interpretation, not raw matrix entries alone. |
| 3 | Confusing parameter score and data score. | and play different roles. | Always say which variable you are differentiating with respect to. |
| 4 | Using empirical Fisher as if it were the true Fisher automatically. | Empirical Fisher is a convenient approximation, not the same object by definition. | Name the approximation explicitly and explain why you are using it. |
| 5 | Assuming high Fisher means "more uncertainty." | High Fisher means stronger local distinguishability and typically lower variance bounds. | Remember that inverse Fisher behaves like a local variance scale. |
| 6 | Forgetting regularity conditions. | Mean-zero score and expected-Hessian identities can fail if support depends on the parameter or smoothness breaks. | Use the score-outer-product definition as primary and mention assumptions when invoking equivalent forms. |
| 7 | Believing Fisher must always be invertible. | Redundancy, symmetry, and non-identifiability create zero or tiny eigenvalues. | Expect singularity in overparameterized or symmetric models and use damping or pseudo-inverses where appropriate. |
| 8 | Treating Jeffreys prior as always proper. | can fail to integrate over the parameter domain. | Check propriety separately from invariance arguments. |
| 9 | Thinking natural gradient solves all optimization problems. | It improves local geometry, but inversion, noise, singularity, and approximation quality still matter. | View Fisher-based updates as principled geometry-aware tools, not universal silver bullets. |
| 10 | Ignoring chapter boundaries and re-proving the full Cramer-Rao theory here. | The canonical home for CRB and asymptotic efficiency is Estimation Theory, not this Information Theory section. | Use forward and backward references instead of duplicating full proofs. |
| 11 | Assuming Fisher and Gauss-Newton coincide everywhere in neural nets. | They align only in specific likelihood-structured settings and approximations. | State the conditions under which the approximation is justified. |
| 12 | Missing the entropy connection. | Fisher can look like a purely statistical curvature object, but de Bruijn and related identities show it links directly to information theory. | Keep both the local-geometry and entropy-smoothing viewpoints in mind. |
9. Exercises
-
Exercise 1 [*] - Bernoulli and Poisson Fisher information. Derive the score and Fisher information for Bernoulli() and Poisson(), and explain the parameter regions where each model is most informative.
-
Exercise 2 [*] - Additivity under iid sampling. Starting from the score of an iid sample, prove that for scalar parameters and explain the connection to uncertainty scaling.
-
Exercise 3 [] - Reparameterization.** For the Bernoulli model, compute Fisher information both in probability coordinates and in logit coordinates , and verify the transformation law explicitly.
-
Exercise 4 [] - Local KL curvature.** Perform a second-order Taylor expansion of for a smooth scalar model and show that the quadratic coefficient is .
-
Exercise 5 [] - Jeffreys prior.** Derive Jeffreys prior for Bernoulli(), Poisson(), and the Gaussian mean model with known variance. Identify which priors are proper and which are not.
-
Exercise 6 [*] - True Fisher, empirical Fisher, and Hessian.** On a toy logistic-regression problem, compute a minibatch empirical Fisher, an observed Hessian, and a Monte-Carlo approximation to the true Fisher. Compare them numerically and interpret the differences.
-
Exercise 7 [*] - Natural gradient or K-FAC-style structure.** Derive the Fisher matrix for a logistic model and use it to construct a natural-gradient update. Then explain what would need to be approximated in a multilayer network for this idea to scale.
-
Exercise 8 [*] - Fisher in a modern AI workflow.** Choose one of the following: (a) diagonal Fisher for continual-learning importance weights, (b) Fisher divergence in score matching, (c) Jeffreys prior in a one-parameter Bayesian model. Build a small worked example and explain the modeling tradeoffs.
10. Why This Matters for AI (2026 Perspective)
| Concept | AI impact |
|---|---|
| Score variance interpretation | Clarifies how local sensitivity of probabilistic models translates into learnability and uncertainty |
| Fisher information matrix | Provides a local geometry for model families and informs curvature-aware training |
| Expected vs empirical Fisher | Separates principled information geometry from convenient deep-learning approximations |
| Local KL curvature | Explains trust-region and natural-gradient ideas in probabilistic terms |
| Fisher-Rao metric | Gives an intrinsic notion of distance between nearby models, useful for invariant optimization and geometry |
| Jeffreys prior | Connects invariant Bayesian defaults to local information geometry |
| Fisher divergence | Bridges local score-field matching and likelihood-free or unnormalized modeling ideas |
| de Bruijn identity | Connects noise injection, entropy growth, and local sharpness in generative modeling |
| K-FAC-style approximations | Shows how Fisher-aware updates can become computationally plausible in large nets |
| EWC | Uses Fisher diagonals to protect old-task knowledge in continual learning |
| Singular Fisher directions | Explains why overparameterized neural models can have flat, redundant, or weakly identifiable directions |
| Diffusion / score connections | Provides conceptual grounding for why local density gradients matter so much in modern generative AI |
Fisher information matters in 2026 because AI systems are no longer judged only by raw accuracy or loss curves. We care about stability, calibration, catastrophic forgetting, parameter efficiency, and local uncertainty geometry. Fisher information sits near the center of all those concerns.
It is also a concept that travels unusually well across abstraction levels. The same matrix can appear as:
- a variance of scores in a statistics proof
- a metric tensor in information geometry
- a preconditioner target in natural-gradient optimization
- a parameter-importance proxy in continual learning
- a conceptual bridge to score-based generative modeling
That is rare. Most mathematical objects live primarily in one chapter of the curriculum. Fisher information touches many.
There is also a strategic reason advanced practitioners keep returning to Fisher-based ideas. Modern foundation-model training mixes at least four pressures at once:
- models are huge, so exact curvature is impossible
- parameterizations are highly redundant
- calibration and uncertainty still matter
- continual or modular adaptation is increasingly common
Fisher information does not solve all four by itself, but it gives a common language for all four. That is why it appears repeatedly in papers that look superficially unrelated: optimizers, continual-learning penalties, local uncertainty approximations, and score-based generative methods are all borrowing from the same local geometry.
For LLM-era systems, the most realistic takeaway is not "compute the exact Fisher matrix." It is:
- understand what Fisher geometry says the right local metric should be
- understand which approximations preserve that geometry faithfully enough for your use case
- know when the approximation you are using is merely heuristic
That distinction between principle and approximation is a large part of mature ML engineering in 2026.
11. Conceptual Bridge
Fisher information closes the Information Theory chapter by turning global divergence ideas into local geometry. Entropy gave us intrinsic uncertainty. KL Divergence gave us a global mismatch measure. Mutual Information gave us uncertainty reduction between variables. Cross-Entropy gave us expected code length under a model. Fisher information now gives us the local second-order structure of those model families.
Backward, it connects strongly to 07-Statistics/02-Estimation-Theory, where Fisher information appears in the asymptotic theory of maximum likelihood, standard errors, and the Cramer-Rao lower bound. That Statistics section is the canonical home of full efficiency theory. This section reframes the same object through the lens of local information geometry.
Forward, the next major connections are:
- to 08-Optimization/03-Second-Order-Methods through natural gradient, structured curvature, and K-FAC
- to 07-Statistics/04-Bayesian-Inference through Jeffreys priors and local curvature approximations
- to future generative-model sections through score matching, denoising, and diffusion-style local density geometry
FISHER INFORMATION IN THE CURRICULUM
===============================================================
Entropy
uncertainty of a distribution
|
v
KL divergence
global mismatch between distributions
|
v
Cross-entropy
expected code length under a model
|
v
Fisher information
local curvature of model distinguishability
|
+--> Cramer-Rao and asymptotic variance
+--> Jeffreys prior
+--> natural gradient
+--> K-FAC / curvature approximations
+--> EWC
+--> score matching and diffusion intuition
===============================================================
If there is one sentence to keep from this chapter, it is this:
Fisher information measures how sharply nearby models become distinguishable.
That sentence unifies likelihood curvature, estimator precision, local KL geometry, natural gradient, and the modern reuse of Fisher ideas in deep learning.
There is an additional bridge worth carrying forward into later chapters:
- in estimation theory, Fisher tells us how hard it is to infer parameters
- in optimization, Fisher tells us how to move parameters in a geometry-aware way
- in Bayesian workflows, Fisher shapes invariant priors and local posterior approximations
- in generative modeling, Fisher-type score objects tell us how densities respond to infinitesimal smoothing and denoising
Seen this way, Fisher information is not "one more formula after KL." It is the object that converts information theory from a language of global divergences into a language of local geometry.
That local-geometry perspective is one of the main mathematical habits that separates beginner probabilistic ML from advanced probabilistic ML. Once you start asking not only "what is the loss?" but also "what metric is this update implicitly using?" or "what local directions are statistically meaningful?" you are already thinking in Fisher's language.
This is why the chapter order matters. Entropy, KL, mutual information, and cross-entropy teach you how to measure uncertainty and mismatch globally. Fisher information teaches you how those global quantities behave infinitesimally. That is the natural endpoint of the Information Theory chapter and a natural launch point into geometry-aware optimization and probabilistic modeling.
References
- Fisher, R. A. (1925). Theory of Statistical Estimation. Proceedings of the Cambridge Philosophical Society.
- Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press.
- Rao, C. R. (1945). "Information and the Accuracy Attainable in the Estimation of Statistical Parameters." Bulletin of the Calcutta Mathematical Society.
- Jeffreys, H. (1946). "An Invariant Form for the Prior Probability in Estimation Problems." Proceedings of the Royal Society A.
- Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
- Duchi, J. C. Lecture Notes on Statistics and Information Theory. PDF
- MIT OpenCourseWare. Statistics for Applications - lecture3.pdf. Course resource
- Ly, A., Marsman, M., Verhagen, J., Grasman, R., and Wagenmakers, E.-J. (2017). "A Tutorial on Fisher Information." arXiv
- Amari, S. (1998). "Natural Gradient Works Efficiently in Learning." Neural Computation. Publisher page
- Martens, J., and Grosse, R. (2015). "Optimizing Neural Networks with Kronecker-factored Approximate Curvature." arXiv
- Kirkpatrick, J., Pascanu, R., Rabinowitz, N., et al. (2017). "Overcoming catastrophic forgetting in neural networks." arXiv
- Kunstner, F., Balles, L., and Hennig, P. (2020). "Limitations of the Empirical Fisher Approximation for Natural Gradient Descent." arXiv
- Amari, S., Karakida, R., and Oizumi, M. (2019). "Fisher Information and Natural Gradient Learning in Random Deep Networks." PMLR
- Gupta, V., Koren, T., and Singer, Y. (2018). "Shampoo: Preconditioned Stochastic Tensor Optimization." Proceedings of ICML.
- Hyvarinen, A. (2005). "Estimation of Non-Normalized Statistical Models by Score Matching." Journal of Machine Learning Research.