"The Jacobian is the derivative - not a number, not a vector, but a linear map. Once you see it that way, everything in deep learning becomes composition of linear maps."
- paraphrasing Michael Spivak, Calculus on Manifolds
Overview
The gradient is the right object when is scalar-valued. Neural networks are not scalar-valued: every layer maps a vector to a vector, every activation transforms a vector elementwise, and the full forward pass is a long chain of vector-to-vector maps. When you differentiate a vector-valued function you get a matrix of partial derivatives - the Jacobian. The Jacobian is the derivative in the correct sense: it is the unique linear map that best approximates the function near a point.
The Hessian is the second derivative. Once you have the gradient (which is vector-valued), you can differentiate it again to get the Jacobian of the gradient - an matrix called the Hessian . The Hessian encodes curvature: how quickly the gradient rotates and stretches as you move through parameter space. Curvature determines whether a critical point is a minimum, maximum, or saddle; it determines how fast gradient descent converges; it is the object that second-order optimisers (Newton, K-FAC, Shampoo) use to accelerate training.
This section builds both objects completely. We prove every key result from first principles: the Frchet derivative definition, Clairaut's theorem with a complete proof and a counterexample, the second-order Taylor theorem with remainder, the Newton step derivation, and the efficient HVP identity. We derive from scratch the Jacobian of every major ML operation: linear layers, elementwise activations, softmax, layer normalisation. We connect everything to modern practice: why backpropagation is VJP composition, why the K-FAC approximation works, why SAM seeks flat minima, and how influence functions use HVPs.
Prerequisites
- Partial derivatives, gradient, directional derivatives - 01 Partial Derivatives and Gradients
- Matrix-vector multiplication, transpose, trace, singular values - 02 Linear Algebra Basics
- Eigenvalue decomposition, positive definite matrices - 07 Positive Definite Matrices
- Single-variable chain rule and Taylor series - 02-04 Calculus Fundamentals
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Jacobian and Hessian computation, ML layer Jacobians, Taylor approximation, Hessian spectrum, Newton's method, HVPs |
| exercises.ipynb | 10 graded exercises from Jacobian computation through Gauss-Newton and SAM |
Learning Objectives
After completing this section, you will be able to:
- Define the Jacobian for via the Frchet derivative and compute it for any differentiable function
- Prove Clairaut's theorem on symmetry of mixed partial derivatives, and give a concrete counterexample showing when it fails
- Define the Hessian as and compute it analytically for quadratic, log-sum-exp, and neural network loss functions
- Derive the second-order multivariate Taylor theorem with explicit remainder, and use it to build quadratic models of loss functions
- Classify critical points using Hessian eigenvalues: positive definite -> strict minimum, indefinite -> saddle
- Derive from scratch the Jacobian of a linear layer, elementwise activation, softmax, and layer normalisation - including the full proof for softmax
- Explain JVPs (forward mode) and VJPs (reverse mode), and prove why reverse mode requires only one pass for scalar loss
- Derive the Newton step from the quadratic model and explain its quadratic convergence
- Implement Hessian-vector products in without materialising the Hessian
- State and apply the Gauss-Newton and K-FAC approximations, explaining why
Table of Contents
- 1. Intuition
- 2. The Jacobian Matrix
- 3. Jacobians of Key ML Operations
- 4. The Hessian Matrix
- 5. Second-Order Taylor Expansion
- 6. Hessian Definiteness and Curvature
- 7. Newton's Method and Second-Order Optimisation
- 8. Hessian-Vector Products
- 9. Advanced Topics
- 10. Common Mistakes
- 11. Exercises
- 12. Why This Matters for AI (2026 Perspective)
- Conceptual Bridge
1. Intuition
1.1 From Gradient to Jacobian
Recall from 01 that for a scalar function , the gradient captures all the first-order information: knowing , we can predict the change in along any direction as . The fundamental linear approximation is:
Now suppose the function is vector-valued: . It produces output numbers from input numbers. We need to capture how each of the outputs changes as we vary each of the inputs. There are such partial derivatives - they form a matrix. This matrix is the Jacobian.
The linear approximation generalises cleanly: if is a small input perturbation, then the output perturbation is approximately:
where is the Jacobian. The product is a matrix-vector product - not a dot product. This is the key structural difference between scalar and vector differentiation.
SCALAR FUNCTION vs VECTOR FUNCTION
f: R -> R g: R -> R
Derivative: nablaf in R Derivative: Jg in R
df ~= nablaf * delta dg ~= Jg delta
(dot product) (matrix-vector product)
Row i of Jg = (nablag) gradient of i-th output component
Col j of Jg = partialg/partialx how ALL outputs respond to x
Shape check:
delta in R, Jg in R -> Jgdelta in R
Why this matters for neural networks. A transformer layer takes an embedding and produces another embedding . The Jacobian captures how every output dimension responds to every input dimension. In backpropagation, when a gradient arrives from the next layer, the upstream gradient is - the transposed Jacobian acting on the incoming gradient. Backpropagation is Jacobian transposition applied repeatedly.
The Jacobian also governs whether information propagates through the network without distortion. If , gradients explode. If , gradients vanish. This is why initialisation schemes (Xavier, He) and normalisation layers (BatchNorm, LayerNorm) are designed to keep Jacobians near the identity in spectral norm.
1.2 From Gradient to Hessian
The gradient is a vector-valued function. We just established that vector-valued functions have Jacobians. So: what is the Jacobian of the gradient? It is the Hessian:
The entry of is:
This measures how the -th component of the gradient changes when you increase . Geometrically: the Hessian describes how the gradient rotates and stretches as you move through space. This is curvature.
Curvature in direction : Consider the function restricted to the line through in direction : . The curvature of at is:
The Hessian thus answers: "how fast does curve when I walk in direction ?" The answer is the Rayleigh quotient .
Why sign of curvature matters for optimisation. If every direction has positive curvature ( for all ), the function is locally bowl-shaped - every direction goes up from the bottom, confirming we are at a local minimum. If some direction has negative curvature, we can slide downhill in that direction - we are at a saddle point, not a minimum. If all curvatures are negative, we are at a local maximum.
CURVATURE AND CRITICAL POINTS
All eigenvalues lambda > 0: uHu > 0 for allu != 0
f
/ bowl \ <- curved upward in ALL directions
/ \ local MINIMUM (nablaf = 0 here)
Mixed eigenvalues: existsu,v: uHu > 0, vHv < 0
saddle goes up in some, down in others
SADDLE POINT (nablaf = 0 here)
All eigenvalues lambda < 0: uHu < 0 for allu != 0
\ /
\ hill / <- curved downward in ALL directions
local MAXIMUM (nablaf = 0 here)
For neural networks, the Hessian of the loss has a rich spectrum: thousands of near-zero eigenvalues (flat directions in parameter space) and a handful of large eigenvalues (sharply curved directions, often corresponding to the most influential parameters). The ratio - the condition number - determines how hard the optimisation problem is for gradient descent.
1.3 Historical Context
| Year | Contributor | Development |
|---|---|---|
| 1841 | Carl Jacobi | Introduced functional determinants in the study of coordinate transforms - what we now call the Jacobian matrix |
| 1844 | Ludwig Hesse | Second-order determinant for classifying critical points of functions of two variables |
| 1867 | Hesse | Expanded to the full matrix of second partials; term "Hessian" coined posthumously |
| 1912 | Morse | Morse lemma: non-degenerate critical points (det ) have a normal form determined entirely by the Hessian |
| 1960s | Levenberg, Marquardt | Gauss-Newton + regularisation for nonlinear least squares - still the workhorse for nonlinear optimisation |
| 1987 | Rumelhart, Hinton, Williams | Popularised backpropagation = repeated VJP composition through Jacobians |
| 1992 | LeCun, Simard, Pearlmutter | Diagonal Hessian approximation for adaptive learning rates - precursor to Adagrad/Adam |
| 2002 | Amari | Natural gradient = Fisher information (expected Hessian of log-likelihood) as preconditioner |
| 2015 | Martens & Grosse | K-FAC: Kronecker-factored curvature approximation; practical second-order training |
| 2018 | Ghorbani, Krishnan, Xiao | Empirical Hessian spectrum of deep networks: bulk + outlier structure |
| 2020 | Foret et al. | SAM: sharpness-aware minimisation; seeking parameters with small |
| 2022 | Vyas et al. | Influence functions at GPT scale via HVPs and the Arnoldi method |
| 2023 | Google Brain | Distributed Shampoo in production LLM training; K-FAC for trillion-parameter models |
1.4 Why These Matrices Define Deep Learning
Every significant algorithm in modern deep learning uses Jacobians or Hessians, often implicitly:
Jacobians are present in:
-
Backpropagation. For a network , the gradient with respect to the input of layer is computed via . Every backward pass is a sequence of transposed Jacobian multiplications.
-
LoRA and low-rank updates. The weight gradient for a linear layer is - a rank-1 outer product. LoRA exploits that the effective gradient update across a training run lives in a low-rank subspace, parameterising with rank .
-
Normalising flows. Generative models (RealNVP, Glow, Neural ODE) learn invertible maps. The log-likelihood requires - the log-volume scaling factor of the Jacobian.
-
Gradient explosion/vanishing. For a depth- network, . If each , gradients grow exponentially (explosion); if , they decay exponentially (vanishing).
Hessians are present in:
-
Newton's method and quasi-Newton (L-BFGS). The Newton step converges quadratically near the minimum. L-BFGS approximates via a rank- update from gradient differences.
-
K-FAC and Shampoo. Approximate the Hessian (or Fisher) using Kronecker products of per-layer statistics. Used in production training at Google (Shampoo for Gemini-scale models).
-
SAM (Sharpness-Aware Minimisation). Explicitly targets parameters where is small, achieving better generalisation by finding flat minima.
-
Influence functions. The influence of training example on test loss is , requiring solving a linear system via Hessian-vector products.
-
Loss landscape analysis. The ratio determines convergence speed; the spectral density reveals whether the landscape is nearly flat (overparameterised regime) or sharply curved.
2. The Jacobian Matrix
2.1 Formal Definition via Frchet Derivative
Before writing down the Jacobian matrix, it helps to understand the conceptual definition from which the matrix naturally emerges.
Definition (Frchet derivative). Let be defined on an open set . We say is differentiable at if there exists a linear map such that:
In other words: is the best linear approximation to near , in the sense that the error is sublinear in (it goes to zero faster than itself).
Theorem (Uniqueness). If such exists, it is unique. Moreover, the matrix of in the standard bases is exactly the matrix of partial derivatives.
Definition (Jacobian matrix). The unique linear map from the Frchet definition, written as an matrix, is the Jacobian of at :
The entry is .
Proof that the Frchet derivative equals the Jacobian (sketch). Applying the Frchet condition with (scalar multiple of -th basis vector) and dividing by :
So the matrix entries of are precisely the partial derivatives.
Important caveat. The converse fails: the existence of all partial derivatives does not guarantee Frchet differentiability. You additionally need the partial derivatives to be continuous, or equivalently that the limit in the Frchet definition holds for all directions simultaneously (not just coordinate directions). The standard sufficient condition is: if all partial derivatives exist and are continuous at , then is Frchet differentiable at (this is the condition).
Notation. We write , , , or interchangeably. The Jacobian matrix depends on the point - it is a matrix-valued function of the input.
Row and column interpretations:
TWO WAYS TO READ THE JACOBIAN
J_f in R for f: R -> R
x_1 x_2 x_3 ... x
f_1 row 1 = (nablaf_1) <- transposed gradient of output 1
f_2 row 2 = (nablaf_2) <- transposed gradient of output 2
f_3 row 3 = (nablaf_3)
f row m = (nablaf)
col col col col j = partialf/partialx
j sensitivity of ALL outputs to x
Row reading: "Given that I move in all n directions,
how does output i respond?"
Column reading: "If I increase input j alone,
how do all m outputs respond?"
2.2 Special Cases
The Jacobian unifies all the derivative objects from earlier chapters:
Case 1: Scalar function (i.e., ). The Jacobian is a row vector: . The gradient is the transposed Jacobian. This is why we write as a column vector and as a row vector for scalar functions - they are transposes of each other.
Case 2: Linear map with . . So everywhere - the Jacobian of a linear map is the matrix of that map, constant everywhere. This is the vector analogue of .
Case 3: Identity map . , so .
Case 4: Scalar function of a scalar . - a matrix, i.e., the ordinary derivative.
Case 5: Vector function of a scalar (a parametric curve). - a column vector, the velocity vector. Shape: .
| Map type | inputs | outputs | shape | Name |
|---|---|---|---|---|
| 1 | Transposed gradient | |||
| Jacobian matrix | ||||
| Square Jacobian | ||||
| 1 | 1 | Ordinary derivative | ||
Non-example: a function with partials but no Jacobian.
Both partials and equal at the origin. But is not differentiable there: along the diagonal , . The Frchet condition fails. This shows why we need continuity of partials, not just existence.
2.3 Computing Jacobians Systematically
Procedure. To compute for :
Row-by-row method (one output at a time): For each : treat as a scalar function and differentiate it with respect to each . Place the resulting partial derivatives in row .
Column-by-column method (one input at a time): For each : differentiate the entire vector with respect to (treating all other inputs as constants). Place the resulting -vector in column .
Standard Jacobian table:
| Function | Domain -> Codomain | Jacobian |
|---|---|---|
| (elementwise) | ||
| (scalar) | (row vector) | |
| (scalar) | (row vector) | |
| (matrix output) | Tensor: |
2.4 Jacobian Determinant and Volume Scaling
When (square Jacobian), the Jacobian determinant has a direct geometric meaning: it measures by what factor scales volumes near .
Formally: if is a small region near , then .
- : expands local volume
- : contracts local volume
- : collapses volume (the map is locally degenerate, not invertible)
- : reverses orientation (a reflection-like component)
Change of variables formula. For integration:
This is the multivariate analogue of substitution: .
Application to normalising flows. A normalising flow models the data density by learning an invertible map from a simple base density (e.g., Gaussian). By change of variables:
Training requires computing for each batch element. Standard Gaussian elimination would cost - prohibitive for . Flow architectures (RealNVP, GLOW) are designed so that is triangular or has Kronecker structure, making the determinant to evaluate.
Singular values and volume scaling. More precisely: , where are the singular values. The two-norm tells us the maximum stretching factor in any direction. The condition number measures how anisotropic the stretching is.
2.5 Jacobian of a Composition (Preview)
The chain rule for Jacobians states: if (i.e., ) then:
The Jacobian of a composed function is the product of Jacobians - outer function first.
Shape check: if and , then , , and - matches for .
This is the mathematical theorem that underlies backpropagation. A neural network has Jacobian:
Since is scalar, is a row vector - the transposed input gradient.
Full proof and derivation: 03 Chain Rule and Backpropagation derives this rigorously and applies it to the full backpropagation algorithm.
2.6 Worked Examples with Full Derivations
Example 1: Componentwise function. , .
Row 1 ():
Row 2 ():
At :
Interpretation: if we perturb the input by , the output changes by approximately .
Example 2: Unit normalisation .
Let . Then .
In matrix form, with :
Interpretation: is a projection (scaled by ). The matrix projects onto the hyperplane perpendicular to . So:
- Perturbations perpendicular to : these are preserved (scaled by ) - perturbing perpendicular to a unit vector changes its direction
- Perturbations parallel to (i.e., scaling ): these are killed by the projection - scaling a vector doesn't change its direction
Non-example: Jacobian of a max operation. (scalar). This function is not differentiable when two or more components tie for the maximum. At a point where is the unique maximum:
At a tie, the Frchet derivative does not exist. The Clarke subdifferential is the convex hull of .
3. Jacobians of Key ML Operations
3.1 Linear Layer
Setup. A linear (dense) layer computes with , .
Jacobian w.r.t. input :
So . The Jacobian of a linear layer is constant (same at every input point) and equals the weight matrix.
Backpropagation through a linear layer. Given upstream gradient (i.e., ), the downstream gradient is:
This is the fundamental backprop equation for a linear layer: multiply by the transposed weight matrix.
Gradient w.r.t. weight matrix . The function is also linear in . Consider . The Jacobian of w.r.t. is:
where is the Kronecker product. For scalar loss , the weight gradient is:
This is a rank-1 outer product - for each training sample, the weight gradient has rank at most 1. The gradient accumulated over a mini-batch of size has rank at most .
Implication for LoRA. Since the weight gradient is the sum of rank-1 outer products , the effective gradient subspace across an entire training run has rank much less than - especially for large pre-trained models where the task-specific signal is concentrated in a low-dimensional subspace. LoRA (Hu et al., 2021) exploits this by parameterising the weight update as with , , . Training only and reduces parameters from to .
3.2 Elementwise Activations
Setup. applied elementwise: for each .
Jacobian:
The Jacobian is diagonal - output depends only on input . Backpropagation through an elementwise activation is elementwise multiplication: .
Common activations in full detail:
ReLU: . (1 for positive inputs, 0 for negative, undefined at 0). The Jacobian is a binary diagonal matrix. Every negative-input neuron contributes zero gradient - the "dead neuron" problem. At scale, if a large fraction of neurons receive negative pre-activations, the effective gradient signal becomes sparse.
Sigmoid: . . Maximum value (at ). For deep networks, the product of sigmoid derivatives is at most exponentially - the vanishing gradient problem. This is why sigmoid was replaced by ReLU in most architectures.
tanh: . . Maximum (at ). Better than sigmoid (maximum derivative is 1 not 1/4), but still vanishes for large .
GELU: where is the standard Gaussian CDF. where is the Gaussian PDF. The derivative is approximately 1 for large positive , 0 for large negative , and smooth around 0. Default activation in BERT, GPT-2/3/4.
SiLU (Swish): . . Non-monotone (has a local minimum). Used in LLaMA, Mistral, Gemma.
Gradient flow analysis. For a depth- network with all linear layers having and elementwise activation Jacobians , the input-to-output Jacobian is:
The spectral norm satisfies . For sigmoid/tanh, or ; for large , this product vanishes or explodes depending on the spectrum.
BatchNorm and LayerNorm are specifically designed to keep each factor throughout training, preventing both explosion and vanishing.
3.3 Softmax Jacobian - Full Proof
Setup. The softmax function (the probability simplex) is defined by:
We want to compute all partial derivatives for all .
Full derivation. Let (the partition function). Then .
Case 1: (diagonal entries). Use the quotient rule:
Case 2: (off-diagonal entries). Since (different index):
Unifying both cases. We can write:
where is the Kronecker delta. In matrix form:
Verification. Entry of is:
- Diagonal ():
- Off-diagonal ():
Properties of the softmax Jacobian:
Symmetry: and . Since , both equal for and on diagonal. So . (Softmax Jacobian is symmetric despite being non-linear.)
Singularity: Consider . Entry is . So - the all-ones vector is in the null space. This reflects that softmax is shift-invariant: for any scalar . Perturbations in the constant direction produce no change.
Rank : One zero eigenvalue (direction ), and positive eigenvalues. is positive semidefinite.
Positive semidefiniteness: For any , . By Jensen's inequality applied to the convex function with weights : . So , with equality iff all are equal (i.e., ).
Cross-entropy gradient elegance. When softmax feeds into cross-entropy loss for true class :
Step 1: (gradient of log w.r.t. the relevant probability).
Step 2: Chain rule gives .
Entry : .
So , i.e., .
The combined gradient is simply prediction minus one-hot label - an elegant result. This is the gradient that flows backward through every transformer's output layer.
3.4 Layer Normalization Jacobian - Full Derivation
Setup. Layer normalisation (Ba et al., 2016) standardises each token embedding independently:
We derive the Jacobian for the normalisation step (ignoring the learnable scale/shift ).
Step 1: Derivatives of and .
(all components).
: Using chain rule on :
(since ). Therefore .
Step 2: Derivative of .
Using the quotient rule:
Wait - let's be careful: and , so .
In matrix form:
Wait - let me recheck. Entry :
Retracing: .
Hmm, . Wait: from Step 1, .
So:
Hmm that's . This is slightly different from the usual formula. The standard result (Xu et al., 2019) is:
Let me redo without the erroneous in the last term. The issue is .
. Then ... actually more carefully:
(since ). So .
Then:
Note (since and , so ). This means is an outer product scaled by . Writing:
Geometric interpretation. The matrix is a projection that:
-
Removes the mean direction (): (since ).
-
Removes the normalised input direction (): .
Null space and rank. The null space is span (2-dimensional if is not parallel to , which is true generically). So rank.
Training implication. Gradients cannot flow through the mean or norm directions. This is a desirable property: it prevents trivial changes (scaling all activations or shifting them) from affecting the gradient, making training more stable.
3.5 JVPs and VJPs - Why Reverse Mode Dominates
Given with Jacobian , we often need to multiply by or without materialising the full matrix.
Jacobian-vector product (JVP, forward mode):
Answers: "If the input moves in direction , how does the output change?"
Computed by a forward pass with dual numbers: run the computation on input where is the "tangent" component. Cost: - one forward pass plus tracking tangents. Does not require storing intermediate values.
Vector-Jacobian product (VJP, reverse mode):
Answers: "Given a weighting over outputs, what is each input's sensitivity?"
Computed by a backward pass: first run a forward pass (storing intermediates), then propagate backward through the computation graph. Cost: compute + for stored activations.
Why neural network training uses reverse mode (VJP). The loss is scalar (). We want all parameter gradients simultaneously. Compare:
-
Reverse mode (VJP): One backward pass gives - all gradients at once. Total cost: .
-
Forward mode (JVP): To get gradient in direction , run one JVP with , giving . Getting all gradients requires JVPs. Total cost: .
For (typical LLM), reverse mode is cheaper. This is why PyTorch, JAX, and TensorFlow all implement reverse-mode AD (backpropagation) as their default.
When forward mode is preferable. If (few inputs, many outputs), forward mode is cheaper: JVPs vs VJPs. Applications: uncertainty propagation through a small number of uncertain inputs, sensitivity analysis, directional derivative computation.
COST COMPARISON: JVP vs VJP
f: R -> R, J_f in R
Full Jacobian via JVPs: n passes of JVP(e), cost = n * c_fwd
Full Jacobian via VJPs: m passes of VJP(e), cost = m * c_bwd
For ML (m=1, scalar loss):
Forward mode: n passes -> cost n * c_fwd (EXPENSIVE)
Reverse mode: 1 pass -> cost c_bwd (CHEAP)
Speedup: n / (c_bwd/c_fwd) ~= n/3 ~= 3*10^7 for GPT-3
For normalising flows (need full Jacobian for log|det J|):
Forward mode: n passes -> expensive for large n
Special architectures (triangular J): O(n) determinant
Full Jacobian recovery. To get :
- Apply JVP with for : fills columns of . Cost: .
- Apply VJP with for : fills rows of . Cost: .
Choose the cheaper option: if , use JVPs; if , use VJPs.
3.6 Jacobian Condition Number and Training Stability
Definition. The condition number of the Jacobian is:
where and are the largest and smallest singular values. (For , refers to the smallest nonzero singular value.)
Geometric meaning. The singular value decomposition shows that rotates (via ) then scales (via ) then rotates again (via ). The input directions (columns of ) are mapped to output directions (scaled columns of ). measures how anisotropic this scaling is:
- : all input directions are scaled equally - perfectly isotropic
- : some directions are amplified by , others suppressed to near zero by - highly anisotropic
Why matters for training. In backpropagation, the gradient flows backward as . The singular values of are the same as those of . So:
- Gradient components in the direction are amplified by
- Gradient components in the direction are suppressed by
Over layers, the overall Jacobian has singular values bounded by (upper) and (lower). If each : exponential gradient explosion. If each : exponential gradient vanishing.
Initialisation strategies. Xavier (Glorot) initialisation sets , which initialises and (near 1 for square layers). He initialisation uses , designed for ReLU which zeroes half the neurons.
Normalisation layers. BatchNorm and LayerNorm act as implicit preconditioning: they rescale the activations so that the effective Jacobian of each normalised layer has . This keeps the product of singular values near 1 regardless of depth.
4. The Hessian Matrix
4.1 Formal Definition
Definition (Hessian). Let be twice differentiable at . The Hessian matrix of at is the matrix of all second-order partial derivatives:
The entry is .
Three equivalent ways to think about the Hessian:
-
Matrix of second partial derivatives. Each entry is differentiate twice: first with respect to , then with respect to .
-
Jacobian of the gradient. Since is a vector-valued function, it has a Jacobian. That Jacobian is the Hessian: , because (when , the order of differentiation does not matter by Clairaut's theorem).
-
Curvature operator. For any unit direction , the scalar curvature of in that direction is . This is the second derivative of at .
Notation. We write , , , or interchangeably.
HESSIAN STRUCTURE (n=3 example)
x_1 x_2 x_3
x_1 partial^2f/partialx_1^2 partial^2f/partialx_1partialx_2 partial^2f/partialx_1partialx_3
x_2 partial^2f/partialx_2partialx_1 partial^2f/partialx_2^2 partial^2f/partialx_2partialx_3
x_3 partial^2f/partialx_3partialx_1 partial^2f/partialx_3partialx_2 partial^2f/partialx_3^2
Diagonal: pure second derivatives (curvature along each axis)
Off-diagonal [i,j]: how partialf/partialx changes as x moves
(interaction / cross-curvature)
When f in C^2: H_f is symmetric (Clairaut's theorem)
i.e., partial^2f/partialxpartialx = partial^2f/partialxpartialx for all i,j
4.2 Clairaut's Theorem - Complete Proof and Failure Case
Theorem (Clairaut-Schwarz, 1740/1873). If has continuous second-order partial derivatives at (i.e., near ), then:
Equivalently: is a symmetric matrix whenever .
Complete proof (two-variable case). It suffices to prove the result for with variables . Define the mixed difference quotient:
Step 1: Express via . Define . Then . By the mean value theorem (MVT) applied to on :
for some . Now apply MVT to on :
Step 2: Express via . Now define . Then . By MVT on on :
for some . Apply MVT to on :
Step 3: Take the limit. From Steps 1 and 2:
As , both sides approach since . By continuity of the second partial derivatives:
Why continuity is essential. The proof uses continuity only in the last step - to take the limit. Without continuity, the two sides approach the same point but need not converge to the same value.
Concrete counterexample (Peano, 1880). Define:
Step 1: Compute at . By the limit definition:
Similarly .
Step 2: Compute at for . . By the quotient rule at :
Step 3: Compute the mixed partial at .
Step 4: By symmetry of the roles of and (with a sign flip from ):
So at the origin, even though both first partial derivatives exist everywhere. The second partial derivatives exist at but are not continuous there - exactly what Clairaut requires.
Practical implication. For any function built from compositions of smooth operations (, , , , , , polynomial), all partial derivatives of all orders are continuous. Neural network losses with smooth activations (GELU, sigmoid, tanh, softmax) therefore always have symmetric Hessians. For ReLU networks, the Hessian is symmetric almost everywhere (where the Hessian exists) but may not exist on the boundaries .
4.3 Computing the Hessian
Method 1: Direct differentiation (symbolic). Compute first, then differentiate each component again.
Cost: partial derivatives. For parameters: entries - the full Hessian matrix is utterly infeasible to compute or store for any modern neural network.
Method 2: Numerical Hessian via finite differences. Use the second-order central difference formula:
Error: truncation error. Round-off error . Optimal balances both.
Cost: function evaluations. Still infeasible for large .
Method 3: Via automatic differentiation.
- PyTorch:
torch.autograd.functional.hessian(f, x)- materialises full Hessian via backward passes - JAX:
jax.hessian(f)(x)- uses forward-over-reverse AD (one forward pass per output component of )
Both require AD passes, each costing - total time. Infeasible for large networks.
Method 4: Hessian-vector products (HVPs). Do not materialise the Hessian; instead compute for a given in time. This is the practical approach for all large-scale applications. See 8.1.
4.4 Hessian as Jacobian of the Gradient
The identity is more than a definition - it is a computational recipe.
Proof. The gradient maps to the vector . The entry of its Jacobian is:
Directional second derivative. Applying this to the directional derivative of in direction :
This identity is the key to efficient HVP computation (8.1): to compute , differentiate the scalar function with respect to . One forward pass computes (a JVP); one backward pass gives .
Curvature Rayleigh quotient. For a unit vector :
This is the second derivative of along the line through in direction :
- : is concave up (curves upward) in direction
- : is concave down (curves downward) in direction
- : is locally flat in direction
The maximum curvature is , achieved in the direction of the top eigenvector . The minimum curvature is , in the direction of .
4.5 Worked Examples with Full Derivations
Example 1: Quadratic form. where is symmetric.
Gradient: .
Hessian: (Jacobian of a linear function is the coefficient matrix).
The Hessian is constant - it does not depend on . A quadratic function has the same curvature everywhere.
Verification: (since is symmetric).
Example 2: Mean squared error loss. with .
Gradient: .
Hessian: .
is positive semidefinite (PSD) since for all . Therefore MSE is a globally convex function - any local minimum is the global minimum. The eigenvalues of are where are singular values of .
Example 3: Log-sum-exp (softmax partition function). .
Gradient: (the softmax probabilities).
Hessian: - the Jacobian of the softmax function. From 3.3:
This is PSD (shown in 3.3), so log-sum-exp is convex. Geometrically: the level sets of are smooth convex curves in .
Example 4: Cross-entropy loss for logistic regression. for a single training example .
Gradient: .
Hessian: Using the chain rule:
This is a rank-1 matrix (outer product of with itself, scaled by ). Its single nonzero eigenvalue is and the corresponding eigenvector is . All other eigenvectors are in with eigenvalue 0 - the loss is flat in all directions perpendicular to .
Over a batch of examples: - sum of rank-1 PSD matrices, so PSD with rank at most .
Non-example: Hessian of a ReLU network. A single ReLU unit :
- For : ,
- For : ,
- At : is discontinuous, does not exist classically
A deep ReLU network is piecewise linear - it is linear on each region of parameter space where all activation patterns are fixed. Therefore its Hessian is zero almost everywhere. Second-order curvature only appears at the activation boundaries (measure zero).
This is why Gauss-Newton and K-FAC (which avoid computing the Hessian through nonlinearities) are preferred over exact Newton for ReLU networks. The exact Hessian is nearly zero and gives no useful curvature information.
5. Second-Order Taylor Expansion
The Taylor expansion is the bridge between the Hessian's definition and its computational use. It explains why the Hessian governs convergence rates, why eigenvalues bound step sizes, and why second-order optimisers are faster than first-order ones.
5.1 The Multivariate Taylor Theorem
Theorem (Taylor's theorem in , exact form). Let be in a neighbourhood of . Then for any :
where the remainder satisfies for some constant .
Proof. Define by . This reduces the multivariate theorem to the single-variable one. Compute derivatives by chain rule:
Apply the single-variable Taylor theorem with integral remainder:
Substituting: , , , and . The integral remainder is ; bounding for some gives .
The quadratic approximation. Dropping yields the second-order approximation:
This is a quadratic function in - an elliptic paraboloid (if ), a hyperboloid (if is indefinite), or a cylinder/flat direction (if is singular). The entire geometry of the function near is encoded in this approximation.
5.2 Geometric Picture: Level Sets and Curvature
To see the geometry, write the quadratic approximation as a function of :
where and . Complete the square: if ,
The minimum is at - this is the Newton step. The level sets are ellipses (in 2D) or ellipsoids (in D) aligned with the eigenvectors of .
QUADRATIC LANDSCAPE GEOMETRY
Isotropic Hessian (H = lambdaI): Anisotropic Hessian:
Circular level sets Elliptical level sets
* * * * * * * * * * *
* (o) * * * *
* * * * * * (o) * * *
* * *<-->* * * * * * * *
* * * * * * <--> * * *
* * * * * * *
* * * * * * * * * *
GD step = Newton step GD step != Newton step
(gradient steepest descent) (GD "zigzags"; Newton goes direct)
Eigenvalues: lambda_1 = lambda_2 = lambda Eigenvalues: lambda_1 >> lambda_2
Condition number: = 1 Condition number: = lambda_1/lambda_2 >> 1
Principal curvatures. The Hessian at a critical point (where ) has eigendecomposition where is orthogonal and . In the basis , the function is a sum of independent quadratics:
Each is the curvature along direction :
- : increases parabolically in direction -> local min in that direction
- : decreases parabolically -> saddle direction
- : is flat -> cannot determine from second-order information
Second derivative test. At a critical point :
- (all ) strict local minimum
- (all ) strict local maximum
- indefinite (mixed signs) saddle point
- but singular inconclusive (need higher-order terms)
5.3 Hessian Eigendecomposition and Convergence Rates
The eigenvalues of the Hessian at a minimum directly control how fast gradient descent converges nearby.
Local convergence analysis. Near , linearise the gradient iteration. The gradient update gives, by Taylor:
Let . Then:
Ignoring the higher-order term: . In the eigenbasis:
After steps: .
Convergence requires for all , i.e., . The optimal step size (minimising the worst-case contraction ratio) is:
giving contraction ratio where is the condition number. The number of steps to reduce error by factor scales as .
The condition number catastrophe. For a poorly conditioned Hessian (), GD makes tiny progress per step. A concrete example: for , the Hessian is , . With :
- Along : converges at rate - very slow
- Along : converges at rate - limited by the slow direction
Newton's method avoids this by multiplying by , giving - quadratic convergence, independent of conditioning.
5.4 Gradient Descent vs Newton's Method
Gradient descent update:
Move in the steepest descent direction, step size proportional to .
Newton's update:
Move to the minimum of the quadratic approximation. Derivation: minimise over :
| Property | Gradient Descent | Newton's Method |
|---|---|---|
| Convergence rate (near ) | Linear: | Quadratic: |
| Dependence on | steps | steps |
| Cost per step | (gradient) | (Hessian inversion) |
| Memory | (Hessian storage) | |
| Works when ? | Yes (gradient always valid) | No ( may not exist or give ascent) |
| Suitable for deep learning? | Yes (SGD, Adam) | No () |
For AI: Deep networks have - parameters. Exact Newton is completely infeasible. This is why the entire field of second-order optimisation for deep learning focuses on approximating cheaply:
- Diagonal approximations: Adam uses (RMS of past gradients) as a diagonal Hessian estimate
- Kronecker factorizations: K-FAC approximates each layer's Fisher as a Kronecker product
- Low-rank approximations: KFAC-Reduce, EKFAC
- Hessian-vector products: Lanczos algorithm to find the top eigenvectors
6. Hessian Definiteness and Flat vs Sharp Minima
The definiteness of the Hessian at a minimum has consequences far beyond the second derivative test. It determines generalisation, robustness to perturbations, and the choice of optimiser. This section examines each case in depth.
6.1 The Positive Definite Case: Strongly Convex Minima
If , the minimum is strongly convex locally. All curvatures are positive; the function rises quadratically in every direction. Key properties:
- Unique minimum: in the neighbourhood, is the only critical point
- Bounded gradient: for near
- Gradient descent converges linearly (as derived in 5.3)
- Robust to small perturbations: a perturbation of size changes by at most
Algebraic criterion (Sylvester's criterion). if and only if all leading principal minors are positive:
In 2D: .
Eigenvalue criterion. . The smallest eigenvalue measures how "tight" the well is in the flattest direction.
6.2 The Positive Semidefinite Case: Degenerate Minima
If with , the quadratic approximation is flat in the null space . The second-order test is inconclusive.
What can happen along flat directions:
- The function might be constant in that direction (a continuum of minima - "flat valley")
- The function might dip lower (a deeper minimum exists nearby)
- The function might rise (a saddle, with flat approach)
Example: Over-parameterised networks. Consider fitting data points with a model having parameters. The minimum loss is achieved on a manifold of parameters (not a single point). Along this manifold, the loss is constant , so and has at least zero eigenvalues. The loss landscape near a minimum is extremely flat in the "model space" directions.
Example: Batch normalisation symmetry. If layer has BatchNorm, then scaling and leaves the network function unchanged. This scale symmetry creates a zero eigenvalue direction in the Hessian of any loss function.
6.3 The Indefinite Case: Saddle Points
If has both positive and negative eigenvalues, is a saddle point. The function rises in some directions and falls in others.
Strict saddles. A saddle is strict if . Gradient descent perturbed by noise escapes strict saddles - the noisy gradient has a component in the negative-curvature direction, pulling the iterate away. This is the basis for the claim that SGD finds local minima rather than saddles in practice.
Prevalence in deep learning. Dauphin et al. (2014) showed empirically that the loss landscape of deep networks is dominated by saddle points rather than local minima. The ratio of saddle points to local minima grows exponentially with depth. Goodfellow et al. confirmed that most "local minima" found by SGD are actually saddle points or nearly flat regions with very small negative curvature.
Index of a saddle. The Morse index is the number of negative eigenvalues. A local minimum has index 0; a typical saddle in has index for .
6.4 Flat vs Sharp Minima and Generalisation
One of the most important practical consequences of Hessian analysis is the flat minima hypothesis for neural network generalisation.
Sharp minimum: . The loss landscape around is narrow and steep. A small perturbation of the weights significantly increases the training loss. Empirically associated with poor generalisation - the test distribution's slight differences from training push the model off the sharp peak.
Flat minimum: is small. The loss landscape is broad and shallow. A perturbation of the weights leaves training loss roughly unchanged. Empirically associated with better generalisation - the model is robust to distribution shift.
PAC-Bayes bound. Hochreiter & Schmidhuber (1997) and Keskar et al. (2017) formalised this: a generalisation bound based on the minimum description length of a model. Flatter minima require fewer bits to specify (the volume of the basin in parameter space is larger), so PAC-Bayes bounds on generalisation error are tighter.
The SAM algorithm (Sharpness-Aware Minimisation, Foret et al. 2021). SAM explicitly searches for flat minima:
The inner maximisation finds the worst-case perturbation within radius . By Taylor:
The SAM gradient is where (ascent step to the worst-case point). This requires two gradient computations per step (one for the ascent, one for the final gradient), doubling cost.
SAM in practice. SAM and its variants (ASAM, mSAM, SAM-ON) consistently improve generalisation by 1-3% on ImageNet and CIFAR benchmarks. The improvement is attributed to SAM finding flatter minima, though the exact mechanism is debated (some argue SAM also acts as a regulariser).
Sharpness measures. Several quantities measure flatness:
- Trace of Hessian: - total curvature (sum of all curvatures)
- Frobenius norm: - total variation
- Largest eigenvalue: - worst-case curvature direction
- Volume of basin: - Laplace approximation normaliser
6.5 Hessian Spectrum in Practice
Ghorbani et al. (2019) performed careful empirical Hessian eigenspectrum analysis of large neural networks, finding a consistent structure:
HESSIAN SPECTRUM OF DEEP NETWORKS
Eigenvalue density (schematic):
dens.
lambda
0 small lambda_max (outliers)
- Bulk: dense cluster of small eigenvalues near 0
(most directions have very small curvature)
- Outliers: a few large eigenvalues (often O(10)-O(1000))
(a handful of directions have sharp curvature)
- The ratio lambda_max / lambda_bulk >> 1 (condition number very large)
- After convergence: more eigenvalues cluster near 0
(the model has found flat directions = the "flat manifold" of solutions)
Implications:
- Learning rate must be for stability - but can be very large ( for ResNets)
- Learning rate warmup is necessary because early in training is large and unstable
- Gradient clipping effectively caps the "step size" in the sharp directions
- Adam/AdaGrad normalise by the gradient magnitude, implicitly compensating for large curvature in some directions
7. Newton's Method and Second-Order Optimisation
7.1 Newton's Method: Full Analysis
We derived the Newton step in 5.4. Here we analyse it thoroughly.
Local quadratic convergence. Let be a local minimum with . For sufficiently close to :
where (Lipschitz constant of the Hessian). This is quadratic convergence: each iteration squares the error. From : , , . Only steps to reach machine precision!
Proof sketch. By Taylor at : . The Newton step gives:
Substituting and using , the linear error term cancels exactly, leaving only .
Problems with vanilla Newton:
- not PD: If is indefinite, may be an ascent direction. At saddle points Newton steps towards the saddle, not away.
- Step too large: Even near a minimum, the step may overshoot if the quadratic approximation is inaccurate far from the current point.
- Hessian cost: Computing costs memory and gradient computations. Inverting costs .
7.2 Damped Newton and Quasi-Newton Methods
Damped Newton's method:
where is chosen by line search to ensure sufficient decrease (Armijo condition):
for (typically ). This guarantees global descent while preserving superlinear convergence near .
Modified Cholesky. To handle non-PD Hessians, add a diagonal: with chosen so . This is the Levenberg-Marquardt modification. The step interpolates between Newton () and gradient descent (, step ).
L-BFGS (Limited-memory BFGS). The most widely used quasi-Newton method in practice (used for GPT-2 fine-tuning in some settings, L-BFGS-B for convex ML). Rather than computing exactly, L-BFGS maintains an implicit approximation from the last gradient differences ( to ):
The BFGS update enforces (secant condition). L-BFGS applies the implicit product via a two-loop recursion in time, avoiding the Hessian matrix.
7.3 The Gauss-Newton Method
For least-squares problems where is the residual vector, the exact Hessian is:
The second term involves Hessians of each residual, which are expensive and may be negative-definite. The Gauss-Newton approximation drops this term:
This is always PSD (so the step is always a descent direction) and costs only - compute once, then form the product.
The Gauss-Newton step: .
For neural networks: The Fisher Information Matrix (FIM) is the natural Gauss-Newton matrix:
where is the Jacobian of the log-probabilities. This equals the Gauss-Newton matrix for log-likelihood maximisation, and is always PSD. Natural gradient descent uses as the update direction.
7.4 Kronecker-Factored Approximate Curvature (K-FAC)
K-FAC (Martens & Grosse 2015) is the most successful practical second-order method for deep learning. It exploits the structure of neural network layers to cheaply approximate the FIM.
The key insight. For a linear layer with input activations and pre-activations , the gradient of the loss with respect to is:
where (backpropagated gradient). Therefore:
The exact FIM block for layer is:
K-FAC approximates this by assuming independence of and :
where and .
The Kronecker structure enables cheap inversion:
So inverting the large matrix costs only - inverting two small matrices. The natural gradient step is:
This can be applied in per layer (far less than for the exact FIM). K-FAC achieves 10-30x speedup over SGD in terms of wall-clock time to convergence on large models, while maintaining good generalisation.
8. Hessian-Vector Products and Spectral Methods
Computing the full Hessian is infeasible for large networks. But many algorithms only need Hessian-vector products (HVPs) - and these can be computed in time.
8.1 The HVP Identity
Theorem. For differentiable and any :
Proof. The -entry of is . The dot product . Taking the gradient:
So .
Implementation. In any automatic differentiation framework (PyTorch/JAX):
# v is a fixed vector; x is the point at which to evaluate H(x)v
loss = f(x)
grad = torch.autograd.grad(loss, x, create_graph=True)[0] # nablaf(x), O(n)
hvp = torch.autograd.grad(grad @ v, x)[0] # H(x)v, O(n)
The first grad call is standard backprop. The second differentiates the scalar - also a standard backprop, costing just like the first. Total: 2 passes, each, giving without ever forming the matrix.
8.2 Power Iteration and Dominant Eigenvalue
Using HVPs, we can find the largest eigenvalue via power iteration:
- Initialise , normalise:
- Repeat: (one HVP),
- Estimate: (Rayleigh quotient, another HVP)
Each iteration costs 1-2 HVPs = . Convergence rate: (geometric in gap ratio). For well-separated spectra, converges in -50 iterations.
For AI: determines the maximum stable learning rate: . Computing this via 20 power iterations costs 20 HVPs = 40 backward passes - affordable even for billion-parameter models. The ProgressiveShrink and Catapult phenomena in transformer training are explained by dynamics: if the learning rate exceeds , loss explodes; after explosion, a sharp minimum may be replaced by a flatter one (the "edge of stability" phenomenon, Cohen et al. 2022).
8.3 Lanczos Algorithm and Full Spectrum
The Lanczos algorithm extracts the full eigenspectrum using a sequence of HVPs. Starting from a random unit vector :
-
Build an orthonormal basis (Krylov subspace) by repeated HVPs:
- where ,
-
After steps, is approximated in the Krylov basis as a tridiagonal matrix :
- Eigenvalues of approximate eigenvalues of . The extreme eigenvalues converge fastest (in iterations rather than for power iteration).
Density estimation. Using the stochastic Lanczos quadrature method (Ghorbani et al. 2019), one can estimate the full spectral density using random starting vectors and Lanczos steps each. This gives a smooth density plot revealing the bulk-outlier structure discussed in 6.5.
8.4 Applications of HVPs in Deep Learning
1. Computing for learning rate bounds. Cohen et al. (2022) showed that SGD with fixed learning rate converges at the "edge of stability" where . Measuring via power iteration explains why training stabilises at a specific sharpness level.
2. Influence functions. For convex models, the influence of removing training point on prediction is:
Computing requires solving a linear system , which can be done iteratively (conjugate gradient, LiSSA) using HVPs. Used in data valuation and finding mislabelled examples.
3. Second-order optimisers (K-FAC, Shampoo, SOAP). All second-order optimisers for deep learning reduce to computing products of the form where is a tractable approximation. The quality of the approximation determines how well the algorithm second-order corrects the gradient.
4. Hyperparameter optimisation. Gradient-based hyperparameter optimisation (MAML meta-learning, bilevel optimisation) requires differentiating through optimisation steps. This involves HVPs of the meta-loss with respect to the base loss's Hessian - third-order derivatives!
5. Neural Tangent Kernel (NTK). In the infinite-width limit, the NTK is where is the Jacobian of the network output. HVPs of the squared loss Hessian equal NTK-vector products, connecting second-order optimisation to kernel theory.
9. Advanced Topics
9.1 Jacobians of Implicit Functions
The Implicit Function Theorem (IFT) gives the Jacobian of an implicitly defined function without solving for it explicitly.
Setup. Suppose is with and the partial Jacobian is invertible. Then there exists a function near satisfying , with Jacobian:
Proof sketch. Differentiate with respect to using the chain rule:
Application: Bilevel optimisation (meta-learning). MAML's inner loop finds satisfying . The meta-gradient requires , which by IFT is:
where and . This requires second-order information (Hessians involving and ), which is why MAML-based methods are costly.
9.2 Clarke Subdifferential for Non-Smooth Functions
For non-differentiable (e.g., networks with ReLU), the Hessian does not exist at activation boundaries. The Clarke subdifferential extends the Jacobian to this setting.
Definition (Clarke subdifferential). For locally Lipschitz , the Clarke subdifferential is:
where is the set of points where is differentiable (measure 1 by Rademacher's theorem) and denotes convex hull.
For ReLU: At , - the interval between the left and right derivatives. Any value in is a valid subgradient.
Clarke Jacobian. For , the Clarke generalised Jacobian is the convex hull of limiting Jacobians. For piecewise-smooth , is the convex hull of Jacobians from each smooth piece meeting at .
Generalised second-order theory. There is no direct analogue of the Hessian in Clarke's framework. Instead, semismoothness (Qi & Sun 1993) allows Newton-like methods: a function is semismooth if the Clarke Jacobian converges in a directional sense. Semismooth Newton methods converge superlinearly to solutions of nonsmooth equations.
9.3 Fisher Information Matrix and Natural Gradient
The Fisher Information Matrix plays a central role in statistical learning theory and connects Jacobians to probabilistic models.
Definition. For a parametric model , the FIM is:
The second equality (Fisher's identity) shows - the FIM equals the negative expected Hessian of the log-likelihood.
Geometric interpretation. The FIM defines a Riemannian metric on the statistical manifold . The KL divergence between nearby distributions is:
The FIM is the Hessian of the KL divergence. Natural gradient descent moves in the direction that most rapidly decreases the KL divergence, not the Euclidean loss:
Equivalence to Gauss-Newton. For cross-entropy loss , the FIM equals the Gauss-Newton matrix where is the Jacobian of the log-probability outputs. This provides a PSD, well-conditioned curvature matrix.
For transformers: The FIM of a transformer language model is too large to compute exactly. K-FAC approximates it using the Kronecker structure. Recent work (Martens 2020, Bernacchia 2022) shows the FIM of attention layers has special structure exploitable for efficient natural gradient computation.
9.4 Jacobians in Normalising Flows and Change of Variables
Change of variables formula. If is a diffeomorphism (bijective, smooth, smooth inverse) and with , then:
Log-likelihood under a flow:
The second term is the log-determinant of the Jacobian - the log-volume scaling factor.
Normalising flows (RealNVP, Glow, NICE) are composed diffeomorphisms . By the chain rule, , so . Each layer is designed so that is cheap to compute:
- Coupling layers: is triangular, , computable in
- Invertible 1x1 convolutions (Glow): (a learnable matrix), computed via LU decomposition
This is why Jacobian theory is central to generative modelling: the entire training objective depends on computing efficiently.
8.5 Conjugate Gradient for Solving
Many applications require not just but - solving a linear system with the Hessian. The Conjugate Gradient (CG) method solves for symmetric using only matrix-vector products:
CG algorithm:
- Initialise: , ,
- Repeat until convergence:
- (one HVP: )
- (update direction)
Convergence. CG converges in at most steps (exact arithmetic). In steps, the error satisfies:
where and . For well-conditioned systems ( small), CG converges in a few dozen iterations - far fewer than .
For influence functions. The LiSSA algorithm (Agarwal et al. 2017) solves via stochastic approximation: draw batches , compute via Taylor series truncation, average. Each iteration uses one stochastic HVP. For large networks (), this is the only feasible approach.
9.5 Jacobians and Automatic Differentiation: The Full Picture
This section ties together the JVP/VJP duality from 3.5 with the full structure of automatic differentiation.
The AD computational graph. Every computation is a directed acyclic graph (DAG) where:
- Nodes represent intermediate values
- Edges represent elementary operations (add, multiply, exp, sin, ...)
- Input nodes: (the input variables )
- Output node: (the scalar loss )
Each elementary operation has a Jacobian - a small, cheap-to-compute matrix.
Forward mode (JVP, tangent propagation). Compute for a fixed tangent vector . At each node:
Running forward mode for each basis vector separately gives column of the full Jacobian . Cost: forward passes total, each .
Reverse mode (VJP, adjoint propagation). Compute backward through the graph. At each node, going backward:
Starting from (derivative of with respect to itself), this accumulates all partial derivatives in one backward pass. Cost: backward pass, .
Why reverse mode wins for scalar . Computing the full gradient requires one VJP (one backward pass). Computing it with JVPs requires forward passes - times more expensive. Since for LLMs, reverse mode is the only option.
Forward mode wins when . If with , computing the full Jacobian requires JVPs (forward) or VJPs (reverse). When , forward mode is cheaper. Example: sensitivity analysis of a physical simulation with parameters and output measurements.
Higher-order derivatives. The Hessian can be computed by applying the AD operators twice:
| Method | Cost | When to use |
|---|---|---|
| Reverse over reverse | (full ) | Never for large |
| Forward over reverse | per row (builds row-by-row) | When you need full at small |
| Reverse over forward | per column | Same as above |
| HVP (reverse-of-dot) | per HVP | Standard: only need |
The HVP trick from 8.1 (differentiate ) is exactly "reverse over the dot product with " - it's forward-then-reverse in the AD sense, but implemented as two reverse passes.
Mixed-mode AD. Some computations benefit from mixing forward and reverse within the same graph. JAX's jax.hessian function automatically selects forward-over-reverse or reverse-over-forward based on the function's input/output dimensions, using the cheaper composition.
10. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Confusing Jacobian rows and columns: "the entry is " | The entry of for is : row = gradient of output , column = sensitivity of all outputs to input | Remember: rows index outputs, columns index inputs. Shape: for |
| 2 | Forgetting the Hessian is symmetric | The Hessian is symmetric when (Clairaut's theorem). Without symmetry assumption, algorithms that diagonalise are invalid | Check ; if computing numerically, average with its transpose |
| 3 | Applying the second derivative test when | The test "if then is a local minimum" requires to be a critical point. A PD Hessian at a non-critical point just means the quadratic approximation has a minimum elsewhere | First verify , then check definiteness |
| 4 | Treating the Jacobian chain rule as commutative: | Matrix multiplication is not commutative. - evaluate outer function's Jacobian at the output of the inner function | Write dimensions explicitly: if , , then is |
| 5 | Using GD learning rate near a non-convex region | The stable range for GD is . If is large (sharp direction), even can cause oscillation | Estimate via power iteration (10-20 HVPs) and set |
| 6 | Confusing JVP and VJP modes | JVP (forward mode): computes , cost per vector; VJP (reverse mode): computes , cost per vector. For scalar loss (), VJP is passes regardless of ; JVP would cost passes | Use reverse mode (backprop) for scalar losses; use forward mode when |
| 7 | Concluding a flat Hessian means global minimum | at only means the quadratic approximation is flat. The actual function could be a plateau, a flat saddle, or a valley floor. Example: has but is an inflection point, not a minimum | Check higher-order derivatives or sufficient decrease conditions from all directions |
| 8 | Computing by forming first | For large , forming costs memory and backward passes. Computing directly via forward-mode AD costs in one pass (single JVP) | Use torch.autograd.functional.jvp or JAX's jvp for forward-mode; never materialise the full Jacobian if you only need the product |
| 9 | Assuming the softmax Jacobian is diagonal | The softmax has Jacobian - a full symmetric rank- matrix. Only the diagonal entries appear on the diagonal | When backpropagating through softmax, use the full Jacobian. Many frameworks handle this correctly, but manual implementations often miss the off-diagonal terms |
| 10 | Ignoring that the Hessian of a ReLU network is zero almost everywhere | ReLU networks are piecewise linear. Their Hessian is 0 in each linear region. Newton's method applied to a ReLU network Hessian gives a zero-curvature degenerate system | Use Gauss-Newton or K-FAC (which use from the output Jacobian, not the parameter Hessian) for ReLU networks |
| 11 | Confusing the FIM with the Hessian of the loss | The FIM is not the same as (Hessian of the training loss). They are equal only when the model is at its optimum (Fisher identity). Away from the optimum, (the Hessian includes an extra term for model misspecification) | Natural gradient uses ; full second-order Newton uses . K-FAC approximates , not |
| 12 | Expecting Newton's method to work with indefinite Hessians | Newton's step is only guaranteed to be a descent direction when . At saddle points, the Newton step points toward the saddle, making loss increase | Add a regularisation term: with chosen so (Levenberg-Marquardt) |
11. Exercises
Exercise 1 (). Computing Jacobians.
Let be defined by .
(a) Compute for general .
(b) Evaluate at .
(c) Compute at for (the JVP).
(d) Compute at for (the VJP).
(e) Verify that (duality of JVP and VJP).
Exercise 2 (). Softmax Jacobian properties.
Let be the softmax function.
(a) For , compute and the full Jacobian .
(b) Verify (the all-ones vector is in the null space).
(c) Verify is symmetric and compute its eigenvalues numerically.
(d) Show that by arguing that spans the null space.
(e) For AI: In cross-entropy loss , the gradient of with respect to the logits is (where is the one-hot vector). Derive this from .
Exercise 3 (). Hessian computation and classification.
Let .
(a) Find all critical points (where ).
(b) Compute at each critical point.
(c) Classify each critical point as local min, local max, or saddle using the second derivative test.
(d) For the saddle point, find the two eigenvectors of and describe the geometry: in which directions is increasing/decreasing quadratically?
Exercise 4 (). LayerNorm Jacobian.
Consider LayerNorm (without learnable parameters ):
(a) Show that where .
(b) Verify that and are in the null space of .
(c) Compute the rank of and explain its geometric meaning (what input variations LayerNorm cannot distinguish).
(d) For and , compute numerically and verify your formula.
Exercise 5 (). Hessian-vector product and edge of stability.
For the MSE loss with :
(a) Show that and .
(b) Implement the HVP formula: (matrix-free, ).
(c) Use 10 iterations of power iteration starting from a random to estimate .
(d) Compute the maximum stable learning rate and verify numerically that GD with diverges while converges.
(e) For AI: Explain why the "edge of stability" phenomenon (Cohen et al. 2022) - where SGD's training loss oscillates as - is consistent with this analysis.
Exercise 6 (). Newton's method convergence.
Consider (softplus).
(a) Compute (sigmoid) and .
(b) Apply 5 steps of Newton's method starting from to minimise (the minimum is at ). Record the error at each step.
(c) Plot the sequence vs . Is the convergence roughly linear or quadratic (does the log-error decrease linearly or super-linearly)?
(d) Now add a constraint: minimise the same but also track the Newton decrement at each step. Interpret its role as a stopping criterion.
Exercise 7 (). K-FAC structure.
Consider a single linear layer: with , loss .
(a) Show that where and is the Kronecker product.
(b) The exact FIM is . Show that K-FAC's independence assumption (with , ) gives .
(c) Show that the K-FAC natural gradient step (reshape form) implements .
(d) Generate random , , compute exact FIM from 100 samples, compute K-FAC approximation, and compare (exact) with (K-FAC) numerically.
Exercise 8 (). Jacobians in normalising flows.
Implement a single coupling layer (as in RealNVP) and verify the log-determinant formula.
(a) Define a coupling layer : split ; output where are simple networks.
(b) Show that the Jacobian is lower-triangular, and the log-determinant is (sum of scale outputs).
(c) Numerically compute by finite differences for a specific , and verify .
(d) Implement the inverse and verify .
(e) For generative modelling: Explain why the tractable log-determinant (which avoids computation) is essential for training normalising flows via maximum likelihood.
12. Why This Matters for AI (2026 Perspective)
| Concept | Concrete AI/ML Usage |
|---|---|
| Jacobian of layer outputs | Backpropagation is iterated Jacobian-vector products (VJPs) through each layer. Every training step of every neural network in the world computes these products implicitly |
| Softmax Jacobian | Understanding that is rank- explains why large vocabulary softmax is numerically tricky (singular Jacobian) and why temperature scaling affects gradient magnitudes |
| LayerNorm Jacobian | The rank- Jacobian of LayerNorm means gradients cannot pass through in the directions and - this "gradient bottleneck" explains why Pre-LN transformers (LN before attention) converge faster than Post-LN (LN after attention) |
| JVP vs VJP (forward vs reverse mode) | All modern deep learning frameworks (PyTorch, JAX) use reverse mode. Forward mode is used in specific settings: sensitivity analysis, Hessian-diagonal estimation, gradient checkpointing trade-offs |
| Jacobian condition number | High condition number -> vanishing/exploding gradients in deep networks. Xavier and He initialisation set the initial Jacobian's singular values near 1 across layers, ensuring stable gradient flow at init |
| Hessian eigenspectrum | The largest Hessian eigenvalue determines the maximum stable learning rate. The "edge of stability" phenomenon (Cohen et al. 2022) shows SGD self-regulates to - foundational for understanding training dynamics |
| Flat vs sharp minima (SAM) | SAM and its variants (ASAM, mSAM, SAM-ON, Adan-SAM) are widely used in production systems. SAM finds flatter minima, improving generalisation by 1-3% on vision and NLP benchmarks. The analysis of flat vs sharp minima requires understanding Hessian definiteness and condition numbers |
| Hessian-vector products | Power iteration on the Hessian (using HVPs) gives for learning rate selection. Influence functions (for data valuation, finding mislabelled examples, understanding model behaviour) use iterative HVP solvers |
| K-FAC / natural gradient | Second-order methods that approximate the Fisher Information Matrix (= Gauss-Newton matrix) achieve 10-30x speedup over Adam in terms of steps to convergence. EKFAC, KFAC-Reduce used in large-scale transformer training at Google and DeepMind |
| Jacobian determinant / normalising flows | The change-of-variables formula requires $\log |
| IFT / bilevel optimisation | MAML meta-learning, hyperparameter optimisation via gradient, and NAS all require differentiating through optimisation loops. This requires Hessian-vector products in the IFT formula |
| Neural Tangent Kernel (NTK) | In the infinite-width limit, the kernel is (output Jacobian squared). Understanding NTK evolution during training (NTK changes vs stays constant) explains why wide networks behave like kernel machines and why the lazy training regime exists |
| RoPE and attention Jacobians | Rotary Position Embeddings (RoPE, used in LLaMA, Mistral, Qwen) are orthogonal transformations applied to queries and keys. Their Jacobians are rotation matrices (orthogonal, $ |
Conceptual Bridge
Looking backward. This section builds directly on 01-Partial-Derivatives-and-Gradients, which introduced the gradient as the collection of partial derivatives for scalar functions. The Jacobian is the natural generalisation: instead of a single gradient vector, we have a matrix of gradients for each output. Every formula from the gradient section - chain rule, directional derivative, gradient descent - has a Jacobian analogue: chain rule for Jacobians (), directional derivative as JVP (), and steepest ascent/descent replacing gradient with VJP backpropagation.
Looking forward. The Hessian established here is the foundation for the next two sections:
-
03-Optimization-on-Manifolds: When parameters are constrained to a manifold (e.g., orthogonal matrices , unit sphere), the Riemannian Hessian replaces the Euclidean Hessian. The Fisher Information Matrix defines a natural Riemannian metric on statistical manifolds.
-
04-Automatic-Differentiation: The JVP and VJP operators computed throughout this section are exactly what forward-mode and reverse-mode AD implement. The Jacobian-vector product is the primitive of forward-mode; the vector-Jacobian product is the primitive of reverse-mode. Understanding both modes, and when to use each, is the foundation of efficient AD system design.
Beyond this chapter: Jacobians and Hessians appear throughout the rest of the curriculum:
- Optimisation (Chapter 8): gradient descent convergence rates, condition numbers, Newton's method, and conjugate gradients all depend on the Hessian eigenspectrum
- Probability and Statistics (Chapter 9): the Fisher Information Matrix and Cramr-Rao bound require Jacobians of log-likelihood functions
- Differential Geometry (Chapter 10): the pushforward and pullback are Jacobian maps between tangent spaces on manifolds
POSITION IN THE CURRICULUM
01-Partial-Derivatives-and-Gradients
(scalar functions: nablaf, directional derivatives, chain rule)
02-Jacobians-and-Hessians <- YOU ARE HERE
(vector functions: J_f, H_f, second-order analysis)
03-Optimization-on-Manifolds 04-Automatic-Differentiation
(Riemannian Hessian, natural gradient) (JVP/VJP implementation)
Chapter 8: Optimisation
(convergence theory, Newton, second-order methods)
The Jacobian and Hessian are the core analytical tools of multivariate analysis. Every gradient-based learning algorithm, every convergence theorem, every second-order method in deep learning is a consequence of these fundamental objects. With this section complete, you have the mathematical foundation to understand not just how modern optimisers work, but why they work - and where they break down.
<- Back to Chapter 05: Multivariate Calculus | Next: 03 Optimization on Manifolds ->
13. Numerical Methods and Finite Differences
Before automatic differentiation, Jacobians and Hessians were computed numerically. Even today, numerical methods are essential for verifying analytical gradients (gradient checking), debugging implementations, and understanding numerical stability.
13.1 Finite Difference Approximations
First-order differences. For :
The centred difference is preferred: it is exact for quadratics (cancels the error term via symmetry) and reduces to forward difference cost only when both evaluations can be reused.
Second-order differences (Hessian diagonal).
Off-diagonal Hessian entries. Two ways:
Computing the full Hessian this way requires function evaluations - completely infeasible for large .
13.2 Gradient Checking
Gradient checking compares the analytical gradient (computed by backprop) against the numerical estimate (computed by centred differences):
A relative error of usually indicates a correct implementation; indicates a bug. This is a gold standard debugging technique - every significant ML framework implementation should be gradient-checked before trusting gradients.
Common pitfalls in gradient checking:
- Kink near the evaluation point: ReLU has undefined. If the evaluation point happens to be near for some neuron, the numerical gradient straddles a kink and gives an inaccurate estimate. Fix: perturb the input slightly away from known kinks.
- Float32 precision: Centred differences in float32 have catastrophic cancellation for small (both terms round to the same value). Use for float32, for float64.
- Dropout and batch norm: These layers behave differently in train vs eval mode. Gradient check with these layers frozen or in eval mode.
Stochastic gradient checking. Instead of checking all coordinates (cost: function evaluations), check a random projection: compute (one dot product) and compare against (one centred difference in direction ). This costs only 2 function evaluations regardless of , and catches most gradient bugs.
13.3 Numerical Stability of Jacobian Computations
Condition number and ill-conditioned Jacobians. The condition number measures how amplified input perturbations become in the output. For a neural network layer, large means:
- Small input perturbations (e.g., numerical noise) produce large output perturbations
- Backpropagated gradients through this layer are magnified or shrunk by
Ill-conditioning accumulates through depth. For a depth- network, the condition number of the full Jacobian satisfies:
This bounds can be exponential in - explaining gradient explosion. Good architecture design (residual connections, normalisation layers) keeps for each layer.
Log-sum-exp trick. The softmax Jacobian involves . Direct computation overflows for large . The numerically stable implementation subtracts before exponentiation:
This cancellation doesn't change the output but keeps all exponentials . The Jacobian formula remains the same - only the computation of is stabilised.
Mixed precision Hessian computations. When computing HVPs in mixed precision (float16/bfloat16 for activations, float32 for gradients), the second differentiation step (gradient of gradient) may accumulate significant rounding error. Google's T5X and JAX-based training systems perform HVP computations in float32 even during mixed-precision training to maintain numerical accuracy of second-order estimates.
14. Jacobians in Modern Transformer Architectures
Transformers introduced several operations with non-trivial Jacobians. Understanding these helps explain training behaviour, initialisation requirements, and the design of efficient fine-tuning methods.
14.1 Attention Jacobian
The scaled dot-product attention function maps :
The Jacobian of with respect to (for a single head, single query position ):
where and is the softmax Jacobian. The full attention Jacobian is a complex composition involving the softmax Jacobian, the key matrix , and the value matrix .
Implications of the softmax Jacobian in attention:
- The singular softmax Jacobian (null space = span{}) means attention weights are translation-invariant: shifting all logits by a constant doesn't change the output
- Temperature scaling () scales the Jacobian: . High temperature () -> nearly-uniform attention -> small Jacobian -> slow learning. Low temperature () -> peaked attention -> large Jacobian -> sharp gradients.
- Attention entropy collapse: if attention concentrates on one token (small or sharp ), for some , and (PSD, small norm). Gradients vanish through the attention softmax.
14.2 RoPE and Orthogonal Jacobians
Rotary Position Embeddings (RoPE), used in LLaMA, Mistral, Qwen, DeepSeek, apply a position-dependent rotation to queries and keys. For position and dimension pair :
The Jacobian of the RoPE transformation (with respect to the query) is a block-diagonal rotation matrix - a rotation matrix with:
- (orthogonal matrix)
- (volume-preserving)
- (condition number = 1)
- (isometry: no gradient scaling)
RoPE's orthogonal Jacobian is a mathematically elegant design: gradient norms are preserved through the positional encoding transformation, preventing it from being a source of gradient instability.
14.3 LoRA and the Low-Rank Jacobian Approximation
Low-Rank Adaptation (LoRA, Hu et al. 2022) freezes the original weight matrix and adds a trainable low-rank perturbation:
The Jacobian of the layer output with respect to the LoRA parameters :
The gradient of the loss with respect to : - an outer product of rank 1, matching the original weight gradient structure. The gradient with respect to : - again rank at most .
Why LoRA works (Jacobian perspective). The empirical observation that full fine-tuning has low "intrinsic dimensionality" (Aghajanyan et al. 2021) means the weight gradient concentrates in a low-rank subspace during fine-tuning. LoRA parameterises updates in this subspace directly. The Jacobian of the LoRA update rule has parameters vs for full fine-tuning - a reduction of in trainable parameters.
DoRA (Weight-Decomposed LoRA, Liu et al. 2024). Decomposes the weight as (magnitude x direction). The Jacobian of the direction normalisation introduces a LayerNorm-like structure - the gradient of the magnitude scalar and the direction matrix decouple, enabling more stable fine-tuning.
14.4 Gradient Flow Through Residual Connections
The residual connection (introduced in ResNets, universal in transformers) has a special Jacobian structure:
This is the key insight: the Jacobian of a residual block is always plus a perturbation. Even if has small singular values (vanishing), the sum has singular values bounded below by . For a well-initialised network, at init, so - the identity map passes gradients through unchanged.
Depth effect. For an -layer residual network:
The leading term is (identity shortcut from input to output), followed by first-order terms (single-layer Jacobians), then higher-order interaction terms. This identity shortcut is why ResNets train stably at depth : the gradient can always flow through the identity path without attenuation.
Comparison to plain networks. For a plain (non-residual) network: . This product can have exponentially small singular values at depth - the vanishing gradient problem. Residual connections solve this by ensuring the Jacobian includes an identity component.
For transformers: Every transformer layer adds a residual connection around both the attention block and the MLP block. The combined Jacobian is - dominated by at initialisation, with the attention and MLP Jacobians as small corrections.
15. Worked Examples: End-to-End Computations
15.1 Complete Backprop Trace for a Two-Layer Network
Let us trace the full computation of Jacobians and VJPs for the small network:
with , .
Forward pass (computing values):
- (linear)
- (elementwise)
- (linear)
- (loss)
Backward pass (computing Jacobians and VJPs):
Start from the loss: (call this ).
Step 1: Through linear layer .
The Jacobian of with respect to and :
- : , so
- , so VJP:
Step 2: Through ReLU.
(diagonal mask). VJP:
(elementwise multiply by the activation mask)
Step 3: Through linear layer .
Let .
- (if we need the input gradient)
Summary of the structure. Every backward step computes:
- Weight gradient = (backpropagated error) x (forward activation) - always a rank-1 outer product in a single example
- Activation gradient = (weight matrix) x (backpropagated error) - transposed Jacobian times incoming gradient
This pattern repeats identically for any depth: backward pass = chain of transposed-Jacobian-vector products (VJPs), with weight gradients as outer products at each layer.
15.2 Numerical Verification of the Softmax Jacobian
A worked numerical example. Take and .
Compute :
Jacobian: :
Verification:
- Row sums: (rounding; exact = 0)
- (null space contains )
- Symmetry:
- Diagonal = :
Eigenvalues (computed numerically): , , . Two nonzero eigenvalues (rank 2 = ), one zero eigenvalue corresponding to eigenvector .
15.3 Newton's Method: Traced Convergence on a Quadratic
Take (condition number ). Minimum at .
, .
Gradient descent with :
| 0 | 1.0 | 1.0 | 13.0 | 1.414 |
| 1 | 0.923 | 0.923 | 11.08 | 1.305 |
| 5 | 0.652 | 0.652 | 5.51 | 0.922 |
| 20 | 0.215 | 0.215 | 0.599 | 0.304 |
| 50 | 0.029 | 0.029 | 0.011 | 0.041 |
Contraction ratio: - slow!
Newton's method with exact Hessian :
Newton step: .
From : Newton step gives - exactly one step!
This is because is exactly quadratic: the Newton step minimises the quadratic approximation exactly, and the approximation is exact for quadratic functions. For non-quadratic functions, Newton converges in steps near the minimum.
Key insight. GD takes steps to get to error ; Newton takes 1 step to get to error . The difference is the Hessian inverse which scales the poorly-conditioned direction (the -direction with curvature 25) by , compensating exactly.
16. Summary of Key Formulas
A quick-reference table of all major Jacobians and Hessians covered in this section.
JACOBIAN REFERENCE TABLE
Function | Jacobian J_f | Shape
f(x) = Ax + b | A | (m,n)
f(x) = sigma(x) elementwise | diag(sigma'(x)) | (n,n)
f(x) = softmax(x) | diag(p) - pp^T | (K,K)
f(x) = LayerNorm(x) | (1/sigma)(I - (1/d)11^T - (1/d)xx^T) | (d,d)
f(x) = ||x||_2 | x^T / ||x||_2 (row vector) | (1,n)
f(x) = x/||x||_2 | (1/||x||)(I - xx^T) | (n,n)
f(x) = log(sum(exp(x))) | softmax(x)^T (row vector) | (1,n)
f(x) = ReLU(x) | diag(1[x > 0]) (a.e.) | (n,n)
f(x,y) = xy | [y, x] (row Jacobian) | (1,2)
Chain: fog | J_f(g(x)) * J_g(x) | (m,n)
HESSIAN REFERENCE TABLE
Function f(x) | Hessian H_f | Definiteness
(1/2)x^T A x | A | same as A
(1/2)||Ax - b||^2 | A^T A | PSD
log-sum-exp(x) | diag(p) - pp^T | PSD, rank K-1
logistic CE, 1 sample | p(1-p) xx^T (rank-1) | PSD, rank 1
logistic CE, m samples | Sigma p(1-p)xx^T | PSD
f(x) = ||x||^2 | 2I | PD
f(x) = 1/||x|| | (3xx^T - ||x||^2I)/||x||^5 | Indefinite
ReLU network | 0 almost everywhere | PSD (trivially)
Cost summary:
| Computation | Naive cost | Efficient cost |
|---|---|---|
| Full Jacobian | AD calls | - |
| JVP (single vector) | 1 forward pass | |
| VJP (single vector) | 1 backward pass | |
| Full gradient for scalar | - | 1 backward pass, |
| Full Hessian | backward passes | |
| HVP | - | 2 backward passes, |
| Top eigenvalue | - | HVPs, each |
| Solve (CG) | - | HVPs |
| K-FAC update per layer |
References
-
Spivak, M. (1965). Calculus on Manifolds. Benjamin. - The rigorous treatment of Frchet derivatives and the inverse/implicit function theorems.
-
Rumelhart, D., Hinton, G., Williams, R. (1986). "Learning representations by back-propagating errors." Nature, 323. - The original backpropagation paper; backprop = iterated VJP.
-
Baydin, A., Pearlmutter, B., Radul, A., Siskind, J. (2018). "Automatic differentiation in machine learning: a survey." JMLR, 18(153). - Comprehensive AD survey covering forward/reverse mode and higher-order derivatives.
-
Pearlmutter, B. (1994). "Fast exact multiplication by the Hessian." Neural Computation, 6(1). - The original HVP identity paper.
-
Martens, J., Grosse, R. (2015). "Optimizing neural networks with Kronecker-factored approximate curvature." ICML. - K-FAC paper.
-
Ghorbani, B., Krishnan, S., Xiao, Y. (2019). "An investigation into neural net optimization via Hessian eigenvalue density." ICML. - Hessian spectrum analysis of deep networks.
-
Cohen, J., Kaur, S., Li, Y., Kolter, J., Talwalkar, A. (2022). "Gradient descent on neural networks typically occurs at the edge of stability." ICLR. - Edge of stability and dynamics.
-
Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B. (2021). "Sharpness-Aware Minimization for Efficiently Improving Generalization." ICLR. - SAM algorithm.
-
Hu, E., Shen, Y., Wallis, P., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. - LoRA; Jacobian perspective on parameter-efficient fine-tuning.
-
Su, J., Lu, Y., Pan, S., et al. (2024). "RoFormer: Enhanced transformer with rotary position embedding." Neurocomputing. - RoPE; orthogonal Jacobian design for position embeddings.
-
Kook, Y., Sra, S. (2024). "Geometric analysis of neural collapse." - Geometric perspective on loss landscape and Hessian structure.
-
Boyd, S., Vandenberghe, L. (2004). Convex Optimization. Cambridge. - Definitive reference for second-order analysis, Newton's method, and optimality conditions.
End of 02 Jacobians and Hessians
<- Back to Chapter 05: Multivariate Calculus | Next: 03 Chain Rule and Backpropagation ->