NotesMath for LLMs

Jacobians and Hessians

Multivariate Calculus / Jacobians and Hessians

Notes

"The Jacobian is the derivative - not a number, not a vector, but a linear map. Once you see it that way, everything in deep learning becomes composition of linear maps."

  • paraphrasing Michael Spivak, Calculus on Manifolds

Overview

The gradient f(x)Rn\nabla f(\mathbf{x}) \in \mathbb{R}^n is the right object when ff is scalar-valued. Neural networks are not scalar-valued: every layer maps a vector to a vector, every activation transforms a vector elementwise, and the full forward pass is a long chain of vector-to-vector maps. When you differentiate a vector-valued function you get a matrix of partial derivatives - the Jacobian. The Jacobian is the derivative in the correct sense: it is the unique linear map that best approximates the function near a point.

The Hessian is the second derivative. Once you have the gradient f:RnRn\nabla f: \mathbb{R}^n \to \mathbb{R}^n (which is vector-valued), you can differentiate it again to get the Jacobian of the gradient - an n×nn \times n matrix called the Hessian HfH_f. The Hessian encodes curvature: how quickly the gradient rotates and stretches as you move through parameter space. Curvature determines whether a critical point is a minimum, maximum, or saddle; it determines how fast gradient descent converges; it is the object that second-order optimisers (Newton, K-FAC, Shampoo) use to accelerate training.

This section builds both objects completely. We prove every key result from first principles: the Frchet derivative definition, Clairaut's theorem with a complete proof and a counterexample, the second-order Taylor theorem with remainder, the Newton step derivation, and the efficient HVP identity. We derive from scratch the Jacobian of every major ML operation: linear layers, elementwise activations, softmax, layer normalisation. We connect everything to modern practice: why backpropagation is VJP composition, why the K-FAC approximation works, why SAM seeks flat minima, and how influence functions use HVPs.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbJacobian and Hessian computation, ML layer Jacobians, Taylor approximation, Hessian spectrum, Newton's method, HVPs
exercises.ipynb10 graded exercises from Jacobian computation through Gauss-Newton and SAM

Learning Objectives

After completing this section, you will be able to:

  1. Define the Jacobian JfRm×nJ_f \in \mathbb{R}^{m \times n} for f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m via the Frchet derivative and compute it for any differentiable function
  2. Prove Clairaut's theorem on symmetry of mixed partial derivatives, and give a concrete counterexample showing when it fails
  3. Define the Hessian as Hf=JfH_f = J_{\nabla f} and compute it analytically for quadratic, log-sum-exp, and neural network loss functions
  4. Derive the second-order multivariate Taylor theorem with explicit remainder, and use it to build quadratic models of loss functions
  5. Classify critical points using Hessian eigenvalues: positive definite -> strict minimum, indefinite -> saddle
  6. Derive from scratch the Jacobian of a linear layer, elementwise activation, softmax, and layer normalisation - including the full proof for softmax
  7. Explain JVPs (forward mode) and VJPs (reverse mode), and prove why reverse mode requires only one pass for scalar loss
  8. Derive the Newton step from the quadratic model and explain its quadratic convergence
  9. Implement Hessian-vector products in O(n)O(n) without materialising the n×nn \times n Hessian
  10. State and apply the Gauss-Newton and K-FAC approximations, explaining why HGN=JJ0H_{\text{GN}} = J^\top J \succeq 0

Table of Contents


1. Intuition

1.1 From Gradient to Jacobian

Recall from 01 that for a scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the gradient f(x)Rn\nabla f(\mathbf{x}) \in \mathbb{R}^n captures all the first-order information: knowing f(x)\nabla f(\mathbf{x}), we can predict the change in ff along any direction u\mathbf{u} as f(x)u\nabla f(\mathbf{x})^\top \mathbf{u}. The fundamental linear approximation is:

f(x+δ)f(x)+f(x)δf(\mathbf{x}+\boldsymbol{\delta}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^\top\boldsymbol{\delta}

Now suppose the function is vector-valued: f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m. It produces mm output numbers from nn input numbers. We need to capture how each of the mm outputs changes as we vary each of the nn inputs. There are m×nm \times n such partial derivatives - they form a matrix. This matrix is the Jacobian.

The linear approximation generalises cleanly: if δRn\boldsymbol{\delta} \in \mathbb{R}^n is a small input perturbation, then the output perturbation is approximately:

f(x+δ)f(x)+Jf(x)δRm\mathbf{f}(\mathbf{x}+\boldsymbol{\delta}) \approx \mathbf{f}(\mathbf{x}) + J_f(\mathbf{x})\,\boldsymbol{\delta} \in \mathbb{R}^m

where Jf(x)Rm×nJ_f(\mathbf{x}) \in \mathbb{R}^{m \times n} is the Jacobian. The product JfδJ_f\boldsymbol{\delta} is a matrix-vector product - not a dot product. This is the key structural difference between scalar and vector differentiation.

SCALAR FUNCTION vs VECTOR FUNCTION


  f: R -> R                      g: R -> R
              
  Derivative: nablaf in R            Derivative: Jg in R

  df ~= nablaf * delta                    dg ~= Jg delta
       (dot product)                   (matrix-vector product)

  Row i of Jg = (nablag)          gradient of i-th output component
  Col j of Jg = partialg/partialx          how ALL outputs respond to x

  Shape check:
    delta in R,  Jg in R  ->  Jgdelta in R  


Why this matters for neural networks. A transformer layer takes an embedding xRd\mathbf{x} \in \mathbb{R}^d and produces another embedding yRd\mathbf{y} \in \mathbb{R}^d. The Jacobian Jf(x)Rd×dJ_f(\mathbf{x}) \in \mathbb{R}^{d \times d} captures how every output dimension responds to every input dimension. In backpropagation, when a gradient δoutRd\boldsymbol{\delta}^{\text{out}} \in \mathbb{R}^d arrives from the next layer, the upstream gradient is δin=Jf(x)δoutRd\boldsymbol{\delta}^{\text{in}} = J_f(\mathbf{x})^\top \boldsymbol{\delta}^{\text{out}} \in \mathbb{R}^d - the transposed Jacobian acting on the incoming gradient. Backpropagation is Jacobian transposition applied repeatedly.

The Jacobian also governs whether information propagates through the network without distortion. If Jf21\|J_f\|_2 \gg 1, gradients explode. If Jf21\|J_f\|_2 \ll 1, gradients vanish. This is why initialisation schemes (Xavier, He) and normalisation layers (BatchNorm, LayerNorm) are designed to keep Jacobians near the identity in spectral norm.

1.2 From Gradient to Hessian

The gradient f:RnRn\nabla f: \mathbb{R}^n \to \mathbb{R}^n is a vector-valued function. We just established that vector-valued functions have Jacobians. So: what is the Jacobian of the gradient? It is the Hessian:

Hf(x)=Jf(x)Rn×nH_f(\mathbf{x}) = J_{\nabla f}(\mathbf{x}) \in \mathbb{R}^{n \times n}

The (i,j)(i,j) entry of HfH_f is:

[Hf]ij=[f]ixj=xjfxi=2fxjxi[H_f]_{ij} = \frac{\partial [\nabla f]_i}{\partial x_j} = \frac{\partial}{\partial x_j}\frac{\partial f}{\partial x_i} = \frac{\partial^2 f}{\partial x_j \partial x_i}

This measures how the ii-th component of the gradient changes when you increase xjx_j. Geometrically: the Hessian describes how the gradient rotates and stretches as you move through space. This is curvature.

Curvature in direction u\mathbf{u}: Consider the function restricted to the line through x\mathbf{x} in direction u\mathbf{u}: g(t)=f(x+tu)g(t) = f(\mathbf{x}+t\mathbf{u}). The curvature of gg at t=0t=0 is:

g(0)=d2dt2t=0f(x+tu)=uHf(x)ug''(0) = \frac{d^2}{dt^2}\bigg|_{t=0} f(\mathbf{x}+t\mathbf{u}) = \mathbf{u}^\top H_f(\mathbf{x})\,\mathbf{u}

The Hessian thus answers: "how fast does ff curve when I walk in direction u\mathbf{u}?" The answer is the Rayleigh quotient uHfu\mathbf{u}^\top H_f \mathbf{u}.

Why sign of curvature matters for optimisation. If every direction u\mathbf{u} has positive curvature (uHfu>0\mathbf{u}^\top H_f \mathbf{u} > 0 for all u0\mathbf{u} \neq \mathbf{0}), the function is locally bowl-shaped - every direction goes up from the bottom, confirming we are at a local minimum. If some direction has negative curvature, we can slide downhill in that direction - we are at a saddle point, not a minimum. If all curvatures are negative, we are at a local maximum.

CURVATURE AND CRITICAL POINTS


  All eigenvalues lambda > 0:    uHu > 0 for allu != 0
  
        f
        
      / bowl \   <-  curved upward in ALL directions
     /        \     local MINIMUM  (nablaf = 0 here)

  Mixed eigenvalues:         existsu,v: uHu > 0, vHv < 0
  
         
        saddle               goes up in some, down in others
                             SADDLE POINT  (nablaf = 0 here)

  All eigenvalues lambda < 0:    uHu < 0 for allu != 0
  
     \        /
      \ hill /   <-  curved downward in ALL directions
                local MAXIMUM  (nablaf = 0 here)


For neural networks, the Hessian of the loss HL(θ)H_\mathcal{L}(\boldsymbol{\theta}) has a rich spectrum: thousands of near-zero eigenvalues (flat directions in parameter space) and a handful of large eigenvalues (sharply curved directions, often corresponding to the most influential parameters). The ratio λmax/λmin\lambda_{\max}/\lambda_{\min} - the condition number - determines how hard the optimisation problem is for gradient descent.

1.3 Historical Context

YearContributorDevelopment
1841Carl JacobiIntroduced functional determinants in the study of coordinate transforms - what we now call the Jacobian matrix
1844Ludwig HesseSecond-order determinant for classifying critical points of functions of two variables
1867HesseExpanded to the full matrix of second partials; term "Hessian" coined posthumously
1912MorseMorse lemma: non-degenerate critical points (det H0H \neq 0) have a normal form determined entirely by the Hessian
1960sLevenberg, MarquardtGauss-Newton + regularisation for nonlinear least squares - still the workhorse for nonlinear optimisation
1987Rumelhart, Hinton, WilliamsPopularised backpropagation = repeated VJP composition through Jacobians
1992LeCun, Simard, PearlmutterDiagonal Hessian approximation for adaptive learning rates - precursor to Adagrad/Adam
2002AmariNatural gradient = Fisher information (expected Hessian of log-likelihood) as preconditioner
2015Martens & GrosseK-FAC: Kronecker-factored curvature approximation; practical second-order training
2018Ghorbani, Krishnan, XiaoEmpirical Hessian spectrum of deep networks: bulk + outlier structure
2020Foret et al.SAM: sharpness-aware minimisation; seeking parameters with small λmax(H)\lambda_{\max}(H)
2022Vyas et al.Influence functions at GPT scale via HVPs and the Arnoldi method
2023Google BrainDistributed Shampoo in production LLM training; K-FAC for trillion-parameter models

1.4 Why These Matrices Define Deep Learning

Every significant algorithm in modern deep learning uses Jacobians or Hessians, often implicitly:

Jacobians are present in:

  • Backpropagation. For a network L=(fL(f1(x)))\mathcal{L} = \ell(f_L(\cdots f_1(\mathbf{x}))), the gradient with respect to the input of layer ll is computed via δ(l)=Jfl(x(l))δ(l+1)\boldsymbol{\delta}^{(l)} = J_{f_l}(\mathbf{x}^{(l)})^\top \boldsymbol{\delta}^{(l+1)}. Every backward pass is a sequence of transposed Jacobian multiplications.

  • LoRA and low-rank updates. The weight gradient for a linear layer is WL=δa\nabla_W \mathcal{L} = \boldsymbol{\delta}\mathbf{a}^\top - a rank-1 outer product. LoRA exploits that the effective gradient update across a training run lives in a low-rank subspace, parameterising ΔW=AB\Delta W = AB with rank rmin(m,n)r \ll \min(m,n).

  • Normalising flows. Generative models (RealNVP, Glow, Neural ODE) learn invertible maps. The log-likelihood requires logdetJf(x)\log|\det J_f(\mathbf{x})| - the log-volume scaling factor of the Jacobian.

  • Gradient explosion/vanishing. For a depth-LL network, L/x(1)2l=1LJfl2\|\partial\mathcal{L}/\partial\mathbf{x}^{(1)}\|_2 \leq \prod_{l=1}^L \|J_{f_l}\|_2. If each Jfl2>1\|J_{f_l}\|_2 > 1, gradients grow exponentially (explosion); if <1< 1, they decay exponentially (vanishing).

Hessians are present in:

  • Newton's method and quasi-Newton (L-BFGS). The Newton step δ=H1L\boldsymbol{\delta}^* = -H^{-1}\nabla\mathcal{L} converges quadratically near the minimum. L-BFGS approximates H1H^{-1} via a rank-kk update from gradient differences.

  • K-FAC and Shampoo. Approximate the Hessian (or Fisher) using Kronecker products of per-layer statistics. Used in production training at Google (Shampoo for Gemini-scale models).

  • SAM (Sharpness-Aware Minimisation). Explicitly targets parameters where λmax(HL)\lambda_{\max}(H_\mathcal{L}) is small, achieving better generalisation by finding flat minima.

  • Influence functions. The influence of training example ii on test loss is LtestH1Li\approx -\nabla\mathcal{L}_{\text{test}}^\top H^{-1}\nabla\mathcal{L}_i, requiring solving a linear system Hu=LtestH\mathbf{u} = \nabla\mathcal{L}_{\text{test}} via Hessian-vector products.

  • Loss landscape analysis. The ratio λmax/λmin\lambda_{\max}/\lambda_{\min} determines convergence speed; the spectral density reveals whether the landscape is nearly flat (overparameterised regime) or sharply curved.


2. The Jacobian Matrix

2.1 Formal Definition via Frchet Derivative

Before writing down the Jacobian matrix, it helps to understand the conceptual definition from which the matrix naturally emerges.

Definition (Frchet derivative). Let f:URnRmf: U \subseteq \mathbb{R}^n \to \mathbb{R}^m be defined on an open set UU. We say ff is differentiable at xU\mathbf{x} \in U if there exists a linear map L:RnRmL: \mathbb{R}^n \to \mathbb{R}^m such that:

limδ0f(x+δ)f(x)Lδδ=0\lim_{\|\boldsymbol{\delta}\| \to 0} \frac{\|f(\mathbf{x}+\boldsymbol{\delta}) - f(\mathbf{x}) - L\boldsymbol{\delta}\|}{\|\boldsymbol{\delta}\|} = 0

In other words: LL is the best linear approximation to ff near x\mathbf{x}, in the sense that the error f(x+δ)f(x)Lδf(\mathbf{x}+\boldsymbol{\delta}) - f(\mathbf{x}) - L\boldsymbol{\delta} is sublinear in δ\|\boldsymbol{\delta}\| (it goes to zero faster than δ\boldsymbol{\delta} itself).

Theorem (Uniqueness). If such LL exists, it is unique. Moreover, the matrix of LL in the standard bases is exactly the matrix of partial derivatives.

Definition (Jacobian matrix). The unique linear map LL from the Frchet definition, written as an m×nm \times n matrix, is the Jacobian of ff at x\mathbf{x}:

Jf(x)=(f1x1f1x2f1xn f2x1f2x2f2xn  fmx1fmx2fmxn)Rm×nJ_f(\mathbf{x}) = \begin{pmatrix} \dfrac{\partial f_1}{\partial x_1} & \dfrac{\partial f_1}{\partial x_2} & \cdots & \dfrac{\partial f_1}{\partial x_n} \\\ \dfrac{\partial f_2}{\partial x_1} & \dfrac{\partial f_2}{\partial x_2} & \cdots & \dfrac{\partial f_2}{\partial x_n} \\\ \vdots & & \ddots & \vdots \\\ \dfrac{\partial f_m}{\partial x_1} & \dfrac{\partial f_m}{\partial x_2} & \cdots & \dfrac{\partial f_m}{\partial x_n} \end{pmatrix} \in \mathbb{R}^{m \times n}

The (i,j)(i,j) entry is [Jf]ij=fi/xj[J_f]_{ij} = \partial f_i / \partial x_j.

Proof that the Frchet derivative equals the Jacobian (sketch). Applying the Frchet condition with δ=tej\boldsymbol{\delta} = t\mathbf{e}_j (scalar multiple of jj-th basis vector) and dividing by t0t \to 0:

[L]ij=limt0fi(x+tej)fi(x)t=fixj[L]_{ij} = \lim_{t\to 0}\frac{f_i(\mathbf{x}+t\mathbf{e}_j) - f_i(\mathbf{x})}{t} = \frac{\partial f_i}{\partial x_j}

So the matrix entries of LL are precisely the partial derivatives. \square

Important caveat. The converse fails: the existence of all partial derivatives does not guarantee Frchet differentiability. You additionally need the partial derivatives to be continuous, or equivalently that the limit in the Frchet definition holds for all directions simultaneously (not just coordinate directions). The standard sufficient condition is: if all partial derivatives exist and are continuous at x\mathbf{x}, then ff is Frchet differentiable at x\mathbf{x} (this is the C1C^1 condition).

Notation. We write Jf(x)J_f(\mathbf{x}), Df(x)Df(\mathbf{x}), f(x)f'(\mathbf{x}), or f/x\partial\mathbf{f}/\partial\mathbf{x} interchangeably. The Jacobian matrix depends on the point x\mathbf{x} - it is a matrix-valued function of the input.

Row and column interpretations:

TWO WAYS TO READ THE JACOBIAN


  J_f in R  for  f: R -> R

         x_1  x_2  x_3  ...  x
       
  f_1    row 1 = (nablaf_1)        <- transposed gradient of output 1
  f_2    row 2 = (nablaf_2)        <- transposed gradient of output 2
  f_3    row 3 = (nablaf_3)      
                           
  f    row m = (nablaf)      
       
                        
          col  col  col   col  j = partialf/partialx
           j                   sensitivity of ALL outputs to x

  Row reading:  "Given that I move in all n directions,
                 how does output i respond?"

  Column reading: "If I increase input j alone,
                   how do all m outputs respond?"


2.2 Special Cases

The Jacobian unifies all the derivative objects from earlier chapters:

Case 1: Scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R} (i.e., m=1m=1). The Jacobian is a 1×n1 \times n row vector: Jf=(f)J_f = (\nabla f)^\top. The gradient is the transposed Jacobian. This is why we write f\nabla f as a column vector and JfJ_f as a row vector for scalar functions - they are transposes of each other.

Case 2: Linear map f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x}+\mathbf{b} with ARm×nA \in \mathbb{R}^{m \times n}. [Jf]ij=(Ax+b)i/xj=Aij[J_f]_{ij} = \partial(A\mathbf{x}+\mathbf{b})_i/\partial x_j = A_{ij}. So Jf=AJ_f = A everywhere - the Jacobian of a linear map is the matrix of that map, constant everywhere. This is the vector analogue of (ax+b)=a(ax+b)' = a.

Case 3: Identity map f(x)=xf(\mathbf{x}) = \mathbf{x}. [Jf]ij=xi/xj=δij[J_f]_{ij} = \partial x_i/\partial x_j = \delta_{ij}, so Jf=InJ_f = I_n.

Case 4: Scalar function of a scalar f:RRf: \mathbb{R} \to \mathbb{R}. Jf=[f(x)]J_f = [f'(x)] - a 1×11 \times 1 matrix, i.e., the ordinary derivative.

Case 5: Vector function of a scalar f:RRmf: \mathbb{R} \to \mathbb{R}^m (a parametric curve). Jf=f(t)RmJ_f = f'(t) \in \mathbb{R}^m - a column vector, the velocity vector. Shape: m×1m \times 1.

Map typenn inputsmm outputsJfJ_f shapeName
f:RnRf:\mathbb{R}^n\to\mathbb{R}nn11×n1\times nTransposed gradient
f:RnRmf:\mathbb{R}^n\to\mathbb{R}^mnnmmm×nm\times nJacobian matrix
f:RnRnf:\mathbb{R}^n\to\mathbb{R}^nnnnnn×nn\times nSquare Jacobian
f:RRf:\mathbb{R}\to\mathbb{R}111×11\times 1Ordinary derivative
f(x)=Axf(\mathbf{x})=A\mathbf{x}nnmmm×nm\times n=A= A

Non-example: a function with partials but no Jacobian.

f(x,y)={xyx2+y2(x,y)(0,0)0(x,y)=(0,0)f(x,y) = \begin{cases} \frac{xy}{\sqrt{x^2+y^2}} & (x,y)\neq(0,0) \\ 0 & (x,y)=(0,0) \end{cases}

Both partials f/x\partial f/\partial x and f/y\partial f/\partial y equal 00 at the origin. But ff is not differentiable there: along the diagonal y=xy=x, f(t,t)/t=t/2t2t/t=1/20=f(1,1)/2f(t,t)/t = t/\sqrt{2t^2} \cdot t/t = 1/\sqrt{2} \neq 0 = \nabla f \cdot (1,1)/\sqrt{2}. The Frchet condition fails. This shows why we need continuity of partials, not just existence.

2.3 Computing Jacobians Systematically

Procedure. To compute Jf(x)J_f(\mathbf{x}) for f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m:

Row-by-row method (one output at a time): For each i=1,,mi = 1,\ldots,m: treat fi(x)f_i(\mathbf{x}) as a scalar function and differentiate it with respect to each xjx_j. Place the resulting partial derivatives in row ii.

Column-by-column method (one input at a time): For each j=1,,nj = 1,\ldots,n: differentiate the entire vector f(x)\mathbf{f}(\mathbf{x}) with respect to xjx_j (treating all other inputs as constants). Place the resulting mm-vector in column jj.

Standard Jacobian table:

FunctionDomain -> CodomainJacobian
f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x}+\mathbf{b}RnRm\mathbb{R}^n\to\mathbb{R}^mAA
f(x)=σ(x)f(\mathbf{x}) = \sigma(\mathbf{x}) (elementwise)RnRn\mathbb{R}^n\to\mathbb{R}^ndiag(σ(x))\text{diag}(\sigma'(\mathbf{x}))
f(x)=axf(\mathbf{x}) = \mathbf{a}^\top\mathbf{x} (scalar)RnR\mathbb{R}^n\to\mathbb{R}a\mathbf{a}^\top (row vector)
f(x)=x2f(\mathbf{x}) = \|\mathbf{x}\|^2 (scalar)RnR\mathbb{R}^n\to\mathbb{R}2x2\mathbf{x}^\top (row vector)
f(x)=x/xf(\mathbf{x}) = \mathbf{x}/\|\mathbf{x}\|RnRn\mathbb{R}^n\to\mathbb{R}^n1x(Ix^x^)\frac{1}{\|\mathbf{x}\|}(I - \hat{\mathbf{x}}\hat{\mathbf{x}}^\top)
f(x)=softmax(x)f(\mathbf{x}) = \text{softmax}(\mathbf{x})RKRK\mathbb{R}^K\to\mathbb{R}^Kdiag(p)pp\text{diag}(\mathbf{p})-\mathbf{p}\mathbf{p}^\top
f(x)=ReLU(x)f(\mathbf{x}) = \text{ReLU}(\mathbf{x})RnRn\mathbb{R}^n\to\mathbb{R}^ndiag(1[x>0])\text{diag}(\mathbf{1}[\mathbf{x}>0])
f(x)=xxf(\mathbf{x}) = \mathbf{x}\mathbf{x}^\top (matrix output)RnRn×n\mathbb{R}^n\to\mathbb{R}^{n\times n}Tensor: [Jf]ij,k=xiδjk+xjδik[J_f]_{ij,k} = x_i\delta_{jk}+x_j\delta_{ik}

2.4 Jacobian Determinant and Volume Scaling

When m=nm = n (square Jacobian), the Jacobian determinant detJf(x)\det J_f(\mathbf{x}) has a direct geometric meaning: it measures by what factor ff scales volumes near x\mathbf{x}.

Formally: if SRnS \subset \mathbb{R}^n is a small region near x\mathbf{x}, then vol(f(S))detJf(x)vol(S)\text{vol}(f(S)) \approx |\det J_f(\mathbf{x})| \cdot \text{vol}(S).

  • detJf>1|\det J_f| > 1: ff expands local volume
  • detJf<1|\det J_f| < 1: ff contracts local volume
  • detJf=0|\det J_f| = 0: ff collapses volume (the map is locally degenerate, not invertible)
  • detJf<0\det J_f < 0: ff reverses orientation (a reflection-like component)

Change of variables formula. For integration:

f(U)h(y)dy=Uh(f(x))detJf(x)dx\int_{f(U)} h(\mathbf{y})\,d\mathbf{y} = \int_U h(f(\mathbf{x}))\,|\det J_f(\mathbf{x})|\,d\mathbf{x}

This is the multivariate analogue of substitution: h(f(x))f(x)dx\int h(f(x))|f'(x)|\,dx.

Application to normalising flows. A normalising flow models the data density p(x)p(\mathbf{x}) by learning an invertible map f:zxf: \mathbf{z} \mapsto \mathbf{x} from a simple base density pZ(z)p_Z(\mathbf{z}) (e.g., Gaussian). By change of variables:

logp(x)=logpZ(f1(x))+logdetJf1(x)\log p(\mathbf{x}) = \log p_Z(f^{-1}(\mathbf{x})) + \log|\det J_{f^{-1}}(\mathbf{x})|

Training requires computing logdetJf\log|\det J_f| for each batch element. Standard Gaussian elimination would cost O(n3)O(n^3) - prohibitive for n=105n = 10^5. Flow architectures (RealNVP, GLOW) are designed so that JfJ_f is triangular or has Kronecker structure, making the determinant O(n)O(n) to evaluate.

Singular values and volume scaling. More precisely: detJf=iσi(Jf)|\det J_f| = \prod_i \sigma_i(J_f), where σi\sigma_i are the singular values. The two-norm Jf2=σmax\|J_f\|_2 = \sigma_{\max} tells us the maximum stretching factor in any direction. The condition number κ(Jf)=σmax/σmin\kappa(J_f) = \sigma_{\max}/\sigma_{\min} measures how anisotropic the stretching is.

2.5 Jacobian of a Composition (Preview)

The chain rule for Jacobians states: if h=fgh = f \circ g (i.e., h(x)=f(g(x))h(\mathbf{x}) = f(g(\mathbf{x}))) then:

Jh(x)=Jf(g(x))Jg(x)J_h(\mathbf{x}) = J_f(g(\mathbf{x})) \cdot J_g(\mathbf{x})

The Jacobian of a composed function is the product of Jacobians - outer function first.

Shape check: if g:RnRkg: \mathbb{R}^n \to \mathbb{R}^k and f:RkRmf: \mathbb{R}^k \to \mathbb{R}^m, then JgRk×nJ_g \in \mathbb{R}^{k\times n}, JfRm×kJ_f \in \mathbb{R}^{m\times k}, and JfJgRm×nJ_f \cdot J_g \in \mathbb{R}^{m\times n} - matches JhRm×nJ_h \in \mathbb{R}^{m\times n} for h:RnRmh:\mathbb{R}^n\to\mathbb{R}^m.

This is the mathematical theorem that underlies backpropagation. A neural network L=fLf1\mathcal{L} = \ell \circ f_L \circ \cdots \circ f_1 has Jacobian:

JL=JJfLJfL1Jf1J_\mathcal{L} = J_\ell \cdot J_{f_L} \cdot J_{f_{L-1}} \cdots J_{f_1}

Since L\mathcal{L} is scalar, JL=(xL)J_\mathcal{L} = (\nabla_\mathbf{x}\mathcal{L})^\top is a 1×n1 \times n row vector - the transposed input gradient.

Full proof and derivation: 03 Chain Rule and Backpropagation derives this rigorously and applies it to the full backpropagation algorithm.

2.6 Worked Examples with Full Derivations

Example 1: Componentwise function. f:R3R2f: \mathbb{R}^3 \to \mathbb{R}^2, f(x,y,z)=(x2y+z,  y2xz)f(x,y,z) = (x^2 y + z,\; y^2 - xz).

Row 1 (f1=x2y+zf_1 = x^2y + z):

f1x=2xy,f1y=x2,f1z=1\frac{\partial f_1}{\partial x} = 2xy, \quad \frac{\partial f_1}{\partial y} = x^2, \quad \frac{\partial f_1}{\partial z} = 1

Row 2 (f2=y2xzf_2 = y^2 - xz):

f2x=z,f2y=2y,f2z=x\frac{\partial f_2}{\partial x} = -z, \quad \frac{\partial f_2}{\partial y} = 2y, \quad \frac{\partial f_2}{\partial z} = -x Jf(x,y,z)=(2xyx21z2yx)J_f(x,y,z) = \begin{pmatrix}2xy & x^2 & 1 \\ -z & 2y & -x\end{pmatrix}

At (1,2,3)(1, 2, 3): Jf(1,2,3)=(411341)J_f(1,2,3) = \begin{pmatrix}4 & 1 & 1 \\ -3 & 4 & -1\end{pmatrix}

Interpretation: if we perturb the input by δ=(0.1,0,0)\boldsymbol{\delta} = (0.1, 0, 0), the output changes by approximately Jf(0.1,0,0)=(0.4,0.3)J_f \cdot (0.1, 0, 0)^\top = (0.4, -0.3).

Example 2: Unit normalisation f(x)=x/xf(\mathbf{x}) = \mathbf{x}/\|\mathbf{x}\|.

Let r=xr = \|\mathbf{x}\|. Then fi=xi/rf_i = x_i / r.

fixj=xjxir=δijrxixjrr2=δijrxixjr3=1r(δijxixjr2)\frac{\partial f_i}{\partial x_j} = \frac{\partial}{\partial x_j}\frac{x_i}{r} = \frac{\delta_{ij} \cdot r - x_i \cdot \frac{x_j}{r}}{r^2} = \frac{\delta_{ij}}{r} - \frac{x_i x_j}{r^3} = \frac{1}{r}\left(\delta_{ij} - \frac{x_i x_j}{r^2}\right)

In matrix form, with x^=x/r\hat{\mathbf{x}} = \mathbf{x}/r:

Jf=1r(Ix^x^)J_f = \frac{1}{r}(I - \hat{\mathbf{x}}\hat{\mathbf{x}}^\top)

Interpretation: JfJ_f is a projection (scaled by 1/r1/r). The matrix P=Ix^x^P = I - \hat{\mathbf{x}}\hat{\mathbf{x}}^\top projects onto the hyperplane perpendicular to x^\hat{\mathbf{x}}. So:

  • Perturbations perpendicular to x\mathbf{x}: these are preserved (scaled by 1/r1/r) - perturbing perpendicular to a unit vector changes its direction
  • Perturbations parallel to x\mathbf{x} (i.e., scaling x\mathbf{x}): these are killed by the projection - scaling a vector doesn't change its direction

Non-example: Jacobian of a max operation. f(x)=maxixif(\mathbf{x}) = \max_i x_i (scalar). This function is not differentiable when two or more components tie for the maximum. At a point where xix_{i^*} is the unique maximum:

Jf=ei(row vector with 1 in position i, zeros elsewhere)J_f = \mathbf{e}_{i^*}^\top \quad (\text{row vector with 1 in position } i^*, \text{ zeros elsewhere})

At a tie, the Frchet derivative does not exist. The Clarke subdifferential is the convex hull of {ei:xi=maxjxj}\{\mathbf{e}_i : x_i = \max_j x_j\}.


3. Jacobians of Key ML Operations

3.1 Linear Layer

Setup. A linear (dense) layer computes f(x)=Wx+bf(\mathbf{x}) = W\mathbf{x}+\mathbf{b} with WRdout×dinW \in \mathbb{R}^{d_{\text{out}}\times d_{\text{in}}}, bRdout\mathbf{b} \in \mathbb{R}^{d_{\text{out}}}.

Jacobian w.r.t. input x\mathbf{x}:

[Jf]ij=(Wx+b)ixj=kWikxk+bixj=Wij[J_f]_{ij} = \frac{\partial (W\mathbf{x}+\mathbf{b})_i}{\partial x_j} = \frac{\partial \sum_k W_{ik}x_k + b_i}{\partial x_j} = W_{ij}

So Jf(x)=WJ_f(\mathbf{x}) = W. The Jacobian of a linear layer is constant (same at every input point) and equals the weight matrix.

Backpropagation through a linear layer. Given upstream gradient δoutRdout\boldsymbol{\delta}^{\text{out}} \in \mathbb{R}^{d_{\text{out}}} (i.e., L/(Wx+b)\partial\mathcal{L}/\partial(W\mathbf{x}+\mathbf{b})), the downstream gradient is:

δin=Jfδout=WδoutRdin\boldsymbol{\delta}^{\text{in}} = J_f^\top\,\boldsymbol{\delta}^{\text{out}} = W^\top\boldsymbol{\delta}^{\text{out}} \in \mathbb{R}^{d_{\text{in}}}

This is the fundamental backprop equation for a linear layer: multiply by the transposed weight matrix.

Gradient w.r.t. weight matrix WW. The function g(W)=Wx+bg(W) = W\mathbf{x}+\mathbf{b} is also linear in WW. Consider vec(W)Rdoutdin\text{vec}(W) \in \mathbb{R}^{d_{\text{out}}d_{\text{in}}}. The Jacobian of gg w.r.t. vec(W)\text{vec}(W) is:

Jg=xIdoutRdout×(doutdin)J_g = \mathbf{x}^\top \otimes I_{d_{\text{out}}} \in \mathbb{R}^{d_{\text{out}} \times (d_{\text{out}}d_{\text{in}})}

where \otimes is the Kronecker product. For scalar loss L\mathcal{L}, the weight gradient is:

LW=δoutxRdout×din\frac{\partial\mathcal{L}}{\partial W} = \boldsymbol{\delta}^{\text{out}}\mathbf{x}^\top \in \mathbb{R}^{d_{\text{out}}\times d_{\text{in}}}

This is a rank-1 outer product - for each training sample, the weight gradient has rank at most 1. The gradient accumulated over a mini-batch of size BB has rank at most min(B,dout,din)\min(B, d_{\text{out}}, d_{\text{in}}).

Implication for LoRA. Since the weight gradient is the sum of rank-1 outer products δixi\boldsymbol{\delta}_i\mathbf{x}_i^\top, the effective gradient subspace across an entire training run has rank much less than min(dout,din)\min(d_{\text{out}}, d_{\text{in}}) - especially for large pre-trained models where the task-specific signal is concentrated in a low-dimensional subspace. LoRA (Hu et al., 2021) exploits this by parameterising the weight update as ΔW=AB\Delta W = AB with ARdout×rA \in \mathbb{R}^{d_{\text{out}}\times r}, BRr×dinB \in \mathbb{R}^{r\times d_{\text{in}}}, rmin(dout,din)r \ll \min(d_{\text{out}}, d_{\text{in}}). Training only AA and BB reduces parameters from doutdind_{\text{out}}d_{\text{in}} to r(dout+din)r(d_{\text{out}}+d_{\text{in}}).

3.2 Elementwise Activations

Setup. f(x)=σ(x)f(\mathbf{x}) = \sigma(\mathbf{x}) applied elementwise: fi(x)=σ(xi)f_i(\mathbf{x}) = \sigma(x_i) for each ii.

Jacobian:

[Jf]ij=fixj=σ(xi)xj=σ(xi)1[i=j][J_f]_{ij} = \frac{\partial f_i}{\partial x_j} = \frac{\partial \sigma(x_i)}{\partial x_j} = \sigma'(x_i)\cdot\mathbf{1}[i=j] Jf(x)=diag(σ(x1),,σ(xn))=diag(σ(x))J_f(\mathbf{x}) = \text{diag}(\sigma'(x_1),\ldots,\sigma'(x_n)) = \text{diag}(\sigma'(\mathbf{x}))

The Jacobian is diagonal - output ii depends only on input ii. Backpropagation through an elementwise activation is elementwise multiplication: δin=σ(x)δout\boldsymbol{\delta}^{\text{in}} = \sigma'(\mathbf{x}) \odot \boldsymbol{\delta}^{\text{out}}.

Common activations in full detail:

ReLU: σ(x)=max(0,x)\sigma(x) = \max(0, x). σ(x)=1[x>0]\sigma'(x) = \mathbf{1}[x>0] (1 for positive inputs, 0 for negative, undefined at 0). The Jacobian is a binary diagonal matrix. Every negative-input neuron contributes zero gradient - the "dead neuron" problem. At scale, if a large fraction of neurons receive negative pre-activations, the effective gradient signal becomes sparse.

Sigmoid: σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}). σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1-\sigma(x)). Maximum value 1/41/4 (at x=0x=0). For deep networks, the product of LL sigmoid derivatives is at most (1/4)L0(1/4)^L \to 0 exponentially - the vanishing gradient problem. This is why sigmoid was replaced by ReLU in most architectures.

tanh: σ(x)=(exex)/(ex+ex)\sigma(x) = (e^x-e^{-x})/(e^x+e^{-x}). σ(x)=1tanh2(x)[0,1]\sigma'(x) = 1-\tanh^2(x) \in [0,1]. Maximum 11 (at x=0x=0). Better than sigmoid (maximum derivative is 1 not 1/4), but still vanishes for large x|x|.

GELU: σ(x)=xΦ(x)\sigma(x) = x\Phi(x) where Φ\Phi is the standard Gaussian CDF. σ(x)=Φ(x)+xϕ(x)\sigma'(x) = \Phi(x) + x\phi(x) where ϕ\phi is the Gaussian PDF. The derivative is approximately 1 for large positive xx, 0 for large negative xx, and smooth around 0. Default activation in BERT, GPT-2/3/4.

SiLU (Swish): σ(x)=xsigmoid(x)\sigma(x) = x\cdot\text{sigmoid}(x). σ(x)=sigmoid(x)+xsigmoid(x)(1sigmoid(x))\sigma'(x) = \text{sigmoid}(x) + x\cdot\text{sigmoid}(x)(1-\text{sigmoid}(x)). Non-monotone (has a local minimum). Used in LLaMA, Mistral, Gemma.

Gradient flow analysis. For a depth-LL network with all linear layers having W(l)2=1\|W^{(l)}\|_2 = 1 and elementwise activation Jacobians D(l)=diag(σ(x(l)))D^{(l)} = \text{diag}(\sigma'(\mathbf{x}^{(l)})), the input-to-output Jacobian is:

J=D(L)W(L)D(1)W(1)J = D^{(L)}W^{(L)}\cdots D^{(1)}W^{(1)}

The spectral norm satisfies J2lD(l)2W(l)2\|J\|_2 \leq \prod_l \|D^{(l)}\|_2 \|W^{(l)}\|_2. For sigmoid/tanh, D(l)21/4\|D^{(l)}\|_2 \leq 1/4 or 11; for large LL, this product vanishes or explodes depending on the spectrum.

BatchNorm and LayerNorm are specifically designed to keep each factor D(l)W(l)21\|D^{(l)}W^{(l)}\|_2 \approx 1 throughout training, preventing both explosion and vanishing.

3.3 Softmax Jacobian - Full Proof

Setup. The softmax function σ:RKΔK1\sigma: \mathbb{R}^K \to \Delta^{K-1} (the probability simplex) is defined by:

σ(z)k=pk=ezkj=1Kezj,k=1,,K\sigma(\mathbf{z})_k = p_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}, \quad k = 1,\ldots,K

We want to compute all partial derivatives pi/zj\partial p_i / \partial z_j for all i,j{1,,K}i, j \in \{1,\ldots,K\}.

Full derivation. Let Z=j=1KezjZ = \sum_{j=1}^K e^{z_j} (the partition function). Then pi=ezi/Zp_i = e^{z_i}/Z.

Case 1: i=ji = j (diagonal entries). Use the quotient rule:

pizi=zieziZ=eziZezieziZ2=eziZ(eziZ)2=pipi2=pi(1pi)\frac{\partial p_i}{\partial z_i} = \frac{\partial}{\partial z_i}\frac{e^{z_i}}{Z} = \frac{e^{z_i} \cdot Z - e^{z_i} \cdot e^{z_i}}{Z^2} = \frac{e^{z_i}}{Z} - \left(\frac{e^{z_i}}{Z}\right)^2 = p_i - p_i^2 = p_i(1-p_i)

Case 2: iji \neq j (off-diagonal entries). Since ezi/zj=0\partial e^{z_i}/\partial z_j = 0 (different index):

pizj=0ZeziZzjZ2=eziezjZ2=eziZezjZ=pipj\frac{\partial p_i}{\partial z_j} = \frac{0 \cdot Z - e^{z_i} \cdot \frac{\partial Z}{\partial z_j}}{Z^2} = \frac{-e^{z_i} \cdot e^{z_j}}{Z^2} = -\frac{e^{z_i}}{Z}\cdot\frac{e^{z_j}}{Z} = -p_i p_j

Unifying both cases. We can write:

pizj=pi(δijpj)\frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j)

where δij\delta_{ij} is the Kronecker delta. In matrix form:

Jσ(z)=diag(p)pp\boxed{J_\sigma(\mathbf{z}) = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top}

Verification. Entry (i,j)(i,j) of diag(p)pp\text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top is:

  • Diagonal (i=ji=j): pipi2=pi(1pi)p_i - p_i^2 = p_i(1-p_i)
  • Off-diagonal (iji\neq j): 0pipj=pipj0 - p_ip_j = -p_ip_j

Properties of the softmax Jacobian:

Symmetry: [Jσ]ij=pi(δijpj)[J_\sigma]_{ij} = p_i(\delta_{ij}-p_j) and [Jσ]ji=pj(δjipi)[J_\sigma]_{ji} = p_j(\delta_{ji}-p_i). Since δij=δji\delta_{ij}=\delta_{ji}, both equal pipj-p_ip_j for iji\neq j and pi(1pi)p_i(1-p_i) on diagonal. So Jσ=JσJ_\sigma = J_\sigma^\top. (Softmax Jacobian is symmetric despite f:RKRKf:\mathbb{R}^K\to\mathbb{R}^K being non-linear.)

Singularity: Consider Jσ1J_\sigma\mathbf{1}. Entry ii is j[Jσ]ij=jpiδijpipj=pipijpj=pipi=0\sum_j [J_\sigma]_{ij} = \sum_j p_i\delta_{ij} - p_i p_j = p_i - p_i\sum_j p_j = p_i - p_i = 0. So Jσ1=0J_\sigma\mathbf{1} = \mathbf{0} - the all-ones vector is in the null space. This reflects that softmax is shift-invariant: σ(z+c1)=σ(z)\sigma(\mathbf{z}+c\mathbf{1}) = \sigma(\mathbf{z}) for any scalar cc. Perturbations in the constant direction produce no change.

Rank K1K-1: One zero eigenvalue (direction 1\mathbf{1}), and K1K-1 positive eigenvalues. JσJ_\sigma is positive semidefinite.

Positive semidefiniteness: For any v\mathbf{v}, vJσv=v(diag(p)pp)v=ipivi2(ipivi)2\mathbf{v}^\top J_\sigma \mathbf{v} = \mathbf{v}^\top(\text{diag}(\mathbf{p})-\mathbf{p}\mathbf{p}^\top)\mathbf{v} = \sum_i p_i v_i^2 - (\sum_i p_i v_i)^2. By Jensen's inequality applied to the convex function t2t^2 with weights pi>0p_i > 0: ipivi2(ipivi)2\sum_i p_i v_i^2 \geq (\sum_i p_i v_i)^2. So vJσv0\mathbf{v}^\top J_\sigma\mathbf{v} \geq 0, with equality iff all viv_i are equal (i.e., v1\mathbf{v} \propto \mathbf{1}). \square

Cross-entropy gradient elegance. When softmax feeds into cross-entropy loss L=logpy\mathcal{L} = -\log p_y for true class yy:

Step 1: pL=ey/py\nabla_\mathbf{p}\mathcal{L} = -\mathbf{e}_y/p_y (gradient of log w.r.t. the relevant probability).

Step 2: Chain rule gives zL=JσpL=(diag(p)pp)(ey/py)\nabla_\mathbf{z}\mathcal{L} = J_\sigma^\top \nabla_\mathbf{p}\mathcal{L} = (\text{diag}(\mathbf{p})-\mathbf{p}\mathbf{p}^\top)\cdot(-\mathbf{e}_y/p_y).

Entry kk: (diag(p)pp)k,:(ey/py)=(pkδkypkpy)(1/py)=δky+pk(\text{diag}(\mathbf{p})-\mathbf{p}\mathbf{p}^\top)_{k,:}(-\mathbf{e}_y/p_y) = (p_k\delta_{ky} - p_kp_y) \cdot (-1/p_y) = -\delta_{ky} + p_k.

So [zL]k=pkδky[\nabla_\mathbf{z}\mathcal{L}]_k = p_k - \delta_{ky}, i.e., zL=pey\nabla_\mathbf{z}\mathcal{L} = \mathbf{p} - \mathbf{e}_y.

The combined gradient is simply prediction minus one-hot label - an elegant result. This is the gradient that flows backward through every transformer's output layer.

3.4 Layer Normalization Jacobian - Full Derivation

Setup. Layer normalisation (Ba et al., 2016) standardises each token embedding independently:

x^i=xiμσ,μ=1dj=1dxj,σ=1dj=1d(xjμ)2+ε\hat{x}_i = \frac{x_i - \mu}{\sigma}, \quad \mu = \frac{1}{d}\sum_{j=1}^d x_j, \quad \sigma = \sqrt{\frac{1}{d}\sum_{j=1}^d (x_j-\mu)^2 + \varepsilon}

We derive the Jacobian x^i/xk\partial\hat{x}_i/\partial x_k for the normalisation step (ignoring the learnable scale/shift γ,β\boldsymbol{\gamma}, \boldsymbol{\beta}).

Step 1: Derivatives of μ\mu and σ\sigma.

μ/xk=1/d\partial\mu/\partial x_k = 1/d (all components).

σ/xk\partial\sigma/\partial x_k: Using chain rule on σ2=1dj(xjμ)2+ε\sigma^2 = \frac{1}{d}\sum_j(x_j-\mu)^2 + \varepsilon:

2σσxk=2dj(xjμ)(xjμ)xk=2dj(xjμ)(δjk1d)2\sigma\frac{\partial\sigma}{\partial x_k} = \frac{2}{d}\sum_j(x_j-\mu)\frac{\partial(x_j-\mu)}{\partial x_k} = \frac{2}{d}\sum_j(x_j-\mu)\left(\delta_{jk}-\frac{1}{d}\right) =2d[(xkμ)1dj(xjμ)]=2d(xkμ)= \frac{2}{d}\left[(x_k-\mu) - \frac{1}{d}\sum_j(x_j-\mu)\right] = \frac{2}{d}(x_k-\mu)

(since j(xjμ)=0\sum_j(x_j-\mu)=0). Therefore σ/xk=xkμdσ=x^k/dσ\partial\sigma/\partial x_k = \frac{x_k-\mu}{d\sigma} = \hat{x}_k/d\sigma.

Step 2: Derivative of x^i\hat{x}_i.

x^ixk=xkxiμσ\frac{\partial\hat{x}_i}{\partial x_k} = \frac{\partial}{\partial x_k}\frac{x_i-\mu}{\sigma}

Using the quotient rule:

=(δik1/d)σ(xiμ)σ/xkσ2= \frac{(\delta_{ik} - 1/d)\cdot\sigma - (x_i-\mu)\cdot\partial\sigma/\partial x_k}{\sigma^2} =δik1/dσ(xiμ)x^kdσ2= \frac{\delta_{ik}-1/d}{\sigma} - \frac{(x_i-\mu)\hat{x}_k}{d\sigma^2} =1σ(δik1dx^ix^kd)= \frac{1}{\sigma}\left(\delta_{ik} - \frac{1}{d} - \frac{\hat{x}_i\hat{x}_k}{d}\right)

Wait - let's be careful: (xiμ)/σ=x^i(x_i-\mu)/\sigma = \hat{x}_i and (xkμ)/σ=x^k(x_k-\mu)/\sigma = \hat{x}_k, so (xiμ)σ/xk/σ2=x^ix^k(1/d)(x_i-\mu)\partial\sigma/\partial x_k/\sigma^2 = \hat{x}_i\cdot\hat{x}_k\cdot(1/d).

In matrix form:

JLN=1σ(I1d111dx^x^)\boxed{J_{\text{LN}} = \frac{1}{\sigma}\left(I - \frac{1}{d}\mathbf{1}\mathbf{1}^\top - \frac{1}{d}\hat{\mathbf{x}}\hat{\mathbf{x}}^\top\right)}

Wait - let me recheck. Entry (i,k)(i,k):

[JLN]ik=1σ(δik1dx^ix^k1dd)[J_{\text{LN}}]_{ik} = \frac{1}{\sigma}\left(\delta_{ik} - \frac{1}{d} - \hat{x}_i\hat{x}_k \cdot \frac{1}{d} \cdot d\right)

Retracing: (xiμ)σ/xk/σ2=x^iσx^k/(dσ)/σ=x^ix^k/(dσ)(x_i-\mu)\cdot\partial\sigma/\partial x_k/\sigma^2 = \hat{x}_i\sigma \cdot \hat{x}_k/(d\sigma) / \sigma = \hat{x}_i\hat{x}_k/(d\sigma).

Hmm, σ/xk=x^k/(dσ)σ=x^k/d\partial\sigma/\partial x_k = \hat{x}_k/(d\sigma) \cdot \sigma = \hat{x}_k/d. Wait: from Step 1, σ/xk=(xkμ)/(dσ)=x^kσ/(dσ)=x^k/d\partial\sigma/\partial x_k = (x_k-\mu)/(d\sigma) = \hat{x}_k\sigma/(d\sigma) = \hat{x}_k/d.

So:

x^ixk=δik1/dσ(xiμ)σ2x^kd=δik1/dσx^ix^kdσ\frac{\partial\hat{x}_i}{\partial x_k} = \frac{\delta_{ik}-1/d}{\sigma} - \frac{(x_i-\mu)}{{\sigma^2}}\cdot\frac{\hat{x}_k}{d} = \frac{\delta_{ik}-1/d}{\sigma} - \frac{\hat{x}_i\hat{x}_k}{d\sigma} =1σ(δik1dx^ix^kd)= \frac{1}{\sigma}\left(\delta_{ik} - \frac{1}{d} - \frac{\hat{x}_i\hat{x}_k}{d}\right)

Hmm that's δik1d(1+x^ix^k)\delta_{ik} - \frac{1}{d}(1 + \hat{x}_i\hat{x}_k). This is slightly different from the usual formula. The standard result (Xu et al., 2019) is:

JLN=1σ(I1d11x^x^)J_{\text{LN}} = \frac{1}{\sigma}\left(I - \frac{1}{d}\mathbf{1}\mathbf{1}^\top - \hat{\mathbf{x}}\hat{\mathbf{x}}^\top\right)

Let me redo without the erroneous 1/d1/d in the last term. The issue is σ2/xk\partial\sigma^2/\partial x_k.

σ2=1dj(xjμ)2+ε\sigma^2 = \frac{1}{d}\sum_j(x_j-\mu)^2 + \varepsilon. Then σ2/xk=2d(xkμ)(11/d)d\partial\sigma^2/\partial x_k = \frac{2}{d}(x_k-\mu)(1-1/d)\cdot d... actually more carefully:

σ2xk=1d2(xkμ)(xkμ)xk+1djk2(xjμ)(xjμ)xk\frac{\partial\sigma^2}{\partial x_k} = \frac{1}{d}\cdot 2(x_k-\mu)\cdot\frac{\partial(x_k-\mu)}{\partial x_k} + \frac{1}{d}\sum_{j\neq k}2(x_j-\mu)\cdot\frac{\partial(x_j-\mu)}{\partial x_k}

=2(xkμ)d(11d)+1djk2(xjμ)(1d)= \frac{2(x_k-\mu)}{d}(1-\frac{1}{d}) + \frac{1}{d}\sum_{j\neq k}2(x_j-\mu)(-\frac{1}{d})

=2(xkμ)d2(xkμ)d22d2jk(xjμ)= \frac{2(x_k-\mu)}{d} - \frac{2(x_k-\mu)}{d^2} - \frac{2}{d^2}\sum_{j\neq k}(x_j-\mu)

=2(xkμ)d2d2j(xjμ)=2(xkμ)d= \frac{2(x_k-\mu)}{d} - \frac{2}{d^2}\sum_j(x_j-\mu) = \frac{2(x_k-\mu)}{d}

(since j(xjμ)=0\sum_j(x_j-\mu)=0). So σ/xk=(xkμ)dσ=x^kdσ/σ=x^kd\partial\sigma/\partial x_k = \frac{(x_k-\mu)}{d\sigma} = \frac{\hat{x}_k}{d}\cdot\sigma/\sigma = \frac{\hat{x}_k}{d}.

Then:

[JLN]ik=δik1/dσx^ix^k/dσ=1σ(δik1dx^ix^kd)[J_{\text{LN}}]_{ik} = \frac{\delta_{ik}-1/d}{\sigma} - \frac{\hat{x}_i\cdot\hat{x}_k/d}{\sigma} = \frac{1}{\sigma}\left(\delta_{ik} - \frac{1}{d} - \frac{\hat{x}_i\hat{x}_k}{d}\right)

Note x^2=d\|\hat{\mathbf{x}}\|^2 = d (since x^j=(xjμ)/σ\hat{x}_j = (x_j-\mu)/\sigma and 1djx^j2=1\frac{1}{d}\sum_j\hat{x}_j^2=1, so jx^j2=d\sum_j\hat{x}_j^2=d). This means 1dx^x^\frac{1}{d}\hat{\mathbf{x}}\hat{\mathbf{x}}^\top is an outer product scaled by 1/d1/d. Writing:

JLN=1σ(I1d111dx^x^)J_{\text{LN}} = \frac{1}{\sigma}\left(I - \frac{1}{d}\mathbf{1}\mathbf{1}^\top - \frac{1}{d}\hat{\mathbf{x}}\hat{\mathbf{x}}^\top\right)

Geometric interpretation. The matrix I1d111dx^x^I - \frac{1}{d}\mathbf{1}\mathbf{1}^\top - \frac{1}{d}\hat{\mathbf{x}}\hat{\mathbf{x}}^\top is a projection that:

  1. Removes the mean direction (1/d\mathbf{1}/\sqrt{d}): JLN1=1σ(1dd1x^1dx^)=1σ(0)=0J_{\text{LN}}\mathbf{1} = \frac{1}{\sigma}\left(\mathbf{1} - \frac{d}{d}\mathbf{1} - \frac{\hat{\mathbf{x}}^\top\mathbf{1}}{d}\hat{\mathbf{x}}\right) = \frac{1}{\sigma}(\mathbf{0}) = \mathbf{0} (since jx^j=j(xjμ)/σ=0\sum_j\hat{x}_j = \sum_j(x_j-\mu)/\sigma = 0).

  2. Removes the normalised input direction (x^/d\hat{\mathbf{x}}/\sqrt{d}): JLNx^=1σ(x^1x^d1x^2dx^)=1σ(x^0x^)=0J_{\text{LN}}\hat{\mathbf{x}} = \frac{1}{\sigma}\left(\hat{\mathbf{x}} - \frac{\mathbf{1}^\top\hat{\mathbf{x}}}{d}\mathbf{1} - \frac{\|\hat{\mathbf{x}}\|^2}{d}\hat{\mathbf{x}}\right) = \frac{1}{\sigma}(\hat{\mathbf{x}} - 0 - \hat{\mathbf{x}}) = \mathbf{0}.

Null space and rank. The null space is span{1,x^}\{\mathbf{1}, \hat{\mathbf{x}}\} (2-dimensional if x^\hat{\mathbf{x}} is not parallel to 1\mathbf{1}, which is true generically). So rank(JLN)=d2(J_{\text{LN}}) = d-2.

Training implication. Gradients cannot flow through the mean or norm directions. This is a desirable property: it prevents trivial changes (scaling all activations or shifting them) from affecting the gradient, making training more stable.

3.5 JVPs and VJPs - Why Reverse Mode Dominates

Given f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m with Jacobian JfRm×nJ_f \in \mathbb{R}^{m\times n}, we often need to multiply by JfJ_f or JfJ_f^\top without materialising the full matrix.

Jacobian-vector product (JVP, forward mode):

JVP(v)=JfvRmfor tangent vector vRn\text{JVP}(\mathbf{v}) = J_f\,\mathbf{v} \in \mathbb{R}^m \quad \text{for tangent vector } \mathbf{v} \in \mathbb{R}^n

Answers: "If the input moves in direction v\mathbf{v}, how does the output change?"

Computed by a forward pass with dual numbers: run the computation on input (x,v)(\mathbf{x}, \mathbf{v}) where v\mathbf{v} is the "tangent" component. Cost: O(n+m)O(n+m) - one forward pass plus tracking tangents. Does not require storing intermediate values.

Vector-Jacobian product (VJP, reverse mode):

VJP(u)=JfuRnfor cotangent vector uRm\text{VJP}(\mathbf{u}) = J_f^\top\,\mathbf{u} \in \mathbb{R}^n \quad \text{for cotangent vector } \mathbf{u} \in \mathbb{R}^m

Answers: "Given a weighting u\mathbf{u} over outputs, what is each input's sensitivity?"

Computed by a backward pass: first run a forward pass (storing intermediates), then propagate u\mathbf{u} backward through the computation graph. Cost: O(n+m)O(n+m) compute + O(forward pass memory)O(\text{forward pass memory}) for stored activations.

Why neural network training uses reverse mode (VJP). The loss L\mathcal{L} is scalar (m=1m = 1). We want all nn parameter gradients simultaneously. Compare:

  • Reverse mode (VJP): One backward pass gives (L)=JL1Rn(\nabla\mathcal{L})^\top = J_\mathcal{L}^\top \cdot 1 \in \mathbb{R}^n - all nn gradients at once. Total cost: cfwd+cbwd2-3×cfwdc_{\text{fwd}} + c_{\text{bwd}} \approx 2\text{-}3\times c_{\text{fwd}}.

  • Forward mode (JVP): To get gradient in direction ej\mathbf{e}_j, run one JVP with v=ej\mathbf{v}=\mathbf{e}_j, giving L/θj\partial\mathcal{L}/\partial\theta_j. Getting all nn gradients requires nn JVPs. Total cost: n×cfwdn \times c_{\text{fwd}}.

For n=108n = 10^8 (typical LLM), reverse mode is 108×10^8\times cheaper. This is why PyTorch, JAX, and TensorFlow all implement reverse-mode AD (backpropagation) as their default.

When forward mode is preferable. If nmn \ll m (few inputs, many outputs), forward mode is cheaper: nn JVPs vs mm VJPs. Applications: uncertainty propagation through a small number of uncertain inputs, sensitivity analysis, directional derivative computation.

COST COMPARISON: JVP vs VJP


  f: R -> R,  J_f in R

  Full Jacobian via JVPs:   n passes of JVP(e),  cost = n * c_fwd
  Full Jacobian via VJPs:   m passes of VJP(e),  cost = m * c_bwd

  For ML (m=1, scalar loss):
    Forward mode: n passes  ->  cost n * c_fwd     (EXPENSIVE)
    Reverse mode: 1 pass    ->  cost     c_bwd     (CHEAP)
    Speedup: n / (c_bwd/c_fwd) ~= n/3 ~= 3*10^7  for GPT-3

  For normalising flows (need full Jacobian for log|det J|):
    Forward mode: n passes  ->  expensive for large n
    Special architectures (triangular J): O(n) determinant


Full Jacobian recovery. To get JfRm×nJ_f \in \mathbb{R}^{m\times n}:

  • Apply JVP with v=ej\mathbf{v}=\mathbf{e}_j for j=1,,nj=1,\ldots,n: fills columns of JfJ_f. Cost: ncfwdn\cdot c_{\text{fwd}}.
  • Apply VJP with u=ei\mathbf{u}=\mathbf{e}_i for i=1,,mi=1,\ldots,m: fills rows of JfJ_f. Cost: mcbwdm\cdot c_{\text{bwd}}.

Choose the cheaper option: if n<mn < m, use JVPs; if m<nm < n, use VJPs.

3.6 Jacobian Condition Number and Training Stability

Definition. The condition number of the Jacobian JfRm×nJ_f \in \mathbb{R}^{m\times n} is:

κ(Jf)=σmax(Jf)σmin(Jf)\kappa(J_f) = \frac{\sigma_{\max}(J_f)}{\sigma_{\min}(J_f)}

where σmax\sigma_{\max} and σmin\sigma_{\min} are the largest and smallest singular values. (For mnm \neq n, σmin\sigma_{\min} refers to the smallest nonzero singular value.)

Geometric meaning. The singular value decomposition Jf=UΣVJ_f = U\Sigma V^\top shows that JfJ_f rotates (via VV^\top) then scales (via Σ\Sigma) then rotates again (via UU). The input directions vj\mathbf{v}_j (columns of VV) are mapped to output directions σjuj\sigma_j \mathbf{u}_j (scaled columns of UU). κ=σmax/σmin\kappa = \sigma_{\max}/\sigma_{\min} measures how anisotropic this scaling is:

  • κ=1\kappa = 1: all input directions are scaled equally - perfectly isotropic
  • κ1\kappa \gg 1: some directions are amplified by σmax\sigma_{\max}, others suppressed to near zero by σmin\sigma_{\min} - highly anisotropic

Why κ\kappa matters for training. In backpropagation, the gradient flows backward as δin=Jfδout\boldsymbol{\delta}^{\text{in}} = J_f^\top \boldsymbol{\delta}^{\text{out}}. The singular values of JfJ_f^\top are the same as those of JfJ_f. So:

  • Gradient components in the σmax\sigma_{\max} direction are amplified by σmax\sigma_{\max}
  • Gradient components in the σmin\sigma_{\min} direction are suppressed by σmin\sigma_{\min}

Over LL layers, the overall Jacobian J=JLJ1J = J_L \cdots J_1 has singular values bounded by lσmax(Jl)\prod_l \sigma_{\max}(J_l) (upper) and lσmin(Jl)\prod_l \sigma_{\min}(J_l) (lower). If each σmax>1+ε\sigma_{\max} > 1+\varepsilon: exponential gradient explosion. If each σmin<1ε\sigma_{\min} < 1-\varepsilon: exponential gradient vanishing.

Initialisation strategies. Xavier (Glorot) initialisation sets WijU(6/(nin+nout),+6/(nin+nout))W_{ij} \sim \mathcal{U}(-\sqrt{6/(n_{\text{in}}+n_{\text{out}})}, +\sqrt{6/(n_{\text{in}}+n_{\text{out}})}), which initialises E[σmax(W)]1\mathbb{E}[\sigma_{\max}(W)] \approx 1 and κ(W)nout/nin\kappa(W) \approx \sqrt{n_{\text{out}}/n_{\text{in}}} (near 1 for square layers). He initialisation uses WijN(0,2/nin)W_{ij} \sim \mathcal{N}(0, 2/n_{\text{in}}), designed for ReLU which zeroes half the neurons.

Normalisation layers. BatchNorm and LayerNorm act as implicit preconditioning: they rescale the activations so that the effective Jacobian of each normalised layer has σmaxσmin1\sigma_{\max} \approx \sigma_{\min} \approx 1. This keeps the product of singular values near 1 regardless of depth.


4. The Hessian Matrix

4.1 Formal Definition

Definition (Hessian). Let f:URnRf: U \subseteq \mathbb{R}^n \to \mathbb{R} be twice differentiable at xU\mathbf{x} \in U. The Hessian matrix of ff at x\mathbf{x} is the n×nn \times n matrix of all second-order partial derivatives:

Hf(x)=2f(x)=(2fx122fx1x22fx1xn 2fx2x12fx222fx2xn  2fxnx12fxnx22fxn2)H_f(\mathbf{x}) = \nabla^2 f(\mathbf{x}) = \begin{pmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\partial x_n} \\\ \dfrac{\partial^2 f}{\partial x_2\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\partial x_n} \\\ \vdots & & \ddots & \vdots \\\ \dfrac{\partial^2 f}{\partial x_n\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{pmatrix}

The (i,j)(i,j) entry is [Hf]ij=2f/xixj[H_f]_{ij} = \partial^2 f / \partial x_i \partial x_j.

Three equivalent ways to think about the Hessian:

  1. Matrix of second partial derivatives. Each entry is differentiate ff twice: first with respect to xjx_j, then with respect to xix_i.

  2. Jacobian of the gradient. Since f:RnRn\nabla f: \mathbb{R}^n \to \mathbb{R}^n is a vector-valued function, it has a Jacobian. That Jacobian is the Hessian: Hf(x)=Jf(x)H_f(\mathbf{x}) = J_{\nabla f}(\mathbf{x}), because [Jf]ij=[f]i/xj=2f/xjxi=[Hf]ij[J_{\nabla f}]_{ij} = \partial[\nabla f]_i/\partial x_j = \partial^2 f/\partial x_j\partial x_i = [H_f]_{ij} (when fC2f \in C^2, the order of differentiation does not matter by Clairaut's theorem).

  3. Curvature operator. For any unit direction u\mathbf{u}, the scalar curvature of ff in that direction is κ(u)=uHf(x)u\kappa(\mathbf{u}) = \mathbf{u}^\top H_f(\mathbf{x})\,\mathbf{u}. This is the second derivative of g(t)=f(x+tu)g(t) = f(\mathbf{x}+t\mathbf{u}) at t=0t=0.

Notation. We write Hf(x)H_f(\mathbf{x}), 2f(x)\nabla^2 f(\mathbf{x}), D2f(x)D^2 f(\mathbf{x}), or Hess(f)(x)\text{Hess}(f)(\mathbf{x}) interchangeably.

HESSIAN STRUCTURE  (n=3 example)


         x_1           x_2           x_3
       
  x_1     partial^2f/partialx_1^2    partial^2f/partialx_1partialx_2  partial^2f/partialx_1partialx_3 
  x_2     partial^2f/partialx_2partialx_1  partial^2f/partialx_2^2    partial^2f/partialx_2partialx_3 
  x_3     partial^2f/partialx_3partialx_1  partial^2f/partialx_3partialx_2  partial^2f/partialx_3^2   
       

  Diagonal: pure second derivatives (curvature along each axis)
  Off-diagonal [i,j]: how partialf/partialx changes as x moves
                      (interaction / cross-curvature)

  When f in C^2: H_f is symmetric (Clairaut's theorem)
  i.e., partial^2f/partialxpartialx = partial^2f/partialxpartialx for all i,j


4.2 Clairaut's Theorem - Complete Proof and Failure Case

Theorem (Clairaut-Schwarz, 1740/1873). If f:URnRf: U \subseteq \mathbb{R}^n \to \mathbb{R} has continuous second-order partial derivatives at xU\mathbf{x} \in U (i.e., fC2f \in C^2 near x\mathbf{x}), then:

2fxixj(x)=2fxjxi(x)for all ij\frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{x}) = \frac{\partial^2 f}{\partial x_j \partial x_i}(\mathbf{x}) \quad \text{for all } i \neq j

Equivalently: Hf(x)H_f(\mathbf{x}) is a symmetric matrix whenever fC2f \in C^2.

Complete proof (two-variable case). It suffices to prove the result for f:R2Rf: \mathbb{R}^2 \to \mathbb{R} with variables (x,y)(x, y). Define the mixed difference quotient:

Δ(h,k)=f(x+h,y+k)f(x+h,y)f(x,y+k)+f(x,y)\Delta(h, k) = f(x+h, y+k) - f(x+h, y) - f(x, y+k) + f(x, y)

Step 1: Express Δ\Delta via 2f/xy\partial^2 f/\partial x \partial y. Define ϕ(t)=f(t,y+k)f(t,y)\phi(t) = f(t, y+k) - f(t, y). Then Δ(h,k)=ϕ(x+h)ϕ(x)\Delta(h,k) = \phi(x+h) - \phi(x). By the mean value theorem (MVT) applied to ϕ\phi on [x,x+h][x, x+h]:

Δ(h,k)=hϕ(x+θ1h)=h[ft(x+θ1h,y+k)ft(x+θ1h,y)]\Delta(h,k) = h\,\phi'(x + \theta_1 h) = h\left[\frac{\partial f}{\partial t}(x+\theta_1 h, y+k) - \frac{\partial f}{\partial t}(x+\theta_1 h, y)\right]

for some θ1(0,1)\theta_1 \in (0,1). Now apply MVT to ψ(s)=fx(x+θ1h,s)\psi(s) = \frac{\partial f}{\partial x}(x+\theta_1 h, s) on [y,y+k][y, y+k]:

Δ(h,k)=hk2fyx(x+θ1h,  y+θ2k)for some θ2(0,1)\Delta(h,k) = h\,k\,\frac{\partial^2 f}{\partial y\partial x}(x+\theta_1 h,\; y+\theta_2 k) \quad \text{for some } \theta_2 \in (0,1)

Step 2: Express Δ\Delta via 2f/yx\partial^2 f/\partial y \partial x. Now define ψ(s)=f(x+h,s)f(x,s)\psi(s) = f(x+h, s) - f(x, s). Then Δ(h,k)=ψ(y+k)ψ(y)\Delta(h,k) = \psi(y+k) - \psi(y). By MVT on ψ\psi on [y,y+k][y, y+k]:

Δ(h,k)=kψ(y+θ3k)=k[fs(x+h,y+θ3k)fs(x,y+θ3k)]\Delta(h,k) = k\,\psi'(y+\theta_3 k) = k\left[\frac{\partial f}{\partial s}(x+h, y+\theta_3 k) - \frac{\partial f}{\partial s}(x, y+\theta_3 k)\right]

for some θ3(0,1)\theta_3 \in (0,1). Apply MVT to sfy(s,y+θ3k)s \mapsto \frac{\partial f}{\partial y}(s, y+\theta_3 k) on [x,x+h][x, x+h]:

Δ(h,k)=hk2fxy(x+θ4h,  y+θ3k)for some θ4(0,1)\Delta(h,k) = h\,k\,\frac{\partial^2 f}{\partial x\partial y}(x+\theta_4 h,\; y+\theta_3 k) \quad \text{for some } \theta_4 \in (0,1)

Step 3: Take the limit. From Steps 1 and 2:

2fyx(x+θ1h,y+θ2k)=Δ(h,k)hk=2fxy(x+θ4h,y+θ3k)\frac{\partial^2 f}{\partial y\partial x}(x+\theta_1 h, y+\theta_2 k) = \frac{\Delta(h,k)}{hk} = \frac{\partial^2 f}{\partial x\partial y}(x+\theta_4 h, y+\theta_3 k)

As h,k0h, k \to 0, both sides approach (x,y)(x, y) since θi(0,1)\theta_i \in (0,1). By continuity of the second partial derivatives:

2fyx(x,y)=2fxy(x,y)\frac{\partial^2 f}{\partial y\partial x}(x, y) = \frac{\partial^2 f}{\partial x\partial y}(x, y) \quad \square

Why continuity is essential. The proof uses continuity only in the last step - to take the limit. Without continuity, the two sides approach the same point but need not converge to the same value.

Concrete counterexample (Peano, 1880). Define:

f(x,y)={xy(x2y2)x2+y2(x,y)(0,0)0(x,y)=(0,0)f(x,y) = \begin{cases} \dfrac{xy(x^2-y^2)}{x^2+y^2} & (x,y)\neq(0,0) \\ 0 & (x,y)=(0,0) \end{cases}

Step 1: Compute f/x\partial f/\partial x at (0,0)(0,0). By the limit definition:

fx(0,0)=limh0f(h,0)f(0,0)h=limh00h=0\frac{\partial f}{\partial x}(0,0) = \lim_{h\to 0}\frac{f(h,0)-f(0,0)}{h} = \lim_{h\to 0}\frac{0}{h} = 0

Similarly f/y(0,0)=0\partial f/\partial y(0,0) = 0.

Step 2: Compute f/x\partial f/\partial x at (0,y)(0,y) for y0y\neq 0. f(x,y)=xy(x2y2)/(x2+y2)f(x,y) = xy(x^2-y^2)/(x^2+y^2). By the quotient rule at x=0x=0:

fx(0,y)=y(x2y2)(x2+y2)xy2x(x2+y2)/(x2+y2)+()x=0=yy2y2=y\frac{\partial f}{\partial x}(0,y) = y\cdot\frac{(x^2-y^2)(x^2+y^2) - xy\cdot 2x\cdot(x^2+y^2)/(x^2+y^2) + \ldots}{(\ldots)}\bigg|_{x=0} = y\cdot\frac{-y^2}{y^2} = -y

Step 3: Compute the mixed partial 2f/yx\partial^2 f/\partial y\partial x at (0,0)(0,0).

2fyx(0,0)=limk0fx(0,k)fx(0,0)k=limk0k0k=1\frac{\partial^2 f}{\partial y\partial x}(0,0) = \lim_{k\to 0}\frac{\frac{\partial f}{\partial x}(0,k) - \frac{\partial f}{\partial x}(0,0)}{k} = \lim_{k\to 0}\frac{-k - 0}{k} = -1

Step 4: By symmetry of the roles of xx and yy (with a sign flip from x2y2x^2-y^2):

2fxy(0,0)=+1\frac{\partial^2 f}{\partial x\partial y}(0,0) = +1

So 2f/yx=11=2f/xy\partial^2 f/\partial y\partial x = -1 \neq 1 = \partial^2 f/\partial x\partial y at the origin, even though both first partial derivatives exist everywhere. The second partial derivatives exist at (0,0)(0,0) but are not continuous there - exactly what Clairaut requires.

Practical implication. For any function built from compositions of smooth operations (exp\exp, log\log, ++, ×\times, sin\sin, cos\cos, polynomial), all partial derivatives of all orders are continuous. Neural network losses with smooth activations (GELU, sigmoid, tanh, softmax) therefore always have symmetric Hessians. For ReLU networks, the Hessian is symmetric almost everywhere (where the Hessian exists) but may not exist on the boundaries {xi=0}\{x_i = 0\}.

4.3 Computing the Hessian

Method 1: Direct differentiation (symbolic). Compute f\nabla f first, then differentiate each component again.

Cost: O(n2)O(n^2) partial derivatives. For n=108n = 10^8 parameters: 101610^{16} entries - the full Hessian matrix is utterly infeasible to compute or store for any modern neural network.

Method 2: Numerical Hessian via finite differences. Use the second-order central difference formula:

[Hf]ijf(x+hei+hej)f(x+heihej)f(xhei+hej)+f(xheihej)4h2[H_f]_{ij} \approx \frac{f(\mathbf{x}+h\mathbf{e}_i+h\mathbf{e}_j) - f(\mathbf{x}+h\mathbf{e}_i-h\mathbf{e}_j) - f(\mathbf{x}-h\mathbf{e}_i+h\mathbf{e}_j) + f(\mathbf{x}-h\mathbf{e}_i-h\mathbf{e}_j)}{4h^2}

Error: O(h2)O(h^2) truncation error. Round-off error O(εmach/h2)O(\varepsilon_{\text{mach}}/h^2). Optimal hεmach1/4104h \approx \varepsilon_{\text{mach}}^{1/4} \approx 10^{-4} balances both.

Cost: O(n2)O(n^2) function evaluations. Still infeasible for large nn.

Method 3: Via automatic differentiation.

  • PyTorch: torch.autograd.functional.hessian(f, x) - materialises full Hessian via nn backward passes
  • JAX: jax.hessian(f)(x) - uses forward-over-reverse AD (one forward pass per output component of f\nabla f)

Both require O(n)O(n) AD passes, each costing O(n)O(n) - total O(n2)O(n^2) time. Infeasible for large networks.

Method 4: Hessian-vector products (HVPs). Do not materialise the Hessian; instead compute HfvH_f\mathbf{v} for a given v\mathbf{v} in O(n)O(n) time. This is the practical approach for all large-scale applications. See 8.1.

4.4 Hessian as Jacobian of the Gradient

The identity Hf=JfH_f = J_{\nabla f} is more than a definition - it is a computational recipe.

Proof. The gradient f:RnRn\nabla f: \mathbb{R}^n \to \mathbb{R}^n maps x\mathbf{x} to the vector (f/x1,,f/xn)(\partial f/\partial x_1, \ldots, \partial f/\partial x_n)^\top. The (i,j)(i,j) entry of its Jacobian JfJ_{\nabla f} is:

[Jf]ij=[f]ixj=xjfxi=2fxjxi=[Hf]ij(when fC2)[J_{\nabla f}]_{ij} = \frac{\partial [\nabla f]_i}{\partial x_j} = \frac{\partial}{\partial x_j}\frac{\partial f}{\partial x_i} = \frac{\partial^2 f}{\partial x_j \partial x_i} = [H_f]_{ij} \quad \text{(when } f\in C^2\text{)} \quad \square

Directional second derivative. Applying this to the directional derivative of f\nabla f in direction v\mathbf{v}:

Jfv=Hfv=x[f(x)v]J_{\nabla f}\,\mathbf{v} = H_f\,\mathbf{v} = \nabla_\mathbf{x}[\nabla f(\mathbf{x})\cdot\mathbf{v}]

This identity is the key to efficient HVP computation (8.1): to compute HfvH_f\mathbf{v}, differentiate the scalar function xf(x)v\mathbf{x}\mapsto\nabla f(\mathbf{x})\cdot\mathbf{v} with respect to x\mathbf{x}. One forward pass computes f(x)v\nabla f(\mathbf{x})\cdot\mathbf{v} (a JVP); one backward pass gives x[fv]=Hfv\nabla_\mathbf{x}[\nabla f\cdot\mathbf{v}] = H_f\mathbf{v}.

Curvature Rayleigh quotient. For a unit vector u\mathbf{u}:

g(t)=f(x+tu)    g(0)=uHf(x)ug(t) = f(\mathbf{x}+t\mathbf{u}) \implies g''(0) = \mathbf{u}^\top H_f(\mathbf{x})\,\mathbf{u}

This is the second derivative of ff along the line through x\mathbf{x} in direction u\mathbf{u}:

  • g(0)>0g''(0) > 0: ff is concave up (curves upward) in direction u\mathbf{u}
  • g(0)<0g''(0) < 0: ff is concave down (curves downward) in direction u\mathbf{u}
  • g(0)=0g''(0) = 0: ff is locally flat in direction u\mathbf{u}

The maximum curvature is λmax(Hf)\lambda_{\max}(H_f), achieved in the direction of the top eigenvector q1\mathbf{q}_1. The minimum curvature is λmin(Hf)\lambda_{\min}(H_f), in the direction of qn\mathbf{q}_n.

4.5 Worked Examples with Full Derivations

Example 1: Quadratic form. f(x)=12xAx+bx+cf(\mathbf{x}) = \frac{1}{2}\mathbf{x}^\top A\mathbf{x} + \mathbf{b}^\top\mathbf{x} + c where AA is symmetric.

Gradient: f=Ax+b\nabla f = A\mathbf{x} + \mathbf{b}.

Hessian: Hf=JAx+b=AH_f = J_{A\mathbf{x}+\mathbf{b}} = A (Jacobian of a linear function is the coefficient matrix).

The Hessian is constant - it does not depend on x\mathbf{x}. A quadratic function has the same curvature everywhere.

Verification: [Hf]ij=2(12k,lAklxkxl)/xixj=12(Aij+Aji)=Aij[H_f]_{ij} = \partial^2(\frac{1}{2}\sum_{k,l}A_{kl}x_kx_l)/\partial x_i\partial x_j = \frac{1}{2}(A_{ij}+A_{ji}) = A_{ij} (since AA is symmetric).

Example 2: Mean squared error loss. f(w)=1mXwy2=1m(Xwy)(Xwy)f(\mathbf{w}) = \frac{1}{m}\|X\mathbf{w}-\mathbf{y}\|^2 = \frac{1}{m}(X\mathbf{w}-\mathbf{y})^\top(X\mathbf{w}-\mathbf{y}) with XRm×nX\in\mathbb{R}^{m\times n}.

Gradient: f=2mX(Xwy)\nabla f = \frac{2}{m}X^\top(X\mathbf{w}-\mathbf{y}).

Hessian: Hf=J[2mX(Xwy)]=2mXXH_f = J\left[\frac{2}{m}X^\top(X\mathbf{w}-\mathbf{y})\right] = \frac{2}{m}X^\top X.

HfH_f is positive semidefinite (PSD) since v2mXXv=2mXv20\mathbf{v}^\top\frac{2}{m}X^\top X\mathbf{v} = \frac{2}{m}\|X\mathbf{v}\|^2 \geq 0 for all v\mathbf{v}. Therefore MSE is a globally convex function - any local minimum is the global minimum. The eigenvalues of Hf=2mXXH_f = \frac{2}{m}X^\top X are 2mσi(X)2\frac{2}{m}\sigma_i(X)^2 where σi(X)\sigma_i(X) are singular values of XX.

Example 3: Log-sum-exp (softmax partition function). f(x)=logk=1Kexkf(\mathbf{x}) = \log\sum_{k=1}^K e^{x_k}.

Gradient: [f]k=exk/jexj=pk[\nabla f]_k = e^{x_k}/\sum_j e^{x_j} = p_k (the softmax probabilities).

Hessian: Hf=Jp(x)H_f = J_\mathbf{p}(\mathbf{x}) - the Jacobian of the softmax function. From 3.3:

Hf=diag(p)ppH_f = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top

This is PSD (shown in 3.3), so log-sum-exp is convex. Geometrically: the level sets of logkexk=c\log\sum_k e^{x_k} = c are smooth convex curves in RK\mathbb{R}^K.

Example 4: Cross-entropy loss for logistic regression. f(w)=ylogσ(wx)(1y)log(1σ(wx))f(\mathbf{w}) = -y\log\sigma(\mathbf{w}^\top\mathbf{x}) - (1-y)\log(1-\sigma(\mathbf{w}^\top\mathbf{x})) for a single training example (x,y)(\mathbf{x}, y).

Gradient: f=(σ(wx)y)x=(p^y)x\nabla f = (\sigma(\mathbf{w}^\top\mathbf{x})-y)\mathbf{x} = (\hat{p}-y)\mathbf{x}.

Hessian: Using the chain rule:

Hf=dp^d(wx)xx=σ(wx)(1σ(wx))xx=p^(1p^)xxH_f = \frac{d\hat{p}}{d(\mathbf{w}^\top\mathbf{x})}\cdot\mathbf{x}\mathbf{x}^\top = \sigma(\mathbf{w}^\top\mathbf{x})(1-\sigma(\mathbf{w}^\top\mathbf{x}))\cdot\mathbf{x}\mathbf{x}^\top = \hat{p}(1-\hat{p})\cdot\mathbf{x}\mathbf{x}^\top

This is a rank-1 matrix (outer product of x\mathbf{x} with itself, scaled by p^(1p^)[0,1/4]\hat{p}(1-\hat{p}) \in [0, 1/4]). Its single nonzero eigenvalue is p^(1p^)x2\hat{p}(1-\hat{p})\|\mathbf{x}\|^2 and the corresponding eigenvector is x^=x/x\hat{\mathbf{x}} = \mathbf{x}/\|\mathbf{x}\|. All other eigenvectors are in x\mathbf{x}^\perp with eigenvalue 0 - the loss is flat in all directions perpendicular to x\mathbf{x}.

Over a batch of mm examples: Hf=i=1mp^i(1p^i)xixiH_f = \sum_{i=1}^m \hat{p}_i(1-\hat{p}_i)\mathbf{x}_i\mathbf{x}_i^\top - sum of rank-1 PSD matrices, so PSD with rank at most min(m,n)\min(m, n).

Non-example: Hessian of a ReLU network. A single ReLU unit f(x)=max(0,x)f(x) = \max(0,x):

  • For x>0x > 0: f(x)=1f'(x) = 1, f(x)=0f''(x) = 0
  • For x<0x < 0: f(x)=0f'(x) = 0, f(x)=0f''(x) = 0
  • At x=0x = 0: ff' is discontinuous, ff'' does not exist classically

A deep ReLU network is piecewise linear - it is linear on each region of parameter space where all activation patterns are fixed. Therefore its Hessian is zero almost everywhere. Second-order curvature only appears at the activation boundaries (measure zero).

This is why Gauss-Newton and K-FAC (which avoid computing the Hessian through nonlinearities) are preferred over exact Newton for ReLU networks. The exact Hessian is nearly zero and gives no useful curvature information.


5. Second-Order Taylor Expansion

The Taylor expansion is the bridge between the Hessian's definition and its computational use. It explains why the Hessian governs convergence rates, why eigenvalues bound step sizes, and why second-order optimisers are faster than first-order ones.

5.1 The Multivariate Taylor Theorem

Theorem (Taylor's theorem in Rn\mathbb{R}^n, exact form). Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be C3C^3 in a neighbourhood of x0\mathbf{x}_0. Then for any δRn\boldsymbol{\delta} \in \mathbb{R}^n:

f(x0+δ)=f(x0)+f(x0)δ+12δHf(x0)δ+R3(δ)f(\mathbf{x}_0 + \boldsymbol{\delta}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top\boldsymbol{\delta} + \frac{1}{2}\boldsymbol{\delta}^\top H_f(\mathbf{x}_0)\boldsymbol{\delta} + R_3(\boldsymbol{\delta})

where the remainder satisfies R3(δ)Cδ3|R_3(\boldsymbol{\delta})| \leq C\|\boldsymbol{\delta}\|^3 for some constant C>0C > 0.

Proof. Define ϕ:[0,1]R\phi: [0,1] \to \mathbb{R} by ϕ(t)=f(x0+tδ)\phi(t) = f(\mathbf{x}_0 + t\boldsymbol{\delta}). This reduces the multivariate theorem to the single-variable one. Compute derivatives by chain rule:

ϕ(t)=f(x0+tδ)δ=ifxix0+tδδi\phi'(t) = \nabla f(\mathbf{x}_0 + t\boldsymbol{\delta})^\top \boldsymbol{\delta} = \sum_i \frac{\partial f}{\partial x_i}\bigg|_{\mathbf{x}_0+t\boldsymbol{\delta}} \delta_i ϕ(t)=δHf(x0+tδ)δ=i,j2fxixjx0+tδδiδj\phi''(t) = \boldsymbol{\delta}^\top H_f(\mathbf{x}_0 + t\boldsymbol{\delta})\boldsymbol{\delta} = \sum_{i,j} \frac{\partial^2 f}{\partial x_i \partial x_j}\bigg|_{\mathbf{x}_0+t\boldsymbol{\delta}} \delta_i\delta_j

Apply the single-variable Taylor theorem with integral remainder:

ϕ(1)=ϕ(0)+ϕ(0)+12ϕ(0)+01(1t)22ϕ(t)dt\phi(1) = \phi(0) + \phi'(0) + \frac{1}{2}\phi''(0) + \int_0^1 \frac{(1-t)^2}{2}\phi'''(t)\,dt

Substituting: ϕ(0)=f(x0)\phi(0) = f(\mathbf{x}_0), ϕ(0)=f(x0)δ\phi'(0) = \nabla f(\mathbf{x}_0)^\top\boldsymbol{\delta}, ϕ(0)=δHf(x0)δ\phi''(0) = \boldsymbol{\delta}^\top H_f(\mathbf{x}_0)\boldsymbol{\delta}, and ϕ(1)=f(x0+δ)\phi(1) = f(\mathbf{x}_0 + \boldsymbol{\delta}). The integral remainder is R3R_3; bounding ϕ(t)Mδ3|\phi'''(t)| \leq M\|\boldsymbol{\delta}\|^3 for some MM gives R3M6δ3|R_3| \leq \frac{M}{6}\|\boldsymbol{\delta}\|^3. \square

The quadratic approximation. Dropping R3R_3 yields the second-order approximation:

f(x0+δ)f(x0)+f(x0)δ+12δHf(x0)δ\boxed{f(\mathbf{x}_0 + \boldsymbol{\delta}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top\boldsymbol{\delta} + \frac{1}{2}\boldsymbol{\delta}^\top H_f(\mathbf{x}_0)\boldsymbol{\delta}}

This is a quadratic function in δ\boldsymbol{\delta} - an elliptic paraboloid (if Hf0H_f \succ 0), a hyperboloid (if HfH_f is indefinite), or a cylinder/flat direction (if HfH_f is singular). The entire geometry of the function near x0\mathbf{x}_0 is encoded in this approximation.

5.2 Geometric Picture: Level Sets and Curvature

To see the geometry, write the quadratic approximation as a function of δ\boldsymbol{\delta}:

q(δ)=12δHfδ+gδ+cq(\boldsymbol{\delta}) = \frac{1}{2}\boldsymbol{\delta}^\top H_f \boldsymbol{\delta} + \mathbf{g}^\top\boldsymbol{\delta} + c

where g=f(x0)\mathbf{g} = \nabla f(\mathbf{x}_0) and c=f(x0)c = f(\mathbf{x}_0). Complete the square: if Hf0H_f \succ 0,

q(δ)=12(δ+Hf1g)Hf(δ+Hf1g)+c12gHf1gq(\boldsymbol{\delta}) = \frac{1}{2}(\boldsymbol{\delta} + H_f^{-1}\mathbf{g})^\top H_f (\boldsymbol{\delta} + H_f^{-1}\mathbf{g}) + c - \frac{1}{2}\mathbf{g}^\top H_f^{-1}\mathbf{g}

The minimum is at δ=Hf1g\boldsymbol{\delta}^* = -H_f^{-1}\mathbf{g} - this is the Newton step. The level sets {q=const}\{q = \text{const}\} are ellipses (in 2D) or ellipsoids (in nnD) aligned with the eigenvectors of HfH_f.

QUADRATIC LANDSCAPE GEOMETRY


  Isotropic Hessian (H = lambdaI):       Anisotropic Hessian:
  Circular level sets                Elliptical level sets

       * * * * * *                      *   *   *   *   *
     *     (o)   *                   *       *       *
   *   *        * *                *  *   (o)  *    *  *
  *   * *<-->* * *               *  *        *     *  *
   *   *        * *               *  * <--> *    *  *
     *      *    *                  *  *        *    *
       * * * * * *                     *   *   *   *

  GD step = Newton step             GD step != Newton step
  (gradient  steepest descent)     (GD "zigzags"; Newton goes direct)

  Eigenvalues: lambda_1 = lambda_2 = lambda           Eigenvalues: lambda_1 >> lambda_2
  Condition number:  = 1             Condition number:  = lambda_1/lambda_2 >> 1


Principal curvatures. The Hessian at a critical point x\mathbf{x}^* (where f=0\nabla f = 0) has eigendecomposition Hf=QΛQH_f = Q\Lambda Q^\top where Q=[q1qn]Q = [\mathbf{q}_1|\cdots|\mathbf{q}_n] is orthogonal and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n). In the basis {qi}\{\mathbf{q}_i\}, the function is a sum of independent quadratics:

f(x+δ)f(x)+12i=1nλi(qiδ)2f(\mathbf{x}^* + \boldsymbol{\delta}) \approx f(\mathbf{x}^*) + \frac{1}{2}\sum_{i=1}^n \lambda_i ({\mathbf{q}_i^\top\boldsymbol{\delta}})^2

Each λi\lambda_i is the curvature along direction qi\mathbf{q}_i:

  • λi>0\lambda_i > 0: ff increases parabolically in direction qi\mathbf{q}_i -> local min in that direction
  • λi<0\lambda_i < 0: ff decreases parabolically -> saddle direction
  • λi=0\lambda_i = 0: ff is flat -> cannot determine from second-order information

Second derivative test. At a critical point f(x)=0\nabla f(\mathbf{x}^*) = 0:

  • Hf0H_f \succ 0 (all λi>0\lambda_i > 0) \Rightarrow strict local minimum
  • Hf0H_f \prec 0 (all λi<0\lambda_i < 0) \Rightarrow strict local maximum
  • HfH_f indefinite (mixed signs) \Rightarrow saddle point
  • Hf0H_f \succeq 0 but singular \Rightarrow inconclusive (need higher-order terms)

5.3 Hessian Eigendecomposition and Convergence Rates

The eigenvalues of the Hessian at a minimum x\mathbf{x}^* directly control how fast gradient descent converges nearby.

Local convergence analysis. Near x\mathbf{x}^*, linearise the gradient iteration. The gradient update xt+1=xtηf(xt)\mathbf{x}_{t+1} = \mathbf{x}_t - \eta\nabla f(\mathbf{x}_t) gives, by Taylor:

f(xt)=f(x)+Hf(x)(xtx)+O(xtx2)=Hf(x)(xtx)+O(2)\nabla f(\mathbf{x}_t) = \nabla f(\mathbf{x}^*) + H_f(\mathbf{x}^*)(\mathbf{x}_t - \mathbf{x}^*) + O(\|\mathbf{x}_t - \mathbf{x}^*\|^2) = H_f(\mathbf{x}^*)(\mathbf{x}_t - \mathbf{x}^*) + O(\|\cdot\|^2)

Let ϵt=xtx\boldsymbol{\epsilon}_t = \mathbf{x}_t - \mathbf{x}^*. Then:

ϵt+1=ϵtηHfϵt+O(ϵt2)=(IηHf)ϵt+O(ϵt2)\boldsymbol{\epsilon}_{t+1} = \boldsymbol{\epsilon}_t - \eta H_f \boldsymbol{\epsilon}_t + O(\|\boldsymbol{\epsilon}_t\|^2) = (I - \eta H_f)\boldsymbol{\epsilon}_t + O(\|\boldsymbol{\epsilon}_t\|^2)

Ignoring the higher-order term: ϵt+1(IηHf)ϵt\boldsymbol{\epsilon}_{t+1} \approx (I - \eta H_f)\boldsymbol{\epsilon}_t. In the eigenbasis:

(ϵt+1)i=(1ηλi)(ϵt)i(\boldsymbol{\epsilon}_{t+1})_i = (1 - \eta\lambda_i)(\boldsymbol{\epsilon}_t)_i

After TT steps: (ϵT)i=(1ηλi)T(ϵ0)i(\boldsymbol{\epsilon}_T)_i = (1 - \eta\lambda_i)^T (\boldsymbol{\epsilon}_0)_i.

Convergence requires 1ηλi<1|1 - \eta\lambda_i| < 1 for all ii, i.e., 0<η<2/λmax0 < \eta < 2/\lambda_{\max}. The optimal step size (minimising the worst-case contraction ratio) is:

η=2λmin+λmax\eta^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}

giving contraction ratio ρ=κ1κ+1\rho = \frac{\kappa - 1}{\kappa + 1} where κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} is the condition number. The number of steps to reduce error by factor ε\varepsilon scales as O(κlog(1/ε))O(\kappa \log(1/\varepsilon)).

The condition number catastrophe. For a poorly conditioned Hessian (κ1\kappa \gg 1), GD makes tiny progress per step. A concrete example: for f(x,y)=12(x2+100y2)f(x,y) = \frac{1}{2}(x^2 + 100y^2), the Hessian is diag(1,100)\text{diag}(1, 100), κ=100\kappa = 100. With η=2/(1+100)0.02\eta = 2/(1+100) \approx 0.02:

  • Along yy: converges at rate 10.02100=0.98|1 - 0.02\cdot 100| = 0.98 - very slow
  • Along xx: converges at rate 10.021=0.98|1 - 0.02\cdot 1| = 0.98 - limited by the slow direction

Newton's method avoids this by multiplying by Hf1H_f^{-1}, giving ϵt+1=O(ϵt2)\boldsymbol{\epsilon}_{t+1} = O(\|\boldsymbol{\epsilon}_t\|^2) - quadratic convergence, independent of conditioning.

5.4 Gradient Descent vs Newton's Method

Gradient descent update:

xt+1=xtηf(xt)\mathbf{x}_{t+1} = \mathbf{x}_t - \eta\nabla f(\mathbf{x}_t)

Move in the steepest descent direction, step size proportional to η\eta.

Newton's update:

xt+1=xtHf(xt)1f(xt)\mathbf{x}_{t+1} = \mathbf{x}_t - H_f(\mathbf{x}_t)^{-1}\nabla f(\mathbf{x}_t)

Move to the minimum of the quadratic approximation. Derivation: minimise q(δ)=f(xt)+gδ+12δHδq(\boldsymbol{\delta}) = f(\mathbf{x}_t) + \mathbf{g}^\top\boldsymbol{\delta} + \frac{1}{2}\boldsymbol{\delta}^\top H\boldsymbol{\delta} over δ\boldsymbol{\delta}:

qδ=g+Hδ=0    δ=H1g\frac{\partial q}{\partial \boldsymbol{\delta}} = \mathbf{g} + H\boldsymbol{\delta} = 0 \implies \boldsymbol{\delta}^* = -H^{-1}\mathbf{g}
PropertyGradient DescentNewton's Method
Convergence rate (near x\mathbf{x}^*)Linear: ϵt+1ρϵt\|\boldsymbol{\epsilon}_{t+1}\| \leq \rho\|\boldsymbol{\epsilon}_t\|Quadratic: ϵt+1=O(ϵt2)\|\boldsymbol{\epsilon}_{t+1}\| = O(\|\boldsymbol{\epsilon}_t\|^2)
Dependence on κ\kappaO(κlog(1/ε))O(\kappa\log(1/\varepsilon)) stepsO(loglog(1/ε))O(\log\log(1/\varepsilon)) steps
Cost per stepO(n)O(n) (gradient)O(n3)O(n^3) (Hessian inversion)
MemoryO(n)O(n)O(n2)O(n^2) (Hessian storage)
Works when H⊁0H \not\succ 0?Yes (gradient always valid)No (H1H^{-1} may not exist or give ascent)
Suitable for deep learning?Yes (SGD, Adam)No (n109n \sim 10^9)

For AI: Deep networks have n106n \sim 10^6-101010^{10} parameters. Exact Newton is completely infeasible. This is why the entire field of second-order optimisation for deep learning focuses on approximating Hf1H_f^{-1} cheaply:

  • Diagonal approximations: Adam uses vt\sqrt{v_t} (RMS of past gradients) as a diagonal Hessian estimate
  • Kronecker factorizations: K-FAC approximates each layer's Fisher as a Kronecker product
  • Low-rank approximations: KFAC-Reduce, EKFAC
  • Hessian-vector products: Lanczos algorithm to find the top eigenvectors

6. Hessian Definiteness and Flat vs Sharp Minima

The definiteness of the Hessian at a minimum has consequences far beyond the second derivative test. It determines generalisation, robustness to perturbations, and the choice of optimiser. This section examines each case in depth.

6.1 The Positive Definite Case: Strongly Convex Minima

If Hf(x)0H_f(\mathbf{x}^*) \succ 0, the minimum is strongly convex locally. All curvatures are positive; the function rises quadratically in every direction. Key properties:

  • Unique minimum: in the neighbourhood, x\mathbf{x}^* is the only critical point
  • Bounded gradient: f(x)λminxx\|\nabla f(\mathbf{x})\| \geq \lambda_{\min}\|\mathbf{x} - \mathbf{x}^*\| for x\mathbf{x} near x\mathbf{x}^*
  • Gradient descent converges linearly (as derived in 5.3)
  • Robust to small perturbations: a perturbation ϵ\boldsymbol{\epsilon} of size δ\delta changes ff by at most 12λmaxδ2\frac{1}{2}\lambda_{\max}\delta^2

Algebraic criterion (Sylvester's criterion). H0H \succ 0 if and only if all leading principal minors are positive:

det(H1:k,1:k)>0for k=1,2,,n\det(H_{1:k,1:k}) > 0 \quad \text{for } k = 1, 2, \ldots, n

In 2D: H=(abbc)0    a>0 and acb2>0H = \begin{pmatrix}a & b \\ b & c\end{pmatrix} \succ 0 \iff a > 0 \text{ and } ac - b^2 > 0.

Eigenvalue criterion. H0    λmin(H)>0H \succ 0 \iff \lambda_{\min}(H) > 0. The smallest eigenvalue measures how "tight" the well is in the flattest direction.

6.2 The Positive Semidefinite Case: Degenerate Minima

If Hf(x)0H_f(\mathbf{x}^*) \succeq 0 with λmin=0\lambda_{\min} = 0, the quadratic approximation is flat in the null space ker(Hf)\ker(H_f). The second-order test is inconclusive.

What can happen along flat directions:

  • The function might be constant in that direction (a continuum of minima - "flat valley")
  • The function might dip lower (a deeper minimum exists nearby)
  • The function might rise (a saddle, with flat approach)

Example: Over-parameterised networks. Consider fitting nn data points with a model having pnp \gg n parameters. The minimum loss is achieved on a manifold of parameters (not a single point). Along this manifold, the loss is constant =0= 0, so f=0\nabla f = 0 and HfH_f has at least pnp - n zero eigenvalues. The loss landscape near a minimum is extremely flat in the "model space" directions.

Example: Batch normalisation symmetry. If layer ll has BatchNorm, then scaling W(l)αW(l)W^{(l)} \to \alpha W^{(l)} and W(l+1)W(l+1)/αW^{(l+1)} \to W^{(l+1)}/\alpha leaves the network function unchanged. This scale symmetry creates a zero eigenvalue direction in the Hessian of any loss function.

6.3 The Indefinite Case: Saddle Points

If Hf(x)H_f(\mathbf{x}^*) has both positive and negative eigenvalues, x\mathbf{x}^* is a saddle point. The function rises in some directions and falls in others.

Strict saddles. A saddle is strict if λmin(Hf)<0\lambda_{\min}(H_f) < 0. Gradient descent perturbed by noise escapes strict saddles - the noisy gradient has a component in the negative-curvature direction, pulling the iterate away. This is the basis for the claim that SGD finds local minima rather than saddles in practice.

Prevalence in deep learning. Dauphin et al. (2014) showed empirically that the loss landscape of deep networks is dominated by saddle points rather than local minima. The ratio of saddle points to local minima grows exponentially with depth. Goodfellow et al. confirmed that most "local minima" found by SGD are actually saddle points or nearly flat regions with very small negative curvature.

Index of a saddle. The Morse index is the number of negative eigenvalues. A local minimum has index 0; a typical saddle in Rn\mathbb{R}^n has index kk for 1kn1 \leq k \leq n.

6.4 Flat vs Sharp Minima and Generalisation

One of the most important practical consequences of Hessian analysis is the flat minima hypothesis for neural network generalisation.

Sharp minimum: λmax(Hf)1\lambda_{\max}(H_f) \gg 1. The loss landscape around x\mathbf{x}^* is narrow and steep. A small perturbation of the weights significantly increases the training loss. Empirically associated with poor generalisation - the test distribution's slight differences from training push the model off the sharp peak.

Flat minimum: λmax(Hf)\lambda_{\max}(H_f) is small. The loss landscape is broad and shallow. A perturbation of the weights leaves training loss roughly unchanged. Empirically associated with better generalisation - the model is robust to distribution shift.

PAC-Bayes bound. Hochreiter & Schmidhuber (1997) and Keskar et al. (2017) formalised this: a generalisation bound based on the minimum description length of a model. Flatter minima require fewer bits to specify (the volume of the basin in parameter space is larger), so PAC-Bayes bounds on generalisation error are tighter.

The SAM algorithm (Sharpness-Aware Minimisation, Foret et al. 2021). SAM explicitly searches for flat minima:

minwfSAM(w):=maxϵρf(w+ϵ)\min_{\mathbf{w}} f_{\text{SAM}}(\mathbf{w}) := \max_{\|\boldsymbol{\epsilon}\| \leq \rho} f(\mathbf{w} + \boldsymbol{\epsilon})

The inner maximisation finds the worst-case perturbation within radius ρ\rho. By Taylor:

maxϵρf(w+ϵ)f(w)+ρf(w)\max_{\|\boldsymbol{\epsilon}\| \leq \rho} f(\mathbf{w} + \boldsymbol{\epsilon}) \approx f(\mathbf{w}) + \rho\|\nabla f(\mathbf{w})\|

The SAM gradient is wf(w^)\nabla_{\mathbf{w}} f(\hat{\mathbf{w}}) where w^=w+ρf(w)f(w)\hat{\mathbf{w}} = \mathbf{w} + \rho\frac{\nabla f(\mathbf{w})}{\|\nabla f(\mathbf{w})\|} (ascent step to the worst-case point). This requires two gradient computations per step (one for the ascent, one for the final gradient), doubling cost.

SAM in practice. SAM and its variants (ASAM, mSAM, SAM-ON) consistently improve generalisation by 1-3% on ImageNet and CIFAR benchmarks. The improvement is attributed to SAM finding flatter minima, though the exact mechanism is debated (some argue SAM also acts as a regulariser).

Sharpness measures. Several quantities measure flatness:

  • Trace of Hessian: tr(Hf)=iλi\text{tr}(H_f) = \sum_i \lambda_i - total curvature (sum of all curvatures)
  • Frobenius norm: HfF=i,jhij2\|H_f\|_F = \sqrt{\sum_{i,j} h_{ij}^2} - total variation
  • Largest eigenvalue: λmax(Hf)\lambda_{\max}(H_f) - worst-case curvature direction
  • Volume of basin: (detHf)1/2iλi1/2(\det H_f)^{-1/2} \propto \prod_i \lambda_i^{-1/2} - Laplace approximation normaliser

6.5 Hessian Spectrum in Practice

Ghorbani et al. (2019) performed careful empirical Hessian eigenspectrum analysis of large neural networks, finding a consistent structure:

HESSIAN SPECTRUM OF DEEP NETWORKS


  Eigenvalue density (schematic):

       
 dens. 
       
       
       
       
       
       
   lambda
       0   small              lambda_max (outliers)

  - Bulk: dense cluster of small eigenvalues near 0
    (most directions have very small curvature)

  - Outliers: a few large eigenvalues (often O(10)-O(1000))
    (a handful of directions have sharp curvature)

  - The ratio lambda_max / lambda_bulk >> 1 (condition number very large)

  - After convergence: more eigenvalues cluster near 0
    (the model has found flat directions = the "flat manifold" of solutions)


Implications:

  • Learning rate must be η<2/λmax\eta < 2/\lambda_{\max} for stability - but λmax\lambda_{\max} can be very large (103\sim 10^3 for ResNets)
  • Learning rate warmup is necessary because early in training λmax\lambda_{\max} is large and unstable
  • Gradient clipping effectively caps the "step size" in the sharp directions
  • Adam/AdaGrad normalise by the gradient magnitude, implicitly compensating for large curvature in some directions

7. Newton's Method and Second-Order Optimisation

7.1 Newton's Method: Full Analysis

We derived the Newton step δN=Hf1f\boldsymbol{\delta}_N = -H_f^{-1}\nabla f in 5.4. Here we analyse it thoroughly.

Local quadratic convergence. Let x\mathbf{x}^* be a local minimum with Hf(x)0H_f(\mathbf{x}^*) \succ 0. For x0\mathbf{x}_0 sufficiently close to x\mathbf{x}^*:

xt+1xM2λminxtx2\|\mathbf{x}_{t+1} - \mathbf{x}^*\| \leq \frac{M}{2\lambda_{\min}} \|\mathbf{x}_t - \mathbf{x}^*\|^2

where M=supxD3f(x)M = \sup_{\mathbf{x}} \|D^3f(\mathbf{x})\| (Lipschitz constant of the Hessian). This is quadratic convergence: each iteration squares the error. From ε0=0.1\varepsilon_0 = 0.1: ε10.01\varepsilon_1 \approx 0.01, ε2104\varepsilon_2 \approx 10^{-4}, ε3108\varepsilon_3 \approx 10^{-8}. Only 30\sim 30 steps to reach machine precision!

Proof sketch. By Taylor at x\mathbf{x}^*: f(xt)=Hf(x)(xtx)+O(xtx2)\nabla f(\mathbf{x}_t) = H_f(\mathbf{x}^*)(\mathbf{x}_t - \mathbf{x}^*) + O(\|\mathbf{x}_t - \mathbf{x}^*\|^2). The Newton step gives:

xt+1=xtHf(xt)1f(xt)\mathbf{x}_{t+1} = \mathbf{x}_t - H_f(\mathbf{x}_t)^{-1}\nabla f(\mathbf{x}_t)

Substituting and using Hf(xt)=Hf(x)+O(xtx)H_f(\mathbf{x}_t) = H_f(\mathbf{x}^*) + O(\|\mathbf{x}_t - \mathbf{x}^*\|), the linear error term cancels exactly, leaving only O(xtx2)O(\|\mathbf{x}_t - \mathbf{x}^*\|^2).

Problems with vanilla Newton:

  1. HfH_f not PD: If HfH_f is indefinite, Hf1g-H_f^{-1}\mathbf{g} may be an ascent direction. At saddle points Newton steps towards the saddle, not away.
  2. Step too large: Even near a minimum, the step may overshoot if the quadratic approximation is inaccurate far from the current point.
  3. Hessian cost: Computing HfRn×nH_f \in \mathbb{R}^{n\times n} costs O(n2)O(n^2) memory and O(n2)O(n^2) gradient computations. Inverting costs O(n3)O(n^3).

7.2 Damped Newton and Quasi-Newton Methods

Damped Newton's method:

xt+1=xtαtHf(xt)1f(xt)\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha_t H_f(\mathbf{x}_t)^{-1}\nabla f(\mathbf{x}_t)

where αt\alpha_t is chosen by line search to ensure sufficient decrease (Armijo condition):

f(xt+αtδN)f(xt)+cαtf(xt)δNf(\mathbf{x}_t + \alpha_t\boldsymbol{\delta}_N) \leq f(\mathbf{x}_t) + c\alpha_t\nabla f(\mathbf{x}_t)^\top\boldsymbol{\delta}_N

for c(0,1)c \in (0,1) (typically c=104c = 10^{-4}). This guarantees global descent while preserving superlinear convergence near x\mathbf{x}^*.

Modified Cholesky. To handle non-PD Hessians, add a diagonal: H~=Hf+μI\tilde{H} = H_f + \mu I with μ0\mu \geq 0 chosen so H~0\tilde{H} \succ 0. This is the Levenberg-Marquardt modification. The step H~1g-\tilde{H}^{-1}\mathbf{g} interpolates between Newton (μ=0\mu = 0) and gradient descent (μ\mu \to \infty, step g/μ\to -\mathbf{g}/\mu).

L-BFGS (Limited-memory BFGS). The most widely used quasi-Newton method in practice (used for GPT-2 fine-tuning in some settings, L-BFGS-B for convex ML). Rather than computing Hf1H_f^{-1} exactly, L-BFGS maintains an implicit approximation from the last mm gradient differences (m=5m = 5 to 2020):

sk=xk+1xk,yk=f(xk+1)f(xk)\mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k, \quad \mathbf{y}_k = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k)

The BFGS update enforces Hk+1sk=ykH_{k+1}\mathbf{s}_k = \mathbf{y}_k (secant condition). L-BFGS applies the implicit product Hk1gH_k^{-1}\mathbf{g} via a two-loop recursion in O(mn)O(mn) time, avoiding the O(n2)O(n^2) Hessian matrix.

7.3 The Gauss-Newton Method

For least-squares problems f(w)=12r(w)2f(\mathbf{w}) = \frac{1}{2}\|\mathbf{r}(\mathbf{w})\|^2 where r:RnRm\mathbf{r}: \mathbb{R}^n \to \mathbb{R}^m is the residual vector, the exact Hessian is:

Hf=JrJr+i=1mri(w)HriH_f = J_{\mathbf{r}}^\top J_{\mathbf{r}} + \sum_{i=1}^m r_i(\mathbf{w}) H_{r_i}

The second term involves Hessians of each residual, which are expensive and may be negative-definite. The Gauss-Newton approximation drops this term:

H^GN=JrJr0\hat{H}_{\text{GN}} = J_{\mathbf{r}}^\top J_{\mathbf{r}} \succeq 0

This is always PSD (so the step is always a descent direction) and costs only O(mn2)O(mn^2) - compute JrJ_{\mathbf{r}} once, then form the n×nn\times n product.

The Gauss-Newton step: wt+1=wt(JJ)1Jr(wt)\mathbf{w}_{t+1} = \mathbf{w}_t - (J^\top J)^{-1} J^\top \mathbf{r}(\mathbf{w}_t).

For neural networks: The Fisher Information Matrix (FIM) is the natural Gauss-Newton matrix:

F=Ex,ypdata[(wlogp(yx;w))(wlogp(yx;w))]=JpJpF = \mathbb{E}_{\mathbf{x},y\sim p_{\text{data}}}[(\nabla_{\mathbf{w}}\log p(y|\mathbf{x};\mathbf{w}))(\nabla_{\mathbf{w}}\log p(y|\mathbf{x};\mathbf{w}))^\top] = J_{p}^\top J_{p}

where JpJ_p is the Jacobian of the log-probabilities. This equals the Gauss-Newton matrix for log-likelihood maximisation, and is always PSD. Natural gradient descent uses F1gF^{-1}\mathbf{g} as the update direction.

7.4 Kronecker-Factored Approximate Curvature (K-FAC)

K-FAC (Martens & Grosse 2015) is the most successful practical second-order method for deep learning. It exploits the structure of neural network layers to cheaply approximate the FIM.

The key insight. For a linear layer y=Wa\mathbf{y} = W\mathbf{a} with input activations aRnin\mathbf{a} \in \mathbb{R}^{n_\text{in}} and pre-activations yRnout\mathbf{y} \in \mathbb{R}^{n_\text{out}}, the gradient of the loss with respect to WW is:

WL=gya\nabla_W \mathcal{L} = \mathbf{g}_y \mathbf{a}^\top

where gy=Ly\mathbf{g}_y = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} (backpropagated gradient). Therefore:

vec(WL)=agy\text{vec}(\nabla_W \mathcal{L}) = \mathbf{a} \otimes \mathbf{g}_y

The exact FIM block for layer ll is:

Fl=E[(agy)(agy)]=E[aagygy]F_l = \mathbb{E}[(\mathbf{a} \otimes \mathbf{g}_y)(\mathbf{a} \otimes \mathbf{g}_y)^\top] = \mathbb{E}[\mathbf{a}\mathbf{a}^\top \otimes \mathbf{g}_y\mathbf{g}_y^\top]

K-FAC approximates this by assuming independence of a\mathbf{a} and gy\mathbf{g}_y:

F~l=E[aa]E[gygy]=AlGl\tilde{F}_l = \mathbb{E}[\mathbf{a}\mathbf{a}^\top] \otimes \mathbb{E}[\mathbf{g}_y\mathbf{g}_y^\top] = A_l \otimes G_l

where Al=E[aa]Rnin×ninA_l = \mathbb{E}[\mathbf{a}\mathbf{a}^\top] \in \mathbb{R}^{n_\text{in}\times n_\text{in}} and Gl=E[gygy]Rnout×noutG_l = \mathbb{E}[\mathbf{g}_y\mathbf{g}_y^\top] \in \mathbb{R}^{n_\text{out}\times n_\text{out}}.

The Kronecker structure enables cheap inversion:

(AlGl)1=Al1Gl1(A_l \otimes G_l)^{-1} = A_l^{-1} \otimes G_l^{-1}

So inverting the large ninnout×ninnoutn_\text{in}n_\text{out} \times n_\text{in}n_\text{out} matrix costs only O(nin3+nout3)O(n_\text{in}^3 + n_\text{out}^3) - inverting two small matrices. The natural gradient step is:

ΔWl=Gl1(WlL)Al1\Delta W_l = -G_l^{-1}(\nabla_{W_l}\mathcal{L}) A_l^{-1}

This can be applied in O(nin2+nout2)O(n_\text{in}^2 + n_\text{out}^2) per layer (far less than O(nin3nout3)O(n_\text{in}^3 n_\text{out}^3) for the exact FIM). K-FAC achieves 10-30x speedup over SGD in terms of wall-clock time to convergence on large models, while maintaining good generalisation.


8. Hessian-Vector Products and Spectral Methods

Computing the full n×nn\times n Hessian is infeasible for large networks. But many algorithms only need Hessian-vector products (HVPs) HfvH_f\mathbf{v} - and these can be computed in O(n)O(n) time.

8.1 The O(n)O(n) HVP Identity

Theorem. For f:RnRf: \mathbb{R}^n \to \mathbb{R} differentiable and any vRn\mathbf{v} \in \mathbb{R}^n:

Hfv=x[xf(x)v]H_f\mathbf{v} = \nabla_{\mathbf{x}}[\nabla_{\mathbf{x}} f(\mathbf{x}) \cdot \mathbf{v}]

Proof. The (i,j)(i,j)-entry of HfH_f is 2fxixj\frac{\partial^2 f}{\partial x_i \partial x_j}. The dot product fv=jfxjvj\nabla f \cdot \mathbf{v} = \sum_j \frac{\partial f}{\partial x_j} v_j. Taking the gradient:

xi(fv)=j2fxixjvj=(Hfv)i\frac{\partial}{\partial x_i}(\nabla f \cdot \mathbf{v}) = \sum_j \frac{\partial^2 f}{\partial x_i \partial x_j} v_j = (H_f\mathbf{v})_i

So [fv]=Hfv\nabla[\nabla f \cdot \mathbf{v}] = H_f\mathbf{v}. \square

Implementation. In any automatic differentiation framework (PyTorch/JAX):

# v is a fixed vector; x is the point at which to evaluate H(x)v
loss = f(x)
grad = torch.autograd.grad(loss, x, create_graph=True)[0]    # nablaf(x), O(n)
hvp = torch.autograd.grad(grad @ v, x)[0]                    # H(x)v, O(n)

The first grad call is standard backprop. The second differentiates the scalar fv\nabla f \cdot \mathbf{v} - also a standard backprop, costing O(n)O(n) just like the first. Total: 2 passes, O(n)O(n) each, giving HfvH_f\mathbf{v} without ever forming the n×nn\times n matrix.

8.2 Power Iteration and Dominant Eigenvalue

Using HVPs, we can find the largest eigenvalue λmax(Hf)\lambda_{\max}(H_f) via power iteration:

  1. Initialise v0N(0,I)\mathbf{v}_0 \sim \mathcal{N}(0, I), normalise: v0v0/v0\mathbf{v}_0 \leftarrow \mathbf{v}_0/\|\mathbf{v}_0\|
  2. Repeat: v~t+1=Hfvt\tilde{\mathbf{v}}_{t+1} = H_f\mathbf{v}_t (one HVP), vt+1=v~t+1/v~t+1\mathbf{v}_{t+1} = \tilde{\mathbf{v}}_{t+1}/\|\tilde{\mathbf{v}}_{t+1}\|
  3. Estimate: λmaxvtHfvt\lambda_{\max} \approx \mathbf{v}_t^\top H_f\mathbf{v}_t (Rayleigh quotient, another HVP)

Each iteration costs 1-2 HVPs = O(n)O(n). Convergence rate: λmax/λ2t|\lambda_{\max}/\lambda_2|^t (geometric in gap ratio). For well-separated spectra, converges in 10\sim 10-50 iterations.

For AI: λmax(Hf)\lambda_{\max}(H_f) determines the maximum stable learning rate: ηmax=2/λmax\eta_{\max} = 2/\lambda_{\max}. Computing this via 20 power iterations costs 20 HVPs = 40 backward passes - affordable even for billion-parameter models. The ProgressiveShrink and Catapult phenomena in transformer training are explained by λmax\lambda_{\max} dynamics: if the learning rate exceeds 2/λmax2/\lambda_{\max}, loss explodes; after explosion, a sharp minimum may be replaced by a flatter one (the "edge of stability" phenomenon, Cohen et al. 2022).

8.3 Lanczos Algorithm and Full Spectrum

The Lanczos algorithm extracts the full eigenspectrum using a sequence of HVPs. Starting from a random unit vector q1\mathbf{q}_1:

  1. Build an orthonormal basis {q1,q2,,qk}\{q_1, q_2, \ldots, q_k\} (Krylov subspace) by repeated HVPs:

    • q~j+1=Hfqjαjqjβj1qj1\tilde{\mathbf{q}}_{j+1} = H_f\mathbf{q}_j - \alpha_j\mathbf{q}_j - \beta_{j-1}\mathbf{q}_{j-1}
    • qj+1=q~j+1/q~j+1\mathbf{q}_{j+1} = \tilde{\mathbf{q}}_{j+1}/\|\tilde{\mathbf{q}}_{j+1}\| where αj=qjHfqj\alpha_j = \mathbf{q}_j^\top H_f\mathbf{q}_j, βj=q~j+1\beta_j = \|\tilde{\mathbf{q}}_{j+1}\|
  2. After kk steps, HfH_f is approximated in the Krylov basis as a tridiagonal matrix TkT_k:

Tk=(α1β1β1α2β2βk1αk)T_k = \begin{pmatrix}\alpha_1 & \beta_1 & & \\ \beta_1 & \alpha_2 & \beta_2 & \\ & \ddots & \ddots & \ddots\\ & & \beta_{k-1} & \alpha_k\end{pmatrix}
  1. Eigenvalues of TkT_k approximate eigenvalues of HfH_f. The extreme eigenvalues converge fastest (in O(κ)O(\sqrt{\kappa}) iterations rather than O(κ)O(\kappa) for power iteration).

Density estimation. Using the stochastic Lanczos quadrature method (Ghorbani et al. 2019), one can estimate the full spectral density ρ(λ)=1niδ(λλi)\rho(\lambda) = \frac{1}{n}\sum_i \delta(\lambda - \lambda_i) using 100\sim 100 random starting vectors and 100\sim 100 Lanczos steps each. This gives a smooth density plot revealing the bulk-outlier structure discussed in 6.5.

8.4 Applications of HVPs in Deep Learning

1. Computing λmax\lambda_{\max} for learning rate bounds. Cohen et al. (2022) showed that SGD with fixed learning rate η\eta converges at the "edge of stability" where λmax(Hf)2/η\lambda_{\max}(H_f) \approx 2/\eta. Measuring λmax\lambda_{\max} via power iteration explains why training stabilises at a specific sharpness level.

2. Influence functions. For convex models, the influence of removing training point ziz_i on prediction is:

I(zi,ztest)=wL(ztest)Hf1wL(zi)\mathcal{I}(z_i, z_{\text{test}}) = -\nabla_{\mathbf{w}}L(z_{\text{test}})^\top H_f^{-1} \nabla_{\mathbf{w}}L(z_i)

Computing Hf1vH_f^{-1}\mathbf{v} requires solving a linear system Hfx=vH_f\mathbf{x} = \mathbf{v}, which can be done iteratively (conjugate gradient, LiSSA) using HVPs. Used in data valuation and finding mislabelled examples.

3. Second-order optimisers (K-FAC, Shampoo, SOAP). All second-order optimisers for deep learning reduce to computing products of the form H^1g\hat{H}^{-1}\mathbf{g} where H^\hat{H} is a tractable approximation. The quality of the approximation determines how well the algorithm second-order corrects the gradient.

4. Hyperparameter optimisation. Gradient-based hyperparameter optimisation (MAML meta-learning, bilevel optimisation) requires differentiating through optimisation steps. This involves HVPs of the meta-loss with respect to the base loss's Hessian - third-order derivatives!

5. Neural Tangent Kernel (NTK). In the infinite-width limit, the NTK is Θ=JJ\Theta = J^\top J where J=f(x;w)wJ = \frac{\partial f(\mathbf{x};\mathbf{w})}{\partial \mathbf{w}} is the Jacobian of the network output. HVPs of the squared loss Hessian equal NTK-vector products, connecting second-order optimisation to kernel theory.


9. Advanced Topics

9.1 Jacobians of Implicit Functions

The Implicit Function Theorem (IFT) gives the Jacobian of an implicitly defined function without solving for it explicitly.

Setup. Suppose F:Rn×RmRmF: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m is C1C^1 with F(x0,y0)=0F(\mathbf{x}_0, \mathbf{y}_0) = 0 and the partial Jacobian Fy(x0,y0)Rm×m\frac{\partial F}{\partial \mathbf{y}}(\mathbf{x}_0, \mathbf{y}_0) \in \mathbb{R}^{m\times m} is invertible. Then there exists a C1C^1 function y=g(x)\mathbf{y} = g(\mathbf{x}) near x0\mathbf{x}_0 satisfying F(x,g(x))=0F(\mathbf{x}, g(\mathbf{x})) = 0, with Jacobian:

Jg(x0)=[Fy(x0,y0)]1Fx(x0,y0)J_g(\mathbf{x}_0) = -\left[\frac{\partial F}{\partial \mathbf{y}}(\mathbf{x}_0, \mathbf{y}_0)\right]^{-1} \frac{\partial F}{\partial \mathbf{x}}(\mathbf{x}_0, \mathbf{y}_0)

Proof sketch. Differentiate F(x,g(x))=0F(\mathbf{x}, g(\mathbf{x})) = 0 with respect to x\mathbf{x} using the chain rule:

Fx+FyJg=0    Jg=(Fy)1Fx\frac{\partial F}{\partial \mathbf{x}} + \frac{\partial F}{\partial \mathbf{y}}J_g = 0 \implies J_g = -\left(\frac{\partial F}{\partial \mathbf{y}}\right)^{-1}\frac{\partial F}{\partial \mathbf{x}}

Application: Bilevel optimisation (meta-learning). MAML's inner loop finds ϕ(w)=argminϕLtrain(ϕ;w)\phi^*(w) = \arg\min_\phi \mathcal{L}_{\text{train}}(\phi; w) satisfying ϕLtrain=0\nabla_\phi \mathcal{L}_{\text{train}} = 0. The meta-gradient requires dϕdw\frac{d\phi^*}{dw}, which by IFT is:

dϕdw=Hϕϕ1Hϕw\frac{d\phi^*}{dw} = -H_{\phi\phi}^{-1} \cdot H_{\phi w}

where Hϕϕ=2Lϕ2H_{\phi\phi} = \frac{\partial^2 \mathcal{L}}{\partial\phi^2} and Hϕw=2LϕwH_{\phi w} = \frac{\partial^2 \mathcal{L}}{\partial\phi\,\partial w}. This requires second-order information (Hessians involving ϕ\phi and ww), which is why MAML-based methods are costly.

9.2 Clarke Subdifferential for Non-Smooth Functions

For non-differentiable ff (e.g., networks with ReLU), the Hessian does not exist at activation boundaries. The Clarke subdifferential extends the Jacobian to this setting.

Definition (Clarke subdifferential). For locally Lipschitz f:RnRf: \mathbb{R}^n \to \mathbb{R}, the Clarke subdifferential is:

Cf(x)=conv{limkf(xk):xkx,xkΩf}\partial_C f(\mathbf{x}) = \text{conv}\left\{\lim_{k\to\infty} \nabla f(\mathbf{x}_k) : \mathbf{x}_k \to \mathbf{x}, \mathbf{x}_k \in \Omega_f\right\}

where Ωf\Omega_f is the set of points where ff is differentiable (measure 1 by Rademacher's theorem) and conv\text{conv} denotes convex hull.

For ReLU: At x=0x = 0, CReLU(x)=[0,1]\partial_C \text{ReLU}(x) = [0, 1] - the interval between the left and right derivatives. Any value in [0,1][0,1] is a valid subgradient.

Clarke Jacobian. For f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m, the Clarke generalised Jacobian Cf(x)\partial_C f(\mathbf{x}) is the convex hull of limiting Jacobians. For piecewise-smooth ff, Cf(x)\partial_C f(\mathbf{x}) is the convex hull of Jacobians from each smooth piece meeting at x\mathbf{x}.

Generalised second-order theory. There is no direct analogue of the Hessian in Clarke's framework. Instead, semismoothness (Qi & Sun 1993) allows Newton-like methods: a function is semismooth if the Clarke Jacobian converges in a directional sense. Semismooth Newton methods converge superlinearly to solutions of nonsmooth equations.

9.3 Fisher Information Matrix and Natural Gradient

The Fisher Information Matrix plays a central role in statistical learning theory and connects Jacobians to probabilistic models.

Definition. For a parametric model p(x;θ)p(x; \theta), the FIM is:

F(θ)=Exp(;θ)[logp(x;θ)θlogp(x;θ)θ]=E[2logp(x;θ)θ2]F(\theta) = \mathbb{E}_{x\sim p(\cdot;\theta)}\left[\frac{\partial \log p(x;\theta)}{\partial\theta}\frac{\partial \log p(x;\theta)}{\partial\theta}^\top\right] = -\mathbb{E}\left[\frac{\partial^2 \log p(x;\theta)}{\partial\theta^2}\right]

The second equality (Fisher's identity) shows F=E[Hlogp]F = -\mathbb{E}[H_{\log p}] - the FIM equals the negative expected Hessian of the log-likelihood.

Geometric interpretation. The FIM defines a Riemannian metric on the statistical manifold M={p(;θ)}\mathcal{M} = \{p(\cdot;\theta)\}. The KL divergence between nearby distributions is:

DKL(p(;θ)p(;θ+δ))12δF(θ)δD_\text{KL}(p(\cdot;\theta)\|p(\cdot;\theta+\boldsymbol{\delta})) \approx \frac{1}{2}\boldsymbol{\delta}^\top F(\theta)\boldsymbol{\delta}

The FIM is the Hessian of the KL divergence. Natural gradient descent moves in the direction that most rapidly decreases the KL divergence, not the Euclidean loss:

~θL=F(θ)1θL\tilde{\nabla}_\theta \mathcal{L} = F(\theta)^{-1}\nabla_\theta \mathcal{L}

Equivalence to Gauss-Newton. For cross-entropy loss L=E[logp(yx;θ)]\mathcal{L} = -\mathbb{E}[\log p(y|x;\theta)], the FIM equals the Gauss-Newton matrix JJJ^\top J where JJ is the Jacobian of the log-probability outputs. This provides a PSD, well-conditioned curvature matrix.

For transformers: The FIM of a transformer language model is too large to compute exactly. K-FAC approximates it using the Kronecker structure. Recent work (Martens 2020, Bernacchia 2022) shows the FIM of attention layers has special structure exploitable for efficient natural gradient computation.

9.4 Jacobians in Normalising Flows and Change of Variables

Change of variables formula. If f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n is a diffeomorphism (bijective, smooth, smooth inverse) and z=f(x)\mathbf{z} = f(\mathbf{x}) with xpX\mathbf{x} \sim p_X, then:

pZ(z)=pX(f1(z))detJf1(z)=pX(x)detJf(x)1p_Z(\mathbf{z}) = p_X(f^{-1}(\mathbf{z}))\cdot |\det J_{f^{-1}}(\mathbf{z})| = p_X(\mathbf{x}) \cdot |\det J_f(\mathbf{x})|^{-1}

Log-likelihood under a flow:

logpZ(z)=logpX(x)logdetJf(x)\log p_Z(\mathbf{z}) = \log p_X(\mathbf{x}) - \log|\det J_f(\mathbf{x})|

The second term is the log-determinant of the Jacobian - the log-volume scaling factor.

Normalising flows (RealNVP, Glow, NICE) are composed diffeomorphisms f=fKf1f = f_K \circ \cdots \circ f_1. By the chain rule, Jf=JfKJf1J_f = J_{f_K} \cdots J_{f_1}, so detJf=kdetJfk\det J_f = \prod_k \det J_{f_k}. Each layer is designed so that detJfk\det J_{f_k} is cheap to compute:

  • Coupling layers: JfkJ_{f_k} is triangular, det=i(Jfk)ii\det = \prod_i (J_{f_k})_{ii}, computable in O(n)O(n)
  • Invertible 1x1 convolutions (Glow): Jfk=WJ_{f_k} = W (a learnable matrix), detW\det W computed via LU decomposition

This is why Jacobian theory is central to generative modelling: the entire training objective depends on computing logdetJf\log|\det J_f| efficiently.

8.5 Conjugate Gradient for Solving Hx=bH\mathbf{x} = \mathbf{b}

Many applications require not just HvH\mathbf{v} but H1vH^{-1}\mathbf{v} - solving a linear system with the Hessian. The Conjugate Gradient (CG) method solves Hx=bH\mathbf{x} = \mathbf{b} for symmetric H0H \succ 0 using only matrix-vector products:

CG algorithm:

  1. Initialise: x0=0\mathbf{x}_0 = 0, r0=b\mathbf{r}_0 = \mathbf{b}, p0=r0\mathbf{p}_0 = \mathbf{r}_0
  2. Repeat until convergence:
    • αk=rk2/(pkHpk)\alpha_k = \|\mathbf{r}_k\|^2 / (\mathbf{p}_k^\top H \mathbf{p}_k) (one HVP: HpkH\mathbf{p}_k)
    • xk+1=xk+αkpk\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k
    • rk+1=rkαkHpk\mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k H\mathbf{p}_k
    • βk=rk+12/rk2\beta_k = \|\mathbf{r}_{k+1}\|^2 / \|\mathbf{r}_k\|^2
    • pk+1=rk+1+βkpk\mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k (update direction)

Convergence. CG converges in at most nn steps (exact arithmetic). In kk steps, the error satisfies:

xkxH2(κ1κ+1)kx0xH\|\mathbf{x}_k - \mathbf{x}^*\|_H \leq 2\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^k \|\mathbf{x}_0 - \mathbf{x}^*\|_H

where yH=yHy\|\mathbf{y}\|_H = \sqrt{\mathbf{y}^\top H\mathbf{y}} and κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min}. For well-conditioned systems (κ\kappa small), CG converges in a few dozen iterations - far fewer than nn.

For influence functions. The LiSSA algorithm (Agarwal et al. 2017) solves H1vH^{-1}\mathbf{v} via stochastic approximation: draw batches B1,B2,B_1, B_2, \ldots, compute HBt1H_{{B_t}}^{-1} via Taylor series truncation, average. Each iteration uses one stochastic HVP. For large networks (n=108n = 10^8), this is the only feasible approach.

9.5 Jacobians and Automatic Differentiation: The Full Picture

This section ties together the JVP/VJP duality from 3.5 with the full structure of automatic differentiation.

The AD computational graph. Every computation is a directed acyclic graph (DAG) where:

  • Nodes represent intermediate values v1,v2,,vkv_1, v_2, \ldots, v_k
  • Edges represent elementary operations (add, multiply, exp, sin, ...)
  • Input nodes: v1,,vnv_1, \ldots, v_n (the input variables x\mathbf{x})
  • Output node: vkv_k (the scalar loss L\mathcal{L})

Each elementary operation vj=gj(vpa(j))v_j = g_j(v_{\text{pa}(j)}) has a Jacobian JgjJ_{g_j} - a small, cheap-to-compute matrix.

Forward mode (JVP, tangent propagation). Compute v˙i=Jf up to node ix˙\dot{v}_i = J_{f \text{ up to node } i} \cdot \dot{\mathbf{x}} for a fixed tangent vector x˙=v\dot{\mathbf{x}} = \mathbf{v}. At each node:

v˙j=ipa(j)gjviv˙i\dot{v}_j = \sum_{i \in \text{pa}(j)} \frac{\partial g_j}{\partial v_i} \dot{v}_i

Running forward mode for each basis vector ek\mathbf{e}_k separately gives column kk of the full Jacobian JfJ_f. Cost: nn forward passes total, each O(time(f))O(\text{time}(f)).

Reverse mode (VJP, adjoint propagation). Compute vˉi=Lvi\bar{v}_i = \frac{\partial \mathcal{L}}{\partial v_i} backward through the graph. At each node, going backward:

vˉi=jch(i)gjvivˉj\bar{v}_i = \sum_{j \in \text{ch}(i)} \frac{\partial g_j}{\partial v_i} \bar{v}_j

Starting from vˉk=1\bar{v}_k = 1 (derivative of L\mathcal{L} with respect to itself), this accumulates all partial derivatives in one backward pass. Cost: 11 backward pass, O(time(f))O(\text{time}(f)).

Why reverse mode wins for scalar L\mathcal{L}. Computing the full gradient xLRn\nabla_{\mathbf{x}}\mathcal{L} \in \mathbb{R}^n requires one VJP (one backward pass). Computing it with JVPs requires nn forward passes - nn times more expensive. Since n109n \sim 10^9 for LLMs, reverse mode is the only option.

Forward mode wins when mnm \gg n. If f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m with mnm \gg n, computing the full Jacobian JfRm×nJ_f \in \mathbb{R}^{m \times n} requires nn JVPs (forward) or mm VJPs (reverse). When nmn \ll m, forward mode is cheaper. Example: sensitivity analysis of a physical simulation with n=5n = 5 parameters and m=104m = 10^4 output measurements.

Higher-order derivatives. The Hessian HfRn×nH_f \in \mathbb{R}^{n \times n} can be computed by applying the AD operators twice:

MethodCostWhen to use
Reverse over reverseO(n2)O(n^2) (full HH)Never for large nn
Forward over reverseO(n)O(n) per row (builds HH row-by-row)When you need full HH at small nn
Reverse over forwardO(n)O(n) per columnSame as above
HVP (reverse-of-dot)O(n)O(n) per HVPStandard: only need HvH\mathbf{v}

The HVP trick from 8.1 (differentiate fv\nabla f \cdot \mathbf{v}) is exactly "reverse over the dot product with v\mathbf{v}" - it's forward-then-reverse in the AD sense, but implemented as two reverse passes.

Mixed-mode AD. Some computations benefit from mixing forward and reverse within the same graph. JAX's jax.hessian function automatically selects forward-over-reverse or reverse-over-forward based on the function's input/output dimensions, using the cheaper composition.


10. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing Jacobian rows and columns: "the (i,j)(i,j) entry is xi/fj\partial x_i / \partial f_j"The (i,j)(i,j) entry of JfJ_f for f:RnRmf:\mathbb{R}^n\to\mathbb{R}^m is fi/xj\partial f_i/\partial x_j: row ii = gradient of output ii, column jj = sensitivity of all outputs to input jjRemember: rows index outputs, columns index inputs. Shape: (m,n)(m,n) for f:RnRmf:\mathbb{R}^n\to\mathbb{R}^m
2Forgetting the Hessian is symmetricThe Hessian Hij=2f/xixjH_{ij} = \partial^2 f/\partial x_i\partial x_j is symmetric when fC2f\in C^2 (Clairaut's theorem). Without symmetry assumption, algorithms that diagonalise HH are invalidCheck H=HH = H^\top; if computing numerically, average with its transpose
3Applying the second derivative test when f(x)0\nabla f(\mathbf{x}^*) \neq 0The test "if H0H \succ 0 then x\mathbf{x}^* is a local minimum" requires x\mathbf{x}^* to be a critical point. A PD Hessian at a non-critical point just means the quadratic approximation has a minimum elsewhereFirst verify f(x)=0\nabla f(\mathbf{x}^*) = 0, then check HH definiteness
4Treating the Jacobian chain rule as commutative: Jfg=JfJg=JgJfJ_{f\circ g} = J_f \cdot J_g = J_g \cdot J_fMatrix multiplication is not commutative. Jfg(x)=Jf(g(x))Jg(x)J_{f\circ g}(\mathbf{x}) = J_f(g(\mathbf{x})) \cdot J_g(\mathbf{x}) - evaluate outer function's Jacobian at the output of the inner functionWrite dimensions explicitly: if g:RnRkg:\mathbb{R}^n\to\mathbb{R}^k, f:RkRmf:\mathbb{R}^k\to\mathbb{R}^m, then JfgJ_{f\circ g} is (m,k)(k,n)=(m,n)(m,k)\cdot(k,n)=(m,n)
5Using GD learning rate η=1\eta = 1 near a non-convex regionThe stable range for GD is 0<η<2/λmax0 < \eta < 2/\lambda_{\max}. If λmax\lambda_{\max} is large (sharp direction), even η=0.01\eta = 0.01 can cause oscillationEstimate λmax\lambda_{\max} via power iteration (10-20 HVPs) and set η<2/λmax\eta < 2/\lambda_{\max}
6Confusing JVP and VJP modesJVP (forward mode): computes JfvJ_f\mathbf{v}, cost O(n)O(n) per vector; VJP (reverse mode): computes JfuJ_f^\top\mathbf{u}, cost O(m)O(m) per vector. For scalar loss (m=1m=1), VJP is O(1)O(1) passes regardless of nn; JVP would cost O(n)O(n) passesUse reverse mode (backprop) for scalar losses; use forward mode when mnm \gg n
7Concluding a flat Hessian means global minimumHf=0H_f = 0 at x\mathbf{x} only means the quadratic approximation is flat. The actual function could be a plateau, a flat saddle, or a valley floor. Example: f(x)=x3f(x) = x^3 has f(0)=0f''(0) = 0 but x=0x=0 is an inflection point, not a minimumCheck higher-order derivatives or sufficient decrease conditions from all directions
8Computing JfvJ_f\mathbf{v} by forming JfJ_f firstFor large nn, forming JfRm×nJ_f \in \mathbb{R}^{m\times n} costs O(mn)O(mn) memory and O(n)O(n) backward passes. Computing JfvJ_f\mathbf{v} directly via forward-mode AD costs O(n)O(n) in one pass (single JVP)Use torch.autograd.functional.jvp or JAX's jvp for forward-mode; never materialise the full Jacobian if you only need the product
9Assuming the softmax Jacobian is diagonalThe softmax σ:RKRK\sigma:\mathbb{R}^K\to\mathbb{R}^K has Jacobian Jσ=diag(p)ppJ_\sigma = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top - a full symmetric rank-(K1)(K-1) matrix. Only the diagonal entries pi(1pi)p_i(1-p_i) appear on the diagonalWhen backpropagating through softmax, use the full Jacobian. Many frameworks handle this correctly, but manual implementations often miss the pipj-p_ip_j off-diagonal terms
10Ignoring that the Hessian of a ReLU network is zero almost everywhereReLU networks are piecewise linear. Their Hessian is 0 in each linear region. Newton's method applied to a ReLU network Hessian gives a zero-curvature degenerate systemUse Gauss-Newton or K-FAC (which use JJJ^\top J from the output Jacobian, not the parameter Hessian) for ReLU networks
11Confusing the FIM with the Hessian of the lossThe FIM F=E[(logp)(logp)]F = \mathbb{E}[(\nabla\log p)(\nabla\log p)^\top] is not the same as HLH_{\mathcal{L}} (Hessian of the training loss). They are equal only when the model is at its optimum (Fisher identity). Away from the optimum, HL=F+E[Hlogp]H_{\mathcal{L}} = F + \mathbb{E}[H_{\log p}] (the Hessian includes an extra term for model misspecification)Natural gradient uses F1F^{-1}; full second-order Newton uses HL1H_{\mathcal{L}}^{-1}. K-FAC approximates FF, not HLH_{\mathcal{L}}
12Expecting Newton's method to work with indefinite HessiansNewton's step H1g-H^{-1}\mathbf{g} is only guaranteed to be a descent direction when H0H \succ 0. At saddle points, the Newton step points toward the saddle, making loss increaseAdd a regularisation term: (H+μI)1g-(H + \mu I)^{-1}\mathbf{g} with μ>0\mu > 0 chosen so H+μI0H + \mu I \succ 0 (Levenberg-Marquardt)

11. Exercises

Exercise 1 (). Computing Jacobians.

Let f:R3R2f: \mathbb{R}^3 \to \mathbb{R}^2 be defined by f(x1,x2,x3)=(x12+x2x3,  ex1x32)f(x_1, x_2, x_3) = (x_1^2 + x_2x_3,\; e^{x_1} - x_3^2).

(a) Compute Jf(x)J_f(\mathbf{x}) for general x\mathbf{x}.

(b) Evaluate JfJ_f at x0=(0,1,1)\mathbf{x}_0 = (0, 1, 1).

(c) Compute JfvJ_f\mathbf{v} at x0\mathbf{x}_0 for v=(1,0,1)\mathbf{v} = (1, 0, -1)^\top (the JVP).

(d) Compute JfuJ_f^\top\mathbf{u} at x0\mathbf{x}_0 for u=(1,2)\mathbf{u} = (1, 2)^\top (the VJP).

(e) Verify that u(Jfv)=(Jfu)v\mathbf{u}^\top(J_f\mathbf{v}) = (J_f^\top\mathbf{u})^\top\mathbf{v} (duality of JVP and VJP).


Exercise 2 (). Softmax Jacobian properties.

Let σ:R3R3\sigma: \mathbb{R}^3 \to \mathbb{R}^3 be the softmax function.

(a) For z=(1,2,0)\mathbf{z} = (1, 2, 0)^\top, compute p=σ(z)\mathbf{p} = \sigma(\mathbf{z}) and the full 3×33\times 3 Jacobian Jσ(z)=diag(p)ppJ_\sigma(\mathbf{z}) = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top.

(b) Verify Jσ1=0J_\sigma\mathbf{1} = \mathbf{0} (the all-ones vector is in the null space).

(c) Verify JσJ_\sigma is symmetric and compute its eigenvalues numerically.

(d) Show that rank(Jσ)=K1\text{rank}(J_\sigma) = K-1 by arguing that 1\mathbf{1} spans the null space.

(e) For AI: In cross-entropy loss L=logpy\mathcal{L} = -\log p_y, the gradient of L\mathcal{L} with respect to the logits z\mathbf{z} is pey\mathbf{p} - \mathbf{e}_y (where ey\mathbf{e}_y is the one-hot vector). Derive this from Lz=JσLp\frac{\partial\mathcal{L}}{\partial\mathbf{z}} = J_\sigma^\top \frac{\partial\mathcal{L}}{\partial\mathbf{p}}.


Exercise 3 (). Hessian computation and classification.

Let f(x,y)=x3+3xyy3f(x, y) = x^3 + 3xy - y^3.

(a) Find all critical points (where f=0\nabla f = 0).

(b) Compute Hf(x,y)H_f(x, y) at each critical point.

(c) Classify each critical point as local min, local max, or saddle using the second derivative test.

(d) For the saddle point, find the two eigenvectors of HfH_f and describe the geometry: in which directions is ff increasing/decreasing quadratically?


Exercise 4 (). LayerNorm Jacobian.

Consider LayerNorm LN:RdRd\text{LN}: \mathbb{R}^d \to \mathbb{R}^d (without learnable parameters γ,β\gamma, \beta):

LN(x)=xxˉ1σ,xˉ=1dixi,σ=1di(xixˉ)2\text{LN}(\mathbf{x}) = \frac{\mathbf{x} - \bar{x}\mathbf{1}}{\sigma}, \quad \bar{x} = \frac{1}{d}\sum_i x_i, \quad \sigma = \sqrt{\frac{1}{d}\sum_i(x_i-\bar{x})^2}

(a) Show that JLN(x)=1σ(I1d111dx^x^)J_{\text{LN}}(\mathbf{x}) = \frac{1}{\sigma}(I - \frac{1}{d}\mathbf{1}\mathbf{1}^\top - \frac{1}{d}\hat{\mathbf{x}}\hat{\mathbf{x}}^\top) where x^=(xxˉ1)/σ\hat{\mathbf{x}} = (\mathbf{x}-\bar{x}\mathbf{1})/\sigma.

(b) Verify that 1\mathbf{1} and x^\hat{\mathbf{x}} are in the null space of JLNJ_{\text{LN}}.

(c) Compute the rank of JLNJ_{\text{LN}} and explain its geometric meaning (what input variations LayerNorm cannot distinguish).

(d) For d=3d = 3 and x=(1,2,3)\mathbf{x} = (1, 2, 3)^\top, compute JLNJ_{\text{LN}} numerically and verify your formula.


Exercise 5 (). Hessian-vector product and edge of stability.

For the MSE loss f(w)=12mXwy2f(\mathbf{w}) = \frac{1}{2m}\|X\mathbf{w} - \mathbf{y}\|^2 with XRm×nX \in \mathbb{R}^{m\times n}:

(a) Show that Hf=1mXXH_f = \frac{1}{m}X^\top X and λmax(Hf)=1mλmax(XX)\lambda_{\max}(H_f) = \frac{1}{m}\lambda_{\max}(X^\top X).

(b) Implement the HVP formula: Hfv=1mX(Xv)H_f\mathbf{v} = \frac{1}{m}X^\top(X\mathbf{v}) (matrix-free, O(mn)O(mn)).

(c) Use 10 iterations of power iteration starting from a random v\mathbf{v} to estimate λmax(Hf)\lambda_{\max}(H_f).

(d) Compute the maximum stable learning rate ηmax=2/λmax\eta_{\max} = 2/\lambda_{\max} and verify numerically that GD with η=1.1ηmax\eta = 1.1\eta_{\max} diverges while η=0.9ηmax\eta = 0.9\eta_{\max} converges.

(e) For AI: Explain why the "edge of stability" phenomenon (Cohen et al. 2022) - where SGD's training loss oscillates as λmax2/η\lambda_{\max} \approx 2/\eta - is consistent with this analysis.


Exercise 6 (). Newton's method convergence.

Consider f(x)=log(1+ex)f(x) = \log(1 + e^x) (softplus).

(a) Compute f(x)=σ(x)f'(x) = \sigma(x) (sigmoid) and f(x)=σ(x)(1σ(x))f''(x) = \sigma(x)(1-\sigma(x)).

(b) Apply 5 steps of Newton's method starting from x0=5x_0 = 5 to minimise g(x)=f(x)2xg(x) = f(x) - 2x (the minimum is at x=log3x^* = \log 3). Record the error xtx|x_t - x^*| at each step.

(c) Plot the sequence logxtx\log|x_t - x^*| vs tt. Is the convergence roughly linear or quadratic (does the log-error decrease linearly or super-linearly)?

(d) Now add a constraint: minimise the same gg but also track the Newton decrement λN=(g(x))2/g(x)\lambda_N = \sqrt{(g'(x))^2/g''(x)} at each step. Interpret its role as a stopping criterion.


Exercise 7 (). K-FAC structure.

Consider a single linear layer: y=Wx\mathbf{y} = W\mathbf{x} with WRnout×ninW \in \mathbb{R}^{n_\text{out}\times n_\text{in}}, loss L\mathcal{L}.

(a) Show that vec(WL)=xg\text{vec}(\nabla_W\mathcal{L}) = \mathbf{x}\otimes\mathbf{g} where g=L/y\mathbf{g} = \partial\mathcal{L}/\partial\mathbf{y} and \otimes is the Kronecker product.

(b) The exact FIM is F=E[(xg)(xg)]=E[xxgg]F = \mathbb{E}[(\mathbf{x}\otimes\mathbf{g})(\mathbf{x}\otimes\mathbf{g})^\top] = \mathbb{E}[\mathbf{x}\mathbf{x}^\top\otimes\mathbf{g}\mathbf{g}^\top]. Show that K-FAC's independence assumption FAGF \approx A\otimes G (with A=E[xx]A = \mathbb{E}[\mathbf{x}\mathbf{x}^\top], G=E[gg]G = \mathbb{E}[\mathbf{g}\mathbf{g}^\top]) gives F1A1G1F^{-1} \approx A^{-1}\otimes G^{-1}.

(c) Show that the K-FAC natural gradient step ΔW=G1(WL)A1\Delta W = -G^{-1}(\nabla_W\mathcal{L})A^{-1} (reshape form) implements vec1(F1vec(WL))-\text{vec}^{-1}(F^{-1}\text{vec}(\nabla_W\mathcal{L})).

(d) Generate random nin=4n_\text{in}=4, nout=3n_\text{out}=3, compute exact FIM from 100 samples, compute K-FAC approximation, and compare F1gF^{-1}\mathbf{g} (exact) with (A1G1)g(A^{-1}\otimes G^{-1})\mathbf{g} (K-FAC) numerically.


Exercise 8 (). Jacobians in normalising flows.

Implement a single coupling layer (as in RealNVP) and verify the log-determinant formula.

(a) Define a coupling layer f:R4R4f: \mathbb{R}^4\to\mathbb{R}^4: split x=(x1,x2)R2×R2\mathbf{x} = (\mathbf{x}_1, \mathbf{x}_2) \in \mathbb{R}^2\times\mathbb{R}^2; output (y1,y2)=(x1,x2es(x1)+t(x1))(\mathbf{y}_1, \mathbf{y}_2) = (\mathbf{x}_1, \mathbf{x}_2 \odot e^{s(\mathbf{x}_1)} + t(\mathbf{x}_1)) where s,t:R2R2s, t: \mathbb{R}^2\to\mathbb{R}^2 are simple networks.

(b) Show that the Jacobian JfJ_f is lower-triangular, and the log-determinant is logdetJf=isi(x1)\log|\det J_f| = \sum_i s_i(\mathbf{x}_1) (sum of scale outputs).

(c) Numerically compute JfJ_f by finite differences for a specific x\mathbf{x}, and verify logdetJf=isi(x1)\log|\det J_f| = \sum_i s_i(\mathbf{x}_1).

(d) Implement the inverse f1(y)=(y1,(y2t(y1))es(y1))f^{-1}(\mathbf{y}) = (\mathbf{y}_1, (\mathbf{y}_2 - t(\mathbf{y}_1))\odot e^{-s(\mathbf{y}_1)}) and verify f1(f(x))=xf^{-1}(f(\mathbf{x})) = \mathbf{x}.

(e) For generative modelling: Explain why the tractable log-determinant (which avoids O(n3)O(n^3) computation) is essential for training normalising flows via maximum likelihood.


12. Why This Matters for AI (2026 Perspective)

ConceptConcrete AI/ML Usage
Jacobian of layer outputsBackpropagation is iterated Jacobian-vector products (VJPs) through each layer. Every training step of every neural network in the world computes these products implicitly
Softmax JacobianUnderstanding that Jσ=diag(p)ppJ_\sigma = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top is rank-(K1)(K-1) explains why large vocabulary softmax is numerically tricky (singular Jacobian) and why temperature scaling affects gradient magnitudes
LayerNorm JacobianThe rank-(d2)(d-2) Jacobian of LayerNorm means gradients cannot pass through in the directions 1\mathbf{1} and x^\hat{\mathbf{x}} - this "gradient bottleneck" explains why Pre-LN transformers (LN before attention) converge faster than Post-LN (LN after attention)
JVP vs VJP (forward vs reverse mode)All modern deep learning frameworks (PyTorch, JAX) use reverse mode. Forward mode is used in specific settings: sensitivity analysis, Hessian-diagonal estimation, gradient checkpointing trade-offs
Jacobian condition numberHigh condition number -> vanishing/exploding gradients in deep networks. Xavier and He initialisation set the initial Jacobian's singular values near 1 across layers, ensuring stable gradient flow at init
Hessian eigenspectrumThe largest Hessian eigenvalue determines the maximum stable learning rate. The "edge of stability" phenomenon (Cohen et al. 2022) shows SGD self-regulates to η2/λmax\eta \approx 2/\lambda_{\max} - foundational for understanding training dynamics
Flat vs sharp minima (SAM)SAM and its variants (ASAM, mSAM, SAM-ON, Adan-SAM) are widely used in production systems. SAM finds flatter minima, improving generalisation by 1-3% on vision and NLP benchmarks. The analysis of flat vs sharp minima requires understanding Hessian definiteness and condition numbers
Hessian-vector productsPower iteration on the Hessian (using HVPs) gives λmax\lambda_{\max} for learning rate selection. Influence functions (for data valuation, finding mislabelled examples, understanding model behaviour) use iterative HVP solvers
K-FAC / natural gradientSecond-order methods that approximate the Fisher Information Matrix (= Gauss-Newton matrix) achieve 10-30x speedup over Adam in terms of steps to convergence. EKFAC, KFAC-Reduce used in large-scale transformer training at Google and DeepMind
Jacobian determinant / normalising flowsThe change-of-variables formula requires $\log
IFT / bilevel optimisationMAML meta-learning, hyperparameter optimisation via gradient, and NAS all require differentiating through optimisation loops. This requires Hessian-vector products in the IFT formula dϕ/dw=Hϕϕ1Hϕwd\phi^*/dw = -H_{\phi\phi}^{-1}H_{\phi w}
Neural Tangent Kernel (NTK)In the infinite-width limit, the kernel is Θ=JJ\Theta = J^\top J (output Jacobian squared). Understanding NTK evolution during training (NTK changes vs stays constant) explains why wide networks behave like kernel machines and why the lazy training regime exists
RoPE and attention JacobiansRotary Position Embeddings (RoPE, used in LLaMA, Mistral, Qwen) are orthogonal transformations applied to queries and keys. Their Jacobians are rotation matrices (orthogonal, $

Conceptual Bridge

Looking backward. This section builds directly on 01-Partial-Derivatives-and-Gradients, which introduced the gradient as the collection of partial derivatives for scalar functions. The Jacobian is the natural generalisation: instead of a single gradient vector, we have a matrix of gradients for each output. Every formula from the gradient section - chain rule, directional derivative, gradient descent - has a Jacobian analogue: chain rule for Jacobians (Jfg=JfJgJ_{f\circ g} = J_f \cdot J_g), directional derivative as JVP (JfvJ_f\mathbf{v}), and steepest ascent/descent replacing gradient with VJP backpropagation.

Looking forward. The Hessian established here is the foundation for the next two sections:

  • 03-Optimization-on-Manifolds: When parameters are constrained to a manifold (e.g., orthogonal matrices O(n)O(n), unit sphere), the Riemannian Hessian replaces the Euclidean Hessian. The Fisher Information Matrix defines a natural Riemannian metric on statistical manifolds.

  • 04-Automatic-Differentiation: The JVP and VJP operators computed throughout this section are exactly what forward-mode and reverse-mode AD implement. The Jacobian-vector product is the primitive of forward-mode; the vector-Jacobian product is the primitive of reverse-mode. Understanding both modes, and when to use each, is the foundation of efficient AD system design.

Beyond this chapter: Jacobians and Hessians appear throughout the rest of the curriculum:

  • Optimisation (Chapter 8): gradient descent convergence rates, condition numbers, Newton's method, and conjugate gradients all depend on the Hessian eigenspectrum
  • Probability and Statistics (Chapter 9): the Fisher Information Matrix and Cramr-Rao bound require Jacobians of log-likelihood functions
  • Differential Geometry (Chapter 10): the pushforward and pullback are Jacobian maps between tangent spaces on manifolds
POSITION IN THE CURRICULUM


   01-Partial-Derivatives-and-Gradients
  (scalar functions: nablaf, directional derivatives, chain rule)
           
           
   02-Jacobians-and-Hessians  <- YOU ARE HERE
  (vector functions: J_f, H_f, second-order analysis)
           
           
                                                                    
                                                                    
   03-Optimization-on-Manifolds                           04-Automatic-Differentiation
  (Riemannian Hessian, natural gradient)                  (JVP/VJP implementation)
                                                                    
           
                             
              Chapter 8: Optimisation
              (convergence theory, Newton, second-order methods)


The Jacobian and Hessian are the core analytical tools of multivariate analysis. Every gradient-based learning algorithm, every convergence theorem, every second-order method in deep learning is a consequence of these fundamental objects. With this section complete, you have the mathematical foundation to understand not just how modern optimisers work, but why they work - and where they break down.

<- Back to Chapter 05: Multivariate Calculus | Next: 03 Optimization on Manifolds ->


13. Numerical Methods and Finite Differences

Before automatic differentiation, Jacobians and Hessians were computed numerically. Even today, numerical methods are essential for verifying analytical gradients (gradient checking), debugging implementations, and understanding numerical stability.

13.1 Finite Difference Approximations

First-order differences. For f:RnRf: \mathbb{R}^n \to \mathbb{R}:

fxjf(x+hej)f(x)h(forward difference, error O(h))\frac{\partial f}{\partial x_j} \approx \frac{f(\mathbf{x} + h\mathbf{e}_j) - f(\mathbf{x})}{h} \quad \text{(forward difference, error } O(h)\text{)} fxjf(x+hej)f(xhej)2h(centred difference, error O(h2))\frac{\partial f}{\partial x_j} \approx \frac{f(\mathbf{x} + h\mathbf{e}_j) - f(\mathbf{x} - h\mathbf{e}_j)}{2h} \quad \text{(centred difference, error } O(h^2)\text{)}

The centred difference is preferred: it is exact for quadratics (cancels the O(h2)O(h^2) error term via symmetry) and reduces to forward difference cost only when both evaluations can be reused.

Second-order differences (Hessian diagonal).

2fxj2f(x+hej)2f(x)+f(xhej)h2(centred, error O(h2))\frac{\partial^2 f}{\partial x_j^2} \approx \frac{f(\mathbf{x}+h\mathbf{e}_j) - 2f(\mathbf{x}) + f(\mathbf{x}-h\mathbf{e}_j)}{h^2} \quad \text{(centred, error } O(h^2)\text{)}

Off-diagonal Hessian entries. Two ways:

2fxixjf(x+h(ei+ej))f(x+hei)f(x+hej)+f(x)h2\frac{\partial^2 f}{\partial x_i \partial x_j} \approx \frac{f(\mathbf{x}+h(\mathbf{e}_i+\mathbf{e}_j)) - f(\mathbf{x}+h\mathbf{e}_i) - f(\mathbf{x}+h\mathbf{e}_j) + f(\mathbf{x})}{h^2}

Computing the full Hessian this way requires O(n2)O(n^2) function evaluations - completely infeasible for large nn.

13.2 Gradient Checking

Gradient checking compares the analytical gradient ganalytical=f(x)\mathbf{g}_{\text{analytical}} = \nabla f(\mathbf{x}) (computed by backprop) against the numerical estimate gnumerical\mathbf{g}_{\text{numerical}} (computed by centred differences):

relative error=ganalyticalgnumericalganalytical+gnumerical\text{relative error} = \frac{\|\mathbf{g}_{\text{analytical}} - \mathbf{g}_{\text{numerical}}\|}{\|\mathbf{g}_{\text{analytical}}\| + \|\mathbf{g}_{\text{numerical}}\|}

A relative error of <105< 10^{-5} usually indicates a correct implementation; >103> 10^{-3} indicates a bug. This is a gold standard debugging technique - every significant ML framework implementation should be gradient-checked before trusting gradients.

Common pitfalls in gradient checking:

  1. Kink near the evaluation point: ReLU has f(0)f'(0) undefined. If the evaluation point happens to be near x=0x = 0 for some neuron, the numerical gradient straddles a kink and gives an inaccurate estimate. Fix: perturb the input slightly away from known kinks.
  2. Float32 precision: Centred differences in float32 have catastrophic cancellation for small hh (both terms round to the same value). Use h105h \approx 10^{-5} for float32, h107h \approx 10^{-7} for float64.
  3. Dropout and batch norm: These layers behave differently in train vs eval mode. Gradient check with these layers frozen or in eval mode.

Stochastic gradient checking. Instead of checking all nn coordinates (cost: 2n2n function evaluations), check a random projection: compute vganalytical\mathbf{v}^\top \mathbf{g}_{\text{analytical}} (one dot product) and compare against (f(x+hv)f(xhv))/(2h)(f(\mathbf{x}+h\mathbf{v}) - f(\mathbf{x}-h\mathbf{v}))/(2h) (one centred difference in direction v\mathbf{v}). This costs only 2 function evaluations regardless of nn, and catches most gradient bugs.

13.3 Numerical Stability of Jacobian Computations

Condition number and ill-conditioned Jacobians. The condition number κ(Jf)=σmax(Jf)/σmin(Jf)\kappa(J_f) = \sigma_{\max}(J_f)/\sigma_{\min}(J_f) measures how amplified input perturbations become in the output. For a neural network layer, κ(Jf)\kappa(J_f) large means:

  • Small input perturbations (e.g., numerical noise) produce large output perturbations
  • Backpropagated gradients through this layer are magnified or shrunk by κ\kappa

Ill-conditioning accumulates through depth. For a depth-LL network, the condition number of the full Jacobian J=JLJ1J = J_L \cdots J_1 satisfies:

κ(J)l=1Lκ(Jl)\kappa(J) \leq \prod_{l=1}^L \kappa(J_l)

This bounds can be exponential in LL - explaining gradient explosion. Good architecture design (residual connections, normalisation layers) keeps κ(Jl)1\kappa(J_l) \approx 1 for each layer.

Log-sum-exp trick. The softmax Jacobian involves pi=ezi/jezjp_i = e^{z_i}/\sum_j e^{z_j}. Direct computation overflows for large ziz_i. The numerically stable implementation subtracts maxjzj\max_j z_j before exponentiation:

pi=ezizmaxjezjzmaxp_i = \frac{e^{z_i - z_{\max}}}{\sum_j e^{z_j - z_{\max}}}

This cancellation doesn't change the output but keeps all exponentials 1\leq 1. The Jacobian formula Jσ=diag(p)ppJ_\sigma = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top remains the same - only the computation of p\mathbf{p} is stabilised.

Mixed precision Hessian computations. When computing HVPs in mixed precision (float16/bfloat16 for activations, float32 for gradients), the second differentiation step (gradient of gradient) may accumulate significant rounding error. Google's T5X and JAX-based training systems perform HVP computations in float32 even during mixed-precision training to maintain numerical accuracy of second-order estimates.


14. Jacobians in Modern Transformer Architectures

Transformers introduced several operations with non-trivial Jacobians. Understanding these helps explain training behaviour, initialisation requirements, and the design of efficient fine-tuning methods.

14.1 Attention Jacobian

The scaled dot-product attention function maps (Q,K,V)O(\mathbf{Q}, \mathbf{K}, \mathbf{V}) \to \mathbf{O}:

A=softmax ⁣(QKdk),O=AV\mathbf{A} = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right), \qquad \mathbf{O} = \mathbf{A}\mathbf{V}

The Jacobian of O\mathbf{O} with respect to Q\mathbf{Q} (for a single head, single query position ii):

OiQi=1dk(Jσ(si))V\frac{\partial O_i}{\partial Q_i} = \frac{1}{\sqrt{d_k}} (J_\sigma(\mathbf{s}_i))^\top \mathbf{V}

where si=Kqi/dk\mathbf{s}_i = \mathbf{K}\mathbf{q}_i/\sqrt{d_k} and Jσ=diag(ai)aiaiJ_\sigma = \text{diag}(\mathbf{a}_i) - \mathbf{a}_i\mathbf{a}_i^\top is the softmax Jacobian. The full attention Jacobian is a complex composition involving the softmax Jacobian, the key matrix K\mathbf{K}, and the value matrix V\mathbf{V}.

Implications of the softmax Jacobian in attention:

  • The singular softmax Jacobian (null space = span{1\mathbf{1}}) means attention weights are translation-invariant: shifting all logits si\mathbf{s}_i by a constant doesn't change the output
  • Temperature τ\tau scaling (softmax(s/τ)\text{softmax}(\mathbf{s}/\tau)) scales the Jacobian: Jσ(τ)=(1/τ)Jσ(1)J_\sigma^{(\tau)} = (1/\tau) J_\sigma^{(1)}. High temperature (τ1\tau \gg 1) -> nearly-uniform attention -> small Jacobian -> slow learning. Low temperature (τ1\tau \ll 1) -> peaked attention -> large Jacobian -> sharp gradients.
  • Attention entropy collapse: if attention concentrates on one token (small τ\tau or sharp s\mathbf{s}), aej\mathbf{a} \approx \mathbf{e}_j for some jj, and Jσ0J_\sigma \approx 0 (PSD, small norm). Gradients vanish through the attention softmax.

14.2 RoPE and Orthogonal Jacobians

Rotary Position Embeddings (RoPE), used in LLaMA, Mistral, Qwen, DeepSeek, apply a position-dependent rotation to queries and keys. For position mm and dimension pair (2i,2i+1)(2i, 2i+1):

(q2iq2i+1)=(cos(mθi)sin(mθi)sin(mθi)cos(mθi))(q2iq2i+1)\begin{pmatrix}q_{2i}' \\ q_{2i+1}'\end{pmatrix} = \begin{pmatrix}\cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i)\end{pmatrix}\begin{pmatrix}q_{2i} \\ q_{2i+1}\end{pmatrix}

The Jacobian of the RoPE transformation (with respect to the query) is a block-diagonal rotation matrix RmO(d)R_m \in O(d) - a rotation matrix with:

  • JRoPE=RmJ_{\text{RoPE}} = R_m (orthogonal matrix)
  • det(Rm)=1\det(R_m) = 1 (volume-preserving)
  • σmax=σmin=1\sigma_{\max} = \sigma_{\min} = 1 (condition number = 1)
  • JRoPE2=1\|J_{\text{RoPE}}\|_2 = 1 (isometry: no gradient scaling)

RoPE's orthogonal Jacobian is a mathematically elegant design: gradient norms are preserved through the positional encoding transformation, preventing it from being a source of gradient instability.

14.3 LoRA and the Low-Rank Jacobian Approximation

Low-Rank Adaptation (LoRA, Hu et al. 2022) freezes the original weight matrix W0Rm×nW_0 \in \mathbb{R}^{m\times n} and adds a trainable low-rank perturbation:

W=W0+BA,BRm×r,  ARr×n,  rmin(m,n)W = W_0 + BA, \quad B \in \mathbb{R}^{m\times r},\; A \in \mathbb{R}^{r\times n},\; r \ll \min(m,n)

The Jacobian of the layer output y=Wx\mathbf{y} = W\mathbf{x} with respect to the LoRA parameters (A,B)(A, B):

yvec(A)=Bx,yvec(B)=Im(Ax)\frac{\partial \mathbf{y}}{\partial \text{vec}(A)} = B \otimes \mathbf{x}^\top, \qquad \frac{\partial \mathbf{y}}{\partial \text{vec}(B)} = I_m \otimes (A\mathbf{x})^\top

The gradient of the loss with respect to BB: BL=δ(Ax)\nabla_B \mathcal{L} = \boldsymbol{\delta}(A\mathbf{x})^\top - an outer product of rank 1, matching the original weight gradient structure. The gradient with respect to AA: AL=Bδx\nabla_A \mathcal{L} = B^\top\boldsymbol{\delta}\mathbf{x}^\top - again rank at most rr.

Why LoRA works (Jacobian perspective). The empirical observation that full fine-tuning has low "intrinsic dimensionality" (Aghajanyan et al. 2021) means the weight gradient WL\nabla_W\mathcal{L} concentrates in a low-rank subspace during fine-tuning. LoRA parameterises updates in this subspace directly. The Jacobian of the LoRA update rule has mr+rnmr + rn parameters vs mnmn for full fine-tuning - a reduction of min(m,n)/r\min(m,n)/r in trainable parameters.

DoRA (Weight-Decomposed LoRA, Liu et al. 2024). Decomposes the weight as W=m(W0+BA)/W0+BAW = m\cdot(W_0 + BA)/\|W_0 + BA\| (magnitude x direction). The Jacobian of the direction normalisation introduces a LayerNorm-like structure - the gradient of the magnitude scalar and the direction matrix decouple, enabling more stable fine-tuning.

14.4 Gradient Flow Through Residual Connections

The residual connection y=x+F(x)\mathbf{y} = \mathbf{x} + F(\mathbf{x}) (introduced in ResNets, universal in transformers) has a special Jacobian structure:

Jy(x)=I+JF(x)J_{\mathbf{y}}(\mathbf{x}) = I + J_F(\mathbf{x})

This is the key insight: the Jacobian of a residual block is always II plus a perturbation. Even if JFJ_F has small singular values (vanishing), the sum I+JFI + J_F has singular values bounded below by 1JF21 - \|J_F\|_2. For a well-initialised network, JF21\|J_F\|_2 \ll 1 at init, so JyIJ_{\mathbf{y}} \approx I - the identity map passes gradients through unchanged.

Depth effect. For an LL-layer residual network:

Jnetwork=l=1L(I+JFl)=I+lJFl+l<lJFlJFl+J_{\text{network}} = \prod_{l=1}^L (I + J_{F_l}) = I + \sum_l J_{F_l} + \sum_{l < l'} J_{F_{l'}}J_{F_l} + \cdots

The leading term is II (identity shortcut from input to output), followed by first-order terms (single-layer Jacobians), then higher-order interaction terms. This identity shortcut is why ResNets train stably at depth L=1000+L = 1000+: the gradient can always flow through the identity path without attenuation.

Comparison to plain networks. For a plain (non-residual) network: Jnetwork=JFLJF1J_{\text{network}} = J_{F_L}\cdots J_{F_1}. This product can have exponentially small singular values at depth - the vanishing gradient problem. Residual connections solve this by ensuring the Jacobian includes an identity component.

For transformers: Every transformer layer adds a residual connection around both the attention block and the MLP block. The combined Jacobian is (I+Jattn)(I+JMLP)=I+Jattn+JMLP+JattnJMLP(I + J_{\text{attn}})(I + J_{\text{MLP}}) = I + J_{\text{attn}} + J_{\text{MLP}} + J_{\text{attn}}J_{\text{MLP}} - dominated by II at initialisation, with the attention and MLP Jacobians as small corrections.


15. Worked Examples: End-to-End Computations

15.1 Complete Backprop Trace for a Two-Layer Network

Let us trace the full computation of Jacobians and VJPs for the small network:

z(1)=W1x,h(1)=ReLU(z(1)),z(2)=W2h(1),L=12z(2)y2\mathbf{z}^{(1)} = W_1\mathbf{x}, \qquad \mathbf{h}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)}), \qquad \mathbf{z}^{(2)} = W_2\mathbf{h}^{(1)}, \qquad \mathcal{L} = \frac{1}{2}\|\mathbf{z}^{(2)} - \mathbf{y}\|^2

with W1Rd1×d0W_1 \in \mathbb{R}^{d_1\times d_0}, W2Rd2×d1W_2 \in \mathbb{R}^{d_2\times d_1}.

Forward pass (computing values):

  1. z(1)=W1x\mathbf{z}^{(1)} = W_1\mathbf{x} (linear)
  2. h(1)=ReLU(z(1))\mathbf{h}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)}) (elementwise)
  3. z(2)=W2h(1)\mathbf{z}^{(2)} = W_2\mathbf{h}^{(1)} (linear)
  4. L=12z(2)y2\mathcal{L} = \frac{1}{2}\|\mathbf{z}^{(2)} - \mathbf{y}\|^2 (loss)

Backward pass (computing Jacobians and VJPs):

Start from the loss: Lz(2)=z(2)yRd2\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{(2)}} = \mathbf{z}^{(2)} - \mathbf{y} \in \mathbb{R}^{d_2} (call this δ(2)\boldsymbol{\delta}^{(2)}).

Step 1: Through linear layer W2W_2.

The Jacobian of z(2)=W2h(1)\mathbf{z}^{(2)} = W_2\mathbf{h}^{(1)} with respect to W2W_2 and h(1)\mathbf{h}^{(1)}:

  • Jz(2),W2J_{\mathbf{z}^{(2)}, W_2}: zi(2)[W2]ij=hj(1)\frac{\partial z^{(2)}_i}{\partial [W_2]_{ij}} = h^{(1)}_j, so W2L=δ(2)(h(1))Rd2×d1\nabla_{W_2}\mathcal{L} = \boldsymbol{\delta}^{(2)}(\mathbf{h}^{(1)})^\top \in \mathbb{R}^{d_2\times d_1}
  • Jz(2),h(1)=W2J_{\mathbf{z}^{(2)}, \mathbf{h}^{(1)}} = W_2, so VJP: δ(1)=W2δ(2)Rd1\boldsymbol{\delta}^{(1)} = W_2^\top\boldsymbol{\delta}^{(2)} \in \mathbb{R}^{d_1}

Step 2: Through ReLU.

Jh(1),z(1)=diag(1[z(1)>0])J_{\mathbf{h}^{(1)}, \mathbf{z}^{(1)}} = \text{diag}(\mathbf{1}[\mathbf{z}^{(1)} > 0]) (diagonal mask). VJP:

Lz(1)=δ(1)1[z(1)>0]Rd1\frac{\partial\mathcal{L}}{\partial\mathbf{z}^{(1)}} = \boldsymbol{\delta}^{(1)} \odot \mathbf{1}[\mathbf{z}^{(1)} > 0] \in \mathbb{R}^{d_1}

(elementwise multiply by the activation mask)

Step 3: Through linear layer W1W_1.

Let δ(0)=Lz(1)\boldsymbol{\delta}^{(0)} = \frac{\partial\mathcal{L}}{\partial\mathbf{z}^{(1)}}.

  • W1L=δ(0)xRd1×d0\nabla_{W_1}\mathcal{L} = \boldsymbol{\delta}^{(0)}\mathbf{x}^\top \in \mathbb{R}^{d_1\times d_0}
  • xL=W1δ(0)Rd0\nabla_{\mathbf{x}}\mathcal{L} = W_1^\top\boldsymbol{\delta}^{(0)} \in \mathbb{R}^{d_0} (if we need the input gradient)

Summary of the structure. Every backward step computes:

  • Weight gradient = (backpropagated error) x (forward activation)^\top - always a rank-1 outer product in a single example
  • Activation gradient = (weight matrix)^\top x (backpropagated error) - transposed Jacobian times incoming gradient

This pattern repeats identically for any depth: backward pass = chain of transposed-Jacobian-vector products (VJPs), with weight gradients as outer products at each layer.

15.2 Numerical Verification of the Softmax Jacobian

A worked numerical example. Take z=(1,0,1)\mathbf{z} = (1, 0, -1)^\top and K=3K = 3.

Compute p=softmax(z)\mathbf{p} = \text{softmax}(\mathbf{z}):

e12.718,e0=1,e10.368,Z=2.718+1+0.368=4.086e^1 \approx 2.718, \quad e^0 = 1, \quad e^{-1} \approx 0.368, \qquad Z = 2.718+1+0.368 = 4.086 p1=2.718/4.0860.665,p2=1/4.0860.245,p3=0.368/4.0860.090p_1 = 2.718/4.086 \approx 0.665, \quad p_2 = 1/4.086 \approx 0.245, \quad p_3 = 0.368/4.086 \approx 0.090

Jacobian: Jσ=diag(p)ppJ_\sigma = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top:

Jσ=(0.6650000.2450000.090)(0.6650.2450.090)(0.6650.2450.090)J_\sigma = \begin{pmatrix}0.665 & 0 & 0\\ 0 & 0.245 & 0 \\ 0 & 0 & 0.090\end{pmatrix} - \begin{pmatrix}0.665\\ 0.245\\ 0.090\end{pmatrix}\begin{pmatrix}0.665 & 0.245 & 0.090\end{pmatrix} =(0.6650.4420.1630.0600.1630.2450.0600.0220.0600.0220.0900.008)=(0.2220.1630.0600.1630.1850.0220.0600.0220.082)= \begin{pmatrix}0.665 - 0.442 & -0.163 & -0.060\\ -0.163 & 0.245 - 0.060 & -0.022\\ -0.060 & -0.022 & 0.090 - 0.008\end{pmatrix} = \begin{pmatrix}0.222 & -0.163 & -0.060\\ -0.163 & 0.185 & -0.022\\ -0.060 & -0.022 & 0.082\end{pmatrix}

Verification:

  • Row sums: 0.2220.1630.060=0.00100.222 - 0.163 - 0.060 = -0.001 \approx 0 (rounding; exact = 0)
  • Jσ1=0J_\sigma\mathbf{1} = 0 (null space contains 1\mathbf{1})
  • Symmetry: J12=J21=0.163J_{12} = J_{21} = -0.163
  • Diagonal = pi(1pi)p_i(1-p_i): 0.665×0.335=0.2230.2220.665 \times 0.335 = 0.223 \approx 0.222

Eigenvalues (computed numerically): λ10.487\lambda_1 \approx 0.487, λ20.003\lambda_2 \approx 0.003, λ3=0\lambda_3 = 0. Two nonzero eigenvalues (rank 2 = K1K-1 ), one zero eigenvalue corresponding to eigenvector 1/3\mathbf{1}/\sqrt{3} .

15.3 Newton's Method: Traced Convergence on a Quadratic

Take f(x,y)=12(x2+25y2)f(x, y) = \frac{1}{2}(x^2 + 25y^2) (condition number κ=25\kappa = 25). Minimum at (0,0)(0, 0).

f=(x,25y)\nabla f = (x, 25y)^\top, Hf=diag(1,25)H_f = \text{diag}(1, 25).

Gradient descent with η=2/(1+25)=0.0769\eta = 2/(1+25) = 0.0769:

ttxtx_tyty_tf(xt)f(\mathbf{x}_t)(xt,yt)\|(x_t,y_t)\|
01.01.013.01.414
10.9230.92311.081.305
50.6520.6525.510.922
200.2150.2150.5990.304
500.0290.0290.0110.041

Contraction ratio: ρ=(251)/(25+1)0.923\rho = (25-1)/(25+1) \approx 0.923 - slow!

Newton's method with exact Hessian H1=diag(1,1/25)H^{-1} = \text{diag}(1, 1/25):

Newton step: δ=H1f=(x,y)\boldsymbol{\delta} = -H^{-1}\nabla f = -(x, y)^\top.

From (x0,y0)=(1,1)(x_0, y_0) = (1, 1): Newton step gives (11,11)=(0,0)(1-1, 1-1) = (0, 0) - exactly one step!

This is because ff is exactly quadratic: the Newton step minimises the quadratic approximation exactly, and the approximation is exact for quadratic functions. For non-quadratic functions, Newton converges in O(loglog(1/ε))O(\log\log(1/\varepsilon)) steps near the minimum.

Key insight. GD takes 50\sim 50 steps to get to error 0.030.03; Newton takes 1 step to get to error 00. The difference is the Hessian inverse which scales the poorly-conditioned direction (the yy-direction with curvature 25) by 1/251/25, compensating exactly.


16. Summary of Key Formulas

A quick-reference table of all major Jacobians and Hessians covered in this section.

JACOBIAN REFERENCE TABLE


  Function                  | Jacobian J_f                        | Shape
  
  f(x) = Ax + b             | A                                   | (m,n)
  f(x) = sigma(x)  elementwise  | diag(sigma'(x))                         | (n,n)
  f(x) = softmax(x)         | diag(p) - pp^T                      | (K,K)
  f(x) = LayerNorm(x)       | (1/sigma)(I - (1/d)11^T - (1/d)xx^T) | (d,d)
  f(x) = ||x||_2             | x^T / ||x||_2  (row vector)          | (1,n)
  f(x) = x/||x||_2           | (1/||x||)(I - xx^T)               | (n,n)
  f(x) = log(sum(exp(x)))   | softmax(x)^T  (row vector)          | (1,n)
  f(x) = ReLU(x)            | diag(1[x > 0])  (a.e.)              | (n,n)
  f(x,y) = xy               | [y, x]  (row Jacobian)              | (1,2)
  Chain: fog                | J_f(g(x)) * J_g(x)                  | (m,n)



HESSIAN REFERENCE TABLE


  Function f(x)             | Hessian H_f                         | Definiteness
  
  (1/2)x^T A x              | A                                   | same as A
  (1/2)||Ax - b||^2          | A^T A                               | PSD
  log-sum-exp(x)            | diag(p) - pp^T                      | PSD, rank K-1
  logistic CE, 1 sample     | p(1-p) xx^T  (rank-1)               | PSD, rank 1
  logistic CE, m samples    | Sigma p(1-p)xx^T                | PSD
  f(x) = ||x||^2             | 2I                                  | PD
  f(x) = 1/||x||            | (3xx^T - ||x||^2I)/||x||^5           | Indefinite
  ReLU network              | 0 almost everywhere                  | PSD (trivially)


Cost summary:

ComputationNaive costEfficient cost
Full Jacobian JfRm×nJ_f \in \mathbb{R}^{m\times n}O(mn)O(mn) AD calls-
JVP JfvJ_f\mathbf{v} (single vector)1 forward passO(n)O(n)
VJP JfuJ_f^\top\mathbf{u} (single vector)1 backward passO(n)O(n)
Full gradient f\nabla f for scalar ff-1 backward pass, O(n)O(n)
Full Hessian HfRn×nH_f \in \mathbb{R}^{n\times n}nn backward passesO(n2)O(n^2)
HVP HfvH_f\mathbf{v}-2 backward passes, O(n)O(n)
Top eigenvalue λmax\lambda_{\max}-20\sim 20 HVPs, O(n)O(n) each
Solve Hfx=bH_f\mathbf{x} = \mathbf{b} (CG)-O(κ)O(\sqrt{\kappa}) HVPs
K-FAC update per layerO(nin3nout3)O(n_{\text{in}}^3 n_{\text{out}}^3)O(nin3+nout3)O(n_{\text{in}}^3 + n_{\text{out}}^3)

References

  1. Spivak, M. (1965). Calculus on Manifolds. Benjamin. - The rigorous treatment of Frchet derivatives and the inverse/implicit function theorems.

  2. Rumelhart, D., Hinton, G., Williams, R. (1986). "Learning representations by back-propagating errors." Nature, 323. - The original backpropagation paper; backprop = iterated VJP.

  3. Baydin, A., Pearlmutter, B., Radul, A., Siskind, J. (2018). "Automatic differentiation in machine learning: a survey." JMLR, 18(153). - Comprehensive AD survey covering forward/reverse mode and higher-order derivatives.

  4. Pearlmutter, B. (1994). "Fast exact multiplication by the Hessian." Neural Computation, 6(1). - The original HVP identity paper.

  5. Martens, J., Grosse, R. (2015). "Optimizing neural networks with Kronecker-factored approximate curvature." ICML. - K-FAC paper.

  6. Ghorbani, B., Krishnan, S., Xiao, Y. (2019). "An investigation into neural net optimization via Hessian eigenvalue density." ICML. - Hessian spectrum analysis of deep networks.

  7. Cohen, J., Kaur, S., Li, Y., Kolter, J., Talwalkar, A. (2022). "Gradient descent on neural networks typically occurs at the edge of stability." ICLR. - Edge of stability and λmax\lambda_{\max} dynamics.

  8. Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B. (2021). "Sharpness-Aware Minimization for Efficiently Improving Generalization." ICLR. - SAM algorithm.

  9. Hu, E., Shen, Y., Wallis, P., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. - LoRA; Jacobian perspective on parameter-efficient fine-tuning.

  10. Su, J., Lu, Y., Pan, S., et al. (2024). "RoFormer: Enhanced transformer with rotary position embedding." Neurocomputing. - RoPE; orthogonal Jacobian design for position embeddings.

  11. Kook, Y., Sra, S. (2024). "Geometric analysis of neural collapse." - Geometric perspective on loss landscape and Hessian structure.

  12. Boyd, S., Vandenberghe, L. (2004). Convex Optimization. Cambridge. - Definitive reference for second-order analysis, Newton's method, and optimality conditions.


End of 02 Jacobians and Hessians

<- Back to Chapter 05: Multivariate Calculus | Next: 03 Chain Rule and Backpropagation ->