Part 1

28 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Chain Rule and Backpropagation: Part 1: Intuition to 10. Common Mistakes

1. Intuition

1.1 From Single-Variable to Multivariate Chain Rule

The single-variable chain rule says: if $y = f(u)$ and $u = g(x)$ , then

\frac{dy}{dx} = \frac{dy}{du}\cdot\frac{du}{dx}

The intuition is rates of change compose multiplicatively. If $g$ triples its input and $f$ doubles its input, then $f \circ g$ multiplies by six.

The multivariate generalisation replaces scalars with vectors and scalar derivatives with Jacobian matrices. If $\mathbf{y} = f(\mathbf{u})$ and $\mathbf{u} = g(\mathbf{x})$ , then

J_{f\circ g}(\mathbf{x}) = J_f(g(\mathbf{x})) \cdot J_g(\mathbf{x})

The product is now matrix multiplication. This is not a different rule - it is the same rule, stated in the correct language for vector-valued functions. The single-variable rule is the special case $n = m = p = 1$ where Jacobians degenerate to scalars.

SCALAR CHAIN RULE vs JACOBIAN CHAIN RULE


  Scalar:    x -> g -> u -> f -> y
             R    R    R    R    R
             dy/dx = (dy/du)(du/dx)   [scalar multiplication]

  Vector:    x -> g -> u -> f -> y
             R   R   R   R   R
             J_{fog} = J_f * J_g     [matrix multiplication]
             (mxp) = (mxn) * (nxp)

  The dimensions work out exactly like matrix multiplication.
  The chain rule IS matrix multiplication for Jacobians.

What makes the multivariate version non-trivial is that $J_f$ must be evaluated at $g(\mathbf{x})$ - the output of the inner function - not at $\mathbf{x}$ itself. This point-dependence is where the local linear approximation lives: the Jacobian $J_f(g(\mathbf{x}))$ is the best linear approximation to $f$ at the specific point $g(\mathbf{x})$ , and $J_g(\mathbf{x})$ is the best linear approximation to $g$ at $\mathbf{x}$ .

1.2 Backpropagation as Iterated Chain Rule

A deep neural network is a long composition of functions:

\mathcal{L} = \ell\bigl(f_L(\cdots f_2(f_1(\mathbf{x}))\cdots)\bigr)

where each $f_l$ is a layer (linear + activation), $\ell$ is the loss function, and $\mathcal{L}$ is a scalar. Computing $\nabla_{\mathbf{w}^{[l]}}\mathcal{L}$ - the gradient of the loss with respect to layer $l$ 's parameters - requires applying the chain rule through every layer from $l$ to $L$ .

The chain rule gives:

\nabla_{\mathbf{w}^{[l]}}\mathcal{L} = J_{f_l,\mathbf{w}}(\mathbf{a}^{[l-1]})^\top \cdot \boldsymbol{\delta}^{[l]}

where $\boldsymbol{\delta}^{[l]} = \nabla_{\mathbf{z}^{[l]}}\mathcal{L}$ is the error signal at layer $l$ , and it satisfies the backpropagation recurrence:

\boldsymbol{\delta}^{[l]} = J_{f_{l+1},\mathbf{a}}(\mathbf{z}^{[l]})^\top \cdot \boldsymbol{\delta}^{[l+1]}

This recurrence propagates the error signal backward from layer $L$ to layer 1 - hence "backpropagation." At each step, we multiply by the transposed Jacobian of the next layer. The entire algorithm is:

Forward pass: compute and store $\mathbf{z}^{[l]}$ , $\mathbf{a}^{[l]}$ for $l = 1, \ldots, L$
Initialise: $\boldsymbol{\delta}^{[L]} = \nabla_{\mathbf{z}^{[L]}}\mathcal{L}$
Backward pass: compute $\boldsymbol{\delta}^{[l]}$ for $l = L-1, \ldots, 1$ using the recurrence
Gradients: extract $\nabla_{W^{[l]}}\mathcal{L}$ from $\boldsymbol{\delta}^{[l]}$ and $\mathbf{a}^{[l-1]}$

Backpropagation is not a fundamentally different concept from the chain rule. It is the chain rule, applied efficiently by sharing intermediate computations (the error signals $\boldsymbol{\delta}^{[l]}$ ) across all parameters in a layer.

1.3 Historical Context

Year	Contributor	Development
1676	Leibniz	Differential calculus; first statement of the single-variable chain rule
1755	Euler	Extended to multiple variables
1960	Kelley	Gradient computation for optimal control (independent discovery of backprop concept)
1970	Linnainmaa	First complete description of reverse-mode automatic differentiation for computing gradients
1974	Werbos	First application to neural networks in his PhD thesis
1986	Rumelhart, Hinton, Williams	Popularised backpropagation in "Learning representations by back-propagating errors" - the paper that launched the neural network revolution
1989	LeCun	Applied backprop to convolutional networks for handwritten digit recognition
2012	Krizhevsky, Sutskever, Hinton	AlexNet demonstrated GPU-accelerated backprop at scale - kicked off the deep learning era
2015	Google Brain, Facebook AI	PyTorch/TensorFlow: automatic differentiation engines that compute backprop automatically
2017	Vaswani et al.	Transformer: backprop through multi-head attention; the architecture underlying GPT, BERT, LLaMA
2021	Hu et al. (LoRA)	Parameter-efficient fine-tuning by limiting gradient flow to low-rank subspaces
2022	Dao et al. (FlashAttention)	Recompute activations during backward to avoid materialising the $N\times N$ attention matrix

1.4 Why Backprop Defines Modern AI

Every large language model, image classifier, and diffusion model trained today relies on backpropagation for every gradient update. The scale is staggering: training GPT-4 reportedly required $\sim 10^{25}$ floating-point operations, the vast majority of which are forward and backward passes through the transformer network.

Backprop enables gradient-based learning at scale because its cost is proportional to the cost of the forward pass - typically $O(n \cdot \text{FLOPs}(f))$ where $n$ is the number of parameters. Alternative approaches (finite differences, evolution strategies, zeroth-order methods) are orders of magnitude more expensive.

Three properties make backprop indispensable:

Efficiency: One backward pass computes $\nabla_{\boldsymbol{\theta}}\mathcal{L}$ for all $|\boldsymbol{\theta}| \sim 10^{10}$ parameters simultaneously. Finite differences would need $10^{10}$ forward passes.
Exactness: Unlike finite differences, backprop computes the exact gradient (up to floating-point precision), not an approximation.
Composability: Any differentiable function composed of differentiable primitives has an automatically computable gradient. This is why PyTorch/JAX can differentiate arbitrary Python code that uses differentiable operations.

For AI in 2026: The gradient is the workhorse of every training algorithm: SGD, Adam, AdaGrad, Muon, SOAP - all are gradient-based. Fine-tuning (LoRA, QLoRA, DoRA), RLHF (PPO, DPO, GRPO), distillation, and continual learning all depend on backprop. Even methods that appear gradient-free (evolutionary strategies, black-box optimisation) are often used because they approximate the gradient in settings where backprop is unavailable (non-differentiable objectives, external APIs).

2. The Multivariate Chain Rule - Full Theory

2.1 The General Chain Rule - Proof

We prove the chain rule using the Frchet derivative from 02. Recall:

Definition. $f: U \subseteq \mathbb{R}^n \to \mathbb{R}^m$ is Frchet differentiable at $\mathbf{x}$ if there exists a linear map $L_\mathbf{x}: \mathbb{R}^n \to \mathbb{R}^m$ such that

\lim_{\|\boldsymbol{\delta}\|\to 0}\frac{\|f(\mathbf{x}+\boldsymbol{\delta}) - f(\mathbf{x}) - L_\mathbf{x}\boldsymbol{\delta}\|}{\|\boldsymbol{\delta}\|} = 0

The matrix of $L_\mathbf{x}$ is the Jacobian $J_f(\mathbf{x})$ .

Theorem (Chain Rule). Let $g: \mathbb{R}^p \to \mathbb{R}^n$ be Frchet differentiable at $\mathbf{x}$ , and $f: \mathbb{R}^n \to \mathbb{R}^m$ be Frchet differentiable at $g(\mathbf{x})$ . Then $h = f \circ g: \mathbb{R}^p \to \mathbb{R}^m$ is Frchet differentiable at $\mathbf{x}$ and

J_h(\mathbf{x}) = J_f(g(\mathbf{x}))\cdot J_g(\mathbf{x})

Proof. Let $\mathbf{u}_0 = g(\mathbf{x})$ . We need to show that $J_f(\mathbf{u}_0)J_g(\mathbf{x})$ is the Frchet derivative of $h$ at $\mathbf{x}$ . Write:

h(\mathbf{x}+\boldsymbol{\delta}) - h(\mathbf{x}) = f(g(\mathbf{x}+\boldsymbol{\delta})) - f(g(\mathbf{x}))

Let $\boldsymbol{\eta} = g(\mathbf{x}+\boldsymbol{\delta}) - g(\mathbf{x})$ . Since $g$ is Frchet differentiable:

\boldsymbol{\eta} = J_g(\mathbf{x})\boldsymbol{\delta} + \mathbf{r}_g(\boldsymbol{\delta}), \quad \frac{\|\mathbf{r}_g(\boldsymbol{\delta})\|}{\|\boldsymbol{\delta}\|} \to 0 \text{ as } \boldsymbol{\delta} \to \mathbf{0}

Now apply Frchet differentiability of $f$ at $\mathbf{u}_0$ :

f(\mathbf{u}_0 + \boldsymbol{\eta}) - f(\mathbf{u}_0) = J_f(\mathbf{u}_0)\boldsymbol{\eta} + \mathbf{r}_f(\boldsymbol{\eta}), \quad \frac{\|\mathbf{r}_f(\boldsymbol{\eta})\|}{\|\boldsymbol{\eta}\|} \to 0 \text{ as } \boldsymbol{\eta} \to \mathbf{0}

Substituting:

h(\mathbf{x}+\boldsymbol{\delta}) - h(\mathbf{x}) = J_f(\mathbf{u}_0)(J_g(\mathbf{x})\boldsymbol{\delta} + \mathbf{r}_g(\boldsymbol{\delta})) + \mathbf{r}_f(\boldsymbol{\eta})

= J_f(\mathbf{u}_0)J_g(\mathbf{x})\boldsymbol{\delta} + \underbrace{J_f(\mathbf{u}_0)\mathbf{r}_g(\boldsymbol{\delta}) + \mathbf{r}_f(\boldsymbol{\eta})}_{\text{remainder}}

We show the remainder is $o(\|\boldsymbol{\delta}\|)$ :

First term: $\|J_f(\mathbf{u}_0)\mathbf{r}_g(\boldsymbol{\delta})\| \leq \|J_f(\mathbf{u}_0)\|\cdot\|\mathbf{r}_g(\boldsymbol{\delta})\| = o(\|\boldsymbol{\delta}\|)$ .
Second term: Since $\|\boldsymbol{\eta}\| \leq \|J_g\|\|\boldsymbol{\delta}\| + \|\mathbf{r}_g\| = O(\|\boldsymbol{\delta}\|)$ , we have $\|\mathbf{r}_f(\boldsymbol{\eta})\| = o(\|\boldsymbol{\eta}\|) = o(\|\boldsymbol{\delta}\|)$ .

Therefore $J_h(\mathbf{x}) = J_f(g(\mathbf{x}))J_g(\mathbf{x})$ . $\square$

When the chain rule fails. The chain rule requires both $g$ at $\mathbf{x}$ and $f$ at $g(\mathbf{x})$ to be Frchet differentiable. If either fails - for example at a ReLU kink where $\mathbf{z}^{[l]} = 0$ - the classical chain rule does not apply. In practice, these measure-zero sets are handled by choosing a subgradient (any element of the Clarke subdifferential), which is what deep learning frameworks do automatically.

2.2 Three Cases in Increasing Generality

Case 1: Scalar composition $h: \mathbb{R} \to \mathbb{R}$ . $h(x) = f(g(x))$ where $f, g: \mathbb{R} \to \mathbb{R}$ . Jacobians are $1\times 1$ = scalars, so $h'(x) = f'(g(x)) \cdot g'(x)$ .

Case 2: Scalar loss of a vector function. $h: \mathbb{R}^n \to \mathbb{R}$ , $h(\mathbf{x}) = f(g(\mathbf{x}))$ where $g: \mathbb{R}^n \to \mathbb{R}^m$ and $f: \mathbb{R}^m \to \mathbb{R}$ . Jacobians: $J_g \in \mathbb{R}^{m\times n}$ and $J_f \in \mathbb{R}^{1\times m}$ (a row vector). So:

J_h = J_f(g(\mathbf{x})) \cdot J_g(\mathbf{x}) \in \mathbb{R}^{1\times n}

Taking the transpose: $\nabla_\mathbf{x} h = J_g(\mathbf{x})^\top \nabla_{g(\mathbf{x})} f$ - the gradient of $h$ with respect to $\mathbf{x}$ is the transposed Jacobian of $g$ times the gradient of $f$ . This is the VJP equation, the core of backprop.

Case 3: Vector composition $h: \mathbb{R}^p \to \mathbb{R}^m$ . The most general case; Jacobians are full matrices and the chain rule is full matrix multiplication:

J_h(\mathbf{x}) = J_f(g(\mathbf{x})) \cdot J_g(\mathbf{x}) \in \mathbb{R}^{m\times p}

The dimensions verify: $(m\times p) = (m\times n)(n\times p)$ . The "inner dimension" $n$ (the dimension of the intermediate space $\mathbb{R}^n$ ) cancels in the product, exactly as in matrix multiplication.

2.3 The VJP Form - Foundation of Backprop

Definition (VJP). For $g: \mathbb{R}^n \to \mathbb{R}^m$ and a "cotangent" vector $\mathbf{u} \in \mathbb{R}^m$ , the vector-Jacobian product is:

\text{VJP}(g, \mathbf{x}, \mathbf{u}) = J_g(\mathbf{x})^\top \mathbf{u} \in \mathbb{R}^n

Why this is the right primitive for backprop. For a scalar loss $\mathcal{L}: \mathbb{R}^m \to \mathbb{R}$ composed with $g: \mathbb{R}^n \to \mathbb{R}^m$ :

\nabla_\mathbf{x}(\mathcal{L} \circ g) = J_g(\mathbf{x})^\top \nabla_{g(\mathbf{x})}\mathcal{L} = \text{VJP}(g, \mathbf{x}, \nabla_{g(\mathbf{x})}\mathcal{L})

The gradient of the composed function with respect to the input is the VJP of the inner function, with the cotangent being the gradient of the outer function.

The backprop recursion is a chain of VJPs. For $\mathcal{L} = \ell \circ f_L \circ \cdots \circ f_1$ :

\nabla_{\mathbf{a}^{[l]}}\mathcal{L} = \text{VJP}(f_{l+1}, \mathbf{a}^{[l]}, \nabla_{\mathbf{a}^{[l+1]}}\mathcal{L}) = J_{f_{l+1}}(\mathbf{a}^{[l]})^\top \nabla_{\mathbf{a}^{[l+1]}}\mathcal{L}

Starting from $\nabla_{\mathbf{a}^{[L]}}\mathcal{L}$ and applying VJPs from right to left computes all intermediate gradients.

Cost comparison. Computing $\nabla_\mathbf{x}\mathcal{L}$ for $\mathcal{L}: \mathbb{R}^n \to \mathbb{R}$ :

JVP (forward mode): requires $n$ passes (one per input dimension). Cost: $n \times O(\text{FLOPs}(g))$ .
VJP (reverse mode): requires $1$ pass. Cost: $O(\text{FLOPs}(g))$ .

For $n \sim 10^{10}$ parameters, reverse mode is $10^{10}$ times cheaper. This asymmetry is why all gradient-based deep learning uses reverse mode (backprop).

2.4 Long Chains and Telescoping Products

For a depth- $L$ network $h = f_L \circ \cdots \circ f_1$ , the Jacobian of $h$ with respect to $\mathbf{x}$ is:

J_h(\mathbf{x}) = J_{f_L}(\mathbf{a}^{[L-1]}) \cdot J_{f_{L-1}}(\mathbf{a}^{[L-2]}) \cdots J_{f_1}(\mathbf{x})

This is a product of $L$ matrices. The spectral norm of the product satisfies:

\|J_h\|_2 \leq \prod_{l=1}^L \|J_{f_l}\|_2

If each $\|J_{f_l}\|_2 = \rho$ , then $\|J_h\|_2 \leq \rho^L$ . For $\rho < 1$ , the gradient vanishes exponentially; for $\rho > 1$ , it explodes. This is the mathematical source of the vanishing/exploding gradient problem (6).

Efficient computation: reverse order. In the forward direction, we compute $\mathbf{a}^{[1]}, \ldots, \mathbf{a}^{[L]}$ left to right. In the backward direction, we compute the error signals $\boldsymbol{\delta}^{[L]}, \boldsymbol{\delta}^{[L-1]}, \ldots, \boldsymbol{\delta}^{[1]}$ right to left, reusing stored activations. The key observation: at step $l$ , we only need $\boldsymbol{\delta}^{[l+1]}$ and the stored activation $\mathbf{a}^{[l]}$ (or $\mathbf{z}^{[l]}$ ) - we do not need to recompute from scratch.

2.5 Differentiating Through Discrete Operations

Some operations in neural networks are discontinuous or discrete: argmax (in beam search), rounding/quantisation (in QAT), sampling (in VAEs and RL). The chain rule does not directly apply.

Straight-Through Estimator (STE). For a quantisation function $q(x) = \lfloor x \rceil$ (round to nearest integer), the derivative is $q'(x) = 0$ almost everywhere, giving zero gradient. The STE replaces the "true" zero gradient with 1 during the backward pass:

\frac{\partial \mathcal{L}}{\partial x} \approx \frac{\partial \mathcal{L}}{\partial q(x)} \quad \text{(pass gradient through as if $q$ were the identity)}

In code: y = round(x).detach() + x - x.detach() - adds $x$ in the forward pass (cancels) but contributes its gradient in the backward pass. STE is used in VQ-VAE, binary neural networks, and quantisation-aware training.

REINFORCE (score function estimator). For a stochastic node $y \sim p_\theta(y|x)$ and loss $\mathcal{L}(y)$ , the gradient of $\mathbb{E}[\mathcal{L}(y)]$ with respect to $\theta$ is:

\nabla_\theta \mathbb{E}_{y \sim p_\theta}[\mathcal{L}(y)] = \mathbb{E}_{y \sim p_\theta}[\mathcal{L}(y) \nabla_\theta \log p_\theta(y|x)]

This allows gradient estimation without differentiating through the sampling step. Used in RLHF (PPO, GRPO) and variational inference. High variance; mitigated by baselines.

3. Computation Graphs

3.1 Formal DAG Definition

A computation graph is a directed acyclic graph $G = (V, E)$ encoding how scalar or tensor quantities depend on one another.

Nodes $V$ partition into three types:

Type	Symbol	Role
Input nodes	$v_1,\ldots,v_n$	Hold model inputs and parameters; no incoming edges
Intermediate nodes	$v_{n+1},\ldots,v_{N-1}$	Hold computed activations; receive edges from their operands
Output node	$v_N$	Holds the scalar loss $\mathcal{L}$ ; required to be scalar for standard backprop

Edges $(u, v) \in E$ encode data dependency: $v = \phi(u_1, \ldots, u_k)$ for some primitive $\phi$ . Each edge carries an implicit local Jacobian $\partial v / \partial u_i$ .

Primitive operations are the atomic building blocks with known local gradients:

PRIMITIVE OPERATIONS AND THEIR LOCAL GRADIENTS


  Operation          Forward           Local gradient (wrt input i)
  
  z = x + y          z = x + y         partialz/partialx = 1,  partialz/partialy = 1
  z = x * y          z = xy            partialz/partialx = y,  partialz/partialy = x
  z = exp(x)         z = e            partialz/partialx = e
  z = log(x)         z = ln x          partialz/partialx = 1/x
  z = relu(x)        z = max(0,x)      partialz/partialx = [x>0]  (a.e.)
  z = W x + b        Wx+b              partialz/partialW = x (as outer),  partialz/partialx = W
  z = softmax(x)     e/Sigmae         diag(p) - pp  (see 02)
  

  Every deep learning framework maintains a lookup table of these
  primitives together with their vjp implementations.

Topological ordering - a linear ordering $\pi$ of $V$ such that for every edge $(u,v) \in E$ , $u$ appears before $v$ in $\pi$ . Topological order exists iff $G$ is acyclic (Kahn's algorithm, 1962). Both the forward pass and the backward pass respect topological order (the latter in reverse).

For AI: Every modern deep learning framework (PyTorch, JAX, TensorFlow) represents a neural network as a computation graph. PyTorch builds the graph dynamically during the forward pass via the autograd tape; JAX traces the graph statically via XLA compilation.

3.2 Forward Pass - Value Propagation

The forward pass evaluates all node values in topological order, caching intermediates required by the backward pass.

Algorithm (Forward Pass):

Input:  graph G = (V, E),  input values {x_1,...,x}
Output: loss value v_N,    cache of intermediates

  For v in topological_order(G):
    if v is an input node:
      cache[v] = x_v  (given)
    else:
      cache[v] = phi_v(cache[u_1], ..., cache[u])
                 where u_1,...,u = parents(v)

  return cache[v_N]

Memory cost of a naive forward pass: Caching all intermediates for backprop costs $O(N)$ memory where $N$ is the number of nodes. For a transformer with $L=96$ layers and activations of size $(B, T, d)$ , this is approximately:

\text{Memory} = L \cdot B \cdot T \cdot d \cdot \text{sizeof(float16)} \approx 96 \times 32 \times 4096 \times 8192 \times 2 \approx 200\,\text{GB}

This is why gradient checkpointing (7) is essential for large models.

What gets cached? A memory-optimal forward pass only caches values that appear in at least one local gradient formula. For a linear layer $\mathbf{z} = W\mathbf{x} + \mathbf{b}$ , the backward needs $\mathbf{x}$ (to compute $\nabla_W$ ) but not $\mathbf{z}$ (already accumulated into the output).

3.3 Backward Pass - Gradient Accumulation

The backward pass evaluates adjoint values $\bar{v} = \partial \mathcal{L}/\partial v$ for every node, in reverse topological order.

Define the adjoint of node $v$ as:

\bar{v} \;:=\; \frac{\partial \mathcal{L}}{\partial v}

where we treat $v$ as a scalar intermediate (extending to tensors componentwise).

Initialisation: $\bar{v}_N = 1$ (the loss node).

Backward recurrence: For a node $v$ with children (successors) $c_1, \ldots, c_m$ - nodes that depend on $v$ :

\bar{v} = \sum_{j=1}^{m} \bar{c}_j \cdot \frac{\partial c_j}{\partial v}

This is exactly the chain rule applied in reverse.

Algorithm (Backward Pass):

Input:  graph G,  cache from forward pass
Output: partial/partialv for all v in V

  adjoint[v_N] <- 1
  For v in reverse_topological_order(G):
    For each parent u of v:
      adjoint[u] += adjoint[v] * partialv/partialu(cache)
                    
                    local_vjp(v, u, adjoint[v])

  return {adjoint[u] : u is a parameter node}

The key observation: each edge $(u, v)$ requires only:

The cached forward value at $u$ (for the local gradient formula)
The downstream adjoint $\bar{v}$ (for the VJP multiplication)

3.4 Gradient Accumulation at Branching Nodes

A fan-out node $u$ has multiple children $c_1, \ldots, c_m$ . The correct gradient is the sum of contributions:

\bar{u} = \sum_{j=1}^{m} \bar{c}_j \cdot \frac{\partial c_j}{\partial u}

Proof: By the total derivative,

\frac{\partial \mathcal{L}}{\partial u} = \sum_{j=1}^{m} \frac{\partial \mathcal{L}}{\partial c_j} \cdot \frac{\partial c_j}{\partial u} = \sum_{j=1}^{m} \bar{c}_j \cdot \frac{\partial c_j}{\partial u} \qquad \square

Example - residual connection:

RESIDUAL BRANCH: u feeds into both F(u) and the skip path


          u
         / \
        /   \
      F(u)   \  <- identity skip
        \   /
         \ /
    z = F(u) + u

  Forward:   z = F(u) + u
  Backward:   = z * partialF(u)/partialu  +  z * 1
                = z * J_F(u)  +  z

  The identity skip guarantees a gradient highway:
  even if J_F(u) ~= 0 (saturated layer), z flows back unchanged.

This is the deep reason residual networks (He et al., 2016) solved the vanishing gradient problem: the skip connection creates a constant-1 term in the backward accumulation, guaranteeing $\bar{u} \geq \bar{z}$ in gradient magnitude.

3.5 Dynamic vs Static Graphs

Two design philosophies produce different tradeoffs:

DYNAMIC GRAPHS (PyTorch eager mode)    STATIC GRAPHS (JAX jit / TF graph)
     

Graph built anew each forward pass     Graph compiled once, reused

 Natural Python control flow           XLA/CUDA fusion, kernel merging
 Easy debugging (print anywhere)       Memory-optimal buffer allocation
 Variable-length sequences trivial     Can export/serve without Python
 Graph construction overhead           Tracing must handle all branches
 Less compiler optimisation            Python side-effects invisible

Examples: PyTorch, early Chainer       Examples: JAX jit, TF2 tf.function,
                                        ONNX Runtime, TensorRT

For transformers: Most production LLM training uses torch.compile (PyTorch 2.0+) which bridges the two: eager-mode graph construction with TorchDynamo tracing and inductor backend compilation, recovering ~30-50% throughput from kernel fusion.

4. Backpropagation

4.1 Network Notation

Consider a feedforward neural network with $L$ layers. Define:

Symbol	Meaning
$\mathbf{a}^{[0]} = \mathbf{x}$	Input vector, dimension $n_0$
$W^{[l]} \in \mathbb{R}^{n_l \times n_{l-1}}$	Weight matrix for layer $l$
$\mathbf{b}^{[l]} \in \mathbb{R}^{n_l}$	Bias vector for layer $l$
$\mathbf{z}^{[l]} = W^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}$	Pre-activation (linear combination)
$\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})$	Post-activation (elementwise)
$\hat{\mathbf{y}} = \mathbf{a}^{[L]}$	Network output
$\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})$	Scalar loss

The forward pass computes $\mathbf{z}^{[l]}$ and $\mathbf{a}^{[l]}$ for $l = 1, \ldots, L$ .

4.2 Forward Equations

\mathbf{z}^{[l]} = W^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}, \qquad l = 1,\ldots,L

\mathbf{a}^{[l]} = \sigma^{[l]}(\mathbf{z}^{[l]}), \qquad l = 1,\ldots,L

\hat{\mathbf{y}} = \mathbf{a}^{[L]}

Cache for backward: $\{\mathbf{z}^{[l]}, \mathbf{a}^{[l-1]}\}_{l=1}^{L}$ .

4.3 Output Layer Gradient

For cross-entropy loss with softmax output, the output gradient has the celebrated clean form (derived in 5.3):

\boldsymbol{\delta}^{[L]} := \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[L]}} = \hat{\mathbf{y}} - \mathbf{y}

where $\mathbf{y}$ is the one-hot label. This combines the softmax Jacobian with the cross-entropy gradient into a single elegant expression.

For MSE loss ( $\mathcal{L} = \tfrac{1}{2}\|\hat{\mathbf{y}} - \mathbf{y}\|^2$ ) with linear output:

\boldsymbol{\delta}^{[L]} = \hat{\mathbf{y}} - \mathbf{y}

(Same form, different derivation - a useful coincidence that makes implementation uniform.)

4.4 Backpropagation Recurrence - Proof

Define the error signal:

\boldsymbol{\delta}^{[l]} \;:=\; \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}} \in \mathbb{R}^{n_l}

Theorem (Backpropagation Recurrence):

\boldsymbol{\delta}^{[l]} = \left(W^{[l+1]}\right)^\top \boldsymbol{\delta}^{[l+1]} \odot \sigma'^{[l]}(\mathbf{z}^{[l]})

Proof: Apply the chain rule from $\mathcal{L}$ to $\mathbf{z}^{[l]}$ via $\mathbf{a}^{[l]}$ and $\mathbf{z}^{[l+1]}$ :

\boldsymbol{\delta}^{[l]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l+1]}} \cdot \frac{\partial \mathbf{z}^{[l+1]}}{\partial \mathbf{z}^{[l]}}

Step 1: $\partial \mathcal{L}/\partial \mathbf{z}^{[l+1]} = (\boldsymbol{\delta}^{[l+1]})^\top$ .

Step 2: $\mathbf{z}^{[l+1]} = W^{[l+1]}\mathbf{a}^{[l]} + \mathbf{b}^{[l+1]} = W^{[l+1]}\sigma(\mathbf{z}^{[l]}) + \mathbf{b}^{[l+1]}$ .

\frac{\partial \mathbf{z}^{[l+1]}_i}{\partial \mathbf{z}^{[l]}_j} = W^{[l+1]}_{ij} \cdot \sigma'(\mathbf{z}^{[l]}_j)

In matrix form: $J = W^{[l+1]} \operatorname{diag}(\sigma'(\mathbf{z}^{[l]}))$ .

Step 3: Multiply by $(\boldsymbol{\delta}^{[l+1]})^\top$ and transpose to get column vector $\boldsymbol{\delta}^{[l]}$ :

\boldsymbol{\delta}^{[l]} = J^\top \boldsymbol{\delta}^{[l+1]} = \operatorname{diag}(\sigma'(\mathbf{z}^{[l]})) \left(W^{[l+1]}\right)^\top \boldsymbol{\delta}^{[l+1]} = \left(W^{[l+1]}\right)^\top \boldsymbol{\delta}^{[l+1]} \odot \sigma'(\mathbf{z}^{[l]}) \qquad \square

The $\odot$ (elementwise) product arises because $\sigma$ is applied elementwise - its Jacobian is diagonal.

4.5 Weight and Bias Gradients

Once error signals $\boldsymbol{\delta}^{[l]}$ are computed, parameter gradients follow immediately:

\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^\top \in \mathbb{R}^{n_l \times n_{l-1}}

\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \boldsymbol{\delta}^{[l]} \in \mathbb{R}^{n_l}

Derivation of weight gradient:

\frac{\partial \mathcal{L}}{\partial W^{[l]}_{ij}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}_i} \cdot \frac{\partial \mathbf{z}^{[l]}_i}{\partial W^{[l]}_{ij}} = \delta^{[l]}_i \cdot a^{[l-1]}_j

Collecting over all $i,j$ : $\nabla_{W^{[l]}} \mathcal{L} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^\top$ .

This is an outer product - the gradient is rank-1 for a single sample. For a batch of $B$ samples it averages to higher rank.

4.6 Batched Backpropagation

With a mini-batch $\{(\mathbf{x}^{(b)}, \mathbf{y}^{(b)})\}_{b=1}^{B}$ , stack inputs into a matrix $X \in \mathbb{R}^{n_0 \times B}$ .

The forward pass becomes:

Z^{[l]} = W^{[l]} A^{[l-1]} + \mathbf{b}^{[l]} \mathbf{1}^\top, \quad A^{[l]} = \sigma(Z^{[l]})

where $A^{[l]} \in \mathbb{R}^{n_l \times B}$ .

The backward pass produces $\Delta^{[l]} \in \mathbb{R}^{n_l \times B}$ (error signals for all samples simultaneously).

Weight gradient for the batch:

\frac{\partial \mathcal{L}_\text{batch}}{\partial W^{[l]}} = \frac{1}{B} \Delta^{[l]} (A^{[l-1]})^\top

This is a single matrix multiplication, making batched backprop efficient on GPUs which excel at large GEMM (general matrix multiplication) operations.

5. Gradient Derivations for Standard Layers

5.1 Linear Layer

Forward: $\mathbf{z} = W\mathbf{x} + \mathbf{b}$ , where $W \in \mathbb{R}^{m \times n}$ , $\mathbf{x} \in \mathbb{R}^n$ .

Upstream gradient: $\bar{\mathbf{z}} = \partial \mathcal{L}/\partial \mathbf{z} \in \mathbb{R}^m$ .

VJP (backward):

\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = W^\top \bar{\mathbf{z}} \in \mathbb{R}^n

\frac{\partial \mathcal{L}}{\partial W} = \bar{\mathbf{z}} \mathbf{x}^\top \in \mathbb{R}^{m \times n}

\frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \bar{\mathbf{z}} \in \mathbb{R}^m

Derivation of $\partial \mathcal{L}/\partial \mathbf{x}$ : Each $z_i = \sum_k W_{ik} x_k + b_i$ , so $\partial z_i / \partial x_j = W_{ij}$ . By VJP: $\partial \mathcal{L}/\partial x_j = \sum_i \bar{z}_i W_{ij} = (W^\top \bar{\mathbf{z}})_j$ .

For AI: In a transformer with hidden dim $d$ and MLP expansion $4d$ : the two linear layers in FFN pass gradients back with $W^\top$ operations - same cost as the forward GEMM. Gradient computation for $W$ is also a GEMM.

5.2 Activation Functions

For elementwise $\mathbf{a} = \sigma(\mathbf{z})$ :

\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \bar{\mathbf{a}} \odot \sigma'(\mathbf{z})

Gradient formulas for common activations:

Activation	$\sigma(z)$	$\sigma'(z)$	Notes
ReLU	$\max(0,z)$	$\mathbf{1}[z>0]$	Sparse gradient; "dead neurons" if $z<0$ always
Sigmoid	$1/(1+e^{-z})$	$\sigma(z)(1-\sigma(z))$	Saturates; max gradient 0.25 at $z=0$
Tanh	$\tanh(z)$	$1-\tanh^2(z)$	Saturates; max gradient 1 at $z=0$
GELU	$z\Phi(z)$	$\Phi(z) + z\phi(z)$	$\Phi$ = Gaussian CDF; smooth at 0
SiLU/Swish	$z \sigma(z)$	$\sigma(z)(1+z(1-\sigma(z)))$	Used in LLaMA, Mistral
SoftPlus	$\log(1+e^z)$	$\sigma(z)$	Smooth ReLU; gradient never zero

GELU (Hendrycks & Gimpel, 2016) is the standard activation in GPT-2/3, BERT, and most modern LLMs. It gates the input by its own probability under a Gaussian, producing richer gradient structure than ReLU.

5.3 Fused Softmax + Cross-Entropy Gradient

Setup: Output logits $\mathbf{z} \in \mathbb{R}^K$ , softmax probabilities $\mathbf{p} = \text{softmax}(\mathbf{z})$ , true label $y \in \{1,\ldots,K\}$ , loss $\mathcal{L} = -\log p_y$ .

Claim:

\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \mathbf{p} - \mathbf{e}_y

where $\mathbf{e}_y$ is the $y$ -th standard basis vector.

Proof: Write $\mathcal{L} = -\log p_y = -z_y + \log \sum_k e^{z_k}$ .

\frac{\partial \mathcal{L}}{\partial z_j} = -\mathbf{1}[j=y] + \frac{e^{z_j}}{\sum_k e^{z_k}} = p_j - \mathbf{1}[j=y] = (\mathbf{p} - \mathbf{e}_y)_j \qquad \square

This direct derivation bypasses the softmax Jacobian computation entirely, which is why modern frameworks implement cross-entropy as a fused operation. For numerical stability, the $\log\sum\exp$ is computed with the log-sum-exp trick: $\log\sum_k e^{z_k} = m + \log\sum_k e^{z_k - m}$ where $m = \max_k z_k$ .

5.4 LayerNorm Gradient

Forward: LayerNorm normalises each token independently:

\hat{\mathbf{x}} = \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}}, \qquad \mathbf{y} = \boldsymbol{\gamma} \odot \hat{\mathbf{x}} + \boldsymbol{\beta}

where $\mu = \tfrac{1}{d}\sum x_i$ , $\sigma^2 = \tfrac{1}{d}\sum (x_i - \mu)^2$ .

Backward: Let $\bar{\mathbf{y}} = \partial \mathcal{L}/\partial \mathbf{y}$ be the upstream gradient.

\frac{\partial \mathcal{L}}{\partial \boldsymbol{\gamma}} = \bar{\mathbf{y}} \odot \hat{\mathbf{x}}, \qquad \frac{\partial \mathcal{L}}{\partial \boldsymbol{\beta}} = \bar{\mathbf{y}}

For the input gradient, define $\bar{\mathbf{x}}_\text{norm} = \bar{\mathbf{y}} \odot \boldsymbol{\gamma}$ . The full gradient through the normalisation is:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{1}{\sqrt{\sigma^2+\epsilon}} \left( \bar{\mathbf{x}}_\text{norm} - \frac{1}{d}\mathbf{1}^\top \bar{\mathbf{x}}_\text{norm} \cdot \mathbf{1} - \frac{1}{d}\hat{\mathbf{x}} \odot (\hat{\mathbf{x}}^\top \bar{\mathbf{x}}_\text{norm}) \cdot \mathbf{1} \right)

This expression subtracts mean and mean-of-hadamard terms, reflecting that LayerNorm's Jacobian projects out two degrees of freedom (02 exercises).

For AI: LayerNorm appears in every transformer layer (pre-norm placement in modern architectures like GPT-NeoX, LLaMA). The gradient through LayerNorm is never zero - it always passes signal, unlike BatchNorm which can become degenerate at small batch sizes.

5.5 Dot-Product Attention Gradient

Forward (simplified single-head):

Q = XW_Q, \quad K = XW_K, \quad V = XW_V

S = QK^\top / \sqrt{d_k}, \quad P = \text{softmax}(S), \quad O = PV

Backward: Given upstream $\bar{O} \in \mathbb{R}^{T \times d_v}$ :

\bar{V} = P^\top \bar{O}, \quad \bar{P} = \bar{O} V^\top

\bar{S} = \text{softmax\_backward}(P, \bar{P}) = P \odot (\bar{P} - (P \odot \bar{P})\mathbf{1}\mathbf{1}^\top) \cdot \frac{1}{\sqrt{d_k}}

\bar{Q} = \bar{S} K, \quad \bar{K} = \bar{S}^\top Q

Then $\nabla_{W_Q} = X^\top \bar{Q}$ , and similarly for $W_K$ , $W_V$ .

Critical memory issue: Storing $P \in \mathbb{R}^{T \times T}$ for the backward costs $O(T^2)$ - this is what FlashAttention avoids by recomputing $P$ from $Q, K$ during the backward pass (see 7.3).

5.6 Embedding Layer Gradient

Forward: $\mathbf{h}_t = E[\text{token}_t]$ , where $E \in \mathbb{R}^{V \times d}$ is the embedding table.

Backward: Given upstream $\bar{\mathbf{h}}_t$ for all tokens $t = 1,\ldots,T$ :

\frac{\partial \mathcal{L}}{\partial E[i]} = \sum_{t : \text{token}_t = i} \bar{\mathbf{h}}_t

This is a sparse gradient - only rows corresponding to tokens in the sequence receive nonzero updates. For vocabulary size $V = 128{,}000$ (LLaMA-3), the embedding matrix is $128{,}000 \times 4{,}096$ , but only a tiny fraction of rows are updated per batch. Distributed training with embedding sharding exploits this sparsity.

6. Vanishing and Exploding Gradients

6.1 Magnitude Analysis - The Core Problem

Consider an $L$ -layer network with no activation functions (to isolate the linear case). The gradient of the loss with respect to layer $l$ parameters involves the product:

\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}} = W^{[L]} W^{[L-1]} \cdots W^{[l+1]} \cdot \boldsymbol{\delta}^{[L]}

This is a product of $L - l$ matrices. By the submultiplicativity of the spectral norm:

\left\| \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[l]}} \right\|_2 \leq \prod_{k=l+1}^{L} \|W^{[k]}\|_2 \cdot \|\boldsymbol{\delta}^{[L]}\|_2

If $\|W^{[k]}\|_2 = \rho < 1$ for all layers:

\text{gradient norm} \leq \rho^{L-l} \cdot \|\boldsymbol{\delta}^{[L]}\|_2 \xrightarrow{L \to \infty} 0 \quad \text{(vanishing)}

If $\|W^{[k]}\|_2 = \rho > 1$ :

\text{gradient norm} \geq \rho^{L-l} \cdot \|\boldsymbol{\delta}^{[L]}\|_2 \xrightarrow{L \to \infty} \infty \quad \text{(exploding)}

GRADIENT MAGNITUDE ACROSS LAYERS


  gradient norm
  
    exploding (rho > 1)
    
     
   ideal (rho = 1)
       
         vanishing (rho < 1)
   layer l
  L                                             0

  With activations, the product includes sigma'(z) terms (< 1 for sigmoid)
  compounding the vanishing problem.

This was identified by Hochreiter (1991) as the fundamental obstacle to training deep networks with gradient descent.

6.2 Activations and Saturation

For sigmoid $\sigma$ : $\sigma'(z) = \sigma(z)(1-\sigma(z)) \leq 0.25$ for all $z$ , with equality only at $z=0$ . In the tails ( $|z| \gg 0$ ), $\sigma'(z) \approx 0$ .

For tanh: $\tanh'(z) = 1 - \tanh^2(z) \leq 1$ , saturating similarly.

In a network with $L$ sigmoid layers and all activations near saturation, the gradient at layer 1 is suppressed by approximately $0.25^L$ . For $L = 20$ : $0.25^{20} \approx 10^{-12}$ - numerically zero.

ReLU resolves saturation: $\text{relu}'(z) = \mathbf{1}[z > 0]$ , which is either 0 or 1. For active neurons, it passes gradients unchanged. However, "dying ReLU" (neurons with $z < 0$ always) creates a different problem - those neurons receive zero gradient and never recover.

GELU and SiLU (used in LLaMA) are smooth approximations that avoid hard zeros, maintaining nonzero gradients everywhere.

6.3 Xavier and He Initialisation

Goal: Choose initial weights so that gradient (and activation) variance is preserved across layers - avoiding exponential growth or decay from the start of training.

Xavier Initialisation (Glorot & Bengio, 2010) - for symmetric activations (tanh, linear):

Assumption: Weights $W_{ij} \sim \mathcal{N}(0, \sigma^2)$ i.i.d., inputs $x_j$ with variance $\text{Var}(x_j) = v$ .

Forward variance preservation: $\text{Var}(z_i) = n_\text{in} \sigma^2 v \Rightarrow \sigma^2 = 1/n_\text{in}$ .

Backward variance preservation: $\text{Var}(\bar{x}_j) = n_\text{out} \sigma^2 \cdot \text{Var}(\bar{z}_i) \Rightarrow \sigma^2 = 1/n_\text{out}$ .

Compromise:

\sigma^2 = \frac{2}{n_\text{in} + n_\text{out}}, \qquad \text{or uniform } W_{ij} \sim U\!\left[-\sqrt{\frac{6}{n_\text{in}+n_\text{out}}},\; \sqrt{\frac{6}{n_\text{in}+n_\text{out}}}\right]

He Initialisation (He et al., 2015) - for ReLU activations:

ReLU zeroes half the distribution, so effective variance is halved: $\text{Var}(\text{relu}(z)) = \tfrac{1}{2}\text{Var}(z)$ . To compensate:

\sigma^2 = \frac{2}{n_\text{in}}

For AI: GPT-2 uses a scaled version: weight initialisation $\mathcal{N}(0, 0.02)$ with the residual projection layers further scaled by $1/\sqrt{2L}$ where $L$ is the number of transformer layers, to control the variance accumulation in the residual stream.

6.4 Residual Connections as Gradient Highways

Theorem: In a residual network $F^{[l+1]}(\mathbf{x}) = \mathbf{x} + G^{[l]}(\mathbf{x})$ , the gradient satisfies:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}^{[0]}} = \prod_{l=1}^{L}\left(I + J_{G^{[l]}}\right) \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{x}^{[L]}}

Key insight: Expanding the product, we get:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}^{[0]}} = \left(I + \sum_l J_{G^{[l]}} + \sum_{l < l'} J_{G^{[l']}} J_{G^{[l]}} + \cdots \right) \frac{\partial \mathcal{L}}{\partial \mathbf{x}^{[L]}}

The identity term guarantees that even if all $J_{G^{[l]}} \approx 0$ (at initialisation), the gradient $\partial \mathcal{L}/\partial \mathbf{x}^{[0]}$ receives the full upstream signal unchanged. This is the theoretical explanation for why ResNets (He et al., 2016) can be trained with hundreds of layers.

In modern transformers, the pre-norm architecture (LayerNorm before the sublayer, not after) further improves gradient flow by ensuring that the residual path carries a pure copy of the signal.

6.5 Gradient Clipping

Gradient explosion is addressed pragmatically by global gradient norm clipping:

\mathbf{g} \leftarrow \mathbf{g} \cdot \min\!\left(1,\; \frac{\tau}{\|\mathbf{g}\|_2}\right)

where $\mathbf{g}$ is the concatenated parameter gradient vector and $\tau$ is the clip threshold.

Typical values: $\tau = 1.0$ for transformers (used in GPT-3, PaLM, LLaMA).

Why global (not per-layer)? Clipping each layer's gradient independently destroys the relative proportions of updates across layers, disrupting the Adam momentum states. Global clipping preserves direction, only reducing magnitude.

Relationship to RNNs: Gradient clipping was originally introduced for RNNs (Mikolov, 2012; Pascanu et al., 2013), where the vanishing/exploding problem is especially severe due to the long chain of time steps.

6.6 Batch Normalisation and Layer Normalisation

BatchNorm (Ioffe & Szegedy, 2015) normalises each feature across the batch, stabilising the distribution of pre-activations. Its gradient has a complex form involving batch statistics, but crucially it prevents activations from saturating on average.

LayerNorm (Ba et al., 2016) normalises each sample across features - preferred in transformers because:

Behaviour is independent of batch size (critical for small-batch inference)
Gradient analysis shows it damps large pre-activation magnitudes
Pre-norm placement ensures the residual stream grows in a controlled manner

Empirical gradient norm tracking is standard practice in LLM training: the gradient norm is logged at every step, and sudden spikes indicate loss spikes or numerical issues. The Chinchilla and GPT-4 training runs used gradient norm monitoring as a primary signal for training health.

7. Memory-Efficient Backpropagation

7.1 Memory Cost of Standard Backprop

Standard backpropagation caches all intermediate activations for use in the backward pass. For a transformer with $L$ layers, batch size $B$ , sequence length $T$ , and hidden dimension $d$ :

Component cached	Size	At FP16
Attention QKV projections	$3 \times L \times B \times T \times d$	$6LBTd$ bytes
Attention scores (pre-softmax)	$L \times B \times H \times T^2$	$2LBH T^2$ bytes
MLP intermediate	$L \times B \times T \times 4d$	$8LBTd$ bytes
LayerNorm stats	$2 \times 2L \times B \times T$	negligible

For GPT-3 ( $L=96$ , $B=512$ , $T=2048$ , $d=12288$ , $H=96$ ): the attention scores alone require $96 \times 512 \times 96 \times 2048^2 \times 2 \approx 2.4 \text{ TB}$ - clearly infeasible without optimisation.

7.2 Gradient Checkpointing

Idea: Trade compute for memory. Instead of caching all activations, cache only a subset of "checkpoint" activations and recompute the rest during the backward pass.

Algorithm (checkpointing at every $k$ -th layer):

GRADIENT CHECKPOINTING


  Forward pass:
    Compute all layers normally
    Save activations only at layers 0, k, 2k, 3k, ...
    Discard all other intermediate activations

  Backward pass:
    For each segment [lk, (l+1)k]:
      Re-run the forward pass from checkpoint lk
      Now have all intermediates for this segment
      Compute gradients for layers lk+1 to (l+1)k-1
      Discard intermediates (no longer needed)

Memory-compute tradeoff:

Memory: $O(\sqrt{L})$ checkpoints (optimal with $k = \sqrt{L}$ ) instead of $O(L)$
Compute: Each layer's forward pass is run twice (once in original forward, once in recomputation) -> approximately $+33\%$ compute overhead

For AI: torch.utils.checkpoint.checkpoint() implements this in PyTorch with a single function call. LLaMA, Mistral, and most OSS LLM trainers enable activation checkpointing by default for sequences longer than ~2048 tokens.

Selective recomputation: Flash Attention (see 7.3) takes a more targeted approach - instead of checkpointing by layer, it recomputes only the attention scores (the $T^2$ term) during the backward pass, since those are the dominant memory consumer.

7.3 FlashAttention: Fused Backward Pass

The $O(T^2)$ problem: Standard attention stores $P = \text{softmax}(QK^\top/\sqrt{d}) \in \mathbb{R}^{T \times T}$ for the backward pass. For $T = 32{,}768$ (long-context models), this is $32768^2 \times 2 \approx 2$ GB per layer per batch element.

FlashAttention solution (Dao et al., 2022): Compute attention in tiles that fit in SRAM (GPU on-chip cache), using the online softmax algorithm (Milakov & Gimelshein, 2018) to avoid materialising the full $T \times T$ matrix.

Backward pass in FlashAttention: The backward pass needs $P$ but doesn't store it. Instead:

Store only the softmax normalisation statistics $m_i, \ell_i$ (scalars per row) - $O(T)$ memory
During the backward pass, recompute $P$ tile by tile from $Q, K$ and the stored statistics
Accumulate gradients $\bar{Q}, \bar{K}, \bar{V}$ tile by tile without ever forming full $P$

Complexity:

Memory: $O(T)$ instead of $O(T^2)$
FLOPs: $4 \times$ the forward FLOPs (small constant factor)
Wall-clock speedup: 2-4x over standard PyTorch attention on A100

For AI: FlashAttention is the default attention implementation in modern LLM training (vLLM, HuggingFace Transformers, NanoGPT). FlashAttention-3 (2024) further optimises for H100 tensor core and async operations.

7.4 Mixed Precision Training

Observation: FP32 (32-bit float) is unnecessarily precise for gradients. FP16 (16-bit float) has $\sim 3\times$ higher memory bandwidth on modern GPUs, but overflow/underflow is common for small/large gradient values.

AMP (Automatic Mixed Precision) strategy:

Component	Precision	Reason
Forward activations	FP16	Fast compute, lower memory
Backward gradients	FP16	Fast compute
Weight updates	FP32	Avoid precision loss
Master weights	FP32	Preserve small updates ( $\Delta w \ll w$ )
Loss scaling	Dynamic	Prevent FP16 underflow for small gradients

Loss scaling: Multiply the loss by a large scale factor $S$ (typically $2^{12}$ to $2^{16}$ ) before backward, then divide gradients by $S$ before the weight update. This shifts gradient values into the representable FP16 range. The scale factor is increased or decreased based on whether overflow (inf/nan) occurred.

BF16 (Brain Float 16, used in TPUs and H100): same 16-bit width but with 8 exponent bits (same as FP32) and only 7 mantissa bits. Eliminates overflow issues while retaining dynamic range - now the preferred format for LLM training.

8. Advanced Differentiation Topics

8.1 Backpropagation Through Time (BPTT)

A recurrent neural network (RNN) with hidden state $\mathbf{h}_t = \sigma(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b})$ can be viewed as a feedforward network unrolled through time:

UNROLLED RNN - BPTT VIEW


  x_1 -> [cell] -> h_1 -> [cell] -> h_2 -> [cell] -> h_3 -> ... -> h -> loss
                                        
          W             W             W         (shared weights)

  BPTT = backprop through the unrolled graph.
  Gradient of loss w.r.t. W = sum of gradients from all time steps.

The gradient with respect to $W_h$ at time step $t$ involves the product:

\frac{\partial \mathcal{L}}{\partial W_h} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t} \left( \prod_{k=1}^{t-1} \frac{\partial \mathbf{h}_{t-k+1}}{\partial \mathbf{h}_{t-k}} \right) \frac{\partial \mathbf{h}_1}{\partial W_h}

Each factor $\partial \mathbf{h}_{t+1}/\partial \mathbf{h}_t = W_h \cdot \text{diag}(\sigma'(\mathbf{z}_t))$ . When $\|W_h\|_2 \cdot \|\sigma'\|_\infty < 1$ , the product of $T$ such factors vanishes exponentially. This is the core failure mode of vanilla RNNs on long sequences (Hochreiter, 1991; Bengio et al., 1994).

Truncated BPTT: In practice, gradients are truncated to a window of $k$ steps to reduce memory and compute costs, at the cost of ignoring long-range dependencies beyond step $k$ .

LSTM/GRU solution: Long Short-Term Memory networks (Hochreiter & Schmidhuber, 1997) use gating mechanisms to maintain a cell state $\mathbf{c}_t$ with additive updates - replacing multiplicative products of weight matrices with additive accumulation, similar to residual connections.

8.2 Implicit Differentiation Preview

For optimisation problems or fixed-point iterations, we sometimes need gradients of implicit functions.

Example: Consider $\mathbf{y}^* = \arg\min_\mathbf{y} \mathcal{L}(\mathbf{y}, \theta)$ where the optimum satisfies $\nabla_\mathbf{y} \mathcal{L}(\mathbf{y}^*, \theta) = 0$ .

By the implicit function theorem:

\frac{d\mathbf{y}^*}{d\theta} = -\left[\nabla^2_{\mathbf{y}\mathbf{y}} \mathcal{L}\right]^{-1} \nabla^2_{\mathbf{y}\theta} \mathcal{L}

This allows differentiating through optimisation steps without unrolling them - the basis of MAML (Model-Agnostic Meta-Learning, Finn et al., 2017) and DEQs (Deep Equilibrium Models, Bai et al., 2019).

Full treatment: Implicit differentiation and differentiable optimisation are covered in depth in 05/05-Automatic-Differentiation.

8.3 Straight-Through Estimator and REINFORCE

The discrete problem: When a node in the computation graph applies a discrete operation (argmax, sampling, rounding), the gradient is zero almost everywhere. The chain rule breaks - the graph is not differentiable at these nodes.

Straight-Through Estimator (STE) (Hinton, 2012; Bengio et al., 2013):

\frac{\partial \mathcal{L}}{\partial \mathbf{x}} := \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{x}}} \cdot \mathbf{1} \qquad \text{(treat discretisation as identity in backward)}

Applications:

Quantisation-aware training (QAT): Simulate INT8 forward, use STE backward. Used in GPTQ, AWQ, and quantised LLM training.
VQ-VAE: Vector quantisation in the encoder uses STE so gradients flow from decoder back to encoder.
Binary neural networks: Forward uses sign(x), backward uses STE with gradient identity.

REINFORCE (Williams, 1992): For stochastic nodes, use the log-derivative trick:

\nabla_\theta \mathbb{E}_{z \sim p_\theta}[\mathcal{L}(z)] = \mathbb{E}_{z \sim p_\theta}[\mathcal{L}(z) \nabla_\theta \log p_\theta(z)]

This produces an unbiased gradient estimate but with high variance (addressed by baseline subtraction: $(\mathcal{L}(z) - b)\nabla_\theta \log p_\theta(z)$ ). REINFORCE is the foundation of policy gradient methods in RL and is used in RLHF's PPO step.

8.4 Higher-Order Gradients

Second-order gradients arise in:

Newton's method: requires Hessian $H = \nabla^2 \mathcal{L}$ (see 02-Jacobians-and-Hessians)
Meta-learning (MAML): gradient of gradient w.r.t. outer parameters
Gradient penalty in GAN training: $\|\nabla_x D(x)\|^2$

In PyTorch: Higher-order gradients are computed by running autograd through itself:

# Second derivative of loss w.r.t. input
loss = model(x).sum()
g = torch.autograd.grad(loss, x, create_graph=True)[0]
g2 = torch.autograd.grad(g.sum(), x)[0]  # second derivative

create_graph=True tells autograd to build a graph for the gradient computation itself, enabling differentiation through it.

Hessian-vector products (HVPs): As shown in 02, the HVP $Hv$ can be computed in $O(n)$ time without forming $H$ :

Hv = \nabla_\mathbf{x}[(\nabla_\mathbf{x} \mathcal{L})^\top v]

This is the primitive operation behind conjugate gradient and Lanczos methods for curvature estimation.

9. Transformer Backpropagation

9.1 Full Transformer Layer Gradient Flow

A pre-norm transformer layer processes the residual stream $\mathbf{x} \in \mathbb{R}^d$ as:

\mathbf{x}' = \mathbf{x} + \text{Attn}(\text{LN}_1(\mathbf{x}))

\mathbf{x}'' = \mathbf{x}' + \text{MLP}(\text{LN}_2(\mathbf{x}'))

Backward through one transformer layer (given $\bar{\mathbf{x}}''\,$ ):

GRADIENT FLOW - ONE TRANSFORMER LAYER


  FORWARD                              BACKWARD

  x      x'' flows in
                                     
      LN_1(x)                           x' = x'' + MLP_backward
                                                  (x'')
      Attn(*)                          x = x' + Attn_backward
                                                  (x')
  x' = x + Attn(LN_1(x)) 
                                       The two residual additions
                                     split the gradient stream
      LN_2(x')                         into parallel paths -
                                     the identity skip carries
      MLP(*)                          the full upstream signal
                                     unchanged.
  x'' = x' + MLP(LN_2(x'))

The critical observation: both residual additions in the transformer layer act as gradient splitters. The skip path carries a copy of $\bar{\mathbf{x}}''$ directly back to $\bar{\mathbf{x}}'$ without passing through the MLP Jacobian. This gives transformers well-behaved gradients even at $L = 96$ layers (GPT-3) or $L = 126$ layers (Grok-1).

9.2 LoRA Backward Pass

Low-Rank Adaptation (Hu et al., 2022) reparametrises a weight matrix:

W = W_0 + BA, \quad W_0 \in \mathbb{R}^{m \times n} \text{ (frozen)}, \quad B \in \mathbb{R}^{m \times r}, A \in \mathbb{R}^{r \times n}

Forward: $\mathbf{y} = W\mathbf{x} = W_0\mathbf{x} + BA\mathbf{x}$ .

Backward (given $\bar{\mathbf{y}}$ ):

\bar{A} = B^\top \bar{\mathbf{y}} \mathbf{x}^\top \in \mathbb{R}^{r \times n}

\bar{B} = \bar{\mathbf{y}} \mathbf{x}^\top A^\top \in \mathbb{R}^{m \times r}

\bar{\mathbf{x}}_\text{from LoRA} = A^\top B^\top \bar{\mathbf{y}} = (BA)^\top \bar{\mathbf{y}}

Note: $W_0$ is frozen, so $\bar{W}_0 = 0$ - no gradient is computed or stored for $W_0$ . The backward pass only updates $A$ and $B$ .

Memory savings: For $W_0 \in \mathbb{R}^{4096 \times 4096}$ with $r = 16$ : gradient storage reduces from $4096^2 = 16.7\text{M}$ to $r(m + n) = 16 \times 8192 = 131\text{K}$ parameters - a $128\times$ reduction in gradient memory for that layer.

DoRA (Liu et al., 2024) further decomposes LoRA into magnitude + direction components, improving fine-tuning quality while preserving the low-rank backward structure.

9.3 Gradient Accumulation

Problem: Large effective batch sizes ( $B = 4\text{M}$ tokens, as in GPT-4 training) don't fit in GPU memory for a single forward-backward pass.

Solution - gradient accumulation:

For each micro-batch b = 1, ..., G:
    loss_b = forward(micro_batch_b) / G   # scaled loss
    backward(loss_b)                       # accumulates gradients
    # gradients are NOT zeroed between micro-batches

optimizer.step()   # update once after G micro-batches
optimizer.zero_grad()

The division by $G$ ensures the accumulated gradient is mathematically identical to what a single pass with the full batch would produce.

For AI: GPT-3 used gradient accumulation to achieve an effective batch of $\sim 3\text{M}$ tokens with hardware that could only process $\sim 500\text{K}$ tokens per step.

9.4 Distributed Gradient Synchronisation

In data parallelism, each GPU processes a different micro-batch but shares the same model weights. After the backward pass, gradients must be synchronised:

All-Reduce: Sum gradients across all $N$ GPUs and divide by $N$ . Implemented via ring-all-reduce (NCCL) for $O(N \cdot |\theta|)$ communication that is bandwidth-optimal.

Gradient sharding (ZeRO): DeepSpeed's ZeRO (Zero Redundancy Optimizer) partitions gradient storage across GPUs:

ZeRO Stage 1: Shard optimiser states -> $4\times$ memory reduction
ZeRO Stage 2: Shard gradients additionally -> $8\times$ reduction
ZeRO Stage 3: Shard parameters too -> $N\times$ reduction (linear in GPU count)

For LLaMA-3 70B training: ZeRO Stage 3 across 1024 H100 GPUs allows storing only $\sim 70\text{B}/1024 \approx 68\text{M}$ parameters per GPU - fitting the model in memory.

10. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Applying scalar chain rule $\frac{dy}{dx} = \frac{dy}{du}\frac{du}{dx}$ for vector functions	Scalar chain rule multiplies; multivariate chain rule composes Jacobians. Order matters: $J_{f\circ g} = J_f \cdot J_g$ , not $J_g \cdot J_f$	Write Jacobians explicitly and multiply left-to-right in the order of composition
2	Forgetting to sum gradients at fan-out (shared weight) nodes	Each use of a weight contributes a gradient; missing uses means undercounting	Accumulate gradients with `+=` in the backward loop over all uses
3	Treating $\nabla_W \mathcal{L} = \boldsymbol{\delta} \mathbf{x}^\top$ as shape-correct without checking	The outer product $\boldsymbol{\delta} \mathbf{x}^\top$ has shape $(n_\text{out}, n_\text{in})$ matching $W$ ; but transposing either vector gives wrong shape	Always verify gradient shapes match parameter shapes before implementation
4	Using sigmoid/tanh in deep networks expecting no vanishing gradients	Their derivatives are bounded by $0.25$ / $1.0$ - products over many layers vanish exponentially	Use ReLU, GELU, or SiLU with proper initialisation; add residual connections
5	Initialising all weights to zero (or same value)	Symmetry breaking fails: every neuron in a layer computes the same gradient, so they all update identically and remain identical forever	Use Xavier or He initialisation with random values
6	Skipping the fused softmax + cross-entropy optimisation and computing them separately	Intermediate probabilities $p_i = e^{z_i}/\sum e^{z_j}$ overflow/underflow for large logits	Always use the log-sum-exp trick or a library's CrossEntropyLoss (which applies it internally)
7	Confusing JVP and VJP - using JVP for all gradient computations	JVP costs $O(n)$ passes for scalar output; VJP costs $O(1)$ per output dimension. For scalar loss, always use VJP (backprop)	Use VJP (backward) for scalar losses; reserve JVP for computing Jacobian rows or directional derivatives
8	Clipping per-layer gradients independently instead of global norm	Destroys the relative scale of gradients across layers; disrupts Adam's per-parameter adaptive scaling	Clip the global gradient norm: compute $\\|\mathbf{g}\\|$ across all parameters, scale down if above threshold
9	Using STE incorrectly in quantisation-aware training - applying STE to continuous weights	STE should only be applied at the discrete rounding step, not to subsequent continuous operations	Apply STE only at the `round()` or `sign()` node; propagate real gradients elsewhere
10	Misunderstanding gradient accumulation - forgetting to scale the loss	Accumulating $G$ micro-batch gradients without dividing by $G$ produces $G\times$ too large an effective gradient	Divide loss by $G$ before backward, or divide accumulated gradients by $G$ before the optimiser step
11	Not using `create_graph=True` when computing higher-order gradients in PyTorch	Without `create_graph=True`, the gradient computation is not tracked, so differentiating through it returns `None` or wrong values	Use `create_graph=True` in the first `torch.autograd.grad()` call when second derivatives are needed
12	Confusing BPTT truncation with sequence truncation	Truncated BPTT still runs the full forward sequence; it only truncates the backward window. Sequence truncation shortens both	These are different operations - read the framework docs to confirm which is applied

Chain Rule and Backpropagation: Part 1 - Intuition To 10 Common Mistakes