Part 2

25 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Chain Rule and Backpropagation: Part 11: Exercises to Appendix K: Summary Tables

11. Exercises

Exercise 1 - Scalar Chain Rule Verification

Let $f(t) = \sin(t^2)$ and $g(t) = e^{3t}$ .

(a) Compute $\frac{d}{dt}[f(g(t))]$ using the chain rule analytically.

(b) Evaluate the derivative at $t = 0$ and $t = 1$ .

(c) Verify numerically using centred finite differences.

(d) Compute $\frac{d}{dt}[g(f(t))]$ - explain why the order of composition matters.

Exercise 2 - Jacobian Composition

Let $f: \mathbb{R}^3 \to \mathbb{R}^2$ and $g: \mathbb{R}^2 \to \mathbb{R}^3$ be defined by:

f(\mathbf{u}) = (u_1 u_2,\ u_2 + u_3^2), \quad g(\mathbf{x}) = (x_1^2,\ x_1 x_2,\ e^{x_2})

(a) Compute $J_f(\mathbf{u})$ and $J_g(\mathbf{x})$ analytically.

(b) Compute the Jacobian of $h = f \circ g: \mathbb{R}^2 \to \mathbb{R}^2$ using the chain rule $J_h = J_f(g(\mathbf{x})) \cdot J_g(\mathbf{x})$ .

(c) Verify using finite differences at $\mathbf{x}_0 = (1, 0)^\top$ .

(d) Compute $J_h$ directly and confirm it equals part (b).

Exercise 3 - Backprop Through a 2-Layer Network

Two-layer network: $\mathbf{z}^{[1]} = W^{[1]}\mathbf{x} + \mathbf{b}^{[1]}$ , $\mathbf{a}^{[1]} = \text{relu}(\mathbf{z}^{[1]})$ , $\hat{y} = \mathbf{w}^{[2]} \cdot \mathbf{a}^{[1]} + b^{[2]}$ , $\mathcal{L} = \tfrac{1}{2}(\hat{y} - y)^2$ .

With $W^{[1]} \in \mathbb{R}^{3 \times 2}$ , $\mathbf{x} \in \mathbb{R}^2$ , $\mathbf{w}^{[2]} \in \mathbb{R}^3$ :

(a) Implement forward pass. Compute $\mathcal{L}$ for given values.

(b) Implement backward pass manually using the backpropagation recurrence.

(c) Verify your gradients using numpy finite differences.

(d) Implement gradient descent for 100 steps with learning rate $0.01$ and verify loss decreases.

Exercise 4 - Vanishing Gradients Analysis

(a) Construct a 20-layer sigmoid network with all weights $= 0.3$ . Compute the gradient at layer 1 symbolically and numerically.

(b) Repeat with ReLU activation. Compare gradient magnitudes at layers 1, 5, 10, 20.

(c) Apply Xavier initialisation to the sigmoid network and compare gradient flow.

(d) Add residual connections to the 20-layer sigmoid network. Quantify the improvement.

(e) Plot gradient norm vs. layer depth for all four cases.

Exercise 5 - Gradient Checkpointing

(a) Implement a 10-layer feedforward network with explicit intermediate caching. Measure peak memory usage.

(b) Implement the same network with gradient checkpointing at every 3rd layer. Measure memory.

(c) Verify that both implementations produce identical gradients.

(d) Measure the compute overhead of recomputation. How does it compare to the theoretical $+33\%$ ?

(e) Find the optimal checkpoint interval $k$ that minimises total memory x compute cost.

Exercise 6 - Attention Gradient

Single-head attention: $O = \text{softmax}(QK^\top/\sqrt{d})V$ with $Q, K, V \in \mathbb{R}^{T \times d}$ for $T = 4$ , $d = 3$ .

(a) Implement forward pass.

(b) Implement backward pass computing $\bar{Q}, \bar{K}, \bar{V}$ given $\bar{O}$ .

(c) Verify all three gradients using finite differences.

(d) For causal masking (set $S_{ij} = -\infty$ for $j > i$ ), show that the backward pass is unchanged except at masked positions.

Exercise 7 - LoRA Gradient Analysis

(a) Implement a linear layer $\mathbf{y} = (W_0 + BA)\mathbf{x}$ with LoRA adaptation. Set $m=8, n=6, r=2$ .

(b) Compute gradients $\nabla_A \mathcal{L}$ and $\nabla_B \mathcal{L}$ analytically and verify numerically.

(c) Confirm that $\nabla_{W_0} \mathcal{L} = \bar{\mathbf{y}} \mathbf{x}^\top$ but is not used (frozen).

(d) Compare the number of gradient parameters for full fine-tuning vs. LoRA.

(e) Implement LoRA training for 200 steps on a toy task and compare convergence with full fine-tuning.

Exercise 8 - REINFORCE and STE

(a) Implement a stochastic computational graph: $z \sim \text{Bernoulli}(\sigma(\theta))$ , $\mathcal{L} = z^2$ .

(b) Compute the REINFORCE gradient $\nabla_\theta \mathbb{E}[\mathcal{L}]$ analytically.

(c) Estimate the REINFORCE gradient with 10000 samples. Verify against the analytical value.

(d) Implement STE for the rounding operation: $\hat{w} = \text{round}(w)$ , $\mathcal{L} = (\hat{w} - w_\text{target})^2$ . Compute the STE gradient and update $w$ .

(e) Compare STE-based quantisation-aware training on a toy example: train for 50 steps and measure quantisation error vs. a post-training quantised model.

12. Why This Matters for AI (2026 Perspective)

Concept	Concrete AI Impact
Multivariate chain rule	The mathematical foundation of every gradient-based learning algorithm - without it, backprop cannot be defined
VJP as backprop primitive	Modern autodiff systems (JAX, PyTorch) are built around VJP primitives; the $O(1)$ cost of reverse mode is what makes training billion-parameter models tractable
Computation graphs	`torch.compile` (PyTorch 2.0), XLA (JAX/TensorFlow), TensorRT all operate by analysing the computation graph to fuse kernels and optimise memory layout
Fused softmax + CE gradient	The clean gradient $\mathbf{p} - \mathbf{e}_y$ makes language model training numerically stable; Flash Attention's backward uses the same softmax log-sum-exp statistics
Xavier/He initialisation	Ensures $O(1)$ gradient scale at depth 1 or depth 96 - a critical practical enabler for deep network training
Residual connections	The "gradient highway" identity term in ResNets/transformers is why 100-layer networks train at all; this was the key insight enabling GPT-3's 96 layers
Gradient checkpointing	Enables training LLMs with 128K context lengths; without it, the $O(T^2)$ activation memory would be prohibitive
FlashAttention backward	IO-aware backward pass reduces memory from $O(T^2)$ to $O(T)$ while maintaining numerical equivalence; standard in all production LLM training as of 2024
LoRA backward	Only $r(m+n) \ll mn$ parameters accumulate gradients; enables fine-tuning 70B models on a single H100 via the low-rank backward structure
STE / REINFORCE	STE enables quantisation-aware training (GPTQ, AWQ, QLoRA); REINFORCE enables RLHF's policy gradient step in PPO-based alignment training
BPTT	The failure of vanilla BPTT for long sequences motivated LSTMs, GRUs, and ultimately the attention mechanism which replaces recurrence with direct pairwise interactions
ZeRO gradient sharding	Partitions gradient storage across GPUs linearly in GPU count; enables training models that would require $8\times$ more memory per GPU without it
Mixed precision backward	BF16 backward passes achieve $2-3\times$ memory bandwidth vs FP32 on H100, with dynamic loss scaling preventing underflow; standard in all LLM training since GPT-3
Higher-order gradients	Gradient penalties in GANs, MAML's meta-gradient, and Hessian-vector products for learning rate scheduling all require differentiating through the backward pass

Conceptual Bridge

Where we came from: 01 (Partial Derivatives) gave us tools to differentiate multivariate functions component by component. 02 (Jacobians and Hessians) assembled those into matrix objects capturing full first- and second-order sensitivity. We now know what a derivative is for a function $f: \mathbb{R}^n \to \mathbb{R}^m$ .

What this section added: The chain rule tells us how derivatives compose - allowing us to differentiate functions built from primitives. Backpropagation is the algorithmic instantiation of this composition for computation graphs, and the VJP (reverse mode) makes the cost of differentiating a scalar loss with respect to millions of parameters equal to the cost of a single forward pass. This is not an approximation - it is exact and provably optimal.

What this enables: Every gradient-based learning algorithm - SGD, Adam, RMSprop, LARS, Shampoo - requires only the gradient $\nabla_\theta \mathcal{L}$ , which backprop provides. The advanced sections of this chapter (04 Optimisation, 05 Automatic Differentiation) build directly on the VJP abstraction established here.

Connection to transformer training: Modern LLM training is essentially an exercise in efficient backpropagation at scale. Every engineering decision - Flash Attention's tiled backward, ZeRO's gradient sharding, gradient checkpointing, LoRA's low-rank backward, mixed precision loss scaling - is a response to the memory and compute constraints of the backward pass. Understanding backpropagation is therefore prerequisite to understanding why LLM training systems are designed the way they are.

POSITION IN THE CURRICULUM


  PREREQUISITES (must know):
    01 Partial Derivatives - partialf/partialx, gradient, directional derivative
    02 Jacobians & Hessians - J_f, Frchet derivative, VJP/JVP

  THIS SECTION (03):
  
    Chain Rule & Backpropagation                                   
    - Multivariate chain rule (J_{fog} = J_f * J_g)               
    - Computation graphs (DAG, topological order)                  
    - Backprop recurrence (delta = Wdelta  sigma'(z))                     
    - Gradient derivations (linear, softmax+CE, LN, attention)     
    - Vanishing/exploding gradients + solutions                    
    - Memory-efficient backprop (checkpointing, Flash Attention)   
    - Advanced: BPTT, STE, REINFORCE, higher-order gradients       
  

  WHAT THIS ENABLES:
    04 Optimisation - gradient descent, Adam, second-order methods
    05 Automatic Differentiation - AD systems, tape, jit compilation
    07 Neural Networks - full training loop built on backprop
    08 Transformer Architecture - FlashAttention, LoRA, gradient flow

  CROSS-CHAPTER CONNECTIONS:
    03-Advanced-LA/02-SVD - gradient low-rank structure
    04-Calculus/02-Derivatives - scalar chain rule (special case)
    06-Probability/03-MLE - loss functions that backprop optimises

For automatic differentiation systems that implement these ideas at scale, see 05 Automatic Differentiation.

For the optimisation algorithms that consume backprop's output, see 04 Multivariate Optimisation.

Appendix A: Worked Backpropagation Example

A.1 Complete Worked Example - 3-Layer Network

To make the backpropagation formulas concrete, we trace through a minimal example end-to-end.

Network architecture:

Input: $\mathbf{x} = (1, 2)^\top$
Layer 1: $W^{[1]} = \begin{pmatrix}0.5 & 0.1 \\ 0.2 & 0.3\end{pmatrix}$ , $\mathbf{b}^{[1]} = (0, 0)^\top$ , activation: ReLU
Layer 2: $\mathbf{w}^{[2]} = (0.4, 0.6)^\top$ , $b^{[2]} = 0$ , activation: none (scalar output)
Loss: $\mathcal{L} = \tfrac{1}{2}(\hat{y} - 1)^2$ (MSE with target $y = 1$ )

Forward pass:

\mathbf{z}^{[1]} = W^{[1]}\mathbf{x} = \begin{pmatrix}0.5 \cdot 1 + 0.1 \cdot 2 \\ 0.2 \cdot 1 + 0.3 \cdot 2\end{pmatrix} = \begin{pmatrix}0.7 \\ 0.8\end{pmatrix}

\mathbf{a}^{[1]} = \text{relu}(\mathbf{z}^{[1]}) = \begin{pmatrix}0.7 \\ 0.8\end{pmatrix} \quad \text{(both positive, no change)}

\hat{y} = \mathbf{w}^{[2]} \cdot \mathbf{a}^{[1]} = 0.4 \times 0.7 + 0.6 \times 0.8 = 0.28 + 0.48 = 0.76

\mathcal{L} = \tfrac{1}{2}(0.76 - 1)^2 = \tfrac{1}{2}(0.0576) = 0.0288

Backward pass:

Output layer gradient:

\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y = 0.76 - 1 = -0.24

Layer 2 gradients (scalar output, linear):

\frac{\partial \mathcal{L}}{\partial \mathbf{w}^{[2]}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \mathbf{a}^{[1]} = -0.24 \times \begin{pmatrix}0.7 \\ 0.8\end{pmatrix} = \begin{pmatrix}-0.168 \\ -0.192\end{pmatrix}

\frac{\partial \mathcal{L}}{\partial b^{[2]}} = -0.24

Error signal propagated to layer 1:

\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[1]}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \mathbf{w}^{[2]} = -0.24 \times \begin{pmatrix}0.4 \\ 0.6\end{pmatrix} = \begin{pmatrix}-0.096 \\ -0.144\end{pmatrix}

Through ReLU:

\boldsymbol{\delta}^{[1]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[1]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[1]}} \odot \text{relu}'(\mathbf{z}^{[1]}) = \begin{pmatrix}-0.096 \\ -0.144\end{pmatrix} \odot \begin{pmatrix}1 \\ 1\end{pmatrix} = \begin{pmatrix}-0.096 \\ -0.144\end{pmatrix}

Layer 1 weight gradients:

\frac{\partial \mathcal{L}}{\partial W^{[1]}} = \boldsymbol{\delta}^{[1]} \mathbf{x}^\top = \begin{pmatrix}-0.096 \\ -0.144\end{pmatrix}(1, 2) = \begin{pmatrix}-0.096 & -0.192 \\ -0.144 & -0.288\end{pmatrix}

Verification (finite difference for $W^{[1]}_{11}$ ): Perturb $W^{[1]}_{11}$ by $h = 10^{-5}$ :

\mathcal{L}(W^{[1]}_{11} + h) - \mathcal{L}(W^{[1]}_{11} - h) \approx 2h \cdot (-0.096)

Numerically: $\mathcal{L}(0.50001) \approx 0.02880960, \mathcal{L}(0.49999) \approx 0.02881152$ . Difference $/2h = -0.096$ .

A.2 Computational Cost Comparison

Forward pass: $O(n_0 n_1 + n_1 n_2 + \cdots + n_{L-1} n_L)$ - one GEMM per layer.

Backward pass: Also $O(\sum_l n_{l-1} n_l)$ - same asymptotic cost, with constant factor $\approx 2-3$ .

Memory: Cache all $\mathbf{z}^{[l]}$ and $\mathbf{a}^{[l]}$ : $O(\sum_l n_l)$ scalars - linear in total neuron count.

The fundamental theorem of backpropagation: Computing $\nabla_\theta \mathcal{L}$ for all $|\theta|$ parameters costs only a constant factor more than computing $\mathcal{L}$ itself. This is the miracle that makes gradient-based learning tractable.

Formal statement: Let $T_f$ be the time to evaluate $\mathcal{L}$ in the forward pass. Then the time to compute all partial derivatives $\partial \mathcal{L}/\partial \theta_i$ via backprop is at most $c \cdot T_f$ where $c \leq 5$ in practice.

This contrasts with finite differences: computing $\partial \mathcal{L}/\partial \theta_i$ for each of $|\theta|$ parameters via finite differences costs $2|\theta|$ forward passes - for GPT-3 with $|\theta| = 175\text{B}$ , this would be $350$ billion forward passes, or approximately the heat death of the universe in compute time.

Appendix B: JVP vs VJP - Mode Selection and Complexity

B.1 Forward Mode vs Reverse Mode

Given $f: \mathbb{R}^n \to \mathbb{R}^m$ , both modes compute the same gradient information but with different costs:

Mode	Computes	Cost per pass	Total cost for full Jacobian
Forward (JVP)	One column of $J_f$	$O(T_f)$	$O(n \cdot T_f)$
Reverse (VJP)	One row of $J_f$	$O(T_f)$	$O(m \cdot T_f)$

COST MATRIX: WHICH MODE WINS?


  Goal: compute partial/partialtheta for : R -> R (scalar loss)

  n = |theta| = 175,000,000,000  (GPT-3 parameter count)
  m = 1                       (scalar loss)

  Forward mode: m x Tf = 1 x Tf  <- ONE PASS
  Reverse mode: n x Tf = 175B x Tf  <- 175 BILLION PASSES

  Wait - that's backwards! Reverse mode (backprop) costs 1 pass
  because m=1 means ONE ROW of J_f = the gradient row vector.
  Forward mode would need n=175B passes to fill all columns.

  
    RULE: Use reverse mode (backprop) when m  n               
    RULE: Use forward mode (JVP) when n  m                    
  

  Most ML: n  m = 1 -> backprop is optimal

When forward mode wins: Computing the sensitivity of all $m$ outputs to one input parameter - e.g., computing how the entire model output changes as a single hyperparameter varies. Also: Jacobian-vector products in conjugate gradient (no need for the full Jacobian).

Mixed strategies: For functions $f: \mathbb{R}^n \to \mathbb{R}^m$ with $n \approx m$ , the optimal choice is to split the Jacobian into row/column blocks and use each mode for the appropriate blocks - the basis of adjoint methods in numerical PDE solvers.

B.2 Tangent Mode for Hessian-Vector Products

As shown in 02, the Hessian-vector product $H\mathbf{v}$ can be computed by composing forward and reverse modes:

H\mathbf{v} = J_\mathbf{g}^\top \mathbf{v} \quad \text{where} \quad \mathbf{g} = \nabla_\theta \mathcal{L}

Algorithm (Pearlmutter's R{} trick, 1994):

Forward pass (JVP with direction $\mathbf{v}$ ): compute $\mathbf{g} = \nabla \mathcal{L}$ and $\dot{\mathbf{g}} = J_\mathbf{g} \mathbf{v}$ simultaneously
Cost: same as backprop ( $O(T_f)$ ) - one pass suffices

Implementation in PyTorch:

g = torch.autograd.grad(loss, params, create_graph=True)
flat_g = torch.cat([gi.view(-1) for gi in g])
hvp = torch.autograd.grad(flat_g @ v, params)

Cost: 2 backprop passes, no $n \times n$ matrix formed. This is the primitive for:

Conjugate gradient for Newton steps (K-FAC-style)
Lanczos iteration for $\lambda_\text{max}$ of Hessian
Eigenvalue monitoring during training (Cohen et al., 2022 - edge of stability)

Appendix C: Automatic Differentiation Preview

C.1 The AD Abstraction

Automatic differentiation (AD) is a mechanical procedure for transforming any program that computes $f(\mathbf{x})$ into a program that also computes $\nabla f(\mathbf{x})$ (or JVPs/VJPs). This section previews the idea; the full treatment is in 05.

AD is neither symbolic differentiation (too slow, exponentially large expressions) nor numerical differentiation (finite differences - too imprecise, costs $O(n)$ evaluations). AD exploits the fact that every program is a composition of primitives, and the chain rule tells us exactly how to compose their derivatives.

Two flavours:

SYMBOLIC DIFF           NUMERICAL DIFF          AUTO DIFF


  f(x) = x^2 + sin(x)    Compute f(x+h)          Track ops in
                          and f(x-h)              computation tape

  -> d/dx = 2x+cos(x)    -> (f(x+h)-f(x-h))/2h    -> Exact as FP allows

  Exact but expression   Approximate; costs       Exact, costs O(1)
  size can explode       O(n) evaluations          evaluations

C.2 The Tape (Wengert List)

The Wengert list (1964) records, during the forward pass, every primitive operation applied and its operands. The backward pass replays this tape in reverse, accumulating adjoints.

FORWARD TAPE EXAMPLE: f(x) = exp(x) * (x + 1)


  Tape (built during forward):
    v_1 = x            (input)
    v_2 = exp(v_1)      (op: exp,  operand: v_1)
    v_3 = v_1 + 1.0     (op: add,  operands: v_1, 1.0)
    v_4 = v_2 x v_3      (op: mul,  operands: v_2, v_3)

  Backward (replay in reverse):
    v_4 = 1.0                          (seed)
    v_2 += v_4 x v_3  = 1.0 x (x+1)    (mul backward)
    v_3 += v_4 x v_2  = 1.0 x exp(x)   (mul backward)
    v_1 += v_3 x 1.0 = exp(x)          (add backward)
    v_1 += v_2 x exp(v_1) = (x+1)exp(x) (exp backward)

  Total: v_1 = exp(x) + (x+1)exp(x) = (x+2)exp(x)  (by product rule)

PyTorch's Tensor stores a grad_fn attribute at each node - this is the tape in disguise. Calling .backward() replays the tape in reverse.

For more: See 05 Automatic Differentiation for the complete treatment of forward/reverse mode AD, source transformation, operator overloading, and the design of JAX vs PyTorch autograd.

Appendix D: Numerical Gradient Verification

In practice, every backpropagation implementation should be verified against finite differences. This appendix presents the standard toolkit.

D.1 Centred Finite Differences

For a scalar loss $\mathcal{L}: \mathbb{R}^n \to \mathbb{R}$ and parameter $\theta_i$ :

\frac{\partial \mathcal{L}}{\partial \theta_i} \approx \frac{\mathcal{L}(\boldsymbol{\theta} + h\mathbf{e}_i) - \mathcal{L}(\boldsymbol{\theta} - h\mathbf{e}_i)}{2h}

Error analysis: Centred differences have $O(h^2)$ error (vs $O(h)$ for forward differences). Optimal step size $h$ balances truncation error ( $O(h^2)$ ) against floating-point cancellation error ( $O(\epsilon/h)$ where $\epsilon$ is machine epsilon):

h_\text{opt} \approx \epsilon^{1/3} \approx (10^{-16})^{1/3} \approx 10^{-5} \quad \text{(for float64)}

Use $h = 10^{-5}$ for float64 and $h = 10^{-3}$ for float32.

Relative error check: Accept the gradient check if:

\frac{\|\nabla_\text{backprop} - \nabla_\text{FD}\|}{\|\nabla_\text{backprop}\| + \|\nabla_\text{FD}\|} < 10^{-7} \quad \text{(float64)}

D.2 When Gradient Checks Fail

Common failure modes:

Symptom	Likely cause
Relative error $\sim 10^{-3}$ throughout	`h` too large (truncation) or float32 precision
Relative error $\sim 1$ for specific parameters	Bug in backward for that parameter type
Relative error $\sim 0$ for all gradients	Loss is approximately linear in those parameters at the test point
Fails at kink (ReLU/max)	Gradient not defined at $z=0$ ; test point near kink; use $\mathbf{x}$ away from kinks
Fails only for batch size 1	BatchNorm statistics degenerate; use batch size $\geq 2$ for BN checks

D.3 Gradient Check in PyTorch

from torch.autograd import gradcheck

def f(x):
    return (x ** 2).sum()

x = torch.randn(5, requires_grad=True, dtype=torch.float64)
gradcheck(f, (x,), eps=1e-6, atol=1e-4, rtol=1e-4)

gradcheck automates centred finite differences for all inputs with requires_grad=True. Always use dtype=torch.float64 for gradient checking - float32 precision is insufficient for reliable checks.

Appendix E: Key Formulas Reference

E.1 Chain Rule Summary

Setting	Formula
Scalar composition	$(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$
Vector composition	$J_{f \circ g}(\mathbf{x}) = J_f(g(\mathbf{x})) \cdot J_g(\mathbf{x})$
VJP (backprop step)	$\bar{\mathbf{x}} = J_g(\mathbf{x})^\top \bar{\mathbf{y}}$
JVP (forward step)	$\dot{\mathbf{y}} = J_g(\mathbf{x}) \dot{\mathbf{x}}$

E.2 Backpropagation Formulas

Layer	Forward	Backward ( $\bar{\mathbf{z}}$ given)
Linear $\mathbf{z} = W\mathbf{x} + \mathbf{b}$	-	$\bar{\mathbf{x}} = W^\top \bar{\mathbf{z}}$ , $\bar{W} = \bar{\mathbf{z}}\mathbf{x}^\top$ , $\bar{\mathbf{b}} = \bar{\mathbf{z}}$
Elementwise $\mathbf{a} = \sigma(\mathbf{z})$	-	$\bar{\mathbf{z}} = \bar{\mathbf{a}} \odot \sigma'(\mathbf{z})$
Softmax+CE	$p_i = e^{z_i}/\sum e^{z_j}$	$\partial \mathcal{L}/\partial \mathbf{z} = \mathbf{p} - \mathbf{e}_y$
Residual $\mathbf{y} = \mathbf{x} + F(\mathbf{x})$	-	$\bar{\mathbf{x}} = \bar{\mathbf{y}} + J_F^\top \bar{\mathbf{y}}$
LayerNorm	$\hat{x}_i = (x_i-\mu)/\sigma$	Complex (see 5.4); passes signal

E.3 Activation Derivatives

Name	$\sigma(z)$	$\sigma'(z)$
ReLU	$\max(0,z)$	$\mathbf{1}[z>0]$
Sigmoid	$1/(1+e^{-z})$	$\sigma(1-\sigma)$
Tanh	$(e^z-e^{-z})/(e^z+e^{-z})$	$1-\tanh^2$
GELU	$z\Phi(z)$	$\Phi(z)+z\phi(z)$
SiLU	$z/(1+e^{-z})$	$\sigma(z)(1+z(1-\sigma(z)))$
Softplus	$\log(1+e^z)$	$\sigma(z)$

E.4 Initialisation Standards

Method	Distribution	Variance	When
Xavier uniform	$U[-a,a]$	$2/(n_\text{in}+n_\text{out})$	Sigmoid, tanh
Xavier normal	$\mathcal{N}(0,\sigma^2)$	$2/(n_\text{in}+n_\text{out})$	Sigmoid, tanh
He uniform	$U[-a,a]$	$2/n_\text{in}$	ReLU
He normal	$\mathcal{N}(0,\sigma^2)$	$2/n_\text{in}$	ReLU
GPT-2 residual	$\mathcal{N}(0,(0.02/\sqrt{2L})^2)$	-	Transformer residuals

Appendix F: Deep Dive - Vanishing Gradients in Transformers

F.1 Why Transformers Don't Vanish

A naive reading of the vanishing gradient analysis (6.1) suggests that 96-layer transformers should suffer catastrophic vanishing. They don't. Here is why.

The residual stream analysis: In a pre-norm transformer, the residual stream after layer $l$ is:

\mathbf{x}^{[l]} = \mathbf{x}^{[0]} + \sum_{k=1}^{l} F^{[k]}(\mathbf{x}^{[k-1]})

where $F^{[k]}$ is the $k$ -th sublayer (attention or MLP, wrapped in LayerNorm).

The gradient of the loss with respect to the input is:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}^{[0]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}^{[L]}} \cdot \prod_{l=1}^{L}\left(I + J_{F^{[l]}}\right)

At initialisation, the transformer weights are small, so $J_{F^{[l]}} \approx 0$ and the product $\prod(I + J_{F^{[l]}}) \approx I$ . The gradient flows back unchanged through all $L$ layers. This is categorically different from a plain deep network where the product of small Jacobians vanishes.

Gradient norm growth: As training progresses and weights grow, $J_{F^{[l]}}$ becomes nontrivial. The gradient norm may grow with depth, but this is controlled by:

LayerNorm dampening (see 6.6)
GPT-2's $1/\sqrt{2L}$ scaling of residual projections
Gradient clipping ( $\tau = 1.0$ )

The "edge of stability" phenomenon (Cohen et al., 2022): In practice, the maximum Hessian eigenvalue $\lambda_\text{max}$ often approaches $2/\eta$ (twice the inverse learning rate) and oscillates there. This is a gradient flow regime where the training dynamics are neither fully stable nor unstable, and gradients are large enough to cause oscillation but not divergence.

F.2 Gradient Norm as Training Signal

Modern LLM training monitors gradient norm at every step. Typical patterns:

GRADIENT NORM DURING LLM TRAINING


  nablatheta_2
    
      spike (loss spike)
        
  1                clip threshold
     
       normal training
    
     steps

  Patterns:
  - Steady nablatheta < 1: healthy training, clipping inactive
  - Sudden spike -> loss spike -> recovery: numerical event
    (often a "bad" batch; LLM training has ~1-3 such events
     per trillion tokens at scale)
  - Slow upward drift: learning rate may be too high

Loss spike mitigation: When the gradient norm exceeds the clip threshold, the entire gradient update is scaled down. If the spike is from a corrupted batch, this prevents permanent damage to the model weights.

Gradient accumulation and norm: When using $G$ accumulation steps, each micro-batch contributes $1/G$ of the gradient. The global norm is computed across the accumulated gradient (after summation, before the optimiser step) - not across individual micro-batches.

F.3 Per-Layer Gradient Norm Analysis

For diagnostic purposes, logging the gradient norm per layer reveals:

Embedding gradients: Often the largest, due to sparse updates (5.6)
Early layers: Smallest (furthest from loss); potential vanishing
Late layers: Largest; potential exploding
LayerNorm parameters: Very small - $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$ converge quickly

This per-layer analysis guided the design of:

LARS/LAMB optimisers (You et al., 2017): layer-wise adaptive learning rates based on weight-to-gradient ratio
Muon (2024): applies Newton step in gradient space with Nesterov momentum; designed for hidden layers while AdamW handles embedding and output

Appendix G: Historical Development

G.1 Timeline of Backpropagation

The development of backpropagation spans three centuries and multiple independent discoveries:

Year	Event	Significance
1676	Leibniz publishes differential calculus (chain rule for single variable)	Mathematical foundation
1744	Euler uses variational methods (antecedent of reverse mode)	First "adjoint" idea
1847	Cauchy introduces gradient descent	The algorithm backprop serves
1960	Kalman filter (reverse-mode for linear dynamical systems)	AD in engineering
1964	Wengert introduces the "reverse accumulation" algorithm	First explicit AD
1970	Linnainmaa's thesis: general backpropagation	Full theoretical framework
1974	Werbos PhD thesis: backprop for neural networks	Connection to ML
1982	Hopfield networks (energy-based models with gradient)	Alternative to backprop
1986	Rumelhart, Hinton & Williams - "Learning representations by back-propagating errors"	Popularised backprop for NNs
1991	Hochreiter: vanishing gradient problem analysed	Identified depth barrier
1997	LSTM: gating to address vanishing gradient in RNNs	First scalable deep sequence model
2012	AlexNet: backprop on GPU at scale	Practical deep learning
2015	ResNets: residual connections for gradient flow	Enabled 100+ layer networks
2016	PyTorch / TensorFlow 1.0: autodiff frameworks	Democratised backprop
2017	Transformers: attention replaces BPTT	Solved long-range vanishing
2018	JAX: functional autodiff, JIT compilation	Research-grade AD
2022	FlashAttention: IO-aware backward pass	Efficient $O(T^2)$ attention backward
2022	PyTorch 2.0 `torch.compile`	Graph-based kernel fusion
2023	FlashAttention-2: improved GPU utilisation	Standard for production
2024	FlashAttention-3: H100-optimised with async	State-of-art attention backward

G.2 The Independent Discoveries

Backpropagation was independently discovered at least four times before becoming widely known:

Linnainmaa (1970): In his master's thesis, presented the general algorithm for computing exact partial derivatives of any function composed of elementary operations - precisely what we today call reverse-mode AD.
Werbos (1974): Applied the same idea to multi-layer neural networks in his PhD thesis, but the work was largely ignored for over a decade.
Parker (1985): Independently rediscovered backpropagation for neural networks.
Rumelhart, Hinton & Williams (1986): Published the algorithm in Nature and produced the critical experimental demonstrations that convinced the community it could work. Their paper is the one most often cited today.

This pattern of independent rediscovery is common in mathematics - the ideas are "in the air" once the prerequisites are established. The chain rule (1676) + computation graphs (1960s) + gradient descent (1847) = backpropagation (inevitable).

G.3 The Hardware-Algorithm Co-evolution

The practical impact of backpropagation depends critically on hardware:

CPU era (1986-2011): Backprop is theoretically valid but computationally slow. Networks with more than 3-4 layers were impractical.
GPU era (2012-present): NVIDIA's CUDA (2007) enables massively parallel GEMM operations. The bottleneck shifts from FLOPS to memory bandwidth.
Tensor core era (2017-present): NVIDIA Volta/Ampere/Hopper GPUs have dedicated matrix multiply accelerators. FP16/BF16 tensor cores achieve 10x the throughput of FP32.
Memory wall: As models scale, the backward pass's memory requirements dominate. FlashAttention, ZeRO, gradient checkpointing all address the memory wall.

The 2024 FLOP/memory ratio in H100 GPUs ( $\sim 2000$ TFLOPS vs $\sim 3.35$ TB/s bandwidth) means that memory access, not computation, is the primary bottleneck for backprop at scale. This fundamental constraint is why FlashAttention's IO-aware design is so impactful.

Appendix H: Connections to Optimisation and Learning Theory

H.1 What the Gradient Tells Us

The gradient $\nabla_\theta \mathcal{L}$ computed by backpropagation is the direction of steepest ascent in parameter space (by the first-order Taylor expansion). Gradient descent moves in the opposite direction:

\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}(\theta)

What the gradient does NOT tell us:

The curvature of the loss landscape (need Hessian for that)
The optimal step size $\eta$
Whether we are near a local minimum, saddle point, or maximum
Whether the gradient is statistically well-estimated (needs large enough batch)

What the gradient DOES tell us:

The direction of maximal increase (used negated for descent)
The sensitivity of the loss to each parameter
Which parameters are "active" (nonzero gradient) vs. saturated (near-zero gradient)

H.2 Gradient Stochasticity

In practice, the true gradient $\nabla_\theta \mathbb{E}[\mathcal{L}]$ over the full data distribution is approximated by the stochastic gradient over a mini-batch:

\hat{g}_B = \frac{1}{B}\sum_{b=1}^B \nabla_\theta \mathcal{L}(\mathbf{x}^{(b)}, \mathbf{y}^{(b)})

This is an unbiased estimator: $\mathbb{E}[\hat{g}_B] = \nabla_\theta \mathbb{E}[\mathcal{L}]$ .

Variance: $\text{Var}(\hat{g}_B) = \text{Var}(\nabla_\theta \mathcal{L})/B$ . Larger batches have lower gradient variance (more accurate gradient estimate) but provide diminishing returns beyond the "critical batch size" (McCandlish et al., 2018).

For LLMs: The critical batch size for GPT-3-scale models is approximately $B^* \approx 1-4$ million tokens. Training at this batch size achieves the best loss-per-FLOP tradeoff. Using larger batches wastes compute; using smaller batches wastes gradient estimation quality.

H.3 The Gradient as a Sufficient Statistic

For first-order optimisers (SGD, Adam, AdaGrad, RMSprop), the gradient is the only information extracted from the forward-backward pass. Second-order information (Hessian curvature) is either ignored or approximated.

Why not use the full Hessian? For $|\theta| = 70\text{B}$ parameters, the Hessian is a $70\text{B} \times 70\text{B}$ matrix - $\sim 10^{22}$ entries. Storing it is impossible ( $10^{22}$ FP32 values ~= $4 \times 10^{22}$ bytes ~= $40 \times 10^{21}$ GB). Inverting it is even more impossible.

Practical second-order methods use approximations:

Diagonal: AdaGrad/Adam maintain diagonal Hessian approximations ( $O(|\theta|)$ memory)
Kronecker factored: K-FAC (see 02) uses $A \otimes G$ per layer ( $O(n^2)$ per layer)
Low-rank: PSGD, Shampoo maintain low-rank or block-diagonal approximations
Newton-Schulz: Muon (2024) approximates the matrix square root $H^{-1/2}$ efficiently

H.4 Generalisation and the Implicit Gradient Bias

Gradient descent with small learning rate and large mini-batches does not merely find any minimum - it has an implicit bias toward flat minima (large regions with low loss) over sharp minima (narrow valleys).

Conjecture (Keskar et al., 2017): Flat minima generalise better because small perturbations to the parameters don't change the loss much - robust to noise in the data.

Mathematical foundation: The SGD noise $\hat{g}_B - g$ effectively adds a regularisation term proportional to $\eta B^{-1} \text{tr}(H)$ - the trace of the Hessian - biasing toward flat (low-trace-Hessian) minima.

This connects gradient computation (the topic of this section) to generalisation theory (a major open question in deep learning theory) - a reminder that understanding backpropagation fully requires understanding not just the mechanics, but the geometry of the loss landscape it navigates.

Appendix I: Practical Implementation Guide

I.1 Implementing Backprop from Scratch

When building a neural network framework from scratch, implement these components in order:

1. Primitive registry:

primitives = {}

def register_primitive(name, forward_fn, backward_fn):
    """Register a primitive op with its VJP."""
    primitives[name] = (forward_fn, backward_fn)

# Example: multiplication primitive
def mul_forward(x, y): return x * y
def mul_backward(x, y, g_out): return g_out * y, g_out * x  # (g_x, g_y)
register_primitive('mul', mul_forward, mul_backward)

2. Value class with gradient tracking:

class Value:
    def __init__(self, data, parents=(), op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None  # closure capturing parents
        self._parents = parents
        self._op = op
    
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), 'mul')
        def _backward():
            self.grad += other.data * out.grad   # VJP for self
            other.grad += self.data * out.grad   # VJP for other
        out._backward = _backward
        return out
    
    def backward(self):
        # Topological sort, then reverse
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for p in v._parents: build_topo(p)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for v in reversed(topo): v._backward()

This is essentially the complete autograd engine from Karpathy's micrograd (2020) - approximately 100 lines implement a working backprop engine.

3. Building blocks: Extend Value with __add__, __pow__, exp, log, relu, softmax - each with its VJP closure.

I.2 Common Implementation Bugs

Bug 1: Overwriting instead of accumulating gradients

# Wrong:
self.grad = other.data * out.grad   # erases previous contributions!

# Correct:
self.grad += other.data * out.grad  # accumulates (fan-out nodes)

Bug 2: Forgetting to zero gradients between batches

# Wrong: gradient accumulates across batches
loss = model(x)
loss.backward()
optimizer.step()

# Correct:
optimizer.zero_grad()   # <- must come before backward
loss = model(x)
loss.backward()
optimizer.step()

Bug 3: Not detaching from the graph for inference

# Wrong: builds graph unnecessarily during inference
with torch.no_grad():  # <- this is the correct fix
    prediction = model(x)

Bug 4: Shape mismatch in weight gradient

# Wrong: grad_W and W may have different shapes
grad_W = delta @ x    # (n_out, 1) @ (1, n_in) only works for batch size 1

# Correct: outer product for single sample
grad_W = np.outer(delta, x)  # (n_out, n_in)

# Correct: batched
grad_W = (1/B) * Delta @ X.T  # (n_out, B) @ (B, n_in) = (n_out, n_in)

I.3 Testing Checklist

Before deploying any backprop implementation:

Gradient check passes for all primitive operations (relative error $< 10^{-6}$ )
Loss decreases monotonically for small enough learning rate (verify on toy problem)
Gradients are zero for frozen parameters
Gradient accumulation at fan-out nodes verified (shared weight receives sum)
Shape of each gradient matches shape of corresponding parameter
Memory usage is $O(\text{num\_layers})$ not $O(\text{num\_layers}^2)$
Higher-order gradients work if needed (use create_graph=True in PyTorch)
Mixed precision: FP16 forward, FP32 gradient accumulation, loss scaling in place

Appendix J: Connections to Information Theory and Statistics

J.1 Fisher Information and the Natural Gradient

The ordinary gradient $\nabla_\theta \mathcal{L}$ measures the steepest direction in parameter space with respect to the Euclidean metric. But parameter space has a natural metric induced by the probability distribution $p_\theta$ - the Fisher information metric.

Fisher information matrix:

F(\theta) = \mathbb{E}_{x \sim p_\theta}\left[\nabla_\theta \log p_\theta(x) \, (\nabla_\theta \log p_\theta(x))^\top\right]

Natural gradient (Amari, 1998):

\tilde{\nabla}_\theta \mathcal{L} = F(\theta)^{-1} \nabla_\theta \mathcal{L}

The natural gradient is the steepest direction in the distributional geometry of the model - invariant to reparametrisation. Computing it exactly requires inverting $F$ , which costs $O(|\theta|^3)$ .

K-FAC (02) approximates $F^{-1}$ as a Kronecker product, making the natural gradient step tractable. It remains the most principled second-order optimiser for neural networks.

For LLMs: The approximation used in practice is Adam's diagonal $F^{-1}$ (second moment of gradient as proxy for diagonal Fisher). This is crude but sufficient - Adam is a diagonal natural gradient step.

J.2 Gradient as Score Function

For a probabilistic model $p_\theta(\mathbf{x})$ , the gradient of the log-likelihood is the score function:

s(\mathbf{x}; \theta) = \nabla_\theta \log p_\theta(\mathbf{x})

The score function is the quantity computed by backpropagation during maximum likelihood estimation. Properties:

$\mathbb{E}_{x \sim p_\theta}[s(\mathbf{x};\theta)] = 0$ (score has zero mean)
$\text{Var}_{x \sim p_\theta}[s(\mathbf{x};\theta)] = F(\theta)$ (Fisher information = variance of score)

For language models: The negative log-likelihood $\mathcal{L} = -\log p_\theta(\mathbf{y}|\mathbf{x})$ has gradient $-s(\mathbf{y}|\mathbf{x};\theta) = -(p_y - e_y)^\top = e_y - p_y$ - the same $\mathbf{p} - \mathbf{e}_y$ formula from 5.3, now understood as the negative score.

J.3 KL Divergence and the Gradient of ELBO

In variational inference and RL (RLHF), we often need gradients of KL divergences. For discrete distributions:

\text{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}

\nabla_\phi \text{KL}(p_\theta \| q_\phi) = -\mathbb{E}_{x \sim p_\theta}\left[\nabla_\phi \log q_\phi(x)\right]

This is computed via backprop through the log-probability of the policy under KL regularisation - the precise form used in RLHF's PPO loss, which includes a KL penalty between the fine-tuned policy $\pi_\phi$ and the reference model $\pi_\text{ref}$ .

References

Rumelhart, Hinton & Williams (1986) - "Learning representations by back-propagating errors." Nature, 323, 533-536. The canonical backpropagation paper.
Linnainmaa, S. (1970) - "The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors." Master's thesis, University of Helsinki. First general reverse-mode AD.
Hochreiter, S. (1991) - "Untersuchungen zu dynamischen neuronalen Netzen." Diploma thesis, TU Munich. First analysis of vanishing gradients.
Glorot, X. & Bengio, Y. (2010) - "Understanding the difficulty of training deep feedforward neural networks." AISTATS. Xavier initialisation.
He, K. et al. (2015) - "Delving Deep into Rectifiers." ICCV. He initialisation for ReLU networks.
He, K. et al. (2016) - "Deep Residual Learning for Image Recognition." CVPR. ResNets and gradient highways.
Ba, J. et al. (2016) - "Layer Normalization." arXiv:1607.06450. LayerNorm for transformers.
Vaswani, A. et al. (2017) - "Attention Is All You Need." NeurIPS. Transformer architecture with attention backward pass.
Amari, S. (1998) - "Natural Gradient Works Efficiently in Learning." Neural Computation. Natural gradient and Fisher information.
Martens, J. & Grosse, R. (2015) - "Optimizing Neural Networks with Kronecker-factored Approximate Curvature." ICML. K-FAC.
Hu, E. et al. (2022) - "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. LoRA backward pass.
Dao, T. et al. (2022) - "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS. IO-aware backward for attention.
Cohen, J. et al. (2022) - "Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability." ICLR. Edge of stability phenomenon.
Dao, T. (2023) - "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." ICLR 2024. FlashAttention-2.
Liu, S. et al. (2024) - "DoRA: Weight-Decomposed Low-Rank Adaptation." DoRA backward analysis.

Appendix K: Summary Tables

K.1 Backpropagation Algorithm Summary

COMPLETE BACKPROPAGATION ALGORITHM


  INPUT: network weights theta, training pair (x, y)

  PHASE 1 - FORWARD PASS
  
  a^0 = x
  For l = 1, 2, ..., L:
    z^l = W^l a^{l-1} + b^l       (cache z^l and a^{l-1})
    a^l = sigma^l(z^l)                 (cache a^l)
   = a^L
   = loss(, y)

  PHASE 2 - BACKWARD PASS
  
  delta^L = partial/partialz^L                  (output layer gradient, layer-specific)
  For l = L-1, L-2, ..., 1:
    delta^l = (W^{l+1}) delta^{l+1}  sigma'^l(z^l)

  PHASE 3 - GRADIENT ASSEMBLY
  
  For l = 1, 2, ..., L:
    nabla_{W^l}  = delta^l (a^{l-1})
    nabla_{b^l}  = delta^l

  PHASE 4 - PARAMETER UPDATE
  
  theta <- theta - eta * nabla_theta              (or Adam/RMSprop update)

K.2 Complexity Summary

Operation	Time	Memory
Forward pass (L layers, width n)	$O(Ln^2)$	$O(Ln)$ cached activations
Backward pass	$O(Ln^2)$	$O(Ln)$ error signals
Full Jacobian via FD	$O(	\theta
Full Jacobian via backprop	$O(m \cdot T_f)$	$O(L)$
Hessian-vector product	$O(T_f)$	$O(L)$
Gradient checkpointing	$O(1.33 T_f)$	$O(\sqrt{L})$
FlashAttention forward	$O(T^2 d)$	$O(T)$
FlashAttention backward	$O(T^2 d)$	$O(T)$

K.3 Gradient Flow Interventions

Problem	Diagnosis	Intervention
Vanishing gradients	$\\|\boldsymbol{\delta}^{[1]}\\| \ll 1$	ReLU/GELU, He init, residual connections
Exploding gradients	$\\|\boldsymbol{\delta}^{[1]}\\| \gg 1$	Gradient clipping, LR warmup
Dead neurons	$\\|\boldsymbol{\delta}^{[l]}\\| = 0$ for layer	Leaky ReLU, better init, BN
Slow convergence	$\\|\nabla_\theta \mathcal{L}\\| \approx 0$ at saddle	Momentum, Adam, noise injection
Oscillating loss	$\\|\nabla_\theta \mathcal{L}\\|$ spikes	Reduce LR, increase batch
NaN gradients	$\\|\nabla_\theta \mathcal{L}\\| = \infty$	Loss scaling, check log/softmax

Chain Rule and Backpropagation: Part 2 - Exercises To Appendix K Summary Tables