Chapter 5 — Multivariate Calculus
"In machine learning, everything is a function of millions of parameters. Multivariate calculus is the language that tells us how each one matters."
Overview
This chapter extends single-variable calculus to functions of many variables — the setting where all practical machine learning lives. A neural network loss depends on billions of parameters; understanding how it changes with each one, how those changes compose through layers, and where the minimum lies requires the full machinery of multivariate calculus.
The progression moves from partial derivatives and gradients (the basic instruments of measurement in high dimensions), through Jacobians and Hessians (the matrix generalizations that capture interactions between variables), through the multivariate chain rule and backpropagation (how derivatives flow through computation graphs), through optimality conditions (the theory of when you've found a minimum), and finally to automatic differentiation (how modern frameworks compute exact derivatives of arbitrary code).
Every concept here has direct ML instantiation: the gradient is the update direction for every optimizer; the Jacobian is backpropagation through a layer; the Hessian curvature determines optimizer convergence; KKT conditions underlie support vector machines and constrained training; automatic differentiation is PyTorch/JAX.
Subsection Map
| # | Subsection | What It Covers | Canonical Topics |
|---|---|---|---|
| 01 | Partial Derivatives and Gradients | Extending differentiation to many-variable functions; the gradient as the direction of steepest ascent | Partial derivatives, gradient vector, directional derivatives, gradient fields, level curves, steepest ascent/descent, gradient in ML (loss landscapes) |
| 02 | Jacobians and Hessians | Matrix generalizations of the derivative for vector-valued functions and second-order curvature | Jacobian matrix, Hessian matrix, symmetry (Clairaut's theorem), positive/negative definiteness, curvature of loss landscapes, Newton's method curvature |
| 03 | Chain Rule and Backpropagation | How derivatives compose through multivariable function chains; the mathematical engine behind neural network training | Multivariate chain rule, computation graphs, forward pass, backward pass, backpropagation algorithm, gradient flow through layers, vanishing/exploding gradients |
| 04 | Optimality Conditions | First- and second-order conditions for local/global minima; constrained optimization via Lagrange multipliers and KKT theory | Critical points, first-order necessary conditions, second-order sufficient conditions, saddle points, Lagrange multipliers, KKT conditions, convex vs non-convex, SVM dual |
| 05 | Automatic Differentiation | Systematic exact differentiation of arbitrary computer programs; the foundation of PyTorch, JAX, and TensorFlow | Forward mode AD, reverse mode AD, dual numbers, Wengert lists/tape, computational complexity, higher-order AD, implementation in modern frameworks |
Reading Order and Dependencies
01-Partial-Derivatives-and-Gradients (foundation: what a derivative means in ℝⁿ)
↓
02-Jacobians-and-Hessians (matrix form: Jacobian = stacked gradients; Hessian = gradient of gradient)
↓
03-Chain-Rule-and-Backpropagation (composition: how gradients flow through function chains)
↓
04-Optimality-Conditions (theory: when does ∇f = 0 guarantee a minimum?)
↓
05-Automatic-Differentiation (practice: how modern frameworks compute all of the above exactly)
↓
08-Optimization (Chapter 8) (algorithms: gradient descent, Adam, Newton, constrained methods)
What Belongs Where — Canonical Homes
This table is the authoritative scoping guide. If a topic has a canonical home, every other section must give at most a 1–2 paragraph preview with a forward/backward reference — never a full treatment.
| Topic | Canonical Home | Preview Only In |
|---|---|---|
| Partial derivatives (definition, computation) | §01 | — |
| Gradient vector | §01 | §02 (as row of Jacobian), §03 (as backprop output) |
| Directional derivative, gradient as steepest ascent | §01 | §04 (gradient condition for optimality) |
| Level curves and contour plots | §01 | §04 (visualizing optimality) |
| Gradient in ML: loss w.r.t. weights | §01 | §03 (as special case of backprop) |
| Jacobian matrix | §02 | §03 (Jacobian of layer = weight matrix) |
| Hessian matrix | §02 | §04 (second-order optimality conditions) |
| Clairaut's theorem (symmetry of mixed partials) | §02 | — |
| Positive definiteness of Hessian | §02 | §04 (sufficient condition for local min) |
| Newton's method curvature | §02 | §04 (Newton step derivation) |
| Multivariate chain rule | §03 | §01 (brief forward ref only) |
| Computation graphs | §03 | §05 (AD implementation uses same graphs) |
| Backpropagation algorithm | §03 | §05 (AD is the systematic implementation) |
| Vanishing/exploding gradients | §03 | — |
| Critical points () | §04 | §01 (gradient = 0 as motivation) |
| First-order necessary conditions | §04 | — |
| Second-order sufficient conditions | §04 | §02 (Hessian role previewed) |
| Saddle points, local vs global minima | §04 | §03 (gradient flow around saddle points) |
| Lagrange multipliers | §04 | — |
| KKT conditions | §04 | — |
| Convex vs non-convex landscapes | §04 | §01 (preview only) |
| Forward mode automatic differentiation | §05 | — |
| Reverse mode automatic differentiation | §05 | §03 (conceptual link to backprop) |
| Dual numbers | §05 | — |
| Wengert list / tape | §05 | — |
| PyTorch autograd / JAX grad | §05 | §03 (brief forward reference) |
| Higher-order AD (Hessian-vector products) | §05 | §02 (Hessian computation preview) |
Overlap Danger Zones
These are the most commonly duplicated topics across sections.
1. Gradient ↔ Jacobian
- §01 defines the gradient for scalar-valued .
- §02 defines the Jacobian for vector-valued . The gradient is the special case (a row of the Jacobian). §02 should backward-reference §01 rather than re-derive the gradient.
2. Backpropagation ↔ Chain Rule ↔ Automatic Differentiation
- §03 derives the multivariate chain rule and applies it to neural network layers, giving the backpropagation algorithm as a direct consequence.
- §05 shows how automatic differentiation implements backpropagation systematically using computation graphs and a tape. §05 should backward-reference §03 for the mathematical derivation and focus on the algorithmic/implementation perspective.
- Do NOT re-derive the chain rule in §05, and do NOT describe the tape/dual-number mechanism in §03.
3. Hessian ↔ Optimality Conditions
- §02 defines the Hessian, proves Clairaut's theorem, and connects Hessian definiteness to curvature.
- §04 uses the Hessian as a tool for classifying critical points (positive definite → local minimum). §04 must not re-derive the Hessian — only apply it.
4. Optimality Conditions ↔ Optimization Algorithms
- §04 covers the theory: necessary and sufficient conditions, Lagrange multipliers, KKT. It does NOT cover specific algorithms (gradient descent, Adam, Newton's method with step rules).
- Chapter 8 (Optimization) covers the algorithms. §04 may preview that gradient descent follows but must not describe convergence rates, learning rate schedules, or adaptive methods.
5. Single-variable chain rule ↔ Multivariate chain rule
- Chapter 4, §02 (Derivatives and Differentiation) derives the single-variable chain rule and applies it to backpropagation through scalar computations.
- §03 of this chapter owns the full multivariate chain rule and the general Jacobian chain rule .
Forward and Backward Reference Format
Forward reference (full treatment is later):
> **Preview: Backpropagation**
> The multivariate chain rule, applied to computation graphs, yields backpropagation:
> derivatives propagate backward layer by layer via $\delta^{(l)} = (W^{(l+1)})^\top \delta^{(l+1)} \odot \sigma'(z^{(l)})$.
>
> → _Full treatment: [Chain Rule and Backpropagation](../03-Chain-Rule-and-Backpropagation/notes.md)_
Backward reference (builds on earlier section):
> **Recall:** The gradient $\nabla f(\mathbf{x}) \in \mathbb{R}^n$ was defined in
> [Partial Derivatives and Gradients](../01-Partial-Derivatives-and-Gradients/notes.md)
> as the vector of partial derivatives. The Hessian stacks the gradients of each
> $\partial f / \partial x_i$ to form a matrix.
Key Cross-Chapter Dependencies
From Chapter 4 — Calculus Fundamentals:
- Single-variable chain rule (§02) → multivariate chain rule in §03
- Single-variable critical points (§02) → optimality conditions in §04
- Taylor series (§04) → multivariate Taylor expansion (Hessian appears as the second-order term) in §02 and §04
From Chapter 3 — Advanced Linear Algebra:
- Matrix–vector products → Jacobian–vector products in §02 and §03
- Positive definite matrices (§07) → Hessian definiteness in §02 and §04
- Eigenvalues (§01) → curvature analysis (eigenvalues of the Hessian) in §02
Into Chapter 8 — Optimization:
- §01 (gradient) → gradient descent direction
- §02 (Hessian) → Newton's method, Gauss-Newton, curvature preconditioning
- §03 (backpropagation) → efficient gradient computation for all optimizers
- §04 (KKT conditions) → constrained optimization algorithms, SVM training
- §05 (automatic differentiation) → all gradient-based learning in practice
Into Chapter 6 — Probability Theory:
- §01 (gradients) → score functions, Fisher information matrix
- §04 (Lagrange multipliers) → maximum entropy distributions
ML Concept Map
| ML Concept | Multivariate Calculus Foundation | Section |
|---|---|---|
| Gradient descent parameter update | Gradient of scalar loss w.r.t. parameter vector | §01 |
| Backpropagation through a dense layer | Multivariate chain rule: | §03 |
| Jacobian of a layer's output w.r.t. input | Jacobian matrix of vector-valued function | §02 |
| Second-order optimizer (Newton, K-FAC) | Hessian matrix and its inverse | §02 |
| Saddle points in loss landscapes | Second-order optimality conditions | §04 |
| LoRA / low-rank adaptation | Jacobian rank structure of weight updates | §02 |
| Support vector machines (hard/soft margin) | Lagrange multipliers + KKT conditions | §04 |
| Batch normalization gradients | Chain rule through normalization layer | §03 |
| Mixed-precision training (gradient scaling) | Jacobian condition number, numerical stability | §02 |
PyTorch .backward(), JAX grad() | Reverse-mode automatic differentiation | §05 |
| Gradient checkpointing | Memory–compute trade-off in reverse-mode AD | §05 |
| Hessian-vector products (HVP) | Forward-over-reverse AD | §05 |
| Fisher information matrix | Hessian of log-likelihood; natural gradient | §02, §04 |
| Attention score gradient | Chain rule through softmax + dot-product | §03 |
Prerequisites
Before starting this chapter, ensure you are comfortable with:
- Single-variable derivatives: chain rule, product rule, implicit differentiation — §02-Derivatives-and-Differentiation
- Taylor series: first- and second-order approximations — §04-Series-and-Sequences
- Matrix–vector multiplication: , transpose, dot products — §02-Linear-Algebra-Basics
- Positive definite matrices: definiteness, eigenvalue sign conditions — §07-Positive-Definite-Matrices
← Previous Chapter: Calculus Fundamentals | Next Chapter: Probability Theory →