A precise mapping from mathematical concepts to their concrete roles in modern
ML systems. Each entry cites the specific model, paper, or algorithm where the
mathematics is load-bearing — not merely present.
How to Read This Document
Each section identifies a mathematical domain, lists its core concepts, and maps
each concept to: (a) the ML context, (b) the specific operation or formula, and
(c) a concrete 2024–2026 example. This is a research-grade reference, not a survey.
1. Linear Algebra → ML
1.1 Matrix Multiplication
| Math concept |
ML operation |
Formula |
Example |
| $\mathbf{y} = W\mathbf{x} + \mathbf{b}$ |
Fully-connected layer |
Forward pass |
Every dense layer in every neural network |
| $\text{Attention} = \text{softmax}(QK^\top/\sqrt{d_k})V$ |
Self-attention |
$O(n^2 d)$ |
Transformers: GPT-4, LLaMA-3, Gemini |
| $H^{(l+1)} = \sigma(\hat{A} H^{(l)} W^{(l)})$ |
Graph convolution |
Message passing |
GCN (Kipf & Welling, 2017) |
1.2 Eigendecomposition
| Concept |
ML role |
Where |
| $A\mathbf{v} = \lambda \mathbf{v}$ |
Stability of RNN hidden state |
$\rho(W_h) < 1$ prevents exploding gradients |
| Top eigenvector of $X^\top X$ |
First principal component |
PCA for dimensionality reduction |
| Eigenvalues of graph Laplacian $L$ |
Spectral graph convolution |
ChebNet, spectral GNNs |
| Hessian spectrum $\nabla^2 \mathcal{L}$ |
Loss landscape sharpness |
Sharpness-aware minimisation (Foret et al., 2021) |
| Neural Tangent Kernel eigenvalues |
Training speed of wide networks |
NTK theory (Jacot et al., 2018) |
1.3 SVD
| Concept |
ML role |
Where |
| $A = U\Sigma V^\top$ |
Low-rank weight decomposition |
LoRA (Hu et al., 2022): $\Delta W = BA$ |
| Eckart-Young theorem |
Optimal rank-$k$ approximation |
Matrix factorisation for recommenders |
| Pseudoinverse $A^\dagger$ |
Least-squares solution |
Normal equations; ridge regression |
| Singular value spectrum |
Weight matrix health |
WeightWatcher (Martin & Mahoney, 2021) |
| Randomised SVD |
Scalable approximation |
Halko, Martinsson & Tropp (2011) |
1.4 Norms and Regularisation
| Norm |
Regulariser |
Effect |
Used in |
| $\lVert \boldsymbol{\theta} \rVert_2^2$ |
L2 / weight decay |
Penalises large weights |
AdamW, all modern LLMs |
| $\lVert \boldsymbol{\theta} \rVert_1$ |
L1 / Lasso |
Induces sparsity |
Sparse fine-tuning |
| $\lVert W \rVert_2 = \sigma_{\max}(W)$ |
Spectral normalisation |
Lipschitz constraint |
GANs (Miyato et al., 2018) |
| $\lVert A \rVert_*$ (nuclear) |
Nuclear norm |
Low-rank inductive bias |
Matrix completion |
2. Calculus → ML
2.1 Chain Rule = Backpropagation
The chain rule is not merely used in backpropagation — it is backpropagation.
$$\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}} \cdot \prod_{k=l+1}^{L} \frac{\partial \mathbf{a}^{[k]}}{\partial \mathbf{a}^{[k-1]}} \cdot \frac{\partial \mathbf{a}^{[l]}}{\partial W^{[l]}}$$
Every automatic differentiation framework (PyTorch, JAX, TensorFlow) is a
symbolic implementation of this identity.
2.2 Key Derivatives in Practice
| Function |
Derivative |
Why it matters |
| $\sigma(x) = 1/(1+e^{-x})$ |
$\sigma(x)(1-\sigma(x))$ |
LSTM gate updates; vanishes for large $\lvert x \rvert$ |
| $\tanh(x)$ |
$1 - \tanh^2(x)$ |
RNN hidden states; range $(-1,1)$ |
| $\text{ReLU}(x) = \max(0,x)$ |
$\mathbb{1}[x > 0]$ |
Avoids vanishing gradient for $x > 0$ |
| $\text{GELU}(x) = x\Phi(x)$ |
$\Phi(x) + x\phi(x)$ |
GPT-2, BERT, modern transformers |
| $\text{softmax}(\mathbf{z})_i$ |
$s_i(\delta_{ij} - s_j)$ |
Cross-entropy gradient; numerically use log-sum-exp |
2.3 Gradient Flow and Vanishing/Exploding
For a depth-$L$ network: $\frac{\partial \mathcal{L}}{\partial W^{[1]}} \propto \prod_{l=2}^{L} W^{[l]} \cdot \sigma'(\mathbf{z}^{[l]})$.
| Condition |
Effect |
Fix |
| $\lVert W \sigma' \rVert < 1$ repeated |
Vanishing gradient |
Residual connections, LSTM gates, ReLU |
| $\lVert W \sigma' \rVert > 1$ repeated |
Exploding gradient |
Gradient clipping $\lVert g \rVert \le c$ |
2.4 Second-Order Methods
| Concept |
Formula |
ML application |
| Hessian $H = \nabla^2 \mathcal{L}$ |
$(H)_{ij} = \partial^2 \mathcal{L}/\partial\theta_i\partial\theta_j$ |
Newton's method; Fisher information matrix |
| Gauss-Newton approximation |
$H \approx J^\top J$ |
K-FAC (Martens & Grosse, 2015) |
| Sharpness $\lambda_{\max}(H)$ |
Largest Hessian eigenvalue |
SAM (Foret et al., 2021); flat minima generalise better |
3. Probability → ML
3.1 Probabilistic View of Supervised Learning
| Framework |
Objective |
Equivalent to |
| MLE |
$\max \sum \log p(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta})$ |
Minimise cross-entropy loss |
| MAP |
$\max \sum \log p(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta}) + \log p(\boldsymbol{\theta})$ |
L2 reg. (Gaussian prior), L1 reg. (Laplace prior) |
| ELBO |
$\mathbb{E}{q}[\log p(\mathbf{x} \mid \mathbf{z})] - D\text{KL}(q(\mathbf{z}\mid\mathbf{x}) | p(\mathbf{z}))$ |
VAE training objective (Kingma & Welling, 2014) |
3.2 Distributions in Active Use (2026)
| Distribution |
Where |
Formula |
| $\mathcal{N}(\boldsymbol{\mu}, \Sigma)$ |
Weight init (He/Xavier), VAE latent, diffusion |
$p(\mathbf{x}) \propto \exp(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}))$ |
| Categorical / softmax |
LLM output token distribution |
$p(x_i) = e^{z_i}/\sum_j e^{z_j}$ |
| Bernoulli |
Dropout mask, binary classification |
$P(X=1) = p$ |
| Dirichlet |
Topic models, LDA, concentration prior |
Conjugate to Categorical |
3.3 Bayes' Theorem in ML
$$p(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta})\, p(\boldsymbol{\theta})}{p(\mathcal{D})}$$
- Likelihood $p(\mathcal{D} \mid \boldsymbol{\theta})$: the loss function
- Prior $p(\boldsymbol{\theta})$: regularisation
- Posterior $p(\boldsymbol{\theta} \mid \mathcal{D})$: what we actually want
- Evidence $p(\mathcal{D})$: intractable; approximated by ELBO, Laplace, MCMC
4. Information Theory → ML
| Concept |
Formula |
ML role |
| Cross-entropy |
$H(p,q) = -\sum p \log q$ |
The classification loss; training objective for all LLMs |
| KL divergence |
$D_\text{KL}(p|q) = \sum p \log(p/q)$ |
VAE regulariser; knowledge distillation; RLHF KL penalty |
| Mutual information |
$I(X;Y) = H(X) - H(X\mid Y)$ |
InfoNCE loss; contrastive learning (SimCLR, CLIP) |
| Perplexity |
$\exp(-\frac{1}{T}\sum_t \log p(x_t\mid x_{<t}))$ |
LLM evaluation; lower = better language model |
| Bits-back coding |
$-\mathbb{E}q[\log p(\mathbf{x}\mid\mathbf{z})] + D\text{KL}(q|\,p)$ |
VAE ELBO reinterpreted as compression |
5. Optimisation → ML
5.1 Gradient Descent Variants
| Algorithm |
Update rule |
Used in |
| SGD + momentum |
$\mathbf{v}t = \beta\mathbf{v}{t-1} + \nabla\mathcal{L}$; $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\mathbf{v}_t$ |
Vision models |
| Adam (Kingma & Ba, 2015) |
Adaptive per-parameter $\eta$; bias-corrected moments |
Default for LLMs |
| AdamW (Loshchilov & Hutter, 2019) |
Adam + decoupled weight decay |
GPT-3, LLaMA, all frontier LLMs |
| Muon (2024) |
Orthogonalised Nesterov momentum |
GPT-4o-scale training |
| SOAP (2024) |
Shampoo + Adam preconditioner |
State-of-art efficiency |
5.2 Learning Rate Schedules
| Schedule |
Formula |
Used in |
| Cosine annealing |
$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})(1+\cos\frac{\pi t}{T})$ |
GPT, LLaMA pretraining |
| Linear warmup |
$\eta_t = \eta_{\max} \cdot t / T_\text{warm}$ |
All large models (first 1–4K steps) |
| WSD (warmup-stable-decay) |
Constant phase + sharp cosine decay |
Mistral, Phi-3 |
6. Curriculum Map — Math to Model
This table maps each repository chapter to the specific models and papers
that use it as load-bearing mathematics.
| Chapter |
Core math |
Primary models / papers |
| 02 Linear Algebra Basics |
Matrix ops, rank, projections |
Every neural network |
| 03 Advanced Linear Algebra |
SVD, eigenvalues |
LoRA, PCA, WeightWatcher |
| 04 Calculus Fundamentals |
Derivatives, chain rule |
Backpropagation (Rumelhart et al., 1986) |
| 05 Multivariate Calculus |
Jacobian, Hessian |
Adam, K-FAC, SAM |
| 06 Probability Theory |
Distributions, Bayes |
VAE, DDPM, Bayesian deep learning |
| 07 Statistics |
MLE, MAP, hypothesis tests |
Training objectives, model selection |
| 08 Optimisation |
Convexity, GD, constraints |
All training algorithms |
| 09 Information Theory |
Entropy, KL, MI |
Cross-entropy loss, RLHF, contrastive learning |
| 10 Numerical Methods |
Condition number, stability |
Mixed precision, numerical autograd |
| 11 Graph Theory |
Laplacian, random walks |
GCN, GAT, Node2Vec |
| 12 Functional Analysis |
Hilbert spaces, RKHS |
SVMs, kernel methods, NTK theory |
| 13 ML-Specific Math |
Attention math, normalisation |
Transformers (Vaswani et al., 2017) |
| 14 Math for Specific Models |
RNN/LSTM, CNN, GAN |
Sequence models, generative models |
This map is updated with each new section added to the curriculum.
For the definitive reference on mathematics for ML, see:
Goodfellow, Bengio & Courville (2016); Bishop (2006); Shalev-Shwartz & Ben-David (2014).