Mathematics for AI/ML/LLM — Cheatsheet

Every formula that actually matters. No filler. Organized by how you encounter them in practice.

1 · Linear Algebra

Vectors

Operation	Formula	Why It Matters
Dot Product	$\mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i b_i$	Attention scores, similarity
Cosine Similarity	$\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{\\|\mathbf{a}\\| \\|\mathbf{b}\\|}$	Embeddings, RAG retrieval, semantic search
L2 Norm	$\\|\mathbf{a}\\| = \sqrt{\sum a_i^2}$	Weight decay, distance metrics
L1 Norm	$\\|\mathbf{a}\\|_1 = \sum \|a_i\|$	Sparsity, Lasso regularization
Projection	$\text{proj}_{\mathbf{b}}\mathbf{a} = \frac{\mathbf{a} \cdot \mathbf{b}}{\\|\mathbf{b}\\|^2}\mathbf{b}$	Orthogonal decomposition, GS process

Matrices

Operation	Formula	NumPy / PyTorch
Matrix Multiply	$(AB)_{ij} = \sum_k A_{ik}B_{kj}$	`A @ B`
Transpose	$(A^T)_{ij} = A_{ji}$	`A.T`
Inverse	$AA^{-1} = I$	`np.linalg.inv(A)`
Trace	$\text{tr}(A) = \sum_i A_{ii}$	`torch.trace(A)`
Hadamard (element-wise)	$(A \odot B)_{ij} = A_{ij} B_{ij}$	`A * B`

Decompositions

Eigendecomposition — Used in: PCA, spectral clustering, graph analysis

$$A\mathbf{v} = \lambda\mathbf{v} \qquad A = PDP^{-1}$$

SVD — Used in: PCA, LoRA, matrix compression, recommender systems

$$A = U\Sigma V^T$$

$U$ ($m \times m$): left singular vectors
$\Sigma$ ($m \times n$): singular values on diagonal
$V^T$ ($n \times n$): right singular vectors
Low-rank approximation: $A_k = U_k \Sigma_k V_k^T$ (keep top-$k$ values)

PCA via SVD: Center data $X$, compute SVD, project onto top-$k$ columns of $V$.

Matrix Properties That Matter

Property	Meaning	Where You See It
Positive Definite	$\mathbf{x}^T A \mathbf{x} > 0, \forall \mathbf{x} \neq 0$	Covariance matrices, convex loss
Orthogonal	$A^T A = I$	Rotation matrices, SVD components
Symmetric	$A = A^T$	Covariance, Hessians, kernels
Rank	$\text{rank}(A) = $ # linearly independent rows/cols	LoRA exploits low rank

2 · Calculus

Derivatives You Need to Know

Function	Derivative	Used In
$x^n$	$nx^{n-1}$	Polynomial features
$e^x$	$e^x$	Softmax, exponential LR decay
$\ln(x)$	$1/x$	Log-likelihood, cross-entropy
$\sigma(x) = \frac{1}{1+e^{-x}}$	$\sigma(x)(1-\sigma(x))$	Sigmoid activation, logistic regression
$\tanh(x)$	$1 - \tanh^2(x)$	RNN/LSTM gates

Rules That Drive Backpropagation

Rule	Formula
Chain Rule	$(f \circ g)' = f'(g(x)) \cdot g'(x)$
Product Rule	$(fg)' = f'g + fg'$
Sum Rule	$(f + g)' = f' + g'$

The chain rule IS backpropagation. Every layer computes local gradient × upstream gradient.

Multivariate Calculus

Gradient — direction of steepest ascent, used in every optimizer:

$$\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]^T$$

Jacobian ($\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$) — used in normalizing flows, diffusion models:

$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$

Hessian — second-order info, used in loss landscape analysis, Newton's method:

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

$H \succ 0$ (positive definite) → local minimum
$H \prec 0$ (negative definite) → local maximum
Mixed eigenvalues → saddle point (common in deep learning!)

3 · Probability & Statistics

Core Rules

Formula	Name
$P(A \cup B) = P(A) + P(B) - P(A \cap B)$	Addition
$P(A \cap B) = P(A) \cdot P(B \mid A)$	Multiplication
$P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}$	Bayes' Theorem

Bayes is everywhere: Naive Bayes, MAP estimation, Bayesian neural nets, posterior inference.

Key Distributions

Distribution	PDF / PMF	Mean	Variance	Used In
Bernoulli	$P(X=1) = p$	$p$	$p(1-p)$	Binary classification
Categorical	$P(X=k) = p_k$	—	—	Multi-class output
Gaussian	$\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$	$\mu$	$\sigma^2$	Weight init, noise, VAE
Multivariate Gaussian	$\frac{1}{(2\pi)^{n/2}\\|\Sigma\\|^{1/2}} e^{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})}$	$\boldsymbol{\mu}$	$\Sigma$	Latent spaces, GMMs

Estimation

MLE — maximize likelihood of observed data:

$$\hat{\theta}_{MLE} = \arg\max_\theta \sum_{i=1}^n \log p(x_i \mid \theta)$$

MAP — MLE + prior belief (= regularization!):

$$\hat{\theta}_{MAP} = \arg\max_\theta \left[\sum_{i=1}^n \log p(x_i \mid \theta) + \log p(\theta)\right]$$

Gaussian prior on $\theta$ → L2 regularization. Laplace prior → L1 regularization.

Expectation & Variance

$$E[X] = \sum_x x \cdot P(x) \qquad \text{Var}(X) = E[X^2] - (E[X])^2$$

Covariance

$$\text{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)] \qquad \rho = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$$

4 · Activation Functions

Function	Formula	Derivative	Used In
ReLU	$\max(0, x)$	$$	CNNs, default choice
Leaky ReLU	$\max(\alpha x, x)$	$$	Avoids dead neurons
GELU	$x \cdot \Phi(x) \approx x \cdot \sigma(1.702x)$	(complex)	GPT, BERT, modern transformers
SiLU / Swish	$x \cdot \sigma(x)$	$\sigma(x) + x\sigma(x)(1-\sigma(x))$	LLaMA, modern LLMs
Sigmoid	$\frac{1}{1+e^{-x}}$	$\sigma(x)(1-\sigma(x))$	Gates (LSTM), binary output
Tanh	$\frac{e^x - e^{-x}}{e^x + e^{-x}}$	$1 - \tanh^2(x)$	RNN hidden states
Softmax	$\frac{e^{z_i}}{\sum_j e^{z_j}}$	$s_i(\delta_{ij} - s_j)$	Classification output, attention

5 · Loss Functions

Classification

Cross-Entropy (multi-class) — THE standard classification loss:

$$L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$

Binary Cross-Entropy:

$$L = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]$$

Focal Loss — handles class imbalance (used in object detection):

$$L = -\alpha_t (1 - \hat{y}_t)^\gamma \log(\hat{y}_t)$$

Regression

MSE:

$$L = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$

MAE / L1 Loss:

$$L = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|$$

Huber Loss — smooth transition between MSE and MAE:

$$L_\delta = $$

Contrastive & Embedding Losses

Contrastive Loss (SimCLR / CLIP):

$$L = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$$

Where $\text{sim}$ = cosine similarity, $\tau$ = temperature

Triplet Loss:

$$L = \max(0, \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha)$$

$a$ = anchor, $p$ = positive, $n$ = negative, $\alpha$ = margin

6 · Optimization

Gradient Descent Variants

Vanilla GD:

$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

SGD with Momentum:

$$v_t = \gamma v_{t-1} + \eta \nabla L(\theta_t) \qquad \theta_{t+1} = \theta_t - v_t$$

Adam (the default optimizer for most deep learning):

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \qquad \text{(1st moment)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \qquad \text{(2nd moment)}$$ $$\hat{m}_t = \frac{m_t}{1-\beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \qquad \text{(bias correction)}$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

Defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$

AdamW (Adam with decoupled weight decay — used in LLM training):

$$\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t\right)$$

Key difference from Adam: weight decay $\lambda\theta_t$ is applied directly, not through the gradient.

Gradient Clipping

By norm (prevents exploding gradients — standard in LLM training):

$$\hat{g} = $$

Learning Rate Schedules

Cosine Annealing (used in most modern LLM training):

$$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\frac{t\pi}{T}\right)$$

Warmup + Cosine Decay (the LLM standard):

$$\eta_t = $$

Regularization

Method	Effect	Formula
L2 / Weight Decay	Small weights	$L + \lambda\sum\theta_i^2$
L1	Sparse weights	$L + \lambda\sum\\|\theta_i\\|$
Dropout	Random neuron masking	$h = \text{mask} \odot f(x) / (1-p)$

Convexity

$$f \text{ convex} \iff f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$$

Convex → single global minimum (guaranteed convergence)
Deep learning losses are non-convex → saddle points, local minima

7 · Normalization

Batch Normalization

$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \qquad y_i = \gamma \hat{x}_i + \beta$$

Normalizes across the batch dimension. Used in CNNs.

Layer Normalization

$$\hat{x}_i = \frac{x_i - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}} \qquad y_i = \gamma \hat{x}_i + \beta$$

Normalizes across the feature dimension. Used in original Transformers, BERT.

RMSNorm (Root Mean Square Normalization)

$$\hat{x}_i = \frac{x_i}{\text{RMS}(x)} \cdot \gamma \qquad \text{RMS}(x) = \sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}$$

No mean subtraction, no bias $\beta$. Faster than LayerNorm. Used in LLaMA, GPT-4, modern LLMs.

8 · Transformer & Attention Math

Scaled Dot-Product Attention

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

$Q = XW_Q$ (queries), $K = XW_K$ (keys), $V = XW_V$ (values)
$d_k$ = key dimension. Scaling by $\sqrt{d_k}$ prevents softmax saturation.
Complexity: $O(n^2 d)$ where $n$ = sequence length

Why $\sqrt{d_k}$?

If $q, k$ have components with variance 1, then $q \cdot k$ has variance $d_k$. Dividing by $\sqrt{d_k}$ restores unit variance → softmax gets reasonable gradients.

Multi-Head Attention

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$$ $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

$h$ heads, each with $d_k = d_{model}/h$
Lets the model attend to different representation subspaces

Causal (Autoregressive) Masking

$$\text{mask}_{ij} = $$

Applied before softmax: $\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \text{mask}\right)V$

Prevents token $i$ from attending to future tokens $j > i$. Used in GPT, LLaMA, all decoder models.

Transformer Block

Input
  → RMSNorm / LayerNorm
  → Multi-Head Self-Attention + Residual
  → RMSNorm / LayerNorm
  → Feed-Forward Network + Residual
Output

Feed-Forward Network (FFN):

$$\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2$$

$W_1$: $d_{model} \to d_{ff}$ (typically $d_{ff} = 4 \times d_{model}$)
$W_2$: $d_{ff} \to d_{model}$

SwiGLU FFN (used in LLaMA, PaLM, modern LLMs):

$$\text{SwiGLU}(x) = (\text{SiLU}(W_1 x) \odot W_3 x) \cdot W_2$$

Residual Connection

$$\text{output} = x + \text{Sublayer}(x)$$

Enables gradient flow through deep networks (50+ layers in LLMs).

9 · LLM-Specific Math

Positional Encoding — Sinusoidal (Original Transformer)

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) \qquad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

Rotary Position Embedding (RoPE) — Used in LLaMA, GPT-NeoX

Rotates query and key vectors based on position:

$$f(x_m, m) = R_m x_m \qquad R_m = $$

Key property: $\langle f(q, m), f(k, n) \rangle$ depends only on $q$, $k$, and relative position $m - n$.

Tokenization (BPE)

Byte-Pair Encoding greedily merges the most frequent adjacent token pair:

$$\text{pair}^* = \arg\max_{(a,b)} \text{count}(a, b)$$

Repeat until vocabulary size reached. Subword tokenization balances vocabulary size vs sequence length.

Language Model Probability

$$P(x_1, \ldots, x_T) = \prod_{t=1}^T P(x_t \mid x_1, \ldots, x_{t-1})$$

Training objective (minimize negative log-likelihood):

$$L = -\frac{1}{T}\sum_{t=1}^T \log P(x_t \mid x_{Temperature Scaling $$P(x_i) = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}$$

$\tau \to 0$: greedy (picks highest logit)
$\tau = 1$: standard sampling
$\tau > 1$: more random / creative

Top-k and Top-p (Nucleus) Sampling

Top-k: Sample from the $k$ highest-probability tokens only
Top-p: Sample from smallest set where $\sum P(x_i) \geq p$

Scaling Laws (Chinchilla)

$$L(N, D) \approx A \cdot N^{-\alpha} + B \cdot D^{-\beta} + L_\infty$$

$N$ = number of parameters, $D$ = number of training tokens
Chinchilla-optimal: $D \approx 20 \times N$ (train on 20 tokens per parameter)

LoRA (Low-Rank Adaptation)

$$W' = W_0 + \Delta W = W_0 + BA$$

$W_0 \in \mathbb{R}^{d \times d}$ (frozen pretrained weights)
$B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times d}$ where $r \ll d$
Only $B$ and $A$ are trained. Parameters: $2dr$ instead of $d^2$
Example: $d = 4096$, $r = 16$ → 99.2% fewer trainable parameters

Quantization

Uniform quantization (maps float → int):

$$x_q = \text{round}\left(\frac{x}{s}\right) + z \qquad x_{dequant} = s(x_q - z)$$

$s$ = scale factor, $z$ = zero point
INT8: 4× memory reduction, INT4: 8× memory reduction

KV Cache

During autoregressive generation, cache Key and Value matrices to avoid recomputation:

Without cache: $O(n^2)$ per token
With cache: $O(n)$ per new token
Memory: $2 \times n_{layers} \times n_{heads} \times d_{head} \times \text{seq\_len} \times \text{precision}$

10 · Information Theory

Concept	Formula	Used In
Entropy	$H(X) = -\sum p(x) \log p(x)$	Uncertainty measure, decision trees
Cross-Entropy	$H(p, q) = -\sum p(x) \log q(x)$	THE classification loss
KL Divergence	$D_{KL}(p \\| q) = \sum p(x) \log \frac{p(x)}{q(x)}$	VAE loss, knowledge distillation
Perplexity	$\text{PPL} = \exp\left(-\frac{1}{T}\sum \log P(x_t \mid x_{	LLM evaluation metric
Mutual Info	$I(X;Y) = H(X) - H(X \mid Y)$	Feature selection, info bottleneck

Cross-entropy = Entropy + KL Divergence: $H(p, q) = H(p) + D_{KL}(p \| q)$

Minimizing cross-entropy loss = minimizing KL divergence from true distribution.

11 · Embeddings & Similarity

Embedding Lookup

$$\text{embed}(x) = E[x, :] \qquad E \in \mathbb{R}^{|V| \times d}$$

One-hot × embedding matrix = row lookup. $|V|$ = vocab size, $d$ = embedding dim.

Similarity Metrics

Metric	Formula	Range	Used In
Cosine Similarity	$\frac{\mathbf{a} \cdot \mathbf{b}}{\\|\mathbf{a}\\| \\|\mathbf{b}\\|}$	$[-1, 1]$	RAG, semantic search, CLIP
Dot Product	$\mathbf{a} \cdot \mathbf{b}$	$(-\infty, \infty)$	Attention scores
Euclidean (L2)	$\\|\mathbf{a} - \mathbf{b}\\|_2$	$[0, \infty)$	k-NN, clustering

Approximate Nearest Neighbor

For retrieval at scale (millions of vectors):

HNSW: Hierarchical Navigable Small World graphs
IVF: Inverted File Index — cluster then search within clusters
PQ: Product Quantization — compress vectors, approximate distance

12 · Generative Models

VAE (Variational Autoencoder)

ELBO (Evidence Lower Bound):

$$\log p(x) \geq E_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$$

First term: reconstruction quality
Second term: regularize latent space toward prior $p(z) = \mathcal{N}(0, I)$

Reparameterization Trick (enables backprop through sampling):

$$z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

GAN (Generative Adversarial Network)

$$\min_G \max_D \; E_{x \sim p_{data}}[\log D(x)] + E_{z \sim p_z}[\log(1 - D(G(z)))]$$

Optimal discriminator: $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$
At equilibrium: generator distribution = data distribution

Diffusion Models (DDPM)

Forward process — add Gaussian noise over $T$ steps:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} \, x_{t-1}, \beta_t I)$$

Direct jump to any timestep:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) I) \qquad \bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)$$

Reverse process — neural network learns to denoise:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$$

Training loss — predict the noise:

$$L = E_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

13 · Alignment & RLHF

Reward Modeling

Train reward model $r_\phi$ on human preferences:

$$P(\text{response}_w \succ \text{response}_l) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$$

Bradley-Terry model: probability that response $w$ is preferred over $l$.

PPO Objective (Proximal Policy Optimization)

$$L_{PPO} = E_t\left[\min\left(r_t(\theta) A_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]$$ $$r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$$

With KL penalty to stay close to base model:

$$\text{objective} = E[r(x, y)] - \beta \cdot D_{KL}(\pi_\theta \| \pi_{ref})$$

DPO (Direct Preference Optimization)

Skips the reward model entirely:

$$L_{DPO} = -E\left[\log \sigma\left(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$

Simpler than PPO pipeline (no reward model, no RL loop)
Equivalent to PPO under certain assumptions

SFT (Supervised Fine-Tuning)

Standard next-token prediction on instruction-response pairs:

$$L_{SFT} = -\sum_{t=1}^T \log P_\theta(y_t \mid x, y_{

14 · Backpropagation

For layer $l$ with $z^l = W^l a^{l-1} + b^l$ and $a^l = \sigma(z^l)$:

$$\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)$$ $$\frac{\partial L}{\partial W^l} = \delta^l (a^{l-1})^T \qquad \frac{\partial L}{\partial b^l} = \delta^l$$

Vanishing / Exploding Gradients

After $n$ layers: $\frac{\partial L}{\partial W^1} \propto \prod_{l=1}^{n} W^l \cdot \sigma'(z^l)$

Problem	Cause	Solution
Vanishing	$\\|W \cdot \sigma'\\| < 1$ repeated	Residual connections, LSTM gates, ReLU
Exploding	$\\|W \cdot \sigma'\\| > 1$ repeated	Gradient clipping, proper initialization

Weight Initialization

Method	Variance	Best For
Xavier / Glorot	$\frac{2}{n_{in} + n_{out}}$	Sigmoid, Tanh
He / Kaiming	$\frac{2}{n_{in}}$	ReLU and variants

15 · Model-Specific Math

CNN — Convolution

$$(f * g)(t) = \sum_{\tau} f(\tau) \cdot g(t - \tau)$$

Output size: $\left\lfloor\frac{n + 2p - k}{s}\right\rfloor + 1$

$n$ = input size, $k$ = kernel size, $p$ = padding, $s$ = stride

RNN / LSTM

RNN:

$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$$

LSTM gates:

$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \qquad \text{(forget)}$$ $$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \qquad \text{(input)}$$ $$\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \qquad \text{(candidate)}$$ $$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \qquad \text{(cell state)}$$ $$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \qquad \text{(output)}$$ $$h_t = o_t \odot \tanh(c_t) \qquad \text{(hidden state)}$$

GNN (Graph Convolutional Network)

$$H^{(l+1)} = \sigma\left(\hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2} H^{(l)} W^{(l)}\right)$$

Where $\hat{A} = A + I$ (adjacency + self-loops), $\hat{D}$ = degree matrix of $\hat{A}$.

16 · PyTorch Quick Reference

import torch
import torch.nn as nn
import torch.nn.functional as F

# === Tensors ===
x = torch.randn(batch, seq_len, d_model)       # Random tensor
x.shape, x.dtype, x.device                      # Inspect

# === Linear Algebra ===
A @ B                                            # Matrix multiply
torch.linalg.svd(A)                              # SVD
torch.linalg.eig(A)                              # Eigendecomposition
torch.linalg.norm(x, dim=-1)                     # Norm

# === Autograd ===
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x
y.backward()
x.grad                                           # dy/dx = 2x + 3 = 7.0

# === Key Layers ===
nn.Linear(d_in, d_out)                           # Fully connected
nn.Embedding(vocab_size, d_model)                # Embedding lookup
nn.MultiheadAttention(d_model, num_heads)        # Multi-head attention
nn.LayerNorm(d_model)                            # Layer normalization
nn.Dropout(p=0.1)                                # Dropout

# === Loss Functions ===
nn.CrossEntropyLoss()                            # Classification
nn.MSELoss()                                     # Regression
nn.BCEWithLogitsLoss()                           # Binary classification
F.cosine_similarity(a, b)                        # Similarity

# === Optimizers ===
torch.optim.Adam(model.parameters(), lr=1e-4)
torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)

# === LR Schedulers ===
torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=1000)
torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=1e-3, total_steps=10000)

17 · Numbers to Know

Metric	Value	Context
GPT-3 parameters	175B	$d_{model}=12288$, 96 layers, 96 heads
LLaMA-2 70B	70B	$d_{model}=8192$, 80 layers, 64 heads
Typical batch size (LLM)	1-4M tokens	Per gradient step
Typical learning rate (LLM)	$1\text{e-}4$ to $3\text{e-}4$	With cosine decay
Chinchilla-optimal tokens	$20 \times N$	For $N$ parameters
Float16 memory per param	2 bytes	70B model ≈ 140 GB
INT4 memory per param	0.5 bytes	70B model ≈ 35 GB
Attention FLOPs	$2n^2d$	$n$ = seq len, $d$ = dim
FFN FLOPs	$16nd^2$	Dominates for short sequences

This cheatsheet covers Ch.01–25 of the Math for AI/ML/LLM curriculum. See also: Notation Guide · ML Math Map · Interview Prep