CHEATSHEETMath for LLMs

CHEATSHEET

docs

CHEATSHEET

Mathematics for AI/ML/LLM — Cheatsheet

Every formula that actually matters. No filler. Organized by how you encounter them in practice.


1 · Linear Algebra

Vectors

OperationFormulaWhy It Matters
Dot Productab=iaibi\mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i b_iAttention scores, similarity
Cosine Similaritycosθ=abab\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}Embeddings, RAG retrieval, semantic search
L2 Norma=ai2\|\mathbf{a}\| = \sqrt{\sum a_i^2}Weight decay, distance metrics
L1 Norm$|\mathbf{a}|_1 = \suma_i
Projectionprojba=abb2b\text{proj}_{\mathbf{b}}\mathbf{a} = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{b}\|^2}\mathbf{b}Orthogonal decomposition, GS process

Matrices

OperationFormulaNumPy / PyTorch
Matrix Multiply(AB)ij=kAikBkj(AB)_{ij} = \sum_k A_{ik}B_{kj}A @ B
Transpose(AT)ij=Aji(A^T)_{ij} = A_{ji}A.T
InverseAA1=IAA^{-1} = Inp.linalg.inv(A)
Tracetr(A)=iAii\text{tr}(A) = \sum_i A_{ii}torch.trace(A)
Hadamard (element-wise)(AB)ij=AijBij(A \odot B)_{ij} = A_{ij} B_{ij}A * B

Decompositions

Eigendecomposition — Used in: PCA, spectral clustering, graph analysis

Av=λvA=PDP1A\mathbf{v} = \lambda\mathbf{v} \qquad A = PDP^{-1}

SVD — Used in: PCA, LoRA, matrix compression, recommender systems

A=UΣVTA = U\Sigma V^T
  • UU (m×mm \times m): left singular vectors
  • Σ\Sigma (m×nm \times n): singular values on diagonal
  • VTV^T (n×nn \times n): right singular vectors
  • Low-rank approximation: Ak=UkΣkVkTA_k = U_k \Sigma_k V_k^T (keep top-kk values)

PCA via SVD: Center data XX, compute SVD, project onto top-kk columns of VV.

Matrix Properties That Matter

PropertyMeaningWhere You See It
Positive DefinitexTAx>0,x0\mathbf{x}^T A \mathbf{x} > 0, \forall \mathbf{x} \neq 0Covariance matrices, convex loss
OrthogonalATA=IA^T A = IRotation matrices, SVD components
SymmetricA=ATA = A^TCovariance, Hessians, kernels
Rankrank(A)=\text{rank}(A) = # linearly independent rows/colsLoRA exploits low rank

2 · Calculus

Derivatives You Need to Know

FunctionDerivativeUsed In
xnx^nnxn1nx^{n-1}Polynomial features
exe^xexe^xSoftmax, exponential LR decay
ln(x)\ln(x)1/x1/xLog-likelihood, cross-entropy
σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}σ(x)(1σ(x))\sigma(x)(1-\sigma(x))Sigmoid activation, logistic regression
tanh(x)\tanh(x)1tanh2(x)1 - \tanh^2(x)RNN/LSTM gates

Rules That Drive Backpropagation

RuleFormula
Chain Rule(fg)=f(g(x))g(x)(f \circ g)' = f'(g(x)) \cdot g'(x)
Product Rule(fg)=fg+fg(fg)' = f'g + fg'
Sum Rule(f+g)=f+g(f + g)' = f' + g'

The chain rule IS backpropagation. Every layer computes local gradient × upstream gradient.

Multivariate Calculus

Gradient — direction of steepest ascent, used in every optimizer:

f=[fx1,fx2,,fxn]T\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]^T

Jacobian (f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m) — used in normalizing flows, diffusion models:

Jij=fixjJ_{ij} = \frac{\partial f_i}{\partial x_j}

Hessian — second-order info, used in loss landscape analysis, Newton's method:

Hij=2fxixjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}
  • H0H \succ 0 (positive definite) → local minimum
  • H0H \prec 0 (negative definite) → local maximum
  • Mixed eigenvalues → saddle point (common in deep learning!)

3 · Probability & Statistics

Core Rules

FormulaName
P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)Addition
P(AB)=P(A)P(BA)P(A \cap B) = P(A) \cdot P(B \mid A)Multiplication
P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}Bayes' Theorem

Bayes is everywhere: Naive Bayes, MAP estimation, Bayesian neural nets, posterior inference.

Key Distributions

DistributionPDF / PMFMeanVarianceUsed In
BernoulliP(X=1)=pP(X=1) = pppp(1p)p(1-p)Binary classification
CategoricalP(X=k)=pkP(X=k) = p_kMulti-class output
Gaussian1σ2πe(xμ)22σ2\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}μ\muσ2\sigma^2Weight init, noise, VAE
Multivariate Gaussian1(2π)n/2Σ1/2e12(xμ)TΣ1(xμ)\frac{1}{(2\pi)^{n/2}\|\Sigma\|^{1/2}} e^{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})}μ\boldsymbol{\mu}Σ\SigmaLatent spaces, GMMs

Estimation

MLE — maximize likelihood of observed data:

θ^MLE=argmaxθi=1nlogp(xiθ)\hat{\theta}_{MLE} = \arg\max_\theta \sum_{i=1}^n \log p(x_i \mid \theta)

MAP — MLE + prior belief (= regularization!):

θ^MAP=argmaxθ[i=1nlogp(xiθ)+logp(θ)]\hat{\theta}_{MAP} = \arg\max_\theta \left[\sum_{i=1}^n \log p(x_i \mid \theta) + \log p(\theta)\right]

Gaussian prior on θ\theta → L2 regularization. Laplace prior → L1 regularization.

Expectation & Variance

E[X]=xxP(x)Var(X)=E[X2](E[X])2E[X] = \sum_x x \cdot P(x) \qquad \text{Var}(X) = E[X^2] - (E[X])^2

Covariance

Cov(X,Y)=E[(XμX)(YμY)]ρ=Cov(X,Y)σXσY\text{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)] \qquad \rho = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

4 · Activation Functions

FunctionFormulaDerivativeUsed In
ReLUmax(0,x)\max(0, x){1x>00x0\begin{cases}1 & x > 0 \\ 0 & x \leq 0\end{cases}CNNs, default choice
Leaky ReLUmax(αx,x)\max(\alpha x, x){1x>0αx0\begin{cases}1 & x > 0 \\ \alpha & x \leq 0\end{cases}Avoids dead neurons
GELUxΦ(x)xσ(1.702x)x \cdot \Phi(x) \approx x \cdot \sigma(1.702x)(complex)GPT, BERT, modern transformers
SiLU / Swishxσ(x)x \cdot \sigma(x)σ(x)+xσ(x)(1σ(x))\sigma(x) + x\sigma(x)(1-\sigma(x))LLaMA, modern LLMs
Sigmoid11+ex\frac{1}{1+e^{-x}}σ(x)(1σ(x))\sigma(x)(1-\sigma(x))Gates (LSTM), binary output
Tanhexexex+ex\frac{e^x - e^{-x}}{e^x + e^{-x}}1tanh2(x)1 - \tanh^2(x)RNN hidden states
Softmaxezijezj\frac{e^{z_i}}{\sum_j e^{z_j}}si(δijsj)s_i(\delta_{ij} - s_j)Classification output, attention

5 · Loss Functions

Classification

Cross-Entropy (multi-class) — THE standard classification loss:

L=i=1Cyilog(y^i)L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)

Binary Cross-Entropy:

L=[ylog(y^)+(1y)log(1y^)]L = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]

Focal Loss — handles class imbalance (used in object detection):

L=αt(1y^t)γlog(y^t)L = -\alpha_t (1 - \hat{y}_t)^\gamma \log(\hat{y}_t)

Regression

MSE:

L=1ni=1n(yiy^i)2L = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2

MAE / L1 Loss:

L=1ni=1nyiy^iL = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|

Huber Loss — smooth transition between MSE and MAE:

Lδ={12(yy^)2yy^δδyy^12δ2otherwiseL_\delta = \begin{cases}\frac{1}{2}(y - \hat{y})^2 & |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise}\end{cases}

Contrastive & Embedding Losses

Contrastive Loss (SimCLR / CLIP):

L=logexp(sim(zi,zj)/τ)kiexp(sim(zi,zk)/τ)L = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}

Where sim\text{sim} = cosine similarity, τ\tau = temperature

Triplet Loss:

L=max(0,f(a)f(p)2f(a)f(n)2+α)L = \max(0, \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha)
  • aa = anchor, pp = positive, nn = negative, α\alpha = margin

6 · Optimization

Gradient Descent Variants

Vanilla GD:

θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)

SGD with Momentum:

vt=γvt1+ηL(θt)θt+1=θtvtv_t = \gamma v_{t-1} + \eta \nabla L(\theta_t) \qquad \theta_{t+1} = \theta_t - v_t

Adam (the default optimizer for most deep learning):

mt=β1mt1+(1β1)gt(1st moment)m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \qquad \text{(1st moment)} vt=β2vt1+(1β2)gt2(2nd moment)v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \qquad \text{(2nd moment)} m^t=mt1β1tv^t=vt1β2t(bias correction)\hat{m}_t = \frac{m_t}{1-\beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \qquad \text{(bias correction)} θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Defaults: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}

AdamW (Adam with decoupled weight decay — used in LLM training):

θt+1=θtη(m^tv^t+ϵ+λθt)\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t\right)

Key difference from Adam: weight decay λθt\lambda\theta_t is applied directly, not through the gradient.

Gradient Clipping

By norm (prevents exploding gradients — standard in LLM training):

g^={gif gccggotherwise\hat{g} = \begin{cases}g & \text{if } \|g\| \leq c \\ c \cdot \frac{g}{\|g\|} & \text{otherwise}\end{cases}

Learning Rate Schedules

Cosine Annealing (used in most modern LLM training):

ηt=ηmin+12(ηmaxηmin)(1+costπT)\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\frac{t\pi}{T}\right)

Warmup + Cosine Decay (the LLM standard):

ηt={ηmaxtTwarmupt<Twarmupcosine decaytTwarmup\eta_t = \begin{cases}\eta_{max} \cdot \frac{t}{T_{warmup}} & t < T_{warmup} \\ \text{cosine decay} & t \geq T_{warmup}\end{cases}

Regularization

MethodEffectFormula
L2 / Weight DecaySmall weightsL+λθi2L + \lambda\sum\theta_i^2
L1Sparse weightsL+λθiL + \lambda\sum\|\theta_i\|
DropoutRandom neuron maskingh=maskf(x)/(1p)h = \text{mask} \odot f(x) / (1-p)

Convexity

f convex    f(λx+(1λ)y)λf(x)+(1λ)f(y)f \text{ convex} \iff f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)
  • Convex → single global minimum (guaranteed convergence)
  • Deep learning losses are non-convex → saddle points, local minima

7 · Normalization

Batch Normalization

x^i=xiμBσB2+ϵyi=γx^i+β\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \qquad y_i = \gamma \hat{x}_i + \beta

Normalizes across the batch dimension. Used in CNNs.

Layer Normalization

x^i=xiμLσL2+ϵyi=γx^i+β\hat{x}_i = \frac{x_i - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}} \qquad y_i = \gamma \hat{x}_i + \beta

Normalizes across the feature dimension. Used in original Transformers, BERT.

RMSNorm (Root Mean Square Normalization)

x^i=xiRMS(x)γRMS(x)=1ni=1nxi2\hat{x}_i = \frac{x_i}{\text{RMS}(x)} \cdot \gamma \qquad \text{RMS}(x) = \sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}

No mean subtraction, no bias β\beta. Faster than LayerNorm. Used in LLaMA, GPT-4, modern LLMs.


8 · Transformer & Attention Math

Scaled Dot-Product Attention

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  • Q=XWQQ = XW_Q (queries), K=XWKK = XW_K (keys), V=XWVV = XW_V (values)
  • dkd_k = key dimension. Scaling by dk\sqrt{d_k} prevents softmax saturation.
  • Complexity: O(n2d)O(n^2 d) where nn = sequence length

Why dk\sqrt{d_k}?

If q,kq, k have components with variance 1, then qkq \cdot k has variance dkd_k. Dividing by dk\sqrt{d_k} restores unit variance → softmax gets reasonable gradients.

Multi-Head Attention

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
  • hh heads, each with dk=dmodel/hd_k = d_{model}/h
  • Lets the model attend to different representation subspaces

Causal (Autoregressive) Masking

maskij={0iji<j\text{mask}_{ij} = \begin{cases}0 & i \geq j \\ -\infty & i < j\end{cases}

Applied before softmax: softmax(QKTdk+mask)V\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \text{mask}\right)V

Prevents token ii from attending to future tokens j>ij > i. Used in GPT, LLaMA, all decoder models.

Transformer Block

Input
  → RMSNorm / LayerNorm
  → Multi-Head Self-Attention + Residual
  → RMSNorm / LayerNorm
  → Feed-Forward Network + Residual
Output

Feed-Forward Network (FFN):

FFN(x)=W2GELU(W1x+b1)+b2\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2
  • W1W_1: dmodeldffd_{model} \to d_{ff} (typically dff=4×dmodeld_{ff} = 4 \times d_{model})
  • W2W_2: dffdmodeld_{ff} \to d_{model}

SwiGLU FFN (used in LLaMA, PaLM, modern LLMs):

SwiGLU(x)=(SiLU(W1x)W3x)W2\text{SwiGLU}(x) = (\text{SiLU}(W_1 x) \odot W_3 x) \cdot W_2

Residual Connection

output=x+Sublayer(x)\text{output} = x + \text{Sublayer}(x)

Enables gradient flow through deep networks (50+ layers in LLMs).


9 · LLM-Specific Math

Positional Encoding — Sinusoidal (Original Transformer)

PE(pos,2i)=sin(pos100002i/d)PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) \qquad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Rotary Position Embedding (RoPE) — Used in LLaMA, GPT-NeoX

Rotates query and key vectors based on position:

f(xm,m)=RmxmRm=(cosmθsinmθsinmθcosmθ)f(x_m, m) = R_m x_m \qquad R_m = \begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}

Key property: f(q,m),f(k,n)\langle f(q, m), f(k, n) \rangle depends only on qq, kk, and relative position mnm - n.

Tokenization (BPE)

Byte-Pair Encoding greedily merges the most frequent adjacent token pair:

pair=argmax(a,b)count(a,b)\text{pair}^* = \arg\max_{(a,b)} \text{count}(a, b)

Repeat until vocabulary size reached. Subword tokenization balances vocabulary size vs sequence length.

Language Model Probability

P(x1,,xT)=t=1TP(xtx1,,xt1)P(x_1, \ldots, x_T) = \prod_{t=1}^T P(x_t \mid x_1, \ldots, x_{t-1})

Training objective (minimize negative log-likelihood):

L=1Tt=1TlogP(xtx<t)L = -\frac{1}{T}\sum_{t=1}^T \log P(x_t \mid x_{<t})

Temperature Scaling

P(xi)=exp(zi/τ)jexp(zj/τ)P(x_i) = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}
  • τ0\tau \to 0: greedy (picks highest logit)
  • τ=1\tau = 1: standard sampling
  • τ>1\tau > 1: more random / creative

Top-k and Top-p (Nucleus) Sampling

  • Top-k: Sample from the kk highest-probability tokens only
  • Top-p: Sample from smallest set where P(xi)p\sum P(x_i) \geq p

Scaling Laws (Chinchilla)

L(N,D)ANα+BDβ+LL(N, D) \approx A \cdot N^{-\alpha} + B \cdot D^{-\beta} + L_\infty
  • NN = number of parameters, DD = number of training tokens
  • Chinchilla-optimal: D20×ND \approx 20 \times N (train on 20 tokens per parameter)

LoRA (Low-Rank Adaptation)

W=W0+ΔW=W0+BAW' = W_0 + \Delta W = W_0 + BA
  • W0Rd×dW_0 \in \mathbb{R}^{d \times d} (frozen pretrained weights)
  • BRd×rB \in \mathbb{R}^{d \times r}, ARr×dA \in \mathbb{R}^{r \times d} where rdr \ll d
  • Only BB and AA are trained. Parameters: 2dr2dr instead of d2d^2
  • Example: d=4096d = 4096, r=16r = 16 → 99.2% fewer trainable parameters

Quantization

Uniform quantization (maps float → int):

xq=round(xs)+zxdequant=s(xqz)x_q = \text{round}\left(\frac{x}{s}\right) + z \qquad x_{dequant} = s(x_q - z)
  • ss = scale factor, zz = zero point
  • INT8: 4× memory reduction, INT4: 8× memory reduction

KV Cache

During autoregressive generation, cache Key and Value matrices to avoid recomputation:

  • Without cache: O(n2)O(n^2) per token
  • With cache: O(n)O(n) per new token
  • Memory: 2×nlayers×nheads×dhead×seq_len×precision2 \times n_{layers} \times n_{heads} \times d_{head} \times \text{seq\_len} \times \text{precision}

10 · Information Theory

ConceptFormulaUsed In
EntropyH(X)=p(x)logp(x)H(X) = -\sum p(x) \log p(x)Uncertainty measure, decision trees
Cross-EntropyH(p,q)=p(x)logq(x)H(p, q) = -\sum p(x) \log q(x)THE classification loss
KL DivergenceDKL(pq)=p(x)logp(x)q(x)D_{KL}(p \| q) = \sum p(x) \log \frac{p(x)}{q(x)}VAE loss, knowledge distillation
PerplexityPPL=exp(1TlogP(xtx<t))\text{PPL} = \exp\left(-\frac{1}{T}\sum \log P(x_t \mid x_{<t})\right)LLM evaluation metric
Mutual InfoI(X;Y)=H(X)H(XY)I(X;Y) = H(X) - H(X \mid Y)Feature selection, info bottleneck

Cross-entropy = Entropy + KL Divergence: H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{KL}(p \| q)

Minimizing cross-entropy loss = minimizing KL divergence from true distribution.


11 · Embeddings & Similarity

Embedding Lookup

embed(x)=E[x,:]ERV×d\text{embed}(x) = E[x, :] \qquad E \in \mathbb{R}^{|V| \times d}

One-hot × embedding matrix = row lookup. V|V| = vocab size, dd = embedding dim.

Similarity Metrics

MetricFormulaRangeUsed In
Cosine Similarityabab\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}[1,1][-1, 1]RAG, semantic search, CLIP
Dot Productab\mathbf{a} \cdot \mathbf{b}(,)(-\infty, \infty)Attention scores
Euclidean (L2)ab2\|\mathbf{a} - \mathbf{b}\|_2[0,)[0, \infty)k-NN, clustering

Approximate Nearest Neighbor

For retrieval at scale (millions of vectors):

  • HNSW: Hierarchical Navigable Small World graphs
  • IVF: Inverted File Index — cluster then search within clusters
  • PQ: Product Quantization — compress vectors, approximate distance

12 · Generative Models

VAE (Variational Autoencoder)

ELBO (Evidence Lower Bound):

logp(x)Eq(zx)[logp(xz)]DKL(q(zx)p(z))\log p(x) \geq E_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))
  • First term: reconstruction quality
  • Second term: regularize latent space toward prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I)

Reparameterization Trick (enables backprop through sampling):

z=μ+σϵ,ϵN(0,I)z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

GAN (Generative Adversarial Network)

minGmaxD  Expdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D \; E_{x \sim p_{data}}[\log D(x)] + E_{z \sim p_z}[\log(1 - D(G(z)))]
  • Optimal discriminator: D(x)=pdata(x)pdata(x)+pg(x)D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}
  • At equilibrium: generator distribution = data distribution

Diffusion Models (DDPM)

Forward process — add Gaussian noise over TT steps:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} \, x_{t-1}, \beta_t I)

Direct jump to any timestep:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)αˉt=s=1t(1βs)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) I) \qquad \bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)

Reverse process — neural network learns to denoise:

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

Training loss — predict the noise:

L=Et,x0,ϵ[ϵϵθ(xt,t)2]L = E_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

13 · Alignment & RLHF

Reward Modeling

Train reward model rϕr_\phi on human preferences:

P(responsewresponsel)=σ(rϕ(x,yw)rϕ(x,yl))P(\text{response}_w \succ \text{response}_l) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))

Bradley-Terry model: probability that response ww is preferred over ll.

PPO Objective (Proximal Policy Optimization)

LPPO=Et[min(rt(θ)At,  clip(rt(θ),1ϵ,1+ϵ)At)]L_{PPO} = E_t\left[\min\left(r_t(\theta) A_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right] rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}

With KL penalty to stay close to base model:

objective=E[r(x,y)]βDKL(πθπref)\text{objective} = E[r(x, y)] - \beta \cdot D_{KL}(\pi_\theta \| \pi_{ref})

DPO (Direct Preference Optimization)

Skips the reward model entirely:

LDPO=E[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]L_{DPO} = -E\left[\log \sigma\left(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]
  • Simpler than PPO pipeline (no reward model, no RL loop)
  • Equivalent to PPO under certain assumptions

SFT (Supervised Fine-Tuning)

Standard next-token prediction on instruction-response pairs:

LSFT=t=1TlogPθ(ytx,y<t)L_{SFT} = -\sum_{t=1}^T \log P_\theta(y_t \mid x, y_{<t})

14 · Backpropagation

For layer ll with zl=Wlal1+blz^l = W^l a^{l-1} + b^l and al=σ(zl)a^l = \sigma(z^l):

δl=((Wl+1)Tδl+1)σ(zl)\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) LWl=δl(al1)TLbl=δl\frac{\partial L}{\partial W^l} = \delta^l (a^{l-1})^T \qquad \frac{\partial L}{\partial b^l} = \delta^l

Vanishing / Exploding Gradients

After nn layers: LW1l=1nWlσ(zl)\frac{\partial L}{\partial W^1} \propto \prod_{l=1}^{n} W^l \cdot \sigma'(z^l)

ProblemCauseSolution
VanishingWσ<1\|W \cdot \sigma'\| < 1 repeatedResidual connections, LSTM gates, ReLU
ExplodingWσ>1\|W \cdot \sigma'\| > 1 repeatedGradient clipping, proper initialization

Weight Initialization

MethodVarianceBest For
Xavier / Glorot2nin+nout\frac{2}{n_{in} + n_{out}}Sigmoid, Tanh
He / Kaiming2nin\frac{2}{n_{in}}ReLU and variants

15 · Model-Specific Math

CNN — Convolution

(fg)(t)=τf(τ)g(tτ)(f * g)(t) = \sum_{\tau} f(\tau) \cdot g(t - \tau)

Output size: n+2pks+1\left\lfloor\frac{n + 2p - k}{s}\right\rfloor + 1

  • nn = input size, kk = kernel size, pp = padding, ss = stride

RNN / LSTM

RNN:

ht=tanh(Whht1+Wxxt+b)h_t = \tanh(W_h h_{t-1} + W_x x_t + b)

LSTM gates:

ft=σ(Wf[ht1,xt]+bf)(forget)f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \qquad \text{(forget)} it=σ(Wi[ht1,xt]+bi)(input)i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \qquad \text{(input)} c~t=tanh(Wc[ht1,xt]+bc)(candidate)\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \qquad \text{(candidate)} ct=ftct1+itc~t(cell state)c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \qquad \text{(cell state)} ot=σ(Wo[ht1,xt]+bo)(output)o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \qquad \text{(output)} ht=ottanh(ct)(hidden state)h_t = o_t \odot \tanh(c_t) \qquad \text{(hidden state)}

GNN (Graph Convolutional Network)

H(l+1)=σ(D^1/2A^D^1/2H(l)W(l))H^{(l+1)} = \sigma\left(\hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2} H^{(l)} W^{(l)}\right)

Where A^=A+I\hat{A} = A + I (adjacency + self-loops), D^\hat{D} = degree matrix of A^\hat{A}.


16 · PyTorch Quick Reference

import torch
import torch.nn as nn
import torch.nn.functional as F

# === Tensors ===
x = torch.randn(batch, seq_len, d_model)       # Random tensor
x.shape, x.dtype, x.device                      # Inspect

# === Linear Algebra ===
A @ B                                            # Matrix multiply
torch.linalg.svd(A)                              # SVD
torch.linalg.eig(A)                              # Eigendecomposition
torch.linalg.norm(x, dim=-1)                     # Norm

# === Autograd ===
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x
y.backward()
x.grad                                           # dy/dx = 2x + 3 = 7.0

# === Key Layers ===
nn.Linear(d_in, d_out)                           # Fully connected
nn.Embedding(vocab_size, d_model)                # Embedding lookup
nn.MultiheadAttention(d_model, num_heads)        # Multi-head attention
nn.LayerNorm(d_model)                            # Layer normalization
nn.Dropout(p=0.1)                                # Dropout

# === Loss Functions ===
nn.CrossEntropyLoss()                            # Classification
nn.MSELoss()                                     # Regression
nn.BCEWithLogitsLoss()                           # Binary classification
F.cosine_similarity(a, b)                        # Similarity

# === Optimizers ===
torch.optim.Adam(model.parameters(), lr=1e-4)
torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)

# === LR Schedulers ===
torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=1000)
torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=1e-3, total_steps=10000)

17 · Numbers to Know

MetricValueContext
GPT-3 parameters175Bdmodel=12288d_{model}=12288, 96 layers, 96 heads
LLaMA-2 70B70Bdmodel=8192d_{model}=8192, 80 layers, 64 heads
Typical batch size (LLM)1-4M tokensPer gradient step
Typical learning rate (LLM)1e-41\text{e-}4 to 3e-43\text{e-}4With cosine decay
Chinchilla-optimal tokens20×N20 \times NFor NN parameters
Float16 memory per param2 bytes70B model ≈ 140 GB
INT4 memory per param0.5 bytes70B model ≈ 35 GB
Attention FLOPs2n2d2n^2dnn = seq len, dd = dim
FFN FLOPs16nd216nd^2Dominates for short sequences

This cheatsheet covers Ch.01–25 of the Math for AI/ML/LLM curriculum. See also: Notation Guide · ML Math Map · Interview Prep