Mathematics for AI/ML/LLM — Cheatsheet
Every formula that actually matters. No filler. Organized by how you encounter them in practice.
1 · Linear Algebra
Vectors
| Operation | Formula | Why It Matters |
|---|---|---|
| Dot Product | $\mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i b_i$ | Attention scores, similarity |
| Cosine Similarity | $\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$ | Embeddings, RAG retrieval, semantic search |
| L2 Norm | $\|\mathbf{a}\| = \sqrt{\sum a_i^2}$ | Weight decay, distance metrics |
| L1 Norm | $\|\mathbf{a}\|_1 = \sum |a_i|$ | Sparsity, Lasso regularization |
| Projection | $\text{proj}_{\mathbf{b}}\mathbf{a} = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{b}\|^2}\mathbf{b}$ | Orthogonal decomposition, GS process |
Matrices
| Operation | Formula | NumPy / PyTorch |
|---|---|---|
| Matrix Multiply | $(AB)_{ij} = \sum_k A_{ik}B_{kj}$ | A @ B |
| Transpose | $(A^T)_{ij} = A_{ji}$ | A.T |
| Inverse | $AA^{-1} = I$ | np.linalg.inv(A) |
| Trace | $\text{tr}(A) = \sum_i A_{ii}$ | torch.trace(A) |
| Hadamard (element-wise) | $(A \odot B)_{ij} = A_{ij} B_{ij}$ | A * B |
Decompositions
Eigendecomposition — Used in: PCA, spectral clustering, graph analysis
$$A\mathbf{v} = \lambda\mathbf{v} \qquad A = PDP^{-1}$$SVD — Used in: PCA, LoRA, matrix compression, recommender systems
$$A = U\Sigma V^T$$- $U$ ($m \times m$): left singular vectors
- $\Sigma$ ($m \times n$): singular values on diagonal
- $V^T$ ($n \times n$): right singular vectors
- Low-rank approximation: $A_k = U_k \Sigma_k V_k^T$ (keep top-$k$ values)
PCA via SVD: Center data $X$, compute SVD, project onto top-$k$ columns of $V$.
Matrix Properties That Matter
| Property | Meaning | Where You See It |
|---|---|---|
| Positive Definite | $\mathbf{x}^T A \mathbf{x} > 0, \forall \mathbf{x} \neq 0$ | Covariance matrices, convex loss |
| Orthogonal | $A^T A = I$ | Rotation matrices, SVD components |
| Symmetric | $A = A^T$ | Covariance, Hessians, kernels |
| Rank | $\text{rank}(A) = $ # linearly independent rows/cols | LoRA exploits low rank |
2 · Calculus
Derivatives You Need to Know
| Function | Derivative | Used In |
|---|---|---|
| $x^n$ | $nx^{n-1}$ | Polynomial features |
| $e^x$ | $e^x$ | Softmax, exponential LR decay |
| $\ln(x)$ | $1/x$ | Log-likelihood, cross-entropy |
| $\sigma(x) = \frac{1}{1+e^{-x}}$ | $\sigma(x)(1-\sigma(x))$ | Sigmoid activation, logistic regression |
| $\tanh(x)$ | $1 - \tanh^2(x)$ | RNN/LSTM gates |
Rules That Drive Backpropagation
| Rule | Formula |
|---|---|
| Chain Rule | $(f \circ g)' = f'(g(x)) \cdot g'(x)$ |
| Product Rule | $(fg)' = f'g + fg'$ |
| Sum Rule | $(f + g)' = f' + g'$ |
The chain rule IS backpropagation. Every layer computes local gradient × upstream gradient.
Multivariate Calculus
Gradient — direction of steepest ascent, used in every optimizer:
$$\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]^T$$Jacobian ($\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$) — used in normalizing flows, diffusion models:
$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$Hessian — second-order info, used in loss landscape analysis, Newton's method:
$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$- $H \succ 0$ (positive definite) → local minimum
- $H \prec 0$ (negative definite) → local maximum
- Mixed eigenvalues → saddle point (common in deep learning!)
3 · Probability & Statistics
Core Rules
| Formula | Name |
|---|---|
| $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ | Addition |
| $P(A \cap B) = P(A) \cdot P(B \mid A)$ | Multiplication |
| $P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}$ | Bayes' Theorem |
Bayes is everywhere: Naive Bayes, MAP estimation, Bayesian neural nets, posterior inference.
Key Distributions
| Distribution | PDF / PMF | Mean | Variance | Used In |
|---|---|---|---|---|
| Bernoulli | $P(X=1) = p$ | $p$ | $p(1-p)$ | Binary classification |
| Categorical | $P(X=k) = p_k$ | — | — | Multi-class output |
| Gaussian | $\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ | $\mu$ | $\sigma^2$ | Weight init, noise, VAE |
| Multivariate Gaussian | $\frac{1}{(2\pi)^{n/2}\|\Sigma\|^{1/2}} e^{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})}$ | $\boldsymbol{\mu}$ | $\Sigma$ | Latent spaces, GMMs |
Estimation
MLE — maximize likelihood of observed data:
$$\hat{\theta}_{MLE} = \arg\max_\theta \sum_{i=1}^n \log p(x_i \mid \theta)$$MAP — MLE + prior belief (= regularization!):
$$\hat{\theta}_{MAP} = \arg\max_\theta \left[\sum_{i=1}^n \log p(x_i \mid \theta) + \log p(\theta)\right]$$Gaussian prior on $\theta$ → L2 regularization. Laplace prior → L1 regularization.
Expectation & Variance
$$E[X] = \sum_x x \cdot P(x) \qquad \text{Var}(X) = E[X^2] - (E[X])^2$$Covariance
$$\text{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)] \qquad \rho = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$$4 · Activation Functions
| Function | Formula | Derivative | Used In |
|---|---|---|---|
| ReLU | $\max(0, x)$ | $$ | CNNs, default choice |
| Leaky ReLU | $\max(\alpha x, x)$ | $$ | Avoids dead neurons |
| GELU | $x \cdot \Phi(x) \approx x \cdot \sigma(1.702x)$ | (complex) | GPT, BERT, modern transformers |
| SiLU / Swish | $x \cdot \sigma(x)$ | $\sigma(x) + x\sigma(x)(1-\sigma(x))$ | LLaMA, modern LLMs |
| Sigmoid | $\frac{1}{1+e^{-x}}$ | $\sigma(x)(1-\sigma(x))$ | Gates (LSTM), binary output |
| Tanh | $\frac{e^x - e^{-x}}{e^x + e^{-x}}$ | $1 - \tanh^2(x)$ | RNN hidden states |
| Softmax | $\frac{e^{z_i}}{\sum_j e^{z_j}}$ | $s_i(\delta_{ij} - s_j)$ | Classification output, attention |
5 · Loss Functions
Classification
Cross-Entropy (multi-class) — THE standard classification loss:
$$L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$Binary Cross-Entropy:
$$L = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]$$Focal Loss — handles class imbalance (used in object detection):
$$L = -\alpha_t (1 - \hat{y}_t)^\gamma \log(\hat{y}_t)$$Regression
MSE:
$$L = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$MAE / L1 Loss:
$$L = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|$$Huber Loss — smooth transition between MSE and MAE:
$$L_\delta = $$Contrastive & Embedding Losses
Contrastive Loss (SimCLR / CLIP):
$$L = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$$Where $\text{sim}$ = cosine similarity, $\tau$ = temperature
Triplet Loss:
$$L = \max(0, \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + \alpha)$$- $a$ = anchor, $p$ = positive, $n$ = negative, $\alpha$ = margin
6 · Optimization
Gradient Descent Variants
Vanilla GD:
$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$SGD with Momentum:
$$v_t = \gamma v_{t-1} + \eta \nabla L(\theta_t) \qquad \theta_{t+1} = \theta_t - v_t$$Adam (the default optimizer for most deep learning):
$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \qquad \text{(1st moment)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \qquad \text{(2nd moment)}$$ $$\hat{m}_t = \frac{m_t}{1-\beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \qquad \text{(bias correction)}$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$Defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$
AdamW (Adam with decoupled weight decay — used in LLM training):
$$\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t\right)$$Key difference from Adam: weight decay $\lambda\theta_t$ is applied directly, not through the gradient.
Gradient Clipping
By norm (prevents exploding gradients — standard in LLM training):
$$\hat{g} = $$Learning Rate Schedules
Cosine Annealing (used in most modern LLM training):
$$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\frac{t\pi}{T}\right)$$Warmup + Cosine Decay (the LLM standard):
$$\eta_t = $$Regularization
| Method | Effect | Formula |
|---|---|---|
| L2 / Weight Decay | Small weights | $L + \lambda\sum\theta_i^2$ |
| L1 | Sparse weights | $L + \lambda\sum\|\theta_i\|$ |
| Dropout | Random neuron masking | $h = \text{mask} \odot f(x) / (1-p)$ |
Convexity
$$f \text{ convex} \iff f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$$- Convex → single global minimum (guaranteed convergence)
- Deep learning losses are non-convex → saddle points, local minima
7 · Normalization
Batch Normalization
$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \qquad y_i = \gamma \hat{x}_i + \beta$$Normalizes across the batch dimension. Used in CNNs.
Layer Normalization
$$\hat{x}_i = \frac{x_i - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}} \qquad y_i = \gamma \hat{x}_i + \beta$$Normalizes across the feature dimension. Used in original Transformers, BERT.
RMSNorm (Root Mean Square Normalization)
$$\hat{x}_i = \frac{x_i}{\text{RMS}(x)} \cdot \gamma \qquad \text{RMS}(x) = \sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}$$No mean subtraction, no bias $\beta$. Faster than LayerNorm. Used in LLaMA, GPT-4, modern LLMs.
8 · Transformer & Attention Math
Scaled Dot-Product Attention
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$- $Q = XW_Q$ (queries), $K = XW_K$ (keys), $V = XW_V$ (values)
- $d_k$ = key dimension. Scaling by $\sqrt{d_k}$ prevents softmax saturation.
- Complexity: $O(n^2 d)$ where $n$ = sequence length
Why $\sqrt{d_k}$?
If $q, k$ have components with variance 1, then $q \cdot k$ has variance $d_k$. Dividing by $\sqrt{d_k}$ restores unit variance → softmax gets reasonable gradients.
Multi-Head Attention
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$$ $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$- $h$ heads, each with $d_k = d_{model}/h$
- Lets the model attend to different representation subspaces
Causal (Autoregressive) Masking
$$\text{mask}_{ij} = $$Applied before softmax: $\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \text{mask}\right)V$
Prevents token $i$ from attending to future tokens $j > i$. Used in GPT, LLaMA, all decoder models.
Transformer Block
Input
→ RMSNorm / LayerNorm
→ Multi-Head Self-Attention + Residual
→ RMSNorm / LayerNorm
→ Feed-Forward Network + Residual
Output
Feed-Forward Network (FFN):
$$\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2$$- $W_1$: $d_{model} \to d_{ff}$ (typically $d_{ff} = 4 \times d_{model}$)
- $W_2$: $d_{ff} \to d_{model}$
SwiGLU FFN (used in LLaMA, PaLM, modern LLMs):
$$\text{SwiGLU}(x) = (\text{SiLU}(W_1 x) \odot W_3 x) \cdot W_2$$Residual Connection
$$\text{output} = x + \text{Sublayer}(x)$$Enables gradient flow through deep networks (50+ layers in LLMs).
9 · LLM-Specific Math
Positional Encoding — Sinusoidal (Original Transformer)
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) \qquad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$Rotary Position Embedding (RoPE) — Used in LLaMA, GPT-NeoX
Rotates query and key vectors based on position:
$$f(x_m, m) = R_m x_m \qquad R_m = $$Key property: $\langle f(q, m), f(k, n) \rangle$ depends only on $q$, $k$, and relative position $m - n$.
Tokenization (BPE)
Byte-Pair Encoding greedily merges the most frequent adjacent token pair:
$$\text{pair}^* = \arg\max_{(a,b)} \text{count}(a, b)$$Repeat until vocabulary size reached. Subword tokenization balances vocabulary size vs sequence length.
Language Model Probability
$$P(x_1, \ldots, x_T) = \prod_{t=1}^T P(x_t \mid x_1, \ldots, x_{t-1})$$Training objective (minimize negative log-likelihood):
$$L = -\frac{1}{T}\sum_{t=1}^T \log P(x_t \mid x_{- $\tau \to 0$: greedy (picks highest logit)
- $\tau = 1$: standard sampling
- $\tau > 1$: more random / creative
Top-k and Top-p (Nucleus) Sampling
- Top-k: Sample from the $k$ highest-probability tokens only
- Top-p: Sample from smallest set where $\sum P(x_i) \geq p$
Scaling Laws (Chinchilla)
$$L(N, D) \approx A \cdot N^{-\alpha} + B \cdot D^{-\beta} + L_\infty$$- $N$ = number of parameters, $D$ = number of training tokens
- Chinchilla-optimal: $D \approx 20 \times N$ (train on 20 tokens per parameter)
LoRA (Low-Rank Adaptation)
$$W' = W_0 + \Delta W = W_0 + BA$$- $W_0 \in \mathbb{R}^{d \times d}$ (frozen pretrained weights)
- $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times d}$ where $r \ll d$
- Only $B$ and $A$ are trained. Parameters: $2dr$ instead of $d^2$
- Example: $d = 4096$, $r = 16$ → 99.2% fewer trainable parameters
Quantization
Uniform quantization (maps float → int):
$$x_q = \text{round}\left(\frac{x}{s}\right) + z \qquad x_{dequant} = s(x_q - z)$$- $s$ = scale factor, $z$ = zero point
- INT8: 4× memory reduction, INT4: 8× memory reduction
KV Cache
During autoregressive generation, cache Key and Value matrices to avoid recomputation:
- Without cache: $O(n^2)$ per token
- With cache: $O(n)$ per new token
- Memory: $2 \times n_{layers} \times n_{heads} \times d_{head} \times \text{seq\_len} \times \text{precision}$
10 · Information Theory
| Concept | Formula | Used In |
|---|---|---|
| Entropy | $H(X) = -\sum p(x) \log p(x)$ | Uncertainty measure, decision trees |
| Cross-Entropy | $H(p, q) = -\sum p(x) \log q(x)$ | THE classification loss |
| KL Divergence | $D_{KL}(p \| q) = \sum p(x) \log \frac{p(x)}{q(x)}$ | VAE loss, knowledge distillation |
| Perplexity | $\text{PPL} = \exp\left(-\frac{1}{T}\sum \log P(x_t \mid x_{| LLM evaluation metric |
|
| Mutual Info | $I(X;Y) = H(X) - H(X \mid Y)$ | Feature selection, info bottleneck |
Cross-entropy = Entropy + KL Divergence: $H(p, q) = H(p) + D_{KL}(p \| q)$
Minimizing cross-entropy loss = minimizing KL divergence from true distribution.
11 · Embeddings & Similarity
Embedding Lookup
$$\text{embed}(x) = E[x, :] \qquad E \in \mathbb{R}^{|V| \times d}$$One-hot × embedding matrix = row lookup. $|V|$ = vocab size, $d$ = embedding dim.
Similarity Metrics
| Metric | Formula | Range | Used In |
|---|---|---|---|
| Cosine Similarity | $\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$ | $[-1, 1]$ | RAG, semantic search, CLIP |
| Dot Product | $\mathbf{a} \cdot \mathbf{b}$ | $(-\infty, \infty)$ | Attention scores |
| Euclidean (L2) | $\|\mathbf{a} - \mathbf{b}\|_2$ | $[0, \infty)$ | k-NN, clustering |
Approximate Nearest Neighbor
For retrieval at scale (millions of vectors):
- HNSW: Hierarchical Navigable Small World graphs
- IVF: Inverted File Index — cluster then search within clusters
- PQ: Product Quantization — compress vectors, approximate distance
12 · Generative Models
VAE (Variational Autoencoder)
ELBO (Evidence Lower Bound):
$$\log p(x) \geq E_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$$- First term: reconstruction quality
- Second term: regularize latent space toward prior $p(z) = \mathcal{N}(0, I)$
Reparameterization Trick (enables backprop through sampling):
$$z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$GAN (Generative Adversarial Network)
$$\min_G \max_D \; E_{x \sim p_{data}}[\log D(x)] + E_{z \sim p_z}[\log(1 - D(G(z)))]$$- Optimal discriminator: $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$
- At equilibrium: generator distribution = data distribution
Diffusion Models (DDPM)
Forward process — add Gaussian noise over $T$ steps:
$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} \, x_{t-1}, \beta_t I)$$Direct jump to any timestep:
$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) I) \qquad \bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)$$Reverse process — neural network learns to denoise:
$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$$Training loss — predict the noise:
$$L = E_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$13 · Alignment & RLHF
Reward Modeling
Train reward model $r_\phi$ on human preferences:
$$P(\text{response}_w \succ \text{response}_l) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$$Bradley-Terry model: probability that response $w$ is preferred over $l$.
PPO Objective (Proximal Policy Optimization)
$$L_{PPO} = E_t\left[\min\left(r_t(\theta) A_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]$$ $$r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$$With KL penalty to stay close to base model:
$$\text{objective} = E[r(x, y)] - \beta \cdot D_{KL}(\pi_\theta \| \pi_{ref})$$DPO (Direct Preference Optimization)
Skips the reward model entirely:
$$L_{DPO} = -E\left[\log \sigma\left(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$- Simpler than PPO pipeline (no reward model, no RL loop)
- Equivalent to PPO under certain assumptions
SFT (Supervised Fine-Tuning)
Standard next-token prediction on instruction-response pairs:
$$L_{SFT} = -\sum_{t=1}^T \log P_\theta(y_t \mid x, y_{14 · Backpropagation
For layer $l$ with $z^l = W^l a^{l-1} + b^l$ and $a^l = \sigma(z^l)$:
$$\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)$$ $$\frac{\partial L}{\partial W^l} = \delta^l (a^{l-1})^T \qquad \frac{\partial L}{\partial b^l} = \delta^l$$Vanishing / Exploding Gradients
After $n$ layers: $\frac{\partial L}{\partial W^1} \propto \prod_{l=1}^{n} W^l \cdot \sigma'(z^l)$
| Problem | Cause | Solution |
|---|---|---|
| Vanishing | $\|W \cdot \sigma'\| < 1$ repeated | Residual connections, LSTM gates, ReLU |
| Exploding | $\|W \cdot \sigma'\| > 1$ repeated | Gradient clipping, proper initialization |
Weight Initialization
| Method | Variance | Best For |
|---|---|---|
| Xavier / Glorot | $\frac{2}{n_{in} + n_{out}}$ | Sigmoid, Tanh |
| He / Kaiming | $\frac{2}{n_{in}}$ | ReLU and variants |
15 · Model-Specific Math
CNN — Convolution
$$(f * g)(t) = \sum_{\tau} f(\tau) \cdot g(t - \tau)$$Output size: $\left\lfloor\frac{n + 2p - k}{s}\right\rfloor + 1$
- $n$ = input size, $k$ = kernel size, $p$ = padding, $s$ = stride
RNN / LSTM
RNN:
$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$$LSTM gates:
$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \qquad \text{(forget)}$$ $$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \qquad \text{(input)}$$ $$\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \qquad \text{(candidate)}$$ $$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \qquad \text{(cell state)}$$ $$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \qquad \text{(output)}$$ $$h_t = o_t \odot \tanh(c_t) \qquad \text{(hidden state)}$$GNN (Graph Convolutional Network)
$$H^{(l+1)} = \sigma\left(\hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2} H^{(l)} W^{(l)}\right)$$Where $\hat{A} = A + I$ (adjacency + self-loops), $\hat{D}$ = degree matrix of $\hat{A}$.
16 · PyTorch Quick Reference
import torch
import torch.nn as nn
import torch.nn.functional as F
# === Tensors ===
x = torch.randn(batch, seq_len, d_model) # Random tensor
x.shape, x.dtype, x.device # Inspect
# === Linear Algebra ===
A @ B # Matrix multiply
torch.linalg.svd(A) # SVD
torch.linalg.eig(A) # Eigendecomposition
torch.linalg.norm(x, dim=-1) # Norm
# === Autograd ===
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x
y.backward()
x.grad # dy/dx = 2x + 3 = 7.0
# === Key Layers ===
nn.Linear(d_in, d_out) # Fully connected
nn.Embedding(vocab_size, d_model) # Embedding lookup
nn.MultiheadAttention(d_model, num_heads) # Multi-head attention
nn.LayerNorm(d_model) # Layer normalization
nn.Dropout(p=0.1) # Dropout
# === Loss Functions ===
nn.CrossEntropyLoss() # Classification
nn.MSELoss() # Regression
nn.BCEWithLogitsLoss() # Binary classification
F.cosine_similarity(a, b) # Similarity
# === Optimizers ===
torch.optim.Adam(model.parameters(), lr=1e-4)
torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
# === LR Schedulers ===
torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=1000)
torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=1e-3, total_steps=10000)
17 · Numbers to Know
| Metric | Value | Context |
|---|---|---|
| GPT-3 parameters | 175B | $d_{model}=12288$, 96 layers, 96 heads |
| LLaMA-2 70B | 70B | $d_{model}=8192$, 80 layers, 64 heads |
| Typical batch size (LLM) | 1-4M tokens | Per gradient step |
| Typical learning rate (LLM) | $1\text{e-}4$ to $3\text{e-}4$ | With cosine decay |
| Chinchilla-optimal tokens | $20 \times N$ | For $N$ parameters |
| Float16 memory per param | 2 bytes | 70B model ≈ 140 GB |
| INT4 memory per param | 0.5 bytes | 70B model ≈ 35 GB |
| Attention FLOPs | $2n^2d$ | $n$ = seq len, $d$ = dim |
| FFN FLOPs | $16nd^2$ | Dominates for short sequences |
This cheatsheet covers Ch.01–25 of the Math for AI/ML/LLM curriculum. See also: Notation Guide · ML Math Map · Interview Prep