Lesson overview | Lesson overview | Next part
Training at Scale: Part 1: Scale as a Constraint Problem to 5. Parallelism Strategies
1. Scale as a Constraint Problem
This part focuses on scale as a constraint problem as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.
| Subtopic | Operational question | Formula |
|---|---|---|
| The same loss at larger cost | training at scale still minimizes next-token cross-entropy | |
| Four limiting resources | memory, compute, bandwidth, and data quality each become a bottleneck | |
| Parameter, optimizer, and activation memory | weights are only one part of training memory | |
| Throughput versus convergence | fast tokens per second are useful only if loss improves | must be read with |
| Failure modes | large training fails by divergence, stalls, bad data, communication bottlenecks, or checkpoint loss | for many steps is a symptom, not a diagnosis |
1.1 The same loss at larger cost
Main idea. Training at scale still minimizes next-token cross-entropy.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
1.2 Four limiting resources
Main idea. Memory, compute, bandwidth, and data quality each become a bottleneck.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
1.3 Parameter, optimizer, and activation memory
Main idea. Weights are only one part of training memory.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
1.4 Throughput versus convergence
Main idea. Fast tokens per second are useful only if loss improves.
Core relation:
\mathrm{tokens/sec}$ must be read with $L(\mathrm{tokens})At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
1.5 Failure modes
Main idea. Large training fails by divergence, stalls, bad data, communication bottlenecks, or checkpoint loss.
Core relation:
\Delta L>0$ for many steps is a symptom, not a diagnosisAt small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
2. Optimization Core
This part focuses on optimization core as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.
| Subtopic | Operational question | Formula |
|---|---|---|
| Mini-batch gradient | distributed workers estimate the same gradient with different data shards | |
| Adam moments | first and second moment estimates adapt update scale | |
| Bias correction | early moments are corrected because they start at zero | |
| AdamW | weight decay is applied outside the adaptive gradient ratio | |
| Gradient clipping | cap update norm when rare batches produce spikes |
2.1 Mini-batch gradient
Main idea. Distributed workers estimate the same gradient with different data shards.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
2.2 Adam moments
Main idea. First and second moment estimates adapt update scale.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
2.3 Bias correction
Main idea. Early moments are corrected because they start at zero.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
2.4 AdamW
Main idea. Weight decay is applied outside the adaptive gradient ratio.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
2.5 Gradient clipping
Main idea. Cap update norm when rare batches produce spikes.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. Clipping is not a cure for a bad run, but it can prevent one rare batch from destroying useful optimizer state.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
3. Batching and Schedules
This part focuses on batching and schedules as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.
| Subtopic | Operational question | Formula |
|---|---|---|
| Effective batch size | global batch combines devices and accumulation steps | |
| Gradient accumulation | several micro-batches approximate one larger batch | |
| Linear warmup | the learning rate starts small to avoid early instability | \eta_t=\eta_\max t/T_\mathrm{warmup} |
| Cosine decay | the learning rate anneals smoothly after warmup | \eta_t=\eta_\min+\frac{1}{2}(\eta_\max-\eta_\min)(1+\cos(\pi s)) |
| Critical batch intuition | past a point, larger batches waste compute rather than reducing noise usefully | only in the useful regime |
3.1 Effective batch size
Main idea. Global batch combines devices and accumulation steps.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
3.2 Gradient accumulation
Main idea. Several micro-batches approximate one larger batch.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
3.3 Linear warmup
Main idea. The learning rate starts small to avoid early instability.
Core relation:
\eta_t=\eta_\max t/T_\mathrm{warmup}At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
3.4 Cosine decay
Main idea. The learning rate anneals smoothly after warmup.
Core relation:
\eta_t=\eta_\min+\frac{1}{2}(\eta_\max-\eta_\min)(1+\cos(\pi s))At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
3.5 Critical batch intuition
Main idea. Past a point, larger batches waste compute rather than reducing noise usefully.
Core relation:
\mathrm{noise}\propto 1/B$ only in the useful regimeAt small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
4. Memory Accounting
This part focuses on memory accounting as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.
| Subtopic | Operational question | Formula |
|---|---|---|
| Bytes per parameter | bf16 weights use 2 bytes but Adam states are often fp32 | bytes |
| Activation memory | stored forward activations can dominate at long context | |
| Activation checkpointing | save memory by recomputing intermediate activations | |
| Optimizer state sharding | ZeRO/FSDP shard model states across data-parallel ranks | for fully sharded states |
| Offload boundary | CPU or NVMe offload trades memory for bandwidth and latency | can become transfer-bound |
4.1 Bytes per parameter
Main idea. Bf16 weights use 2 bytes but adam states are often fp32.
Core relation:
M_\mathrm{Adam}\approx 2P + 2P + 8P$ bytesAt small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
4.2 Activation memory
Main idea. Stored forward activations can dominate at long context.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
4.3 Activation checkpointing
Main idea. Save memory by recomputing intermediate activations.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
4.4 Optimizer state sharding
Main idea. Zero/fsdp shard model states across data-parallel ranks.
Core relation:
M_\mathrm{per\ rank}\approx M/N$ for fully sharded statesAt small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This is why a model that cannot fit on one accelerator can still be trained across many accelerators.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
4.5 Offload boundary
Main idea. Cpu or nvme offload trades memory for bandwidth and latency.
Core relation:
T_\mathrm{step}$ can become transfer-boundAt small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
5. Parallelism Strategies
This part focuses on parallelism strategies as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.
| Subtopic | Operational question | Formula |
|---|---|---|
| Data parallelism | replicate the model and all-reduce gradients | |
| Tensor parallelism | split matrix multiplications across devices | or depending on layout |
| Pipeline parallelism | place layers on stages and stream micro-batches | |
| Sequence parallelism | split sequence-length work when activations are too large | is partitioned across ranks |
| Parallelism product | large jobs combine data, tensor, pipeline, and sometimes sequence parallelism |
5.1 Data parallelism
Main idea. Replicate the model and all-reduce gradients.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
5.2 Tensor parallelism
Main idea. Split matrix multiplications across devices.
Core relation:
Y=X[W_1\ W_2]$ or $Y=XW_1+XW_2$ depending on layoutAt small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
5.3 Pipeline parallelism
Main idea. Place layers on stages and stream micro-batches.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
5.4 Sequence parallelism
Main idea. Split sequence-length work when activations are too large.
Core relation:
T$ is partitioned across ranksAt small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.
5.5 Parallelism product
Main idea. Large jobs combine data, tensor, pipeline, and sometimes sequence parallelism.
Core relation:
At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.
Worked micro-example. Suppose a dense model has billion parameters. bf16 weights alone require about bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about bytes, before activations. That is why "weights fit" is not the same as "training fits."
Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.
AI connection. This formula is part of the control surface for a large training run.
Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.