Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Training at Scale: Part 6: Communication Math to Why This Matters for AI

6. Communication Math

This part focuses on communication math as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

SubtopicOperational questionFormula
All-reduce costgradient synchronization costs latency plus bandwidthTαlogN+βST\approx \alpha\log N+\beta S
Reduce-scatter and all-gathersharded training replaces one all-reduce with state movement primitivesallreduce=reduce scatter+all gather\mathrm{allreduce}=\mathrm{reduce\ scatter}+\mathrm{all\ gather}
Overlaphide communication under backward computation when dependencies allow itTstepmax(Tcompute,Tcomm)T_\mathrm{step}\approx\max(T_\mathrm{compute},T_\mathrm{comm})
Bandwidth hierarchyintra-node links are much faster than inter-node linksTinter>TintraT_\mathrm{inter}>T_\mathrm{intra} for the same payload
Straggler sensitivitysynchronous steps wait for the slowest rankTstep=maxrTrT_\mathrm{step}=\max_r T_r

6.1 All-reduce cost

Main idea. Gradient synchronization costs latency plus bandwidth.

Core relation:

TαlogN+βST\approx \alpha\log N+\beta S

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

6.2 Reduce-scatter and all-gather

Main idea. Sharded training replaces one all-reduce with state movement primitives.

Core relation:

allreduce=reduce scatter+all gather\mathrm{allreduce}=\mathrm{reduce\ scatter}+\mathrm{all\ gather}

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

6.3 Overlap

Main idea. Hide communication under backward computation when dependencies allow it.

Core relation:

Tstepmax(Tcompute,Tcomm)T_\mathrm{step}\approx\max(T_\mathrm{compute},T_\mathrm{comm})

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

6.4 Bandwidth hierarchy

Main idea. Intra-node links are much faster than inter-node links.

Core relation:

T_\mathrm{inter}>T_\mathrm{intra}$ for the same payload

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

6.5 Straggler sensitivity

Main idea. Synchronous steps wait for the slowest rank.

Core relation:

Tstep=maxrTrT_\mathrm{step}=\max_r T_r

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

7. Compute and Scaling Laws

This part focuses on compute and scaling laws as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

SubtopicOperational questionFormula
Training FLOPs estimatedense transformer training is often approximated by six times parameters times tokensC6NDC\approx 6ND
Kaplan-style power lawsloss follows predictable power trends over model, data, and compute in a rangeL(C)=L+aCαL(C)=L_\infty+aC^{-\alpha}
Compute-optimal tradeofffor a fixed budget, model size and token count must be balancedC6NDC\approx 6ND with both NN and DD chosen
MFUmodel FLOPs utilization compares achieved useful FLOPs to hardware peakMFU=model FLOPs/sec/peak FLOPs/sec\mathrm{MFU}=\mathrm{model\ FLOPs/sec}/\mathrm{peak\ FLOPs/sec}
Inference-aware trainingovertraining a smaller model can reduce serving cost even if it is not pure compute-optimal pretrainingtrain cost+serve cost\mathrm{train\ cost}+\mathrm{serve\ cost} matters

7.1 Training FLOPs estimate

Main idea. Dense transformer training is often approximated by six times parameters times tokens.

Core relation:

C6NDC\approx 6ND

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This simple estimate is often the first line in a training-budget spreadsheet.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

7.2 Kaplan-style power laws

Main idea. Loss follows predictable power trends over model, data, and compute in a range.

Core relation:

L(C)=L+aCαL(C)=L_\infty+aC^{-\alpha}

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

7.3 Compute-optimal tradeoff

Main idea. For a fixed budget, model size and token count must be balanced.

Core relation:

C\approx 6ND$ with both $N$ and $D$ chosen

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

7.4 MFU

Main idea. Model flops utilization compares achieved useful flops to hardware peak.

Core relation:

MFU=model FLOPs/sec/peak FLOPs/sec\mathrm{MFU}=\mathrm{model\ FLOPs/sec}/\mathrm{peak\ FLOPs/sec}

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This separates a slow model because it is mathematically large from a slow run because the system is wasting hardware.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

7.5 Inference-aware training

Main idea. Overtraining a smaller model can reduce serving cost even if it is not pure compute-optimal pretraining.

Core relation:

\mathrm{train\ cost}+\mathrm{serve\ cost}$ matters

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

8. Numerical Stability

This part focuses on numerical stability as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

SubtopicOperational questionFormula
Mixed precisionbf16/fp16 reduce memory and increase throughput but require stable reductionsθ\theta may be bf16 while optimizer states stay fp32
Loss scalingfp16 may need scaling to avoid underflowL~=sL,g~=sg\tilde L=sL,\quad \tilde g=sg
Attention stabilityscore scaling and stable softmax matter more at long sequence lengthsQK/dQK^\top/\sqrt d
Loss spikesspikes can come from data, optimizer state, numerical overflow, or synchronization problemsLtmedian(Ltk:t)L_t\gg\mathrm{median}(L_{t-k:t})
Resume correctnesscheckpoint reload must restore model, optimizer, scheduler, RNG, and dataloader stateθ,m,v,t,rng\theta,m,v,t,\mathrm{rng} all matter

8.1 Mixed precision

Main idea. Bf16/fp16 reduce memory and increase throughput but require stable reductions.

Core relation:

\theta$ may be bf16 while optimizer states stay fp32

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

8.2 Loss scaling

Main idea. Fp16 may need scaling to avoid underflow.

Core relation:

L~=sL,g~=sg\tilde L=sL,\quad \tilde g=sg

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

8.3 Attention stability

Main idea. Score scaling and stable softmax matter more at long sequence lengths.

Core relation:

QK/dQK^\top/\sqrt d

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

8.4 Loss spikes

Main idea. Spikes can come from data, optimizer state, numerical overflow, or synchronization problems.

Core relation:

Ltmedian(Ltk:t)L_t\gg\mathrm{median}(L_{t-k:t})

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

8.5 Resume correctness

Main idea. Checkpoint reload must restore model, optimizer, scheduler, rng, and dataloader state.

Core relation:

\theta,m,v,t,\mathrm{rng}$ all matter

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. A bad resume can silently fork the training trajectory even when the checkpoint file loads.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

9. Data and Checkpoint Operations

This part focuses on data and checkpoint operations as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

SubtopicOperational questionFormula
Token budgetdata is counted in tokens, not documentsD=docstokens(doc)D=\sum_\mathrm{docs}\mathrm{tokens(doc)}
Packingshort examples are packed to reduce padding wasteutilization=real tokens/allocated tokens\mathrm{utilization}=\mathrm{real\ tokens}/\mathrm{allocated\ tokens}
Deduplication and filteringbad repeated data can improve train loss while hurting generalizationptrainp_\mathrm{train} can drift from desired pdeployp_\mathrm{deploy}
Checkpoint frequencythe optimal interval balances lost work and checkpoint overheadoverheadTckpt/K+failure loss(K)\mathrm{overhead}\approx T_\mathrm{ckpt}/K+\mathrm{failure\ loss}(K)
Validation cadenceheld-out loss catches overfitting, data bugs, and regression after resumeLvalL_\mathrm{val} is the early warning signal

9.1 Token budget

Main idea. Data is counted in tokens, not documents.

Core relation:

D=docstokens(doc)D=\sum_\mathrm{docs}\mathrm{tokens(doc)}

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

9.2 Packing

Main idea. Short examples are packed to reduce padding waste.

Core relation:

utilization=real tokens/allocated tokens\mathrm{utilization}=\mathrm{real\ tokens}/\mathrm{allocated\ tokens}

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

9.3 Deduplication and filtering

Main idea. Bad repeated data can improve train loss while hurting generalization.

Core relation:

p_\mathrm{train}$ can drift from desired $p_\mathrm{deploy}

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

9.4 Checkpoint frequency

Main idea. The optimal interval balances lost work and checkpoint overhead.

Core relation:

overheadTckpt/K+failure loss(K)\mathrm{overhead}\approx T_\mathrm{ckpt}/K+\mathrm{failure\ loss}(K)

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

9.5 Validation cadence

Main idea. Held-out loss catches overfitting, data bugs, and regression after resume.

Core relation:

L_\mathrm{val}$ is the early warning signal

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

10. Operational Debugging

This part focuses on operational debugging as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

SubtopicOperational questionFormula
Shape and mask checkswrong labels or masks can produce plausible but meaningless losstargeti=inputi+1\mathrm{target}_i=\mathrm{input}_{i+1}
Gradient norm tracestrack global norms before and after clippingg2\Vert g\Vert_2
Learning-rate tracesoptimizer behavior must match the intended scheduleηt\eta_t
Throughput decompositionseparate dataloader, forward, backward, communication, optimizer, and checkpoint timeTstep=jTjT_\mathrm{step}=\sum_j T_j
Reproducible small runscale only after a small deterministic run learns and resumes correctlyL100<L0L_{100}<L_0 is a smoke test

10.1 Shape and mask checks

Main idea. Wrong labels or masks can produce plausible but meaningless loss.

Core relation:

targeti=inputi+1\mathrm{target}_i=\mathrm{input}_{i+1}

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

10.2 Gradient norm traces

Main idea. Track global norms before and after clipping.

Core relation:

g2\Vert g\Vert_2

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

10.3 Learning-rate traces

Main idea. Optimizer behavior must match the intended schedule.

Core relation:

ηt\eta_t

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

10.4 Throughput decomposition

Main idea. Separate dataloader, forward, backward, communication, optimizer, and checkpoint time.

Core relation:

Tstep=jTjT_\mathrm{step}=\sum_j T_j

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

10.5 Reproducible small run

Main idea. Scale only after a small deterministic run learns and resumes correctly.

Core relation:

L_{100}<L_0$ is a smoke test

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has P=7P=7 billion parameters. bf16 weights alone require about 2P2P bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about 8P8P bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.


Practice Exercises

  1. Compute one AdamW update by hand for a scalar parameter.
  2. Clip a gradient vector to a target norm.
  3. Build a warmup plus cosine learning-rate schedule.
  4. Compute effective batch size in tokens.
  5. Estimate memory for Adam training with and without sharding.
  6. Compute a pipeline bubble fraction.
  7. Determine tensor-parallel shard shapes for a linear layer.
  8. Estimate training FLOPs from parameter and token counts.
  9. Compute model FLOPs utilization from achieved throughput.
  10. Create a launch checklist for a small reproducible training run.

Why This Matters for AI

Good LLM training is not only about choosing a model architecture. The optimizer can diverge, the memory plan can be impossible, the communication plan can waste the cluster, the data stream can repeat contaminated text, and the checkpoint can fail to restore optimizer state. The mathematics in this section lets you reason about those failures before the run burns budget.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue