Part 2

30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Training at Scale: Part 6: Communication Math to Why This Matters for AI

6. Communication Math

This part focuses on communication math as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

Subtopic	Operational question	Formula
All-reduce cost	gradient synchronization costs latency plus bandwidth	$T\approx \alpha\log N+\beta S$
Reduce-scatter and all-gather	sharded training replaces one all-reduce with state movement primitives	$\mathrm{allreduce}=\mathrm{reduce\ scatter}+\mathrm{all\ gather}$
Overlap	hide communication under backward computation when dependencies allow it	$T_\mathrm{step}\approx\max(T_\mathrm{compute},T_\mathrm{comm})$
Bandwidth hierarchy	intra-node links are much faster than inter-node links	$T_\mathrm{inter}>T_\mathrm{intra}$ for the same payload
Straggler sensitivity	synchronous steps wait for the slowest rank	$T_\mathrm{step}=\max_r T_r$

6.1 All-reduce cost

Main idea. Gradient synchronization costs latency plus bandwidth.

Core relation:

T\approx \alpha\log N+\beta S

At small scale, this relation may feel like bookkeeping. At LLM scale, it becomes a hard constraint. A missing factor of two in a memory estimate can decide whether the job starts. A wrong batch-size convention can change the optimization regime. A poor communication plan can leave expensive accelerators idle.

Worked micro-example. Suppose a dense model has $P=7$ billion parameters. bf16 weights alone require about $2P$ bytes, or roughly 14 GB. Training with Adam usually also needs gradients and two optimizer moment tensors. If the moments are fp32, the optimizer state adds about $8P$ bytes, before activations. That is why "weights fit" is not the same as "training fits."

Implementation check. Write down the unit. Is the number per parameter, per token, per device, per data-parallel rank, per step, or per full run? Most scale-training bugs are not exotic math errors; they are unit and axis errors.

AI connection. This formula is part of the control surface for a large training run.

Common mistake. Do not optimize one metric in isolation. More tokens per second can be bad if validation loss stops improving, and lower memory can be bad if recomputation makes the step too slow.

6.2 Reduce-scatter and all-gather

Main idea. Sharded training replaces one all-reduce with state movement primitives.

Core relation:

\mathrm{allreduce}=\mathrm{reduce\ scatter}+\mathrm{all\ gather}

AI connection. This formula is part of the control surface for a large training run.

6.3 Overlap

Main idea. Hide communication under backward computation when dependencies allow it.

Core relation:

T_\mathrm{step}\approx\max(T_\mathrm{compute},T_\mathrm{comm})

AI connection. This formula is part of the control surface for a large training run.

6.4 Bandwidth hierarchy

Main idea. Intra-node links are much faster than inter-node links.

Core relation:

T_\mathrm{inter}>T_\mathrm{intra}$ for the same payload

AI connection. This formula is part of the control surface for a large training run.

6.5 Straggler sensitivity

Main idea. Synchronous steps wait for the slowest rank.

Core relation:

T_\mathrm{step}=\max_r T_r

AI connection. This formula is part of the control surface for a large training run.

7. Compute and Scaling Laws

This part focuses on compute and scaling laws as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

Subtopic	Operational question	Formula
Training FLOPs estimate	dense transformer training is often approximated by six times parameters times tokens	$C\approx 6ND$
Kaplan-style power laws	loss follows predictable power trends over model, data, and compute in a range	$L(C)=L_\infty+aC^{-\alpha}$
Compute-optimal tradeoff	for a fixed budget, model size and token count must be balanced	$C\approx 6ND$ with both $N$ and $D$ chosen
MFU	model FLOPs utilization compares achieved useful FLOPs to hardware peak	$\mathrm{MFU}=\mathrm{model\ FLOPs/sec}/\mathrm{peak\ FLOPs/sec}$
Inference-aware training	overtraining a smaller model can reduce serving cost even if it is not pure compute-optimal pretraining	$\mathrm{train\ cost}+\mathrm{serve\ cost}$ matters

7.1 Training FLOPs estimate

Main idea. Dense transformer training is often approximated by six times parameters times tokens.

Core relation:

C\approx 6ND

AI connection. This simple estimate is often the first line in a training-budget spreadsheet.

7.2 Kaplan-style power laws

Main idea. Loss follows predictable power trends over model, data, and compute in a range.

Core relation:

L(C)=L_\infty+aC^{-\alpha}

AI connection. This formula is part of the control surface for a large training run.

7.3 Compute-optimal tradeoff

Main idea. For a fixed budget, model size and token count must be balanced.

Core relation:

C\approx 6ND$ with both $N$ and $D$ chosen

AI connection. This formula is part of the control surface for a large training run.

7.4 MFU

Main idea. Model flops utilization compares achieved useful flops to hardware peak.

Core relation:

\mathrm{MFU}=\mathrm{model\ FLOPs/sec}/\mathrm{peak\ FLOPs/sec}

AI connection. This separates a slow model because it is mathematically large from a slow run because the system is wasting hardware.

7.5 Inference-aware training

Main idea. Overtraining a smaller model can reduce serving cost even if it is not pure compute-optimal pretraining.

Core relation:

\mathrm{train\ cost}+\mathrm{serve\ cost}$ matters

AI connection. This formula is part of the control surface for a large training run.

8. Numerical Stability

This part focuses on numerical stability as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

Subtopic	Operational question	Formula
Mixed precision	bf16/fp16 reduce memory and increase throughput but require stable reductions	$\theta$ may be bf16 while optimizer states stay fp32
Loss scaling	fp16 may need scaling to avoid underflow	$\tilde L=sL,\quad \tilde g=sg$
Attention stability	score scaling and stable softmax matter more at long sequence lengths	$QK^\top/\sqrt d$
Loss spikes	spikes can come from data, optimizer state, numerical overflow, or synchronization problems	$L_t\gg\mathrm{median}(L_{t-k:t})$
Resume correctness	checkpoint reload must restore model, optimizer, scheduler, RNG, and dataloader state	$\theta,m,v,t,\mathrm{rng}$ all matter

8.1 Mixed precision

Main idea. Bf16/fp16 reduce memory and increase throughput but require stable reductions.

Core relation:

\theta$ may be bf16 while optimizer states stay fp32

AI connection. This formula is part of the control surface for a large training run.

8.2 Loss scaling

Main idea. Fp16 may need scaling to avoid underflow.

Core relation:

\tilde L=sL,\quad \tilde g=sg

AI connection. This formula is part of the control surface for a large training run.

8.3 Attention stability

Main idea. Score scaling and stable softmax matter more at long sequence lengths.

Core relation:

QK^\top/\sqrt d

AI connection. This formula is part of the control surface for a large training run.

8.4 Loss spikes

Main idea. Spikes can come from data, optimizer state, numerical overflow, or synchronization problems.

Core relation:

L_t\gg\mathrm{median}(L_{t-k:t})

AI connection. This formula is part of the control surface for a large training run.

8.5 Resume correctness

Main idea. Checkpoint reload must restore model, optimizer, scheduler, rng, and dataloader state.

Core relation:

\theta,m,v,t,\mathrm{rng}$ all matter

AI connection. A bad resume can silently fork the training trajectory even when the checkpoint file loads.

9. Data and Checkpoint Operations

This part focuses on data and checkpoint operations as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

Subtopic	Operational question	Formula
Token budget	data is counted in tokens, not documents	$D=\sum_\mathrm{docs}\mathrm{tokens(doc)}$
Packing	short examples are packed to reduce padding waste	$\mathrm{utilization}=\mathrm{real\ tokens}/\mathrm{allocated\ tokens}$
Deduplication and filtering	bad repeated data can improve train loss while hurting generalization	$p_\mathrm{train}$ can drift from desired $p_\mathrm{deploy}$
Checkpoint frequency	the optimal interval balances lost work and checkpoint overhead	$\mathrm{overhead}\approx T_\mathrm{ckpt}/K+\mathrm{failure\ loss}(K)$
Validation cadence	held-out loss catches overfitting, data bugs, and regression after resume	$L_\mathrm{val}$ is the early warning signal

9.1 Token budget

Main idea. Data is counted in tokens, not documents.

Core relation:

D=\sum_\mathrm{docs}\mathrm{tokens(doc)}

AI connection. This formula is part of the control surface for a large training run.

9.2 Packing

Main idea. Short examples are packed to reduce padding waste.

Core relation:

\mathrm{utilization}=\mathrm{real\ tokens}/\mathrm{allocated\ tokens}

AI connection. This formula is part of the control surface for a large training run.

9.3 Deduplication and filtering

Main idea. Bad repeated data can improve train loss while hurting generalization.

Core relation:

p_\mathrm{train}$ can drift from desired $p_\mathrm{deploy}

AI connection. This formula is part of the control surface for a large training run.

9.4 Checkpoint frequency

Main idea. The optimal interval balances lost work and checkpoint overhead.

Core relation:

\mathrm{overhead}\approx T_\mathrm{ckpt}/K+\mathrm{failure\ loss}(K)

AI connection. This formula is part of the control surface for a large training run.

9.5 Validation cadence

Main idea. Held-out loss catches overfitting, data bugs, and regression after resume.

Core relation:

L_\mathrm{val}$ is the early warning signal

AI connection. This formula is part of the control surface for a large training run.

10. Operational Debugging

This part focuses on operational debugging as a practical mathematical constraint in LLM training. The goal is not to memorize infrastructure names, but to understand the formulas that determine whether a run fits, learns, communicates, and resumes.

Subtopic	Operational question	Formula
Shape and mask checks	wrong labels or masks can produce plausible but meaningless loss	$\mathrm{target}_i=\mathrm{input}_{i+1}$
Gradient norm traces	track global norms before and after clipping	$\Vert g\Vert_2$
Learning-rate traces	optimizer behavior must match the intended schedule	$\eta_t$
Throughput decomposition	separate dataloader, forward, backward, communication, optimizer, and checkpoint time	$T_\mathrm{step}=\sum_j T_j$
Reproducible small run	scale only after a small deterministic run learns and resumes correctly	$L_{100}<L_0$ is a smoke test

10.1 Shape and mask checks

Main idea. Wrong labels or masks can produce plausible but meaningless loss.

Core relation:

\mathrm{target}_i=\mathrm{input}_{i+1}

AI connection. This formula is part of the control surface for a large training run.

10.2 Gradient norm traces

Main idea. Track global norms before and after clipping.

Core relation:

\Vert g\Vert_2

AI connection. This formula is part of the control surface for a large training run.

10.3 Learning-rate traces

Main idea. Optimizer behavior must match the intended schedule.

Core relation:

\eta_t

AI connection. This formula is part of the control surface for a large training run.

10.4 Throughput decomposition

Main idea. Separate dataloader, forward, backward, communication, optimizer, and checkpoint time.

Core relation:

T_\mathrm{step}=\sum_j T_j

AI connection. This formula is part of the control surface for a large training run.

10.5 Reproducible small run

Main idea. Scale only after a small deterministic run learns and resumes correctly.

Core relation:

L_{100}<L_0$ is a smoke test

AI connection. This formula is part of the control surface for a large training run.

Practice Exercises

Compute one AdamW update by hand for a scalar parameter.
Clip a gradient vector to a target norm.
Build a warmup plus cosine learning-rate schedule.
Compute effective batch size in tokens.
Estimate memory for Adam training with and without sharding.
Compute a pipeline bubble fraction.
Determine tensor-parallel shard shapes for a linear layer.
Estimate training FLOPs from parameter and token counts.
Compute model FLOPs utilization from achieved throughput.
Create a launch checklist for a small reproducible training run.

Why This Matters for AI

Good LLM training is not only about choosing a model architecture. The optimizer can diverge, the memory plan can be impossible, the communication plan can waste the cluster, the data stream can repeat contaminated text, and the checkpoint can fail to restore optimizer state. The mathematics in this section lets you reason about those failures before the run burns budget.

Training at Scale: Part 2 - Communication Math To Why This Matters For Ai

Training at Scale: Part 6: Communication Math to Why This Matters for AI

6. Communication Math

6.1 All-reduce cost

6.2 Reduce-scatter and all-gather

6.3 Overlap

6.4 Bandwidth hierarchy

6.5 Straggler sensitivity

7. Compute and Scaling Laws

7.1 Training FLOPs estimate

7.2 Kaplan-style power laws

7.3 Compute-optimal tradeoff

7.4 MFU

7.5 Inference-aware training

8. Numerical Stability

8.1 Mixed precision

8.2 Loss scaling

8.3 Attention stability

8.4 Loss spikes

8.5 Resume correctness

9. Data and Checkpoint Operations

9.1 Token budget

9.2 Packing

9.3 Deduplication and filtering

9.4 Checkpoint frequency

9.5 Validation cadence

10. Operational Debugging

10.1 Shape and mask checks

10.2 Gradient norm traces

10.3 Learning-rate traces

10.4 Throughput decomposition

10.5 Reproducible small run

Practice Exercises

Why This Matters for AI

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?