Mixture-of-experts models separate total capacity from active per-token computation. A router chooses a small number of expert networks for each token, giving the model many parameters without using all of them on every token.
Overview
A dense transformer FFN applies the same network to every token. An MoE FFN replaces that one network with many experts:
where is the selected expert set for token , is the router weight, and is expert . The central math is not only the weighted sum. It is the accounting around it: active parameters, total parameters, router probabilities, expert capacity, load balancing, all-to-all dispatch, drop rate, and latency.
Prerequisites
- Transformer FFN shapes
- Softmax and top-k selection
- Training-at-scale memory and parallelism
- Efficient inference and serving latency vocabulary
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates router softmax, top-k dispatch, capacity overflow, auxiliary balance loss, expert histograms, all-to-all traffic, active parameter counts, and router-collapse diagnostics. |
| exercises.ipynb | Ten practice problems for routing probabilities, capacity, drop rate, load balancing, expert counts, and MoE debugging. |
Learning Objectives
After this section, you should be able to:
- Explain total parameters versus active parameters.
- Compute router probabilities and top-k selected experts.
- Count MoE expert parameters and active expert compute.
- Compute expert capacity and token overflow.
- Define importance, load, auxiliary load-balancing loss, entropy, and z-loss.
- Explain why MoE training often needs all-to-all communication.
- Diagnose expert collapse with histograms, drop rate, entropy, and gradient norms.
- Explain the inference tradeoff: lower active compute but higher memory and routing complexity.
Table of Contents
- Dense versus Sparse Computation
- 1.1 Dense FFN
- 1.2 Expert bank
- 1.3 Sparse activation
- 1.4 Total versus active parameters
- 1.5 Memory caveat
- Router Mathematics
- 2.1 Router logits
- 2.2 Routing probabilities
- 2.3 Top-k selection
- 2.4 Gated combination
- 2.5 Top-1 Switch routing
- Parameter and FLOP Accounting
- 3.1 FFN parameter count
- 3.2 MoE total parameters
- 3.3 MoE active parameters
- 3.4 Router overhead
- 3.5 Compute ratio
- Capacity and Token Dropping
- 4.1 Expected tokens per expert
- 4.2 Capacity factor
- 4.3 Overflow
- 4.4 Batch sensitivity
- 4.5 Expert collapse
- Load Balancing Losses
- 5.1 Importance
- 5.2 Load
- 5.3 Auxiliary loss
- 5.4 Entropy encouragement
- 5.5 Z-loss
- Expert Parallelism
- 6.1 Expert placement
- 6.2 All-to-all dispatch
- 6.3 Combine step
- 6.4 Communication bottleneck
- 6.5 Locality
- Training Dynamics
- 7.1 Specialization
- 7.2 Cold experts
- 7.3 Router noise
- 7.4 Top-2 gradients
- 7.5 Stability tradeoff
- Inference Behavior
- 8.1 Active compute
- 8.2 Weight memory
- 8.3 Batch routing variance
- 8.4 Cache interaction
- 8.5 Latency tails
- MoE Design Variants
- 9.1 Sparsely gated MoE
- 9.2 GShard
- 9.3 Switch Transformer
- 9.4 Top-2 MoE
- 9.5 Shared experts
- Diagnostics
- 10.1 Expert histogram
- 10.2 Drop rate
- 10.3 Router entropy
- 10.4 Per-expert gradients
- 10.5 Ablations
One-Layer MoE Shape Flow
tokens x_t
|
router logits r_t = W_r x_t
|
top-k experts S_t
|
dispatch tokens to experts
|
expert FFNs E_i(x_t)
|
weighted combine and restore token order
1. Dense versus Sparse Computation
This part studies dense versus sparse computation in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| Dense FFN | every token uses the same feed-forward network | |
| Expert bank | replace one FFN with many candidate FFNs | |
| Sparse activation | each token uses only k experts | |
| Total versus active parameters | MoE increases capacity without proportional per-token compute | |
| Memory caveat | inactive experts still occupy memory |
1.1 Dense FFN
Main idea. Every token uses the same feed-forward network.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
1.2 Expert bank
Main idea. Replace one ffn with many candidate ffns.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
1.3 Sparse activation
Main idea. Each token uses only k experts.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
1.4 Total versus active parameters
Main idea. Moe increases capacity without proportional per-token compute.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This distinction is the reason MoE models can have large total capacity with smaller per-token compute.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
1.5 Memory caveat
Main idea. Inactive experts still occupy memory.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
2. Router Mathematics
This part studies router mathematics in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| Router logits | a small projection scores experts for each token | |
| Routing probabilities | softmax turns router logits into expert probabilities | |
| Top-k selection | only the highest scoring experts receive the token | |
| Gated combination | selected expert outputs are weighted by router probabilities | |
| Top-1 Switch routing | route each token to one expert for simplicity |
2.1 Router logits
Main idea. A small projection scores experts for each token.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
2.2 Routing probabilities
Main idea. Softmax turns router logits into expert probabilities.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
2.3 Top-k selection
Main idea. Only the highest scoring experts receive the token.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
2.4 Gated combination
Main idea. Selected expert outputs are weighted by router probabilities.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
2.5 Top-1 Switch routing
Main idea. Route each token to one expert for simplicity.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
3. Parameter and FLOP Accounting
This part studies parameter and flop accounting in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| FFN parameter count | a transformer FFN has two large projections | |
| MoE total parameters | experts multiply the FFN parameter count | |
| MoE active parameters | only selected experts are used per token | |
| Router overhead | router cost is usually small compared with expert FFNs | |
| Compute ratio | sparse compute scales with k, not M | if expert size matches dense FFN |
3.1 FFN parameter count
Main idea. A transformer ffn has two large projections.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
3.2 MoE total parameters
Main idea. Experts multiply the ffn parameter count.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
3.3 MoE active parameters
Main idea. Only selected experts are used per token.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
3.4 Router overhead
Main idea. Router cost is usually small compared with expert ffns.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
3.5 Compute ratio
Main idea. Sparse compute scales with k, not m.
Core relation:
\mathrm{FLOPs}_\mathrm{MoE}/\mathrm{FLOPs}_\mathrm{dense}\approx k$ if expert size matches dense FFNAn MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
4. Capacity and Token Dropping
This part studies capacity and token dropping in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| Expected tokens per expert | balanced routing sends roughly T/M tokens to each expert | |
| Capacity factor | reserve extra slots beyond the expected load | |
| Overflow | tokens above capacity are dropped or rerouted | |
| Batch sensitivity | small batches have noisier expert loads | |
| Expert collapse | if the router favors a few experts, capacity and learning both suffer | for many experts |
4.1 Expected tokens per expert
Main idea. Balanced routing sends roughly t/m tokens to each expert.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
4.2 Capacity factor
Main idea. Reserve extra slots beyond the expected load.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. Capacity is the serving and training contract between the router and the expert bank.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
4.3 Overflow
Main idea. Tokens above capacity are dropped or rerouted.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
4.4 Batch sensitivity
Main idea. Small batches have noisier expert loads.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
4.5 Expert collapse
Main idea. If the router favors a few experts, capacity and learning both suffer.
Core relation:
p_i\approx 0$ for many expertsAn MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
5. Load Balancing Losses
This part studies load balancing losses in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| Importance | sum of router probabilities assigned to each expert | |
| Load | number of tokens actually routed to each expert | |
| Auxiliary loss | penalize uneven routing | |
| Entropy encouragement | router entropy can discourage overconfident early routing | |
| Z-loss | penalize large router logits for stability |
5.1 Importance
Main idea. Sum of router probabilities assigned to each expert.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
5.2 Load
Main idea. Number of tokens actually routed to each expert.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
5.3 Auxiliary loss
Main idea. Penalize uneven routing.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. Without balancing, a router can discover a few experts and ignore the rest.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
5.4 Entropy encouragement
Main idea. Router entropy can discourage overconfident early routing.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
5.5 Z-loss
Main idea. Penalize large router logits for stability.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
6. Expert Parallelism
This part studies expert parallelism in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| Expert placement | different devices own different experts | |
| All-to-all dispatch | tokens move to the devices that own their experts | |
| Combine step | expert outputs return to original token order | |
| Communication bottleneck | MoE speed depends on token traffic, not only FLOPs | |
| Locality | routing and placement choices can reduce cross-device movement |
6.1 Expert placement
Main idea. Different devices own different experts.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
6.2 All-to-all dispatch
Main idea. Tokens move to the devices that own their experts.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. MoE turns part of the model into a distributed routing problem.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
6.3 Combine step
Main idea. Expert outputs return to original token order.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
6.4 Communication bottleneck
Main idea. Moe speed depends on token traffic, not only flops.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
6.5 Locality
Main idea. Routing and placement choices can reduce cross-device movement.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
7. Training Dynamics
This part studies training dynamics in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| Specialization | experts differentiate because they receive different token subsets | only from routed tokens |
| Cold experts | rarely selected experts learn slowly | |
| Router noise | noise can encourage exploration early in training | |
| Top-2 gradients | top-2 routing gives more experts gradient signal than top-1 | $ |
| Stability tradeoff | strong balancing can fight useful specialization |
7.1 Specialization
Main idea. Experts differentiate because they receive different token subsets.
Core relation:
\nabla_{\theta_i}L$ only from routed tokensAn MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
7.2 Cold experts
Main idea. Rarely selected experts learn slowly.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
7.3 Router noise
Main idea. Noise can encourage exploration early in training.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
7.4 Top-2 gradients
Main idea. Top-2 routing gives more experts gradient signal than top-1.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
7.5 Stability tradeoff
Main idea. Strong balancing can fight useful specialization.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
8. Inference Behavior
This part studies inference behavior in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| Active compute | per-token expert compute depends on k | selected experts |
| Weight memory | serving must store or page all experts that may be routed | resident or streamed |
| Batch routing variance | different requests can activate different experts | varies by token |
| Cache interaction | MoE changes FFN compute but not attention KV cache math directly | unchanged by experts |
| Latency tails | hot experts and cross-device traffic can increase p95 latency |
8.1 Active compute
Main idea. Per-token expert compute depends on k.
Core relation:
k$ selected expertsAn MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
8.2 Weight memory
Main idea. Serving must store or page all experts that may be routed.
Core relation:
P_\mathrm{total}$ resident or streamedAn MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
8.3 Batch routing variance
Main idea. Different requests can activate different experts.
Core relation:
S_t$ varies by tokenAn MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
8.4 Cache interaction
Main idea. Moe changes ffn compute but not attention kv cache math directly.
Core relation:
M_\mathrm{KV}$ unchanged by expertsAn MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
8.5 Latency tails
Main idea. Hot experts and cross-device traffic can increase p95 latency.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
9. MoE Design Variants
This part studies moe design variants in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| Sparsely gated MoE | learned gates select a sparse expert subset | |
| GShard | scaled conditional computation with automatic sharding | |
| Switch Transformer | top-1 routing simplifies dispatch | |
| Top-2 MoE | two experts can improve quality at higher compute | |
| Shared experts | some designs combine routed experts with always-on shared experts |
9.1 Sparsely gated MoE
Main idea. Learned gates select a sparse expert subset.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
9.2 GShard
Main idea. Scaled conditional computation with automatic sharding.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
9.3 Switch Transformer
Main idea. Top-1 routing simplifies dispatch.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
9.4 Top-2 MoE
Main idea. Two experts can improve quality at higher compute.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
9.5 Shared experts
Main idea. Some designs combine routed experts with always-on shared experts.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
10. Diagnostics
This part studies diagnostics in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.
| Subtopic | Question | Formula |
|---|---|---|
| Expert histogram | plot token counts per expert | |
| Drop rate | measure overflowed tokens | |
| Router entropy | track whether routing is collapsing or too diffuse | |
| Per-expert gradients | cold experts have small or zero gradient norms | |
| Ablations | compare dense, top-1, top-2, and capacity factors |
10.1 Expert histogram
Main idea. Plot token counts per expert.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. The histogram is the first picture to look at when an MoE model behaves strangely.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
10.2 Drop rate
Main idea. Measure overflowed tokens.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
10.3 Router entropy
Main idea. Track whether routing is collapsing or too diffuse.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
10.4 Per-expert gradients
Main idea. Cold experts have small or zero gradient norms.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
10.5 Ablations
Main idea. Compare dense, top-1, top-2, and capacity factors.
Core relation:
An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.
Worked micro-example. If a dense FFN has parameters and an MoE layer has experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.
Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.
AI connection. This is a practical MoE control variable.
Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.
Practice Exercises
- Compute top-k experts from router probabilities.
- Count dense FFN and MoE expert parameters.
- Compute active versus total expert parameters.
- Compute expert capacity from tokens, experts, and capacity factor.
- Compute drop rate from expert loads.
- Compute a Switch-style auxiliary balancing term.
- Compute router entropy for a token.
- Estimate all-to-all token traffic by expert placement.
- Compare top-1 and top-2 active compute.
- Write an MoE debugging checklist.
Why This Matters for AI
MoE models are attractive because they can increase capacity without a proportional increase in active compute. But they are not free. They create routing, balancing, communication, memory, and serving problems. Learning MoE math means learning to ask precise questions: which experts were active, how balanced were they, how many tokens were dropped, how much traffic moved, and how much quality came from sparsity rather than raw parameter count?
Bridge to Quantization and Distillation
Quantization and distillation also change the relationship between quality, memory, and compute. The next section studies how precision reduction and teacher-student training compress models while trying to preserve behavior.
References
- Noam Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer", 2017: https://arxiv.org/abs/1701.06538
- Dmitry Lepikhin et al., "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding", 2020: https://arxiv.org/abs/2006.16668
- William Fedus, Barret Zoph, and Noam Shazeer, "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity", 2021: https://arxiv.org/abs/2101.03961
- Mistral AI, "Mixtral of Experts", 2024: https://arxiv.org/abs/2401.04088