NotesMath for LLMs

Mixture of Experts and Routing

Math for LLMs / Mixture of Experts and Routing

Notes

Mixture-of-experts models separate total capacity from active per-token computation. A router chooses a small number of expert networks for each token, giving the model many parameters without using all of them on every token.

Overview

A dense transformer FFN applies the same network to every token. An MoE FFN replaces that one network with many experts:

yt=iStgt,iEi(xt),y_t=\sum_{i\in S_t} g_{t,i}E_i(x_t),

where StS_t is the selected expert set for token tt, gt,ig_{t,i} is the router weight, and EiE_i is expert ii. The central math is not only the weighted sum. It is the accounting around it: active parameters, total parameters, router probabilities, expert capacity, load balancing, all-to-all dispatch, drop rate, and latency.

Prerequisites

  • Transformer FFN shapes
  • Softmax and top-k selection
  • Training-at-scale memory and parallelism
  • Efficient inference and serving latency vocabulary

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates router softmax, top-k dispatch, capacity overflow, auxiliary balance loss, expert histograms, all-to-all traffic, active parameter counts, and router-collapse diagnostics.
exercises.ipynbTen practice problems for routing probabilities, capacity, drop rate, load balancing, expert counts, and MoE debugging.

Learning Objectives

After this section, you should be able to:

  • Explain total parameters versus active parameters.
  • Compute router probabilities and top-k selected experts.
  • Count MoE expert parameters and active expert compute.
  • Compute expert capacity and token overflow.
  • Define importance, load, auxiliary load-balancing loss, entropy, and z-loss.
  • Explain why MoE training often needs all-to-all communication.
  • Diagnose expert collapse with histograms, drop rate, entropy, and gradient norms.
  • Explain the inference tradeoff: lower active compute but higher memory and routing complexity.

Table of Contents

  1. Dense versus Sparse Computation
  2. Router Mathematics
  3. Parameter and FLOP Accounting
  4. Capacity and Token Dropping
  5. Load Balancing Losses
  6. Expert Parallelism
  7. Training Dynamics
  8. Inference Behavior
  9. MoE Design Variants
  10. Diagnostics

One-Layer MoE Shape Flow

tokens x_t
   |
router logits r_t = W_r x_t
   |
top-k experts S_t
   |
dispatch tokens to experts
   |
expert FFNs E_i(x_t)
   |
weighted combine and restore token order

1. Dense versus Sparse Computation

This part studies dense versus sparse computation in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
Dense FFNevery token uses the same feed-forward networky=FFN(x)y=\mathrm{FFN}(x)
Expert bankreplace one FFN with many candidate FFNsE1,,EME_1,\ldots,E_M
Sparse activationeach token uses only k expertskMk\ll M
Total versus active parametersMoE increases capacity without proportional per-token computePactivePtotalP_\mathrm{active}\ll P_\mathrm{total}
Memory caveatinactive experts still occupy memoryMweightsPtotalM_\mathrm{weights}\propto P_\mathrm{total}

1.1 Dense FFN

Main idea. Every token uses the same feed-forward network.

Core relation:

y=FFN(x)y=\mathrm{FFN}(x)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

1.2 Expert bank

Main idea. Replace one ffn with many candidate ffns.

Core relation:

E1,,EME_1,\ldots,E_M

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

1.3 Sparse activation

Main idea. Each token uses only k experts.

Core relation:

kMk\ll M

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

1.4 Total versus active parameters

Main idea. Moe increases capacity without proportional per-token compute.

Core relation:

PactivePtotalP_\mathrm{active}\ll P_\mathrm{total}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This distinction is the reason MoE models can have large total capacity with smaller per-token compute.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

1.5 Memory caveat

Main idea. Inactive experts still occupy memory.

Core relation:

MweightsPtotalM_\mathrm{weights}\propto P_\mathrm{total}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2. Router Mathematics

This part studies router mathematics in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
Router logitsa small projection scores experts for each tokenr=Wrxr=W_rx
Routing probabilitiessoftmax turns router logits into expert probabilitiespi=exp(ri)/jexp(rj)p_i=\exp(r_i)/\sum_j\exp(r_j)
Top-k selectiononly the highest scoring experts receive the tokenS=TopK(p,k)S=\mathrm{TopK}(p,k)
Gated combinationselected expert outputs are weighted by router probabilitiesy=iSp~iEi(x)y=\sum_{i\in S} \tilde p_i E_i(x)
Top-1 Switch routingroute each token to one expert for simplicityy=Eargmaxipi(x)y=E_{\arg\max_i p_i}(x)

2.1 Router logits

Main idea. A small projection scores experts for each token.

Core relation:

r=Wrxr=W_rx

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2.2 Routing probabilities

Main idea. Softmax turns router logits into expert probabilities.

Core relation:

pi=exp(ri)/jexp(rj)p_i=\exp(r_i)/\sum_j\exp(r_j)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2.3 Top-k selection

Main idea. Only the highest scoring experts receive the token.

Core relation:

S=TopK(p,k)S=\mathrm{TopK}(p,k)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2.4 Gated combination

Main idea. Selected expert outputs are weighted by router probabilities.

Core relation:

y=iSp~iEi(x)y=\sum_{i\in S} \tilde p_i E_i(x)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2.5 Top-1 Switch routing

Main idea. Route each token to one expert for simplicity.

Core relation:

y=Eargmaxipi(x)y=E_{\arg\max_i p_i}(x)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3. Parameter and FLOP Accounting

This part studies parameter and flop accounting in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
FFN parameter counta transformer FFN has two large projectionsPffn2ddffP_\mathrm{ffn}\approx 2dd_\mathrm{ff}
MoE total parametersexperts multiply the FFN parameter countPexpertsM2ddffP_\mathrm{experts}\approx M\cdot 2dd_\mathrm{ff}
MoE active parametersonly selected experts are used per tokenPactivek2ddffP_\mathrm{active}\approx k\cdot 2dd_\mathrm{ff}
Router overheadrouter cost is usually small compared with expert FFNsProuter=dMP_\mathrm{router}=dM
Compute ratiosparse compute scales with k, not MFLOPsMoE/FLOPsdensek\mathrm{FLOPs}_\mathrm{MoE}/\mathrm{FLOPs}_\mathrm{dense}\approx k if expert size matches dense FFN

3.1 FFN parameter count

Main idea. A transformer ffn has two large projections.

Core relation:

Pffn2ddffP_\mathrm{ffn}\approx 2dd_\mathrm{ff}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3.2 MoE total parameters

Main idea. Experts multiply the ffn parameter count.

Core relation:

PexpertsM2ddffP_\mathrm{experts}\approx M\cdot 2dd_\mathrm{ff}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3.3 MoE active parameters

Main idea. Only selected experts are used per token.

Core relation:

Pactivek2ddffP_\mathrm{active}\approx k\cdot 2dd_\mathrm{ff}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3.4 Router overhead

Main idea. Router cost is usually small compared with expert ffns.

Core relation:

Prouter=dMP_\mathrm{router}=dM

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3.5 Compute ratio

Main idea. Sparse compute scales with k, not m.

Core relation:

\mathrm{FLOPs}_\mathrm{MoE}/\mathrm{FLOPs}_\mathrm{dense}\approx k$ if expert size matches dense FFN

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4. Capacity and Token Dropping

This part studies capacity and token dropping in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
Expected tokens per expertbalanced routing sends roughly T/M tokens to each expertE[ni]=T/ME[n_i]=T/M
Capacity factorreserve extra slots beyond the expected loadCi=capacity factorT/MC_i=\lceil \mathrm{capacity\ factor}\cdot T/M\rceil
Overflowtokens above capacity are dropped or reroutedmax(0,niCi)\max(0,n_i-C_i)
Batch sensitivitysmall batches have noisier expert loadsVar(ni)=Tpi(1pi)\mathrm{Var}(n_i)=Tp_i(1-p_i)
Expert collapseif the router favors a few experts, capacity and learning both sufferpi0p_i\approx 0 for many experts

4.1 Expected tokens per expert

Main idea. Balanced routing sends roughly t/m tokens to each expert.

Core relation:

E[ni]=T/ME[n_i]=T/M

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4.2 Capacity factor

Main idea. Reserve extra slots beyond the expected load.

Core relation:

Ci=capacity factorT/MC_i=\lceil \mathrm{capacity\ factor}\cdot T/M\rceil

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. Capacity is the serving and training contract between the router and the expert bank.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4.3 Overflow

Main idea. Tokens above capacity are dropped or rerouted.

Core relation:

max(0,niCi)\max(0,n_i-C_i)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4.4 Batch sensitivity

Main idea. Small batches have noisier expert loads.

Core relation:

Var(ni)=Tpi(1pi)\mathrm{Var}(n_i)=Tp_i(1-p_i)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4.5 Expert collapse

Main idea. If the router favors a few experts, capacity and learning both suffer.

Core relation:

p_i\approx 0$ for many experts

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5. Load Balancing Losses

This part studies load balancing losses in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
Importancesum of router probabilities assigned to each expertIi=tpt,iI_i=\sum_t p_{t,i}
Loadnumber of tokens actually routed to each expertLi=t1{iSt}L_i=\sum_t \mathbf{1}\{i\in S_t\}
Auxiliary losspenalize uneven routingLauxMifiPiL_\mathrm{aux}\propto M\sum_i f_iP_i
Entropy encouragementrouter entropy can discourage overconfident early routingH(pt)=ipt,ilogpt,iH(p_t)=-\sum_i p_{t,i}\log p_{t,i}
Z-losspenalize large router logits for stabilityLz=(logieri)2L_z=(\log\sum_i e^{r_i})^2

5.1 Importance

Main idea. Sum of router probabilities assigned to each expert.

Core relation:

Ii=tpt,iI_i=\sum_t p_{t,i}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5.2 Load

Main idea. Number of tokens actually routed to each expert.

Core relation:

Li=t1{iSt}L_i=\sum_t \mathbf{1}\{i\in S_t\}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5.3 Auxiliary loss

Main idea. Penalize uneven routing.

Core relation:

LauxMifiPiL_\mathrm{aux}\propto M\sum_i f_iP_i

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. Without balancing, a router can discover a few experts and ignore the rest.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5.4 Entropy encouragement

Main idea. Router entropy can discourage overconfident early routing.

Core relation:

H(pt)=ipt,ilogpt,iH(p_t)=-\sum_i p_{t,i}\log p_{t,i}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5.5 Z-loss

Main idea. Penalize large router logits for stability.

Core relation:

Lz=(logieri)2L_z=(\log\sum_i e^{r_i})^2

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6. Expert Parallelism

This part studies expert parallelism in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
Expert placementdifferent devices own different expertsEirank(i)E_i\rightarrow \mathrm{rank}(i)
All-to-all dispatchtokens move to the devices that own their expertstokensexperts\mathrm{tokens}\rightarrow\mathrm{experts}
Combine stepexpert outputs return to original token orderyt=igt,iEi(xt)y_t=\sum_i g_{t,i}E_i(x_t)
Communication bottleneckMoE speed depends on token traffic, not only FLOPsTstepmax(Texpert,Talltoall)T_\mathrm{step}\approx\max(T_\mathrm{expert},T_\mathrm{alltoall})
Localityrouting and placement choices can reduce cross-device movementtraffic\mathrm{traffic}\downarrow

6.1 Expert placement

Main idea. Different devices own different experts.

Core relation:

Eirank(i)E_i\rightarrow \mathrm{rank}(i)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6.2 All-to-all dispatch

Main idea. Tokens move to the devices that own their experts.

Core relation:

tokensexperts\mathrm{tokens}\rightarrow\mathrm{experts}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. MoE turns part of the model into a distributed routing problem.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6.3 Combine step

Main idea. Expert outputs return to original token order.

Core relation:

yt=igt,iEi(xt)y_t=\sum_i g_{t,i}E_i(x_t)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6.4 Communication bottleneck

Main idea. Moe speed depends on token traffic, not only flops.

Core relation:

Tstepmax(Texpert,Talltoall)T_\mathrm{step}\approx\max(T_\mathrm{expert},T_\mathrm{alltoall})

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6.5 Locality

Main idea. Routing and placement choices can reduce cross-device movement.

Core relation:

traffic\mathrm{traffic}\downarrow

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7. Training Dynamics

This part studies training dynamics in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
Specializationexperts differentiate because they receive different token subsetsθiL\nabla_{\theta_i}L only from routed tokens
Cold expertsrarely selected experts learn slowlyni0θi0n_i\approx 0\Rightarrow \nabla_{\theta_i}\approx 0
Router noisenoise can encourage exploration early in trainingr=r+ϵr'=r+\epsilon
Top-2 gradientstop-2 routing gives more experts gradient signal than top-1$
Stability tradeoffstrong balancing can fight useful specializationL=Ltask+λLauxL=L_\mathrm{task}+\lambda L_\mathrm{aux}

7.1 Specialization

Main idea. Experts differentiate because they receive different token subsets.

Core relation:

\nabla_{\theta_i}L$ only from routed tokens

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7.2 Cold experts

Main idea. Rarely selected experts learn slowly.

Core relation:

ni0θi0n_i\approx 0\Rightarrow \nabla_{\theta_i}\approx 0

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7.3 Router noise

Main idea. Noise can encourage exploration early in training.

Core relation:

r=r+ϵr'=r+\epsilon

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7.4 Top-2 gradients

Main idea. Top-2 routing gives more experts gradient signal than top-1.

Core relation:

S=2|S|=2

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7.5 Stability tradeoff

Main idea. Strong balancing can fight useful specialization.

Core relation:

L=Ltask+λLauxL=L_\mathrm{task}+\lambda L_\mathrm{aux}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8. Inference Behavior

This part studies inference behavior in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
Active computeper-token expert compute depends on kkk selected experts
Weight memoryserving must store or page all experts that may be routedPtotalP_\mathrm{total} resident or streamed
Batch routing variancedifferent requests can activate different expertsStS_t varies by token
Cache interactionMoE changes FFN compute but not attention KV cache math directlyMKVM_\mathrm{KV} unchanged by experts
Latency tailshot experts and cross-device traffic can increase p95 latencyQ0.95(T)Q_{0.95}(T)

8.1 Active compute

Main idea. Per-token expert compute depends on k.

Core relation:

k$ selected experts

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8.2 Weight memory

Main idea. Serving must store or page all experts that may be routed.

Core relation:

P_\mathrm{total}$ resident or streamed

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8.3 Batch routing variance

Main idea. Different requests can activate different experts.

Core relation:

S_t$ varies by token

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8.4 Cache interaction

Main idea. Moe changes ffn compute but not attention kv cache math directly.

Core relation:

M_\mathrm{KV}$ unchanged by experts

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8.5 Latency tails

Main idea. Hot experts and cross-device traffic can increase p95 latency.

Core relation:

Q0.95(T)Q_{0.95}(T)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9. MoE Design Variants

This part studies moe design variants in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
Sparsely gated MoElearned gates select a sparse expert subsety=igiEi(x)y=\sum_i g_iE_i(x)
GShardscaled conditional computation with automatic shardingexpert parallelism\mathrm{expert\ parallelism}
Switch Transformertop-1 routing simplifies dispatchk=1k=1
Top-2 MoEtwo experts can improve quality at higher computek=2k=2
Shared expertssome designs combine routed experts with always-on shared expertsy=Eshared(x)+Erouted(x)y=E_\mathrm{shared}(x)+E_\mathrm{routed}(x)

9.1 Sparsely gated MoE

Main idea. Learned gates select a sparse expert subset.

Core relation:

y=igiEi(x)y=\sum_i g_iE_i(x)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9.2 GShard

Main idea. Scaled conditional computation with automatic sharding.

Core relation:

expert parallelism\mathrm{expert\ parallelism}

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9.3 Switch Transformer

Main idea. Top-1 routing simplifies dispatch.

Core relation:

k=1k=1

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9.4 Top-2 MoE

Main idea. Two experts can improve quality at higher compute.

Core relation:

k=2k=2

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9.5 Shared experts

Main idea. Some designs combine routed experts with always-on shared experts.

Core relation:

y=Eshared(x)+Erouted(x)y=E_\mathrm{shared}(x)+E_\mathrm{routed}(x)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10. Diagnostics

This part studies diagnostics in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

SubtopicQuestionFormula
Expert histogramplot token counts per expertnin_i
Drop ratemeasure overflowed tokensdrop=imax(0,niCi)/T\mathrm{drop}=\sum_i\max(0,n_i-C_i)/T
Router entropytrack whether routing is collapsing or too diffuseH(p)H(p)
Per-expert gradientscold experts have small or zero gradient normsgi\Vert g_i\Vert
Ablationscompare dense, top-1, top-2, and capacity factorsΔL,ΔT,ΔM\Delta L,\Delta T,\Delta M

10.1 Expert histogram

Main idea. Plot token counts per expert.

Core relation:

nin_i

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. The histogram is the first picture to look at when an MoE model behaves strangely.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10.2 Drop rate

Main idea. Measure overflowed tokens.

Core relation:

drop=imax(0,niCi)/T\mathrm{drop}=\sum_i\max(0,n_i-C_i)/T

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10.3 Router entropy

Main idea. Track whether routing is collapsing or too diffuse.

Core relation:

H(p)H(p)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10.4 Per-expert gradients

Main idea. Cold experts have small or zero gradient norms.

Core relation:

gi\Vert g_i\Vert

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10.5 Ablations

Main idea. Compare dense, top-1, top-2, and capacity factors.

Core relation:

ΔL,ΔT,ΔM\Delta L,\Delta T,\Delta M

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has 2ddff2dd_\mathrm{ff} parameters and an MoE layer has M=8M=8 experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.


Practice Exercises

  1. Compute top-k experts from router probabilities.
  2. Count dense FFN and MoE expert parameters.
  3. Compute active versus total expert parameters.
  4. Compute expert capacity from tokens, experts, and capacity factor.
  5. Compute drop rate from expert loads.
  6. Compute a Switch-style auxiliary balancing term.
  7. Compute router entropy for a token.
  8. Estimate all-to-all token traffic by expert placement.
  9. Compare top-1 and top-2 active compute.
  10. Write an MoE debugging checklist.

Why This Matters for AI

MoE models are attractive because they can increase capacity without a proportional increase in active compute. But they are not free. They create routing, balancing, communication, memory, and serving problems. Learning MoE math means learning to ask precise questions: which experts were active, how balanced were they, how many tokens were dropped, how much traffic moved, and how much quality came from sparsity rather than raw parameter count?

Bridge to Quantization and Distillation

Quantization and distillation also change the relationship between quality, memory, and compute. The next section studies how precision reduction and teacher-student training compress models while trying to preserve behavior.

References