Notes - Math for LLMs Tutorial

Notes

Mixture-of-experts models separate total capacity from active per-token computation. A router chooses a small number of expert networks for each token, giving the model many parameters without using all of them on every token.

Overview

A dense transformer FFN applies the same network to every token. An MoE FFN replaces that one network with many experts:

y_t=\sum_{i\in S_t} g_{t,i}E_i(x_t),

where $S_t$ is the selected expert set for token $t$ , $g_{t,i}$ is the router weight, and $E_i$ is expert $i$ . The central math is not only the weighted sum. It is the accounting around it: active parameters, total parameters, router probabilities, expert capacity, load balancing, all-to-all dispatch, drop rate, and latency.

Prerequisites

Transformer FFN shapes
Softmax and top-k selection
Training-at-scale memory and parallelism
Efficient inference and serving latency vocabulary

Companion Notebooks

Notebook	Purpose
theory.ipynb	Demonstrates router softmax, top-k dispatch, capacity overflow, auxiliary balance loss, expert histograms, all-to-all traffic, active parameter counts, and router-collapse diagnostics.
exercises.ipynb	Ten practice problems for routing probabilities, capacity, drop rate, load balancing, expert counts, and MoE debugging.

Learning Objectives

After this section, you should be able to:

Explain total parameters versus active parameters.
Compute router probabilities and top-k selected experts.
Count MoE expert parameters and active expert compute.
Compute expert capacity and token overflow.
Define importance, load, auxiliary load-balancing loss, entropy, and z-loss.
Explain why MoE training often needs all-to-all communication.
Diagnose expert collapse with histograms, drop rate, entropy, and gradient norms.
Explain the inference tradeoff: lower active compute but higher memory and routing complexity.

Dense versus Sparse Computation
- 1.1 Dense FFN
- 1.2 Expert bank
- 1.3 Sparse activation
- 1.4 Total versus active parameters
- 1.5 Memory caveat
Router Mathematics
- 2.1 Router logits
- 2.2 Routing probabilities
- 2.3 Top-k selection
- 2.4 Gated combination
- 2.5 Top-1 Switch routing
Parameter and FLOP Accounting
- 3.1 FFN parameter count
- 3.2 MoE total parameters
- 3.3 MoE active parameters
- 3.4 Router overhead
- 3.5 Compute ratio
Capacity and Token Dropping
- 4.1 Expected tokens per expert
- 4.2 Capacity factor
- 4.3 Overflow
- 4.4 Batch sensitivity
- 4.5 Expert collapse
Load Balancing Losses
- 5.1 Importance
- 5.2 Load
- 5.3 Auxiliary loss
- 5.4 Entropy encouragement
- 5.5 Z-loss
Expert Parallelism
- 6.1 Expert placement
- 6.2 All-to-all dispatch
- 6.3 Combine step
- 6.4 Communication bottleneck
- 6.5 Locality
Training Dynamics
- 7.1 Specialization
- 7.2 Cold experts
- 7.3 Router noise
- 7.4 Top-2 gradients
- 7.5 Stability tradeoff
Inference Behavior
- 8.1 Active compute
- 8.2 Weight memory
- 8.3 Batch routing variance
- 8.4 Cache interaction
- 8.5 Latency tails
MoE Design Variants
- 9.1 Sparsely gated MoE
- 9.2 GShard
- 9.3 Switch Transformer
- 9.4 Top-2 MoE
- 9.5 Shared experts
Diagnostics

10.1 Expert histogram
10.2 Drop rate
10.3 Router entropy
10.4 Per-expert gradients
10.5 Ablations

One-Layer MoE Shape Flow

tokens x_t
   |
router logits r_t = W_r x_t
   |
top-k experts S_t
   |
dispatch tokens to experts
   |
expert FFNs E_i(x_t)
   |
weighted combine and restore token order

1. Dense versus Sparse Computation

This part studies dense versus sparse computation in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
Dense FFN	every token uses the same feed-forward network	$y=\mathrm{FFN}(x)$
Expert bank	replace one FFN with many candidate FFNs	$E_1,\ldots,E_M$
Sparse activation	each token uses only k experts	$k\ll M$
Total versus active parameters	MoE increases capacity without proportional per-token compute	$P_\mathrm{active}\ll P_\mathrm{total}$
Memory caveat	inactive experts still occupy memory	$M_\mathrm{weights}\propto P_\mathrm{total}$

1.1 Dense FFN

Main idea. Every token uses the same feed-forward network.

Core relation:

y=\mathrm{FFN}(x)

An MoE layer replaces a single dense feed-forward block with a bank of experts and a router. The router decides which experts see each token. This gives the model more total parameters than a dense model with similar active compute, but it adds routing instability, capacity limits, communication, and memory pressure.

Worked micro-example. If a dense FFN has $2dd_\mathrm{ff}$ parameters and an MoE layer has $M=8$ experts of the same size, total expert parameters increase by 8x. With top-2 routing, each token uses only two experts, so the active FFN compute is about 2x a single dense FFN, not 8x.

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

1.2 Expert bank

Main idea. Replace one ffn with many candidate ffns.

Core relation:

E_1,\ldots,E_M

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

1.3 Sparse activation

Main idea. Each token uses only k experts.

Core relation:

k\ll M

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

1.4 Total versus active parameters

Main idea. Moe increases capacity without proportional per-token compute.

Core relation:

P_\mathrm{active}\ll P_\mathrm{total}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This distinction is the reason MoE models can have large total capacity with smaller per-token compute.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

1.5 Memory caveat

Main idea. Inactive experts still occupy memory.

Core relation:

M_\mathrm{weights}\propto P_\mathrm{total}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2. Router Mathematics

This part studies router mathematics in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
Router logits	a small projection scores experts for each token	$r=W_rx$
Routing probabilities	softmax turns router logits into expert probabilities	$p_i=\exp(r_i)/\sum_j\exp(r_j)$
Top-k selection	only the highest scoring experts receive the token	$S=\mathrm{TopK}(p,k)$
Gated combination	selected expert outputs are weighted by router probabilities	$y=\sum_{i\in S} \tilde p_i E_i(x)$
Top-1 Switch routing	route each token to one expert for simplicity	$y=E_{\arg\max_i p_i}(x)$

2.1 Router logits

Main idea. A small projection scores experts for each token.

Core relation:

r=W_rx

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2.2 Routing probabilities

Main idea. Softmax turns router logits into expert probabilities.

Core relation:

p_i=\exp(r_i)/\sum_j\exp(r_j)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2.3 Top-k selection

Main idea. Only the highest scoring experts receive the token.

Core relation:

S=\mathrm{TopK}(p,k)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2.4 Gated combination

Main idea. Selected expert outputs are weighted by router probabilities.

Core relation:

y=\sum_{i\in S} \tilde p_i E_i(x)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

2.5 Top-1 Switch routing

Main idea. Route each token to one expert for simplicity.

Core relation:

y=E_{\arg\max_i p_i}(x)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3. Parameter and FLOP Accounting

This part studies parameter and flop accounting in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
FFN parameter count	a transformer FFN has two large projections	$P_\mathrm{ffn}\approx 2dd_\mathrm{ff}$
MoE total parameters	experts multiply the FFN parameter count	$P_\mathrm{experts}\approx M\cdot 2dd_\mathrm{ff}$
MoE active parameters	only selected experts are used per token	$P_\mathrm{active}\approx k\cdot 2dd_\mathrm{ff}$
Router overhead	router cost is usually small compared with expert FFNs	$P_\mathrm{router}=dM$
Compute ratio	sparse compute scales with k, not M	$\mathrm{FLOPs}_\mathrm{MoE}/\mathrm{FLOPs}_\mathrm{dense}\approx k$ if expert size matches dense FFN

3.1 FFN parameter count

Main idea. A transformer ffn has two large projections.

Core relation:

P_\mathrm{ffn}\approx 2dd_\mathrm{ff}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3.2 MoE total parameters

Main idea. Experts multiply the ffn parameter count.

Core relation:

P_\mathrm{experts}\approx M\cdot 2dd_\mathrm{ff}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3.3 MoE active parameters

Main idea. Only selected experts are used per token.

Core relation:

P_\mathrm{active}\approx k\cdot 2dd_\mathrm{ff}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3.4 Router overhead

Main idea. Router cost is usually small compared with expert ffns.

Core relation:

P_\mathrm{router}=dM

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

3.5 Compute ratio

Main idea. Sparse compute scales with k, not m.

Core relation:

\mathrm{FLOPs}_\mathrm{MoE}/\mathrm{FLOPs}_\mathrm{dense}\approx k$ if expert size matches dense FFN

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4. Capacity and Token Dropping

This part studies capacity and token dropping in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
Expected tokens per expert	balanced routing sends roughly T/M tokens to each expert	$E[n_i]=T/M$
Capacity factor	reserve extra slots beyond the expected load	$C_i=\lceil \mathrm{capacity\ factor}\cdot T/M\rceil$
Overflow	tokens above capacity are dropped or rerouted	$\max(0,n_i-C_i)$
Batch sensitivity	small batches have noisier expert loads	$\mathrm{Var}(n_i)=Tp_i(1-p_i)$
Expert collapse	if the router favors a few experts, capacity and learning both suffer	$p_i\approx 0$ for many experts

4.1 Expected tokens per expert

Main idea. Balanced routing sends roughly t/m tokens to each expert.

Core relation:

E[n_i]=T/M

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4.2 Capacity factor

Main idea. Reserve extra slots beyond the expected load.

Core relation:

C_i=\lceil \mathrm{capacity\ factor}\cdot T/M\rceil

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. Capacity is the serving and training contract between the router and the expert bank.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4.3 Overflow

Main idea. Tokens above capacity are dropped or rerouted.

Core relation:

\max(0,n_i-C_i)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4.4 Batch sensitivity

Main idea. Small batches have noisier expert loads.

Core relation:

\mathrm{Var}(n_i)=Tp_i(1-p_i)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

4.5 Expert collapse

Main idea. If the router favors a few experts, capacity and learning both suffer.

Core relation:

p_i\approx 0$ for many experts

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5. Load Balancing Losses

This part studies load balancing losses in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
Importance	sum of router probabilities assigned to each expert	$I_i=\sum_t p_{t,i}$
Load	number of tokens actually routed to each expert	$L_i=\sum_t \mathbf{1}\{i\in S_t\}$
Auxiliary loss	penalize uneven routing	$L_\mathrm{aux}\propto M\sum_i f_iP_i$
Entropy encouragement	router entropy can discourage overconfident early routing	$H(p_t)=-\sum_i p_{t,i}\log p_{t,i}$
Z-loss	penalize large router logits for stability	$L_z=(\log\sum_i e^{r_i})^2$

5.1 Importance

Main idea. Sum of router probabilities assigned to each expert.

Core relation:

I_i=\sum_t p_{t,i}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5.2 Load

Main idea. Number of tokens actually routed to each expert.

Core relation:

L_i=\sum_t \mathbf{1}\{i\in S_t\}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5.3 Auxiliary loss

Main idea. Penalize uneven routing.

Core relation:

L_\mathrm{aux}\propto M\sum_i f_iP_i

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. Without balancing, a router can discover a few experts and ignore the rest.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5.4 Entropy encouragement

Main idea. Router entropy can discourage overconfident early routing.

Core relation:

H(p_t)=-\sum_i p_{t,i}\log p_{t,i}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

5.5 Z-loss

Main idea. Penalize large router logits for stability.

Core relation:

L_z=(\log\sum_i e^{r_i})^2

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6. Expert Parallelism

This part studies expert parallelism in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
Expert placement	different devices own different experts	$E_i\rightarrow \mathrm{rank}(i)$
All-to-all dispatch	tokens move to the devices that own their experts	$\mathrm{tokens}\rightarrow\mathrm{experts}$
Combine step	expert outputs return to original token order	$y_t=\sum_i g_{t,i}E_i(x_t)$
Communication bottleneck	MoE speed depends on token traffic, not only FLOPs	$T_\mathrm{step}\approx\max(T_\mathrm{expert},T_\mathrm{alltoall})$
Locality	routing and placement choices can reduce cross-device movement	$\mathrm{traffic}\downarrow$

6.1 Expert placement

Main idea. Different devices own different experts.

Core relation:

E_i\rightarrow \mathrm{rank}(i)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6.2 All-to-all dispatch

Main idea. Tokens move to the devices that own their experts.

Core relation:

\mathrm{tokens}\rightarrow\mathrm{experts}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. MoE turns part of the model into a distributed routing problem.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6.3 Combine step

Main idea. Expert outputs return to original token order.

Core relation:

y_t=\sum_i g_{t,i}E_i(x_t)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6.4 Communication bottleneck

Main idea. Moe speed depends on token traffic, not only flops.

Core relation:

T_\mathrm{step}\approx\max(T_\mathrm{expert},T_\mathrm{alltoall})

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

6.5 Locality

Main idea. Routing and placement choices can reduce cross-device movement.

Core relation:

\mathrm{traffic}\downarrow

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7. Training Dynamics

This part studies training dynamics in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
Specialization	experts differentiate because they receive different token subsets	$\nabla_{\theta_i}L$ only from routed tokens
Cold experts	rarely selected experts learn slowly	$n_i\approx 0\Rightarrow \nabla_{\theta_i}\approx 0$
Router noise	noise can encourage exploration early in training	$r'=r+\epsilon$
Top-2 gradients	top-2 routing gives more experts gradient signal than top-1	$
Stability tradeoff	strong balancing can fight useful specialization	$L=L_\mathrm{task}+\lambda L_\mathrm{aux}$

7.1 Specialization

Main idea. Experts differentiate because they receive different token subsets.

Core relation:

\nabla_{\theta_i}L$ only from routed tokens

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7.2 Cold experts

Main idea. Rarely selected experts learn slowly.

Core relation:

n_i\approx 0\Rightarrow \nabla_{\theta_i}\approx 0

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7.3 Router noise

Main idea. Noise can encourage exploration early in training.

Core relation:

r'=r+\epsilon

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7.4 Top-2 gradients

Main idea. Top-2 routing gives more experts gradient signal than top-1.

Core relation:

|S|=2

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

7.5 Stability tradeoff

Main idea. Strong balancing can fight useful specialization.

Core relation:

L=L_\mathrm{task}+\lambda L_\mathrm{aux}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8. Inference Behavior

This part studies inference behavior in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
Active compute	per-token expert compute depends on k	$k$ selected experts
Weight memory	serving must store or page all experts that may be routed	$P_\mathrm{total}$ resident or streamed
Batch routing variance	different requests can activate different experts	$S_t$ varies by token
Cache interaction	MoE changes FFN compute but not attention KV cache math directly	$M_\mathrm{KV}$ unchanged by experts
Latency tails	hot experts and cross-device traffic can increase p95 latency	$Q_{0.95}(T)$

8.1 Active compute

Main idea. Per-token expert compute depends on k.

Core relation:

k$ selected experts

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8.2 Weight memory

Main idea. Serving must store or page all experts that may be routed.

Core relation:

P_\mathrm{total}$ resident or streamed

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8.3 Batch routing variance

Main idea. Different requests can activate different experts.

Core relation:

S_t$ varies by token

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8.4 Cache interaction

Main idea. Moe changes ffn compute but not attention kv cache math directly.

Core relation:

M_\mathrm{KV}$ unchanged by experts

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

8.5 Latency tails

Main idea. Hot experts and cross-device traffic can increase p95 latency.

Core relation:

Q_{0.95}(T)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9. MoE Design Variants

This part studies moe design variants in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
Sparsely gated MoE	learned gates select a sparse expert subset	$y=\sum_i g_iE_i(x)$
GShard	scaled conditional computation with automatic sharding	$\mathrm{expert\ parallelism}$
Switch Transformer	top-1 routing simplifies dispatch	$k=1$
Top-2 MoE	two experts can improve quality at higher compute	$k=2$
Shared experts	some designs combine routed experts with always-on shared experts	$y=E_\mathrm{shared}(x)+E_\mathrm{routed}(x)$

9.1 Sparsely gated MoE

Main idea. Learned gates select a sparse expert subset.

Core relation:

y=\sum_i g_iE_i(x)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9.2 GShard

Main idea. Scaled conditional computation with automatic sharding.

Core relation:

\mathrm{expert\ parallelism}

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9.3 Switch Transformer

Main idea. Top-1 routing simplifies dispatch.

Core relation:

k=1

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9.4 Top-2 MoE

Main idea. Two experts can improve quality at higher compute.

Core relation:

k=2

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

9.5 Shared experts

Main idea. Some designs combine routed experts with always-on shared experts.

Core relation:

y=E_\mathrm{shared}(x)+E_\mathrm{routed}(x)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10. Diagnostics

This part studies diagnostics in mixture-of-experts LLMs. The useful habit is to separate probability, capacity, compute, memory, and communication.

Subtopic	Question	Formula
Expert histogram	plot token counts per expert	$n_i$
Drop rate	measure overflowed tokens	$\mathrm{drop}=\sum_i\max(0,n_i-C_i)/T$
Router entropy	track whether routing is collapsing or too diffuse	$H(p)$
Per-expert gradients	cold experts have small or zero gradient norms	$\Vert g_i\Vert$
Ablations	compare dense, top-1, top-2, and capacity factors	$\Delta L,\Delta T,\Delta M$

10.1 Expert histogram

Main idea. Plot token counts per expert.

Core relation:

n_i

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. The histogram is the first picture to look at when an MoE model behaves strangely.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10.2 Drop rate

Main idea. Measure overflowed tokens.

Core relation:

\mathrm{drop}=\sum_i\max(0,n_i-C_i)/T

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10.3 Router entropy

Main idea. Track whether routing is collapsing or too diffuse.

Core relation:

H(p)

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10.4 Per-expert gradients

Main idea. Cold experts have small or zero gradient norms.

Core relation:

\Vert g_i\Vert

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

10.5 Ablations

Main idea. Compare dense, top-1, top-2, and capacity factors.

Core relation:

\Delta L,\Delta T,\Delta M

Implementation check. For every MoE run, log expert token counts, drop rate, router entropy, auxiliary loss, and per-expert gradient norms. A low loss curve can hide a collapsed router.

AI connection. This is a practical MoE control variable.

Common mistake. Do not say an MoE model "uses all its parameters" for one token. The correct statement is total parameters versus active parameters per token.

Practice Exercises

Compute top-k experts from router probabilities.
Count dense FFN and MoE expert parameters.
Compute active versus total expert parameters.
Compute expert capacity from tokens, experts, and capacity factor.
Compute drop rate from expert loads.
Compute a Switch-style auxiliary balancing term.
Compute router entropy for a token.
Estimate all-to-all token traffic by expert placement.
Compare top-1 and top-2 active compute.
Write an MoE debugging checklist.

Why This Matters for AI

MoE models are attractive because they can increase capacity without a proportional increase in active compute. But they are not free. They create routing, balancing, communication, memory, and serving problems. Learning MoE math means learning to ask precise questions: which experts were active, how balanced were they, how many tokens were dropped, how much traffic moved, and how much quality came from sparsity rather than raw parameter count?

Bridge to Quantization and Distillation

Quantization and distillation also change the relationship between quality, memory, and compute. The next section studies how precision reduction and teacher-student training compress models while trying to preserve behavior.

References

Noam Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer", 2017: https://arxiv.org/abs/1701.06538
Dmitry Lepikhin et al., "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding", 2020: https://arxiv.org/abs/2006.16668
William Fedus, Barret Zoph, and Noam Shazeer, "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity", 2021: https://arxiv.org/abs/2101.03961
Mistral AI, "Mixtral of Experts", 2024: https://arxiv.org/abs/2401.04088

Mixture of Experts and Routing

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Table of Contents

One-Layer MoE Shape Flow

1. Dense versus Sparse Computation

1.1 Dense FFN

1.2 Expert bank

1.3 Sparse activation

1.4 Total versus active parameters

1.5 Memory caveat

2. Router Mathematics

2.1 Router logits

2.2 Routing probabilities

2.3 Top-k selection

2.4 Gated combination

2.5 Top-1 Switch routing

3. Parameter and FLOP Accounting

3.1 FFN parameter count

3.2 MoE total parameters

3.3 MoE active parameters

3.4 Router overhead

3.5 Compute ratio

4. Capacity and Token Dropping

4.1 Expected tokens per expert

4.2 Capacity factor

4.3 Overflow

4.4 Batch sensitivity

4.5 Expert collapse

5. Load Balancing Losses

5.1 Importance

5.2 Load

5.3 Auxiliary loss

5.4 Entropy encouragement

5.5 Z-loss

6. Expert Parallelism

6.1 Expert placement

6.2 All-to-all dispatch

6.3 Combine step

6.4 Communication bottleneck

6.5 Locality

7. Training Dynamics

7.1 Specialization

7.2 Cold experts

7.3 Router noise

7.4 Top-2 gradients

7.5 Stability tradeoff

8. Inference Behavior

8.1 Active compute

8.2 Weight memory

8.3 Batch routing variance

8.4 Cache interaction

8.5 Latency tails

9. MoE Design Variants

9.1 Sparsely gated MoE

9.2 GShard

9.3 Switch Transformer

9.4 Top-2 MoE

9.5 Shared experts

10. Diagnostics

10.1 Expert histogram

10.2 Drop rate

10.3 Router entropy

10.4 Per-expert gradients

10.5 Ablations

Practice Exercises

Why This Matters for AI

Bridge to Quantization and Distillation

References