Exercises NotebookMath for LLMs

Mixture of Experts and Routing

Math for LLMs / Mixture of Experts and Routing

Run notebook
Exercises Notebook

Exercises Notebook

Converted from exercises.ipynb for web reading.

Mixture of Experts and Routing: Exercises

Ten exercises cover the routing and accounting math behind MoE layers: top-k selection, parameter counts, capacity, drop rate, load balancing, entropy, communication, and diagnostics.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Exercise 1: Top-k routing

Find the top-2 experts from probabilities.

Code cell 4

# Your Solution
p = np.array([0.10, 0.40, 0.25, 0.25])
print("Starter: argsort and take the last two indices.")

Code cell 5

# Solution
p = np.array([0.10, 0.40, 0.25, 0.25])
top2 = np.argsort(p)[-2:][::-1]
print("top2:", top2)

Exercise 2: MoE parameter count

Count total expert parameters.

Code cell 7

# Your Solution
d, d_ff, experts = 1024, 4096, 8
print("Starter: experts * 2*d*d_ff.")

Code cell 8

# Solution
d, d_ff, experts = 1024, 4096, 8
total = experts * 2 * d * d_ff
print("total expert params:", total)

Exercise 3: Active parameters

Count active parameters for top-2.

Code cell 10

# Your Solution
d, d_ff, k = 1024, 4096, 2
print("Starter: k * 2*d*d_ff.")

Code cell 11

# Solution
d, d_ff, k = 1024, 4096, 2
active = k * 2 * d * d_ff
print("active expert params:", active)

Exercise 4: Capacity

Compute expert capacity for 100 tokens, 8 experts, factor 1.25.

Code cell 13

# Your Solution
T, M, factor = 100, 8, 1.25
print("Starter: ceil(factor*T/M).")

Code cell 14

# Solution
T, M, factor = 100, 8, 1.25
capacity = int(np.ceil(factor * T / M))
print("capacity:", capacity)

Exercise 5: Drop rate

Compute drop rate from loads and capacity.

Code cell 16

# Your Solution
loads = np.array([20, 10, 15, 5])
capacity = 12
print("Starter: sum(max(0, load-capacity))/sum(loads).")

Code cell 17

# Solution
loads = np.array([20, 10, 15, 5])
capacity = 12
drop = np.maximum(0, loads - capacity).sum() / loads.sum()
print("drop rate:", drop)

Exercise 6: Auxiliary loss

Compute M * sum(f_i * P_i).

Code cell 19

# Your Solution
f = np.array([0.5, 0.25, 0.25, 0.0])
P = np.array([0.4, 0.3, 0.2, 0.1])
print("Starter: M * dot(f, P).")

Code cell 20

# Solution
f = np.array([0.5, 0.25, 0.25, 0.0])
P = np.array([0.4, 0.3, 0.2, 0.1])
aux = len(f) * np.dot(f, P)
print("aux:", aux)

Exercise 7: Router entropy

Compute entropy of one router distribution.

Code cell 22

# Your Solution
p = np.array([0.7, 0.1, 0.1, 0.1])
print("Starter: -sum p log p.")

Code cell 23

# Solution
p = np.array([0.7, 0.1, 0.1, 0.1])
entropy = -np.sum(p * np.log(p))
print("entropy:", entropy)

Exercise 8: All-to-all traffic

Count tokens whose expert is on a different rank.

Code cell 25

# Your Solution
origin = np.array([0, 0, 1, 1])
expert_rank = np.array([0, 1, 1, 0])
chosen_expert = np.array([0, 1, 2, 3])
print("Starter: compare origin to expert_rank[chosen_expert].")

Code cell 26

# Solution
origin = np.array([0, 0, 1, 1])
expert_rank = np.array([0, 1, 1, 0])
chosen_expert = np.array([0, 1, 2, 3])
dest = expert_rank[chosen_expert]
cross = origin != dest
print("cross-rank tokens:", cross.sum())

Exercise 9: Top-k compute

Compare active compute for top-1 and top-2.

Code cell 28

# Your Solution
dense_ffn = 1.0
print("Starter: active compute is k*dense_ffn.")

Code cell 29

# Solution
dense_ffn = 1.0
for k in [1, 2]:
    print(f"top-{k} active compute:", k * dense_ffn)

Exercise 10: MoE checklist

Write four MoE diagnostics.

Code cell 31

# Your Solution
print("Starter: include load histogram, drop rate, entropy, and all-to-all traffic.")

Code cell 32

# Solution
checks = [
    "expert load histogram",
    "drop rate versus capacity factor",
    "router entropy and z-loss",
    "all-to-all traffic and per-expert gradient norms",
]
for check in checks:
    print("-", check)

Closing Reflection

MoE is conditional computation plus routing accountability. Always separate total parameters, active compute, load balance, capacity, and communication.