Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Mixture of Experts and Routing: Exercises
Ten exercises cover the routing and accounting math behind MoE layers: top-k selection, parameter counts, capacity, drop rate, load balancing, entropy, communication, and diagnostics.
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Exercise 1: Top-k routing
Find the top-2 experts from probabilities.
Code cell 4
# Your Solution
p = np.array([0.10, 0.40, 0.25, 0.25])
print("Starter: argsort and take the last two indices.")
Code cell 5
# Solution
p = np.array([0.10, 0.40, 0.25, 0.25])
top2 = np.argsort(p)[-2:][::-1]
print("top2:", top2)
Exercise 2: MoE parameter count
Count total expert parameters.
Code cell 7
# Your Solution
d, d_ff, experts = 1024, 4096, 8
print("Starter: experts * 2*d*d_ff.")
Code cell 8
# Solution
d, d_ff, experts = 1024, 4096, 8
total = experts * 2 * d * d_ff
print("total expert params:", total)
Exercise 3: Active parameters
Count active parameters for top-2.
Code cell 10
# Your Solution
d, d_ff, k = 1024, 4096, 2
print("Starter: k * 2*d*d_ff.")
Code cell 11
# Solution
d, d_ff, k = 1024, 4096, 2
active = k * 2 * d * d_ff
print("active expert params:", active)
Exercise 4: Capacity
Compute expert capacity for 100 tokens, 8 experts, factor 1.25.
Code cell 13
# Your Solution
T, M, factor = 100, 8, 1.25
print("Starter: ceil(factor*T/M).")
Code cell 14
# Solution
T, M, factor = 100, 8, 1.25
capacity = int(np.ceil(factor * T / M))
print("capacity:", capacity)
Exercise 5: Drop rate
Compute drop rate from loads and capacity.
Code cell 16
# Your Solution
loads = np.array([20, 10, 15, 5])
capacity = 12
print("Starter: sum(max(0, load-capacity))/sum(loads).")
Code cell 17
# Solution
loads = np.array([20, 10, 15, 5])
capacity = 12
drop = np.maximum(0, loads - capacity).sum() / loads.sum()
print("drop rate:", drop)
Exercise 6: Auxiliary loss
Compute M * sum(f_i * P_i).
Code cell 19
# Your Solution
f = np.array([0.5, 0.25, 0.25, 0.0])
P = np.array([0.4, 0.3, 0.2, 0.1])
print("Starter: M * dot(f, P).")
Code cell 20
# Solution
f = np.array([0.5, 0.25, 0.25, 0.0])
P = np.array([0.4, 0.3, 0.2, 0.1])
aux = len(f) * np.dot(f, P)
print("aux:", aux)
Exercise 7: Router entropy
Compute entropy of one router distribution.
Code cell 22
# Your Solution
p = np.array([0.7, 0.1, 0.1, 0.1])
print("Starter: -sum p log p.")
Code cell 23
# Solution
p = np.array([0.7, 0.1, 0.1, 0.1])
entropy = -np.sum(p * np.log(p))
print("entropy:", entropy)
Exercise 8: All-to-all traffic
Count tokens whose expert is on a different rank.
Code cell 25
# Your Solution
origin = np.array([0, 0, 1, 1])
expert_rank = np.array([0, 1, 1, 0])
chosen_expert = np.array([0, 1, 2, 3])
print("Starter: compare origin to expert_rank[chosen_expert].")
Code cell 26
# Solution
origin = np.array([0, 0, 1, 1])
expert_rank = np.array([0, 1, 1, 0])
chosen_expert = np.array([0, 1, 2, 3])
dest = expert_rank[chosen_expert]
cross = origin != dest
print("cross-rank tokens:", cross.sum())
Exercise 9: Top-k compute
Compare active compute for top-1 and top-2.
Code cell 28
# Your Solution
dense_ffn = 1.0
print("Starter: active compute is k*dense_ffn.")
Code cell 29
# Solution
dense_ffn = 1.0
for k in [1, 2]:
print(f"top-{k} active compute:", k * dense_ffn)
Exercise 10: MoE checklist
Write four MoE diagnostics.
Code cell 31
# Your Solution
print("Starter: include load histogram, drop rate, entropy, and all-to-all traffic.")
Code cell 32
# Solution
checks = [
"expert load histogram",
"drop rate versus capacity factor",
"router entropy and z-loss",
"all-to-all traffic and per-expert gradient norms",
]
for check in checks:
print("-", check)
Closing Reflection
MoE is conditional computation plus routing accountability. Always separate total parameters, active compute, load balance, capacity, and communication.