Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Serving and Systems Tradeoffs: Exercises
Ten exercises cover practical serving math: queueing, latency, KV memory, cost, batch choices, autoscaling, SLO budgets, and traces.
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Exercise 1: Little's law
Compute concurrency from arrival rate and latency.
Code cell 4
# Your Solution
lam = 12
W = 2.0
print("Starter: L = lam * W.")
Code cell 5
# Solution
lam = 12
W = 2.0
L = lam * W
print("concurrency:", L)
Exercise 2: Utilization
Compute utilization from arrival and service rates.
Code cell 7
# Your Solution
lam = 30
mu = 50
print("Starter: rho=lam/mu.")
Code cell 8
# Solution
lam = 30
mu = 50
rho = lam / mu
print("utilization:", rho)
Exercise 3: Latency budget
Compute total latency.
Code cell 10
# Your Solution
queue, prefill, output_tokens, tpot, post = 20, 100, 40, 15, 5
print("Starter: queue+prefill+output_tokens*tpot+post.")
Code cell 11
# Solution
queue, prefill, output_tokens, tpot, post = 20, 100, 40, 15, 5
total = queue + prefill + output_tokens * tpot + post
print("total latency ms:", total)
Exercise 4: KV concurrency
Compute max requests from available KV memory.
Code cell 13
# Your Solution
available_gb = 32
kv_per_request = 0.8
print("Starter: floor available/kv_per_request.")
Code cell 14
# Solution
available_gb = 32
kv_per_request = 0.8
max_req = int(available_gb // kv_per_request)
print("max requests:", max_req)
Exercise 5: CPM
Compute cost per million tokens.
Code cell 16
# Your Solution
hour_cost = 3.0
tok_per_sec = 1000
print("Starter: 1e6*hour_cost/(3600*tok_per_sec).")
Code cell 17
# Solution
hour_cost = 3.0
tok_per_sec = 1000
cpm = 1e6 * hour_cost / (3600 * tok_per_sec)
print("CPM:", cpm)
Exercise 6: Batch choice
Pick highest throughput batch under latency limit.
Code cell 19
# Your Solution
throughput = np.array([100, 180, 240])
latency = np.array([200, 450, 900])
limit = 500
print("Starter: filter latency<=limit, choose max throughput.")
Code cell 20
# Solution
throughput = np.array([100, 180, 240])
latency = np.array([200, 450, 900])
limit = 500
eligible = np.where(latency <= limit)[0]
best = eligible[np.argmax(throughput[eligible])]
print("best batch index:", best)
Exercise 7: Autoscaling
Compute needed replicas.
Code cell 22
# Your Solution
lam = 120
mu = 30
target = 0.75
print("Starter: ceil(lam/(target*mu)).")
Code cell 23
# Solution
lam = 120
mu = 30
target = 0.75
replicas = int(np.ceil(lam / (target * mu)))
print("replicas:", replicas)
Exercise 8: Error budget
Compute allowed failures for 99.9 percent SLO and 1M requests.
Code cell 25
# Your Solution
requests = 1_000_000
slo = 0.999
print("Starter: (1-slo)*requests.")
Code cell 26
# Solution
requests = 1_000_000
slo = 0.999
allowed = int((1 - slo) * requests)
print("allowed failures:", allowed)
Exercise 9: Degradation action
Choose best quality action under latency SLO.
Code cell 28
# Your Solution
latency = np.array([2000, 1200, 900])
quality = np.array([1.0, 0.95, 0.90])
slo = 1300
print("Starter: among latency<=slo, choose max quality.")
Code cell 29
# Solution
latency = np.array([2000, 1200, 900])
quality = np.array([1.0, 0.95, 0.90])
slo = 1300
eligible = np.where(latency <= slo)[0]
best = eligible[np.argmax(quality[eligible])]
print("best action index:", best)
Exercise 10: Trace checklist
Write four serving trace fields.
Code cell 31
# Your Solution
print("Starter: include queue, prefill, decode, status.")
Code cell 32
# Solution
fields = [
"queue time",
"prefill time",
"decode TPOT and output tokens",
"status and fallback path",
]
for field in fields:
print("-", field)
Closing Reflection
Serving is the final contract between model capability and user experience. Measure the whole request path, then optimize the bottleneck that matters.