Exercises NotebookMath for LLMs

Serving and Systems Tradeoffs

Math for LLMs / Serving and Systems Tradeoffs

Run notebook
Exercises Notebook

Exercises Notebook

Converted from exercises.ipynb for web reading.

Serving and Systems Tradeoffs: Exercises

Ten exercises cover practical serving math: queueing, latency, KV memory, cost, batch choices, autoscaling, SLO budgets, and traces.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Exercise 1: Little's law

Compute concurrency from arrival rate and latency.

Code cell 4

# Your Solution
lam = 12
W = 2.0
print("Starter: L = lam * W.")

Code cell 5

# Solution
lam = 12
W = 2.0
L = lam * W
print("concurrency:", L)

Exercise 2: Utilization

Compute utilization from arrival and service rates.

Code cell 7

# Your Solution
lam = 30
mu = 50
print("Starter: rho=lam/mu.")

Code cell 8

# Solution
lam = 30
mu = 50
rho = lam / mu
print("utilization:", rho)

Exercise 3: Latency budget

Compute total latency.

Code cell 10

# Your Solution
queue, prefill, output_tokens, tpot, post = 20, 100, 40, 15, 5
print("Starter: queue+prefill+output_tokens*tpot+post.")

Code cell 11

# Solution
queue, prefill, output_tokens, tpot, post = 20, 100, 40, 15, 5
total = queue + prefill + output_tokens * tpot + post
print("total latency ms:", total)

Exercise 4: KV concurrency

Compute max requests from available KV memory.

Code cell 13

# Your Solution
available_gb = 32
kv_per_request = 0.8
print("Starter: floor available/kv_per_request.")

Code cell 14

# Solution
available_gb = 32
kv_per_request = 0.8
max_req = int(available_gb // kv_per_request)
print("max requests:", max_req)

Exercise 5: CPM

Compute cost per million tokens.

Code cell 16

# Your Solution
hour_cost = 3.0
tok_per_sec = 1000
print("Starter: 1e6*hour_cost/(3600*tok_per_sec).")

Code cell 17

# Solution
hour_cost = 3.0
tok_per_sec = 1000
cpm = 1e6 * hour_cost / (3600 * tok_per_sec)
print("CPM:", cpm)

Exercise 6: Batch choice

Pick highest throughput batch under latency limit.

Code cell 19

# Your Solution
throughput = np.array([100, 180, 240])
latency = np.array([200, 450, 900])
limit = 500
print("Starter: filter latency<=limit, choose max throughput.")

Code cell 20

# Solution
throughput = np.array([100, 180, 240])
latency = np.array([200, 450, 900])
limit = 500
eligible = np.where(latency <= limit)[0]
best = eligible[np.argmax(throughput[eligible])]
print("best batch index:", best)

Exercise 7: Autoscaling

Compute needed replicas.

Code cell 22

# Your Solution
lam = 120
mu = 30
target = 0.75
print("Starter: ceil(lam/(target*mu)).")

Code cell 23

# Solution
lam = 120
mu = 30
target = 0.75
replicas = int(np.ceil(lam / (target * mu)))
print("replicas:", replicas)

Exercise 8: Error budget

Compute allowed failures for 99.9 percent SLO and 1M requests.

Code cell 25

# Your Solution
requests = 1_000_000
slo = 0.999
print("Starter: (1-slo)*requests.")

Code cell 26

# Solution
requests = 1_000_000
slo = 0.999
allowed = int((1 - slo) * requests)
print("allowed failures:", allowed)

Exercise 9: Degradation action

Choose best quality action under latency SLO.

Code cell 28

# Your Solution
latency = np.array([2000, 1200, 900])
quality = np.array([1.0, 0.95, 0.90])
slo = 1300
print("Starter: among latency<=slo, choose max quality.")

Code cell 29

# Solution
latency = np.array([2000, 1200, 900])
quality = np.array([1.0, 0.95, 0.90])
slo = 1300
eligible = np.where(latency <= slo)[0]
best = eligible[np.argmax(quality[eligible])]
print("best action index:", best)

Exercise 10: Trace checklist

Write four serving trace fields.

Code cell 31

# Your Solution
print("Starter: include queue, prefill, decode, status.")

Code cell 32

# Solution
fields = [
    "queue time",
    "prefill time",
    "decode TPOT and output tokens",
    "status and fallback path",
]
for field in fields:
    print("-", field)

Closing Reflection

Serving is the final contract between model capability and user experience. Measure the whole request path, then optimize the bottleneck that matters.