NotesMath for LLMs

Serving and Systems Tradeoffs

Math for LLMs / Serving and Systems Tradeoffs

Notes

Serving is where LLM math becomes a live system. A deployed model must manage stochastic demand, queueing, prefill, decode, KV cache memory, batching, scheduling, cost, observability, and reliability.

Overview

The serving equation is not one equation. It is a budget:

Tuser=Tq+Tp+noutTPOT+To,T_\mathrm{user}=T_q+T_p+n_\mathrm{out}\mathrm{TPOT}+T_o,

plus memory:

Mw+MKV+MworkMGPU,M_w+M_\mathrm{KV}+M_\mathrm{work}\le M_\mathrm{GPU},

plus cost:

CPM=106Chour/(3600Q).\mathrm{CPM}=10^6C_\mathrm{hour}/(3600Q).

Serving tradeoffs are practical: low latency, high throughput, low cost, high quality, and high reliability cannot all be maximized independently.

Prerequisites

  • Efficient attention and inference metrics
  • KV cache memory math
  • RAG retrieval latency and context budget
  • Basic probability for percentiles and queueing

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates Little's law, utilization curves, latency decomposition, batch tradeoffs, memory concurrency, cost per million tokens, autoscaling, and SLO budgets.
exercises.ipynbTen practice problems for queueing, latency, memory, cost, scheduling, and observability.

Learning Objectives

After this section, you should be able to:

  • Compute TTFT, TPOT, total latency, throughput, and cost per million tokens.
  • Use Little's law for serving capacity planning.
  • Explain why high utilization increases queueing delay.
  • Estimate max concurrency from weight memory and KV cache memory.
  • Compare static batching, continuous batching, and chunked prefill.
  • Explain serving parallelism choices and phase splitting.
  • Build a cost model for tokens per dollar.
  • Define SLOs, error budgets, and observability traces.
  • Choose operational fallbacks under overload.

Table of Contents

  1. Serving Objectives
  2. Queueing Basics
  3. Latency Decomposition
  4. Batching Tradeoffs
  5. Memory and Concurrency
  6. Parallelism for Serving
  7. Cost Modeling
  8. Scheduling Policies
  9. Observability and SLOs
  10. Operational Tradeoffs

Serving Control Loop

traffic -> admission -> queue -> scheduler -> prefill/decode workers -> postprocess -> response
             |            |          |                 |                  |
          rate limit    metrics   batching          memory             traces

1. Serving Objectives

This part studies serving objectives as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
User latencyinteractive systems care about time to first token and total response timeTuser=TTFT+noutTPOTT_\mathrm{user}=\mathrm{TTFT}+n_\mathrm{out}\mathrm{TPOT}
Throughputoperators care about tokens completed per secondQ=tokens/secQ=\mathrm{tokens}/\mathrm{sec}
Costproduct decisions depend on cost per useful tokencost/token=Tgpuprice/tokens\mathrm{cost/token}=T_\mathrm{gpu}\cdot \mathrm{price}/\mathrm{tokens}
Qualitysystems changes must preserve model behaviorΔS\Delta S bounded
Reliabilityserving must meet SLOs under variable loadP(TTSLO)0.95P(T\le T_\mathrm{SLO})\ge 0.95

1.1 User latency

Main idea. Interactive systems care about time to first token and total response time.

Core relation:

Tuser=TTFT+noutTPOTT_\mathrm{user}=\mathrm{TTFT}+n_\mathrm{out}\mathrm{TPOT}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

1.2 Throughput

Main idea. Operators care about tokens completed per second.

Core relation:

Q=tokens/secQ=\mathrm{tokens}/\mathrm{sec}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

1.3 Cost

Main idea. Product decisions depend on cost per useful token.

Core relation:

cost/token=Tgpuprice/tokens\mathrm{cost/token}=T_\mathrm{gpu}\cdot \mathrm{price}/\mathrm{tokens}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

1.4 Quality

Main idea. Systems changes must preserve model behavior.

Core relation:

\Delta S$ bounded

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

1.5 Reliability

Main idea. Serving must meet slos under variable load.

Core relation:

P(TTSLO)0.95P(T\le T_\mathrm{SLO})\ge 0.95

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

2. Queueing Basics

This part studies queueing basics as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
Arrival raterequests arrive stochasticallyλ\lambda requests/sec
Service ratethe system completes work at rate muμ\mu requests/sec
Utilizationhigh utilization increases queueing delayρ=λ/μ\rho=\lambda/\mu
Little's lawaverage concurrency equals arrival rate times latencyL=λWL=\lambda W
Tail latencyp95 and p99 grow before averages look alarmingQ0.95(T)Q_{0.95}(T)

2.1 Arrival rate

Main idea. Requests arrive stochastically.

Core relation:

\lambda$ requests/sec

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

2.2 Service rate

Main idea. The system completes work at rate mu.

Core relation:

\mu$ requests/sec

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

2.3 Utilization

Main idea. High utilization increases queueing delay.

Core relation:

ρ=λ/μ\rho=\lambda/\mu

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

2.4 Little's law

Main idea. Average concurrency equals arrival rate times latency.

Core relation:

L=λWL=\lambda W

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is the smallest useful equation for capacity planning.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

2.5 Tail latency

Main idea. P95 and p99 grow before averages look alarming.

Core relation:

Q0.95(T)Q_{0.95}(T)

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

3. Latency Decomposition

This part studies latency decomposition as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
Queue timewaiting before GPU work startsTqT_q
Prefill timeprocess prompt and build KV cacheTpT_p
Decode timegenerate output tokens seriallyTd=noutTPOTT_d=n_\mathrm{out}\mathrm{TPOT}
Postprocess timesampling, detokenization, filters, and transport add overheadToT_o
End-to-end latencymeasure the whole path users experienceT=Tq+Tp+Td+ToT=T_q+T_p+T_d+T_o

3.1 Queue time

Main idea. Waiting before gpu work starts.

Core relation:

TqT_q

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

3.2 Prefill time

Main idea. Process prompt and build kv cache.

Core relation:

TpT_p

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

3.3 Decode time

Main idea. Generate output tokens serially.

Core relation:

Td=noutTPOTT_d=n_\mathrm{out}\mathrm{TPOT}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

3.4 Postprocess time

Main idea. Sampling, detokenization, filters, and transport add overhead.

Core relation:

ToT_o

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

3.5 End-to-end latency

Main idea. Measure the whole path users experience.

Core relation:

T=Tq+Tp+Td+ToT=T_q+T_p+T_d+T_o

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

4. Batching Tradeoffs

This part studies batching tradeoffs as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
Static batchingwait to form a batch and run it togetherBB fixed
Continuous batchinginsert and remove requests between decode stepsBtB_t changes
Batch efficiencylarger batches increase utilization until memory or latency limitsQ(B)Q(B)
Head-of-line blockinglong requests can delay short requestsTshortT_\mathrm{short} increases
Chunked prefillsplit long prompts to avoid starving decodeTp=cTp,cT_p=\sum_c T_{p,c}

4.1 Static batching

Main idea. Wait to form a batch and run it together.

Core relation:

B$ fixed

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

4.2 Continuous batching

Main idea. Insert and remove requests between decode steps.

Core relation:

B_t$ changes

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is why modern LLM serving does not behave like ordinary fixed-batch inference.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

4.3 Batch efficiency

Main idea. Larger batches increase utilization until memory or latency limits.

Core relation:

Q(B)Q(B)

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

4.4 Head-of-line blocking

Main idea. Long requests can delay short requests.

Core relation:

T_\mathrm{short}$ increases

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

4.5 Chunked prefill

Main idea. Split long prompts to avoid starving decode.

Core relation:

Tp=cTp,cT_p=\sum_c T_{p,c}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

5. Memory and Concurrency

This part studies memory and concurrency as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
Weight memorymodel weights consume a fixed resident footprintMwM_w
KV cache memoryactive requests consume context-dependent memoryMKV=2BLTHkvdhbM_\mathrm{KV}=2BLTH_{kv}d_hb
Workspace memorykernels and temporary buffers also need headroomMworkM_\mathrm{work}
Max concurrencyavailable memory bounds active tokensMw+MKV+MworkMGPUM_w+M_\mathrm{KV}+M_\mathrm{work}\le M_\mathrm{GPU}
Fragmentationvariable request lengths waste reserved cache blocksMreservedMusedM_\mathrm{reserved}-M_\mathrm{used}

5.1 Weight memory

Main idea. Model weights consume a fixed resident footprint.

Core relation:

MwM_w

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

5.2 KV cache memory

Main idea. Active requests consume context-dependent memory.

Core relation:

MKV=2BLTHkvdhbM_\mathrm{KV}=2BLTH_{kv}d_hb

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

5.3 Workspace memory

Main idea. Kernels and temporary buffers also need headroom.

Core relation:

MworkM_\mathrm{work}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

5.4 Max concurrency

Main idea. Available memory bounds active tokens.

Core relation:

Mw+MKV+MworkMGPUM_w+M_\mathrm{KV}+M_\mathrm{work}\le M_\mathrm{GPU}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. Most practical serving limits are memory limits before they are pure compute limits.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

5.5 Fragmentation

Main idea. Variable request lengths waste reserved cache blocks.

Core relation:

MreservedMusedM_\mathrm{reserved}-M_\mathrm{used}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

6. Parallelism for Serving

This part studies parallelism for serving as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
Tensor parallelismsplit matrix operations across devicesNtpN_\mathrm{tp}
Pipeline parallelismplace layers on different devicesNppN_\mathrm{pp}
Data parallel replicasreplicate the full serving stack for more throughputNreplicaN_\mathrm{replica}
Phase splittingprefill and decode may run on different poolsprefill pool,decode pool\mathrm{prefill\ pool},\mathrm{decode\ pool}
Network costmulti-node serving pays communication latencyTnetT_\mathrm{net}

6.1 Tensor parallelism

Main idea. Split matrix operations across devices.

Core relation:

NtpN_\mathrm{tp}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

6.2 Pipeline parallelism

Main idea. Place layers on different devices.

Core relation:

NppN_\mathrm{pp}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

6.3 Data parallel replicas

Main idea. Replicate the full serving stack for more throughput.

Core relation:

NreplicaN_\mathrm{replica}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

6.4 Phase splitting

Main idea. Prefill and decode may run on different pools.

Core relation:

prefill pool,decode pool\mathrm{prefill\ pool},\mathrm{decode\ pool}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

6.5 Network cost

Main idea. Multi-node serving pays communication latency.

Core relation:

TnetT_\mathrm{net}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

7. Cost Modeling

This part studies cost modeling as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
GPU-hour costhardware price turns time into dollarsChourC_\mathrm{hour}
Tokens per dollarthroughput divided by hourly cost\mathrm{tokens}/\=3600Q/C_\mathrm{hour}$
Cost per million tokensstandardize cost reportingCPM=106Chour/(3600Q)\mathrm{CPM}=10^6C_\mathrm{hour}/(3600Q)
Utilizationidle capacity increases effective costCeffective=Cnominal/uC_\mathrm{effective}=C_\mathrm{nominal}/u
Quality-adjusted costcheaper systems can be worse if quality fallsJ=cost+λ(1S)J=\mathrm{cost}+\lambda(1-S)

7.1 GPU-hour cost

Main idea. Hardware price turns time into dollars.

Core relation:

ChourC_\mathrm{hour}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

7.2 Tokens per dollar

Main idea. Throughput divided by hourly cost.

Core relation:

tokens/$=3600Q/Chour\mathrm{tokens}/\$=3600Q/C_\mathrm{hour}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

7.3 Cost per million tokens

Main idea. Standardize cost reporting.

Core relation:

CPM=106Chour/(3600Q)\mathrm{CPM}=10^6C_\mathrm{hour}/(3600Q)

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is the number that connects kernel work to product economics.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

7.4 Utilization

Main idea. Idle capacity increases effective cost.

Core relation:

Ceffective=Cnominal/uC_\mathrm{effective}=C_\mathrm{nominal}/u

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

7.5 Quality-adjusted cost

Main idea. Cheaper systems can be worse if quality falls.

Core relation:

J=cost+λ(1S)J=\mathrm{cost}+\lambda(1-S)

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

8. Scheduling Policies

This part studies scheduling policies as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
FIFOserve requests in arrival orderorder=tarrival\mathrm{order}=t_\mathrm{arrival}
Shortest remaining workfavor small jobs to reduce mean latencyminT^remaining\min \hat T_\mathrm{remaining}
Priority queuesseparate interactive and batch trafficpip_i priority
Admission controlreject or defer work when queues exceed budgetLq>LmaxL_q>L_\mathrm{max}
Autoscalingadd replicas when load exceeds target utilizationnλ/(ρtargetμ)n\ge \lambda/(\rho_\mathrm{target}\mu)

8.1 FIFO

Main idea. Serve requests in arrival order.

Core relation:

order=tarrival\mathrm{order}=t_\mathrm{arrival}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

8.2 Shortest remaining work

Main idea. Favor small jobs to reduce mean latency.

Core relation:

minT^remaining\min \hat T_\mathrm{remaining}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

8.3 Priority queues

Main idea. Separate interactive and batch traffic.

Core relation:

p_i$ priority

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

8.4 Admission control

Main idea. Reject or defer work when queues exceed budget.

Core relation:

Lq>LmaxL_q>L_\mathrm{max}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

8.5 Autoscaling

Main idea. Add replicas when load exceeds target utilization.

Core relation:

nλ/(ρtargetμ)n\ge \lambda/(\rho_\mathrm{target}\mu)

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

9. Observability and SLOs

This part studies observability and slos as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
Metricslog TTFT, TPOT, total latency, queue time, throughput, memory, and errorsmtm_t
Percentilestrack p50, p95, and p99 separatelyQp(T)Q_p(T)
Error budgetallowed failures over a windowB=(1SLO)NB=(1-\mathrm{SLO})N
Tracingattach per-request spans for queue, prefill, decode, retrieval, and postprocesstrace\mathrm{trace}
Canaryingroll out changes to a small fraction and compare metricsΔm\Delta m

9.1 Metrics

Main idea. Log ttft, tpot, total latency, queue time, throughput, memory, and errors.

Core relation:

mtm_t

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

9.2 Percentiles

Main idea. Track p50, p95, and p99 separately.

Core relation:

Qp(T)Q_p(T)

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

9.3 Error budget

Main idea. Allowed failures over a window.

Core relation:

B=(1SLO)NB=(1-\mathrm{SLO})N

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

9.4 Tracing

Main idea. Attach per-request spans for queue, prefill, decode, retrieval, and postprocess.

Core relation:

trace\mathrm{trace}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. Without per-request traces, serving optimization becomes guesswork.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

9.5 Canarying

Main idea. Roll out changes to a small fraction and compare metrics.

Core relation:

Δm\Delta m

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

10. Operational Tradeoffs

This part studies operational tradeoffs as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.

SubtopicQuestionFormula
Fallback modelsroute to smaller models under overloadMlargeMsmallM_\mathrm{large}\rightarrow M_\mathrm{small}
Cachingreuse repeated prompts or retrieved context when safey=f(x)y=f(x) cache hit
Rate limitsprotect service health by limiting demandλλmax\lambda\le\lambda_\mathrm{max}
Graceful degradationshorter outputs, lower k retrieval, or smaller model can preserve responsivenessTT\downarrow with bounded ΔS\Delta S
Rollbackkeep a fast path back to the previous stable serving configurationvnewvoldv_{new}\rightarrow v_{old}

10.1 Fallback models

Main idea. Route to smaller models under overload.

Core relation:

MlargeMsmallM_\mathrm{large}\rightarrow M_\mathrm{small}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

10.2 Caching

Main idea. Reuse repeated prompts or retrieved context when safe.

Core relation:

y=f(x)$ cache hit

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

10.3 Rate limits

Main idea. Protect service health by limiting demand.

Core relation:

λλmax\lambda\le\lambda_\mathrm{max}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

10.4 Graceful degradation

Main idea. Shorter outputs, lower k retrieval, or smaller model can preserve responsiveness.

Core relation:

T\downarrow$ with bounded $\Delta S

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.

10.5 Rollback

Main idea. Keep a fast path back to the previous stable serving configuration.

Core relation:

vnewvoldv_{new}\rightarrow v_{old}

LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.

Worked micro-example. If requests arrive at λ=20\lambda=20 per second and average end-to-end latency is W=1.5W=1.5 seconds, Little's law gives average concurrency L=30L=30 requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.

Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.

AI connection. This is a practical serving control variable.

Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.


Practice Exercises

  1. Use Little's law to compute average concurrency.
  2. Compute utilization from arrival and service rates.
  3. Build an end-to-end latency budget.
  4. Compute max concurrent requests from KV cache memory.
  5. Estimate cost per million tokens.
  6. Compare batch choices under a latency budget.
  7. Compute autoscaling replica count.
  8. Compute an SLO error budget.
  9. Choose a graceful degradation action under overload.
  10. Write a serving trace checklist.

Why This Matters for AI

LLMs are not useful only because they are trained. They become useful when they can answer real requests within latency, cost, and reliability limits. Serving math keeps deployment decisions honest: every model, context length, retrieval choice, quantization format, and batching policy has a measurable tradeoff.

References