Serving is where LLM math becomes a live system. A deployed model must manage stochastic demand, queueing, prefill, decode, KV cache memory, batching, scheduling, cost, observability, and reliability.
Overview
The serving equation is not one equation. It is a budget:
plus memory:
plus cost:
Serving tradeoffs are practical: low latency, high throughput, low cost, high quality, and high reliability cannot all be maximized independently.
Prerequisites
- Efficient attention and inference metrics
- KV cache memory math
- RAG retrieval latency and context budget
- Basic probability for percentiles and queueing
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates Little's law, utilization curves, latency decomposition, batch tradeoffs, memory concurrency, cost per million tokens, autoscaling, and SLO budgets. |
| exercises.ipynb | Ten practice problems for queueing, latency, memory, cost, scheduling, and observability. |
Learning Objectives
After this section, you should be able to:
- Compute TTFT, TPOT, total latency, throughput, and cost per million tokens.
- Use Little's law for serving capacity planning.
- Explain why high utilization increases queueing delay.
- Estimate max concurrency from weight memory and KV cache memory.
- Compare static batching, continuous batching, and chunked prefill.
- Explain serving parallelism choices and phase splitting.
- Build a cost model for tokens per dollar.
- Define SLOs, error budgets, and observability traces.
- Choose operational fallbacks under overload.
Table of Contents
- Serving Objectives
- 1.1 User latency
- 1.2 Throughput
- 1.3 Cost
- 1.4 Quality
- 1.5 Reliability
- Queueing Basics
- 2.1 Arrival rate
- 2.2 Service rate
- 2.3 Utilization
- 2.4 Little's law
- 2.5 Tail latency
- Latency Decomposition
- 3.1 Queue time
- 3.2 Prefill time
- 3.3 Decode time
- 3.4 Postprocess time
- 3.5 End-to-end latency
- Batching Tradeoffs
- 4.1 Static batching
- 4.2 Continuous batching
- 4.3 Batch efficiency
- 4.4 Head-of-line blocking
- 4.5 Chunked prefill
- Memory and Concurrency
- 5.1 Weight memory
- 5.2 KV cache memory
- 5.3 Workspace memory
- 5.4 Max concurrency
- 5.5 Fragmentation
- Parallelism for Serving
- 6.1 Tensor parallelism
- 6.2 Pipeline parallelism
- 6.3 Data parallel replicas
- 6.4 Phase splitting
- 6.5 Network cost
- Cost Modeling
- 7.1 GPU-hour cost
- 7.2 Tokens per dollar
- 7.3 Cost per million tokens
- 7.4 Utilization
- 7.5 Quality-adjusted cost
- Scheduling Policies
- 8.1 FIFO
- 8.2 Shortest remaining work
- 8.3 Priority queues
- 8.4 Admission control
- 8.5 Autoscaling
- Observability and SLOs
- 9.1 Metrics
- 9.2 Percentiles
- 9.3 Error budget
- 9.4 Tracing
- 9.5 Canarying
- Operational Tradeoffs
- 10.1 Fallback models
- 10.2 Caching
- 10.3 Rate limits
- 10.4 Graceful degradation
- 10.5 Rollback
Serving Control Loop
traffic -> admission -> queue -> scheduler -> prefill/decode workers -> postprocess -> response
| | | | |
rate limit metrics batching memory traces
1. Serving Objectives
This part studies serving objectives as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| User latency | interactive systems care about time to first token and total response time | |
| Throughput | operators care about tokens completed per second | |
| Cost | product decisions depend on cost per useful token | |
| Quality | systems changes must preserve model behavior | bounded |
| Reliability | serving must meet SLOs under variable load |
1.1 User latency
Main idea. Interactive systems care about time to first token and total response time.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
1.2 Throughput
Main idea. Operators care about tokens completed per second.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
1.3 Cost
Main idea. Product decisions depend on cost per useful token.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
1.4 Quality
Main idea. Systems changes must preserve model behavior.
Core relation:
\Delta S$ boundedLLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
1.5 Reliability
Main idea. Serving must meet slos under variable load.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
2. Queueing Basics
This part studies queueing basics as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| Arrival rate | requests arrive stochastically | requests/sec |
| Service rate | the system completes work at rate mu | requests/sec |
| Utilization | high utilization increases queueing delay | |
| Little's law | average concurrency equals arrival rate times latency | |
| Tail latency | p95 and p99 grow before averages look alarming |
2.1 Arrival rate
Main idea. Requests arrive stochastically.
Core relation:
\lambda$ requests/secLLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
2.2 Service rate
Main idea. The system completes work at rate mu.
Core relation:
\mu$ requests/secLLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
2.3 Utilization
Main idea. High utilization increases queueing delay.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
2.4 Little's law
Main idea. Average concurrency equals arrival rate times latency.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is the smallest useful equation for capacity planning.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
2.5 Tail latency
Main idea. P95 and p99 grow before averages look alarming.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
3. Latency Decomposition
This part studies latency decomposition as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| Queue time | waiting before GPU work starts | |
| Prefill time | process prompt and build KV cache | |
| Decode time | generate output tokens serially | |
| Postprocess time | sampling, detokenization, filters, and transport add overhead | |
| End-to-end latency | measure the whole path users experience |
3.1 Queue time
Main idea. Waiting before gpu work starts.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
3.2 Prefill time
Main idea. Process prompt and build kv cache.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
3.3 Decode time
Main idea. Generate output tokens serially.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
3.4 Postprocess time
Main idea. Sampling, detokenization, filters, and transport add overhead.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
3.5 End-to-end latency
Main idea. Measure the whole path users experience.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
4. Batching Tradeoffs
This part studies batching tradeoffs as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| Static batching | wait to form a batch and run it together | fixed |
| Continuous batching | insert and remove requests between decode steps | changes |
| Batch efficiency | larger batches increase utilization until memory or latency limits | |
| Head-of-line blocking | long requests can delay short requests | increases |
| Chunked prefill | split long prompts to avoid starving decode |
4.1 Static batching
Main idea. Wait to form a batch and run it together.
Core relation:
B$ fixedLLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
4.2 Continuous batching
Main idea. Insert and remove requests between decode steps.
Core relation:
B_t$ changesLLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is why modern LLM serving does not behave like ordinary fixed-batch inference.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
4.3 Batch efficiency
Main idea. Larger batches increase utilization until memory or latency limits.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
4.4 Head-of-line blocking
Main idea. Long requests can delay short requests.
Core relation:
T_\mathrm{short}$ increasesLLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
4.5 Chunked prefill
Main idea. Split long prompts to avoid starving decode.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
5. Memory and Concurrency
This part studies memory and concurrency as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| Weight memory | model weights consume a fixed resident footprint | |
| KV cache memory | active requests consume context-dependent memory | |
| Workspace memory | kernels and temporary buffers also need headroom | |
| Max concurrency | available memory bounds active tokens | |
| Fragmentation | variable request lengths waste reserved cache blocks |
5.1 Weight memory
Main idea. Model weights consume a fixed resident footprint.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
5.2 KV cache memory
Main idea. Active requests consume context-dependent memory.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
5.3 Workspace memory
Main idea. Kernels and temporary buffers also need headroom.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
5.4 Max concurrency
Main idea. Available memory bounds active tokens.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. Most practical serving limits are memory limits before they are pure compute limits.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
5.5 Fragmentation
Main idea. Variable request lengths waste reserved cache blocks.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
6. Parallelism for Serving
This part studies parallelism for serving as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| Tensor parallelism | split matrix operations across devices | |
| Pipeline parallelism | place layers on different devices | |
| Data parallel replicas | replicate the full serving stack for more throughput | |
| Phase splitting | prefill and decode may run on different pools | |
| Network cost | multi-node serving pays communication latency |
6.1 Tensor parallelism
Main idea. Split matrix operations across devices.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
6.2 Pipeline parallelism
Main idea. Place layers on different devices.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
6.3 Data parallel replicas
Main idea. Replicate the full serving stack for more throughput.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
6.4 Phase splitting
Main idea. Prefill and decode may run on different pools.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
6.5 Network cost
Main idea. Multi-node serving pays communication latency.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
7. Cost Modeling
This part studies cost modeling as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| GPU-hour cost | hardware price turns time into dollars | |
| Tokens per dollar | throughput divided by hourly cost | \mathrm{tokens}/\=3600Q/C_\mathrm{hour}$ |
| Cost per million tokens | standardize cost reporting | |
| Utilization | idle capacity increases effective cost | |
| Quality-adjusted cost | cheaper systems can be worse if quality falls |
7.1 GPU-hour cost
Main idea. Hardware price turns time into dollars.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
7.2 Tokens per dollar
Main idea. Throughput divided by hourly cost.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
7.3 Cost per million tokens
Main idea. Standardize cost reporting.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is the number that connects kernel work to product economics.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
7.4 Utilization
Main idea. Idle capacity increases effective cost.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
7.5 Quality-adjusted cost
Main idea. Cheaper systems can be worse if quality falls.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
8. Scheduling Policies
This part studies scheduling policies as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| FIFO | serve requests in arrival order | |
| Shortest remaining work | favor small jobs to reduce mean latency | |
| Priority queues | separate interactive and batch traffic | priority |
| Admission control | reject or defer work when queues exceed budget | |
| Autoscaling | add replicas when load exceeds target utilization |
8.1 FIFO
Main idea. Serve requests in arrival order.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
8.2 Shortest remaining work
Main idea. Favor small jobs to reduce mean latency.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
8.3 Priority queues
Main idea. Separate interactive and batch traffic.
Core relation:
p_i$ priorityLLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
8.4 Admission control
Main idea. Reject or defer work when queues exceed budget.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
8.5 Autoscaling
Main idea. Add replicas when load exceeds target utilization.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
9. Observability and SLOs
This part studies observability and slos as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| Metrics | log TTFT, TPOT, total latency, queue time, throughput, memory, and errors | |
| Percentiles | track p50, p95, and p99 separately | |
| Error budget | allowed failures over a window | |
| Tracing | attach per-request spans for queue, prefill, decode, retrieval, and postprocess | |
| Canarying | roll out changes to a small fraction and compare metrics |
9.1 Metrics
Main idea. Log ttft, tpot, total latency, queue time, throughput, memory, and errors.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
9.2 Percentiles
Main idea. Track p50, p95, and p99 separately.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
9.3 Error budget
Main idea. Allowed failures over a window.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
9.4 Tracing
Main idea. Attach per-request spans for queue, prefill, decode, retrieval, and postprocess.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. Without per-request traces, serving optimization becomes guesswork.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
9.5 Canarying
Main idea. Roll out changes to a small fraction and compare metrics.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
10. Operational Tradeoffs
This part studies operational tradeoffs as systems math for LLM deployment. The useful habit is to connect every serving choice to latency, throughput, memory, cost, quality, or reliability.
| Subtopic | Question | Formula |
|---|---|---|
| Fallback models | route to smaller models under overload | |
| Caching | reuse repeated prompts or retrieved context when safe | cache hit |
| Rate limits | protect service health by limiting demand | |
| Graceful degradation | shorter outputs, lower k retrieval, or smaller model can preserve responsiveness | with bounded |
| Rollback | keep a fast path back to the previous stable serving configuration |
10.1 Fallback models
Main idea. Route to smaller models under overload.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
10.2 Caching
Main idea. Reuse repeated prompts or retrieved context when safe.
Core relation:
y=f(x)$ cache hitLLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
10.3 Rate limits
Main idea. Protect service health by limiting demand.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
10.4 Graceful degradation
Main idea. Shorter outputs, lower k retrieval, or smaller model can preserve responsiveness.
Core relation:
T\downarrow$ with bounded $\Delta SLLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
10.5 Rollback
Main idea. Keep a fast path back to the previous stable serving configuration.
Core relation:
LLM serving is a queueing and memory-management problem wrapped around transformer inference. The model must be fast enough, cheap enough, reliable enough, and good enough at the same time. Improving one axis can hurt another: larger batches improve throughput but can increase latency; longer contexts improve answer quality but consume KV memory; quantization saves memory but can change quality.
Worked micro-example. If requests arrive at per second and average end-to-end latency is seconds, Little's law gives average concurrency requests. If each active request consumes KV cache proportional to its context length, concurrency directly becomes a memory-planning number.
Implementation check. Measure queue time, prefill time, decode TPOT, output length, memory use, and final status for each request. Then inspect percentiles, not only means.
AI connection. This is a practical serving control variable.
Common mistake. Do not report tokens/sec alone. A serving system can have high throughput while p95 latency is unacceptable.
Practice Exercises
- Use Little's law to compute average concurrency.
- Compute utilization from arrival and service rates.
- Build an end-to-end latency budget.
- Compute max concurrent requests from KV cache memory.
- Estimate cost per million tokens.
- Compare batch choices under a latency budget.
- Compute autoscaling replica count.
- Compute an SLO error budget.
- Choose a graceful degradation action under overload.
- Write a serving trace checklist.
Why This Matters for AI
LLMs are not useful only because they are trained. They become useful when they can answer real requests within latency, cost, and reliability limits. Serving math keeps deployment decisions honest: every model, context length, retrieval choice, quantization format, and batching policy has a measurable tradeoff.
References
- Gyeong-In Yu et al., "Orca: A Distributed Serving System for Transformer-Based Generative Models", 2022: https://www.usenix.org/conference/osdi22/presentation/yu
- Woosuk Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", 2023: https://arxiv.org/abs/2309.06180
- Lequn Chen et al., "Efficiently Scaling Transformer Inference", 2023: https://arxiv.org/abs/2211.05102
- Ying Sheng et al., "FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU", 2023: https://arxiv.org/abs/2303.06865
- Amey Agrawal et al., "Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills", 2024: https://arxiv.org/abs/2403.02310