Efficient inference is the mathematics of turning a trained LLM into a responsive service. The main objects are attention cost, KV cache memory, prefill and decode latency, batching, scheduling, and probability-preserving acceleration.
Overview
Autoregressive inference has two phases:
prompt tokens -> prefill -> KV cache -> decode one token -> append -> decode one token -> ...
Prefill processes the prompt in parallel and is often compute-heavy. Decode emits one token at a time and is often memory-bandwidth-heavy because each step reads weights and KV cache. Efficient serving is therefore not just "make attention faster." It is a coordinated memory, kernel, cache, batching, and scheduling problem.
The central cache formula is:
The factor 2 is for keys and values. is batch size, is layers, is cached context length, is the number of key-value heads, is head dimension, and is bytes per cache element.
Prerequisites
- Attention mechanism math and causal masking
- Positional encodings and autoregressive decoding
- Scaling-law and training-at-scale cost vocabulary
- Basic memory units and tensor shapes
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Computes prefill/decode costs, KV cache memory, MHA/MQA/GQA savings, FlashAttention memory intuition, paged-cache waste, speculative speedup, and latency budgets. |
| exercises.ipynb | Ten practice problems for KV cache sizing, GQA savings, pipeline latency, page waste, speculative decoding, and serving diagnostics. |
Learning Objectives
After this section, you should be able to:
- Distinguish prefill latency, decode latency, TTFT, TPOT, throughput, and tail latency.
- Estimate dense attention and decode attention cost.
- Compute KV cache memory for MHA, MQA, and GQA.
- Explain why FlashAttention is exact but IO-aware.
- Explain why PagedAttention-style cache management improves serving memory utilization.
- Reason about continuous batching, chunked prefill, and head-of-line blocking.
- Estimate speculative decoding speedup from acceptance rate and draft latency.
- Build an inference debugging checklist that checks both speed and correctness.
Table of Contents
- Inference Phases
- 1.1 Prefill
- 1.2 Decode
- 1.3 TTFT
- 1.4 TPOT
- 1.5 Throughput
- Attention Cost
- 2.1 Training attention
- 2.2 Decode attention
- 2.3 Causal mask
- 2.4 Memory traffic
- 2.5 Arithmetic intensity
- KV Cache Math
- 3.1 Cache contents
- 3.2 Cache memory
- 3.3 MHA
- 3.4 MQA
- 3.5 GQA
- FlashAttention and IO Awareness
- 4.1 Naive attention memory
- 4.2 Tiling
- 4.3 Online softmax
- 4.4 Exactness
- 4.5 Long context
- Serving Memory Management
- 5.1 Request lengths vary
- 5.2 Paged KV cache
- 5.3 Continuous batching
- 5.4 Prefix sharing
- 5.5 Eviction and recompute
- Decode Acceleration
- Batching and Scheduling
- 7.1 Static batching
- 7.2 Dynamic batching
- 7.3 Head-of-line blocking
- 7.4 Chunked prefill
- 7.5 SLA tradeoff
- Quantization and Bandwidth
- 8.1 Weight bandwidth
- 8.2 KV quantization
- 8.3 Accuracy tradeoff
- 8.4 Kernel support
- 8.5 Mixed systems
- Latency Metrics
- 9.1 Queue time
- 9.2 First-token latency
- 9.3 Inter-token latency
- 9.4 Tail latency
- 9.5 Cost per token
- Debugging Efficient Inference
- 10.1 Shape checks
- 10.2 Cache correctness
- 10.3 Memory accounting
- 10.4 Bottleneck attribution
- 10.5 Quality regression
Mental Model
quality target
|
model weights -- kernels -- memory bandwidth -- KV cache -- scheduler -- user latency
| | | |
FlashAttention quantization paging batching
Every speedup lives somewhere in this chain. The math tells you which bottleneck it can actually improve.
1. Inference Phases
This part studies inference phases as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Prefill | process the prompt in parallel and create the KV cache | for dense attention |
| Decode | generate one token at a time using cached keys and values | per output token |
| TTFT | time to first token includes scheduling plus prefill | |
| TPOT | time per output token measures decode speed | |
| Throughput | serving systems balance tokens per second against latency |
1.1 Prefill
Main idea. Process the prompt in parallel and create the kv cache.
Core relation:
T_\mathrm{prefill}\propto T_\mathrm{prompt}^2$ for dense attentionInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
1.2 Decode
Main idea. Generate one token at a time using cached keys and values.
Core relation:
T_\mathrm{decode}\propto T_\mathrm{context}$ per output tokenInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
1.3 TTFT
Main idea. Time to first token includes scheduling plus prefill.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
1.4 TPOT
Main idea. Time per output token measures decode speed.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
1.5 Throughput
Main idea. Serving systems balance tokens per second against latency.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
2. Attention Cost
This part studies attention cost as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Training attention | dense attention over a full sequence forms a T by T score matrix | |
| Decode attention | one query attends to all cached keys | per generated token |
| Causal mask | future tokens are invisible but the triangular work is still large in prefill | |
| Memory traffic | attention can be limited by reads and writes, not only FLOPs | |
| Arithmetic intensity | roofline reasoning compares FLOPs to bytes moved |
2.1 Training attention
Main idea. Dense attention over a full sequence forms a t by t score matrix.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
2.2 Decode attention
Main idea. One query attends to all cached keys.
Core relation:
O(Td)$ per generated tokenInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
2.3 Causal mask
Main idea. Future tokens are invisible but the triangular work is still large in prefill.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
2.4 Memory traffic
Main idea. Attention can be limited by reads and writes, not only flops.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
2.5 Arithmetic intensity
Main idea. Roofline reasoning compares flops to bytes moved.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
3. KV Cache Math
This part studies kv cache math as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Cache contents | store keys and values for every layer, token, and KV head | |
| Cache memory | KV memory grows linearly with context and batch | |
| MHA | standard multi-head attention has one KV head per query head | |
| MQA | multi-query attention shares one KV head across query heads | |
| GQA | grouped-query attention interpolates between MHA and MQA |
3.1 Cache contents
Main idea. Store keys and values for every layer, token, and kv head.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
3.2 Cache memory
Main idea. Kv memory grows linearly with context and batch.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This one formula often decides how many requests fit on a serving GPU.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
3.3 MHA
Main idea. Standard multi-head attention has one kv head per query head.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
3.4 MQA
Main idea. Multi-query attention shares one kv head across query heads.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
3.5 GQA
Main idea. Grouped-query attention interpolates between mha and mqa.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
4. FlashAttention and IO Awareness
This part studies flashattention and io awareness as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Naive attention memory | materializing attention scores costs T squared memory | |
| Tiling | FlashAttention computes blocks while keeping partial statistics | without storing all scores |
| Online softmax | blockwise max and denominator keep softmax exact | |
| Exactness | FlashAttention changes memory access, not the attention definition | |
| Long context | IO savings matter more as T grows | score storage becomes the wall |
4.1 Naive attention memory
Main idea. Materializing attention scores costs t squared memory.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
4.2 Tiling
Main idea. Flashattention computes blocks while keeping partial statistics.
Core relation:
\mathrm{softmax}(QK^\top)V$ without storing all scoresInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. FlashAttention is fast because it respects the memory hierarchy, not because it approximates attention.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
4.3 Online softmax
Main idea. Blockwise max and denominator keep softmax exact.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
4.4 Exactness
Main idea. Flashattention changes memory access, not the attention definition.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
4.5 Long context
Main idea. Io savings matter more as t grows.
Core relation:
T^2$ score storage becomes the wallInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
5. Serving Memory Management
This part studies serving memory management as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Request lengths vary | static allocation wastes KV memory for short or finished requests | |
| Paged KV cache | allocate KV cache in blocks and map logical tokens to physical blocks | |
| Continuous batching | add and remove requests between decode steps | changes over time |
| Prefix sharing | shared prompts can reuse KV cache | |
| Eviction and recompute | memory pressure can force swapping or rebuilding cache | trades against |
5.1 Request lengths vary
Main idea. Static allocation wastes kv memory for short or finished requests.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
5.2 Paged KV cache
Main idea. Allocate kv cache in blocks and map logical tokens to physical blocks.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. Paged allocation makes serving look more like an operating-system memory problem.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
5.3 Continuous batching
Main idea. Add and remove requests between decode steps.
Core relation:
B_t$ changes over timeInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
5.4 Prefix sharing
Main idea. Shared prompts can reuse kv cache.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
5.5 Eviction and recompute
Main idea. Memory pressure can force swapping or rebuilding cache.
Core relation:
T_\mathrm{recompute}$ trades against $M_\mathrm{free}Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
6. Decode Acceleration
This part studies decode acceleration as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Greedy serial bottleneck | autoregressive decoding normally emits one token per target-model step | tokens require target passes |
| Speculative decoding | a draft model proposes tokens and the target verifies them | |
| Acceptance rate | speedup depends on draft quality and draft latency | |
| Parallel heads | methods such as Medusa predict multiple future tokens with extra heads | candidates |
| Distribution preservation | exact speculative schemes preserve the target distribution when acceptance rules are correct |
6.1 Greedy serial bottleneck
Main idea. Autoregressive decoding normally emits one token per target-model step.
Core relation:
K$ tokens require $K$ target passesInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
6.2 Speculative decoding
Main idea. A draft model proposes tokens and the target verifies them.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. The target model can verify several proposed tokens in one pass, reducing serial decode steps.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
6.3 Acceptance rate
Main idea. Speedup depends on draft quality and draft latency.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
6.4 Parallel heads
Main idea. Methods such as medusa predict multiple future tokens with extra heads.
Core relation:
y_{t+1:t+k}$ candidatesInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
6.5 Distribution preservation
Main idea. Exact speculative schemes preserve the target distribution when acceptance rules are correct.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
7. Batching and Scheduling
This part studies batching and scheduling as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Static batching | wait for a batch, run together, then finish together | fixed |
| Dynamic batching | merge compatible requests at runtime | variable |
| Head-of-line blocking | one long request can delay short ones | depends on schedule |
| Chunked prefill | split long prompts so decode traffic is not starved | chunks |
| SLA tradeoff | higher throughput can increase tail latency | versus tokens/sec |
7.1 Static batching
Main idea. Wait for a batch, run together, then finish together.
Core relation:
B$ fixedInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
7.2 Dynamic batching
Main idea. Merge compatible requests at runtime.
Core relation:
B_t$ variableInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
7.3 Head-of-line blocking
Main idea. One long request can delay short ones.
Core relation:
T_\mathrm{latency}$ depends on scheduleInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
7.4 Chunked prefill
Main idea. Split long prompts so decode traffic is not starved.
Core relation:
T_\mathrm{prompt}$ chunksInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
7.5 SLA tradeoff
Main idea. Higher throughput can increase tail latency.
Core relation:
p95\ \mathrm{latency}$ versus tokens/secInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
8. Quantization and Bandwidth
This part studies quantization and bandwidth as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Weight bandwidth | decode often reloads weights for each generated token | |
| KV quantization | compressing KV cache can increase batch or context capacity | |
| Accuracy tradeoff | lower precision can change probabilities | |
| Kernel support | a quantized format helps only if kernels are fast | matters |
| Mixed systems | weights, activations, and KV cache may use different precision |
8.1 Weight bandwidth
Main idea. Decode often reloads weights for each generated token.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
8.2 KV quantization
Main idea. Compressing kv cache can increase batch or context capacity.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
8.3 Accuracy tradeoff
Main idea. Lower precision can change probabilities.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
8.4 Kernel support
Main idea. A quantized format helps only if kernels are fast.
Core relation:
T_\mathrm{kernel}$ mattersInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
8.5 Mixed systems
Main idea. Weights, activations, and kv cache may use different precision.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
9. Latency Metrics
This part studies latency metrics as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Queue time | user latency includes time before GPU work starts | |
| First-token latency | long prompts mainly hurt TTFT | |
| Inter-token latency | long outputs mainly accumulate TPOT | |
| Tail latency | p95 and p99 matter more than average for products | |
| Cost per token | serving cost combines hardware time and utilization |
9.1 Queue time
Main idea. User latency includes time before gpu work starts.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
9.2 First-token latency
Main idea. Long prompts mainly hurt ttft.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
9.3 Inter-token latency
Main idea. Long outputs mainly accumulate tpot.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
9.4 Tail latency
Main idea. P95 and p99 matter more than average for products.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
9.5 Cost per token
Main idea. Serving cost combines hardware time and utilization.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
10. Debugging Efficient Inference
This part studies debugging efficient inference as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.
| Subtopic | Question | Formula |
|---|---|---|
| Shape checks | KV heads, query heads, and head dimension must align | integer for GQA |
| Cache correctness | cached decode must match full recomputation | $\max |
| Memory accounting | estimate weights plus KV cache plus workspace | |
| Bottleneck attribution | separate compute, memory, scheduler, and network time | |
| Quality regression | speedups must preserve target quality unless explicitly approximate | tracked |
10.1 Shape checks
Main idea. Kv heads, query heads, and head dimension must align.
Core relation:
H_q/H_{kv}$ integer for GQAInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
10.2 Cache correctness
Main idea. Cached decode must match full recomputation.
Core relation:
\max|y_\mathrm{cache}-y_\mathrm{full}|$ smallInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. A fast cached path is worthless if it disagrees with full recomputation.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
10.3 Memory accounting
Main idea. Estimate weights plus kv cache plus workspace.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
10.4 Bottleneck attribution
Main idea. Separate compute, memory, scheduler, and network time.
Core relation:
Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
10.5 Quality regression
Main idea. Speedups must preserve target quality unless explicitly approximate.
Core relation:
\Delta L,\Delta S$ trackedInference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.
Worked micro-example. For a model with layers, batch , context , KV heads, head dimension , and bf16 cache values with bytes, the KV cache is bytes. Reducing from 32 to 8 cuts this cache by 4x.
Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.
AI connection. This quantity is a practical lever for LLM inference.
Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.
Practice Exercises
- Compute KV cache memory for a model configuration.
- Compare MHA, GQA, and MQA cache sizes.
- Estimate prefill versus decode attention operations.
- Compute naive attention-score memory and FlashAttention savings intuition.
- Estimate roofline bandwidth-limited runtime.
- Compute page/block waste for variable request lengths.
- Estimate continuous batching utilization.
- Compute speculative decoding expected target passes.
- Build a latency budget from queue, prefill, decode, and sampling.
- Write a cached-decode correctness checklist.
Why This Matters for AI
Training makes a model capable. Inference makes it usable. A model that is excellent but too slow, too expensive, or too memory-hungry cannot serve real users well. Efficient inference math helps you decide whether to change attention kernels, KV head count, cache layout, quantization, batching, or decoding strategy.
Bridge to Mixture of Experts and Routing
Mixture-of-experts models change the inference problem by activating only some parameters per token while keeping many parameters in memory. The next section studies routing, expert capacity, load balancing, and the difference between total parameters and active parameters.
References
- Ashish Vaswani et al., "Attention Is All You Need", 2017: https://arxiv.org/abs/1706.03762
- Noam Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need", 2019: https://arxiv.org/abs/1911.02150
- Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", 2022: https://arxiv.org/abs/2205.14135
- Joshua Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints", 2023: https://arxiv.org/abs/2305.13245
- Woosuk Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", 2023: https://arxiv.org/abs/2309.06180
- Yaniv Leviathan, Matan Kalman, and Yossi Matias, "Fast Inference from Transformers via Speculative Decoding", 2023: https://proceedings.mlr.press/v202/leviathan23a.html
- Tianle Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024: https://arxiv.org/abs/2401.10774