NotesMath for LLMs

Efficient Attention and Inference

Math for LLMs / Efficient Attention and Inference

Notes

Efficient inference is the mathematics of turning a trained LLM into a responsive service. The main objects are attention cost, KV cache memory, prefill and decode latency, batching, scheduling, and probability-preserving acceleration.

Overview

Autoregressive inference has two phases:

prompt tokens -> prefill -> KV cache -> decode one token -> append -> decode one token -> ...

Prefill processes the prompt in parallel and is often compute-heavy. Decode emits one token at a time and is often memory-bandwidth-heavy because each step reads weights and KV cache. Efficient serving is therefore not just "make attention faster." It is a coordinated memory, kernel, cache, batching, and scheduling problem.

The central cache formula is:

MKV=2BLTHkvdhb.M_\mathrm{KV}=2BLTH_{kv}d_hb.

The factor 2 is for keys and values. BB is batch size, LL is layers, TT is cached context length, HkvH_{kv} is the number of key-value heads, dhd_h is head dimension, and bb is bytes per cache element.

Prerequisites

  • Attention mechanism math and causal masking
  • Positional encodings and autoregressive decoding
  • Scaling-law and training-at-scale cost vocabulary
  • Basic memory units and tensor shapes

Companion Notebooks

NotebookPurpose
theory.ipynbComputes prefill/decode costs, KV cache memory, MHA/MQA/GQA savings, FlashAttention memory intuition, paged-cache waste, speculative speedup, and latency budgets.
exercises.ipynbTen practice problems for KV cache sizing, GQA savings, pipeline latency, page waste, speculative decoding, and serving diagnostics.

Learning Objectives

After this section, you should be able to:

  • Distinguish prefill latency, decode latency, TTFT, TPOT, throughput, and tail latency.
  • Estimate dense attention and decode attention cost.
  • Compute KV cache memory for MHA, MQA, and GQA.
  • Explain why FlashAttention is exact but IO-aware.
  • Explain why PagedAttention-style cache management improves serving memory utilization.
  • Reason about continuous batching, chunked prefill, and head-of-line blocking.
  • Estimate speculative decoding speedup from acceptance rate and draft latency.
  • Build an inference debugging checklist that checks both speed and correctness.

Table of Contents

  1. Inference Phases
  2. Attention Cost
  3. KV Cache Math
  4. FlashAttention and IO Awareness
  5. Serving Memory Management
  6. Decode Acceleration
  7. Batching and Scheduling
  8. Quantization and Bandwidth
  9. Latency Metrics
  10. Debugging Efficient Inference

Mental Model

quality target
   |
model weights -- kernels -- memory bandwidth -- KV cache -- scheduler -- user latency
                   |              |             |             |
              FlashAttention   quantization   paging      batching

Every speedup lives somewhere in this chain. The math tells you which bottleneck it can actually improve.

1. Inference Phases

This part studies inference phases as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Prefillprocess the prompt in parallel and create the KV cacheTprefillTprompt2T_\mathrm{prefill}\propto T_\mathrm{prompt}^2 for dense attention
Decodegenerate one token at a time using cached keys and valuesTdecodeTcontextT_\mathrm{decode}\propto T_\mathrm{context} per output token
TTFTtime to first token includes scheduling plus prefillTTFT=Tqueue+Tprefill+Tsample\mathrm{TTFT}=T_\mathrm{queue}+T_\mathrm{prefill}+T_\mathrm{sample}
TPOTtime per output token measures decode speedTPOT=Tdecode step\mathrm{TPOT}=T_\mathrm{decode\ step}
Throughputserving systems balance tokens per second against latencytokens/sec=completed tokens/T\mathrm{tokens/sec}=\mathrm{completed\ tokens}/T

1.1 Prefill

Main idea. Process the prompt in parallel and create the kv cache.

Core relation:

T_\mathrm{prefill}\propto T_\mathrm{prompt}^2$ for dense attention

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

1.2 Decode

Main idea. Generate one token at a time using cached keys and values.

Core relation:

T_\mathrm{decode}\propto T_\mathrm{context}$ per output token

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

1.3 TTFT

Main idea. Time to first token includes scheduling plus prefill.

Core relation:

TTFT=Tqueue+Tprefill+Tsample\mathrm{TTFT}=T_\mathrm{queue}+T_\mathrm{prefill}+T_\mathrm{sample}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

1.4 TPOT

Main idea. Time per output token measures decode speed.

Core relation:

TPOT=Tdecode step\mathrm{TPOT}=T_\mathrm{decode\ step}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

1.5 Throughput

Main idea. Serving systems balance tokens per second against latency.

Core relation:

tokens/sec=completed tokens/T\mathrm{tokens/sec}=\mathrm{completed\ tokens}/T

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2. Attention Cost

This part studies attention cost as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Training attentiondense attention over a full sequence forms a T by T score matrixO(T2d)O(T^2d)
Decode attentionone query attends to all cached keysO(Td)O(Td) per generated token
Causal maskfuture tokens are invisible but the triangular work is still large in prefilljij\le i
Memory trafficattention can be limited by reads and writes, not only FLOPsTmax(Tcompute,Tmemory)T\approx\max(T_\mathrm{compute},T_\mathrm{memory})
Arithmetic intensityroofline reasoning compares FLOPs to bytes movedI=FLOPs/bytesI=\mathrm{FLOPs}/\mathrm{bytes}

2.1 Training attention

Main idea. Dense attention over a full sequence forms a t by t score matrix.

Core relation:

O(T2d)O(T^2d)

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2.2 Decode attention

Main idea. One query attends to all cached keys.

Core relation:

O(Td)$ per generated token

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2.3 Causal mask

Main idea. Future tokens are invisible but the triangular work is still large in prefill.

Core relation:

jij\le i

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2.4 Memory traffic

Main idea. Attention can be limited by reads and writes, not only flops.

Core relation:

Tmax(Tcompute,Tmemory)T\approx\max(T_\mathrm{compute},T_\mathrm{memory})

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2.5 Arithmetic intensity

Main idea. Roofline reasoning compares flops to bytes moved.

Core relation:

I=FLOPs/bytesI=\mathrm{FLOPs}/\mathrm{bytes}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3. KV Cache Math

This part studies kv cache math as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Cache contentsstore keys and values for every layer, token, and KV headK,VRL×T×Hkv×dhK,V\in\mathbb{R}^{L\times T\times H_{kv}\times d_h}
Cache memoryKV memory grows linearly with context and batchMKV=2BLTHkvdhbM_\mathrm{KV}=2BLTH_{kv}d_hb
MHAstandard multi-head attention has one KV head per query headHkv=HqH_{kv}=H_q
MQAmulti-query attention shares one KV head across query headsHkv=1H_{kv}=1
GQAgrouped-query attention interpolates between MHA and MQA1HkvHq1\le H_{kv}\le H_q

3.1 Cache contents

Main idea. Store keys and values for every layer, token, and kv head.

Core relation:

K,VRL×T×Hkv×dhK,V\in\mathbb{R}^{L\times T\times H_{kv}\times d_h}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3.2 Cache memory

Main idea. Kv memory grows linearly with context and batch.

Core relation:

MKV=2BLTHkvdhbM_\mathrm{KV}=2BLTH_{kv}d_hb

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This one formula often decides how many requests fit on a serving GPU.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3.3 MHA

Main idea. Standard multi-head attention has one kv head per query head.

Core relation:

Hkv=HqH_{kv}=H_q

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3.4 MQA

Main idea. Multi-query attention shares one kv head across query heads.

Core relation:

Hkv=1H_{kv}=1

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3.5 GQA

Main idea. Grouped-query attention interpolates between mha and mqa.

Core relation:

1HkvHq1\le H_{kv}\le H_q

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4. FlashAttention and IO Awareness

This part studies flashattention and io awareness as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Naive attention memorymaterializing attention scores costs T squared memoryMscores=BT2HM_\mathrm{scores}=BT^2H
TilingFlashAttention computes blocks while keeping partial statisticssoftmax(QK)V\mathrm{softmax}(QK^\top)V without storing all scores
Online softmaxblockwise max and denominator keep softmax exactm=maxisi,l=iesimm=\max_i s_i,\quad l=\sum_i e^{s_i-m}
ExactnessFlashAttention changes memory access, not the attention definitionoutputflash=outputdense\mathrm{output}_\mathrm{flash}=\mathrm{output}_\mathrm{dense}
Long contextIO savings matter more as T growsT2T^2 score storage becomes the wall

4.1 Naive attention memory

Main idea. Materializing attention scores costs t squared memory.

Core relation:

Mscores=BT2HM_\mathrm{scores}=BT^2H

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4.2 Tiling

Main idea. Flashattention computes blocks while keeping partial statistics.

Core relation:

\mathrm{softmax}(QK^\top)V$ without storing all scores

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. FlashAttention is fast because it respects the memory hierarchy, not because it approximates attention.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4.3 Online softmax

Main idea. Blockwise max and denominator keep softmax exact.

Core relation:

m=maxisi,l=iesimm=\max_i s_i,\quad l=\sum_i e^{s_i-m}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4.4 Exactness

Main idea. Flashattention changes memory access, not the attention definition.

Core relation:

outputflash=outputdense\mathrm{output}_\mathrm{flash}=\mathrm{output}_\mathrm{dense}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4.5 Long context

Main idea. Io savings matter more as t grows.

Core relation:

T^2$ score storage becomes the wall

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5. Serving Memory Management

This part studies serving memory management as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Request lengths varystatic allocation wastes KV memory for short or finished requestswaste=MreservedMused\mathrm{waste}=M_\mathrm{reserved}-M_\mathrm{used}
Paged KV cacheallocate KV cache in blocks and map logical tokens to physical blockstokenblock id,offset\mathrm{token}\rightarrow\mathrm{block\ id},\mathrm{offset}
Continuous batchingadd and remove requests between decode stepsBtB_t changes over time
Prefix sharingshared prompts can reuse KV cacheKV(x,y)=KV(x)+KV(yx)KV(x,y)=KV(x)+KV(y\mid x)
Eviction and recomputememory pressure can force swapping or rebuilding cacheTrecomputeT_\mathrm{recompute} trades against MfreeM_\mathrm{free}

5.1 Request lengths vary

Main idea. Static allocation wastes kv memory for short or finished requests.

Core relation:

waste=MreservedMused\mathrm{waste}=M_\mathrm{reserved}-M_\mathrm{used}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5.2 Paged KV cache

Main idea. Allocate kv cache in blocks and map logical tokens to physical blocks.

Core relation:

tokenblock id,offset\mathrm{token}\rightarrow\mathrm{block\ id},\mathrm{offset}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. Paged allocation makes serving look more like an operating-system memory problem.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5.3 Continuous batching

Main idea. Add and remove requests between decode steps.

Core relation:

B_t$ changes over time

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5.4 Prefix sharing

Main idea. Shared prompts can reuse kv cache.

Core relation:

KV(x,y)=KV(x)+KV(yx)KV(x,y)=KV(x)+KV(y\mid x)

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5.5 Eviction and recompute

Main idea. Memory pressure can force swapping or rebuilding cache.

Core relation:

T_\mathrm{recompute}$ trades against $M_\mathrm{free}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6. Decode Acceleration

This part studies decode acceleration as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Greedy serial bottleneckautoregressive decoding normally emits one token per target-model stepKK tokens require KK target passes
Speculative decodinga draft model proposes tokens and the target verifies themspeedupaccepted tokens/target pass\mathrm{speedup}\approx \mathrm{accepted\ tokens}/\mathrm{target\ pass}
Acceptance ratespeedup depends on draft quality and draft latencya=P(accept)a=P(\mathrm{accept})
Parallel headsmethods such as Medusa predict multiple future tokens with extra headsyt+1:t+ky_{t+1:t+k} candidates
Distribution preservationexact speculative schemes preserve the target distribution when acceptance rules are correctpoutput=ptargetp_\mathrm{output}=p_\mathrm{target}

6.1 Greedy serial bottleneck

Main idea. Autoregressive decoding normally emits one token per target-model step.

Core relation:

K$ tokens require $K$ target passes

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6.2 Speculative decoding

Main idea. A draft model proposes tokens and the target verifies them.

Core relation:

speedupaccepted tokens/target pass\mathrm{speedup}\approx \mathrm{accepted\ tokens}/\mathrm{target\ pass}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. The target model can verify several proposed tokens in one pass, reducing serial decode steps.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6.3 Acceptance rate

Main idea. Speedup depends on draft quality and draft latency.

Core relation:

a=P(accept)a=P(\mathrm{accept})

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6.4 Parallel heads

Main idea. Methods such as medusa predict multiple future tokens with extra heads.

Core relation:

y_{t+1:t+k}$ candidates

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6.5 Distribution preservation

Main idea. Exact speculative schemes preserve the target distribution when acceptance rules are correct.

Core relation:

poutput=ptargetp_\mathrm{output}=p_\mathrm{target}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7. Batching and Scheduling

This part studies batching and scheduling as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Static batchingwait for a batch, run together, then finish togetherBB fixed
Dynamic batchingmerge compatible requests at runtimeBtB_t variable
Head-of-line blockingone long request can delay short onesTlatencyT_\mathrm{latency} depends on schedule
Chunked prefillsplit long prompts so decode traffic is not starvedTpromptT_\mathrm{prompt} chunks
SLA tradeoffhigher throughput can increase tail latencyp95 latencyp95\ \mathrm{latency} versus tokens/sec

7.1 Static batching

Main idea. Wait for a batch, run together, then finish together.

Core relation:

B$ fixed

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7.2 Dynamic batching

Main idea. Merge compatible requests at runtime.

Core relation:

B_t$ variable

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7.3 Head-of-line blocking

Main idea. One long request can delay short ones.

Core relation:

T_\mathrm{latency}$ depends on schedule

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7.4 Chunked prefill

Main idea. Split long prompts so decode traffic is not starved.

Core relation:

T_\mathrm{prompt}$ chunks

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7.5 SLA tradeoff

Main idea. Higher throughput can increase tail latency.

Core relation:

p95\ \mathrm{latency}$ versus tokens/sec

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8. Quantization and Bandwidth

This part studies quantization and bandwidth as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Weight bandwidthdecode often reloads weights for each generated tokenTweightsMweights/bandwidthT_\mathrm{weights}\approx M_\mathrm{weights}/\mathrm{bandwidth}
KV quantizationcompressing KV cache can increase batch or context capacityMKVM_\mathrm{KV}\downarrow
Accuracy tradeofflower precision can change probabilitiesDKL(pfp,pquant)D_\mathrm{KL}(p_\mathrm{fp},p_\mathrm{quant})
Kernel supporta quantized format helps only if kernels are fastTkernelT_\mathrm{kernel} matters
Mixed systemsweights, activations, and KV cache may use different precisionbweightbKVb_\mathrm{weight}\ne b_\mathrm{KV}

8.1 Weight bandwidth

Main idea. Decode often reloads weights for each generated token.

Core relation:

TweightsMweights/bandwidthT_\mathrm{weights}\approx M_\mathrm{weights}/\mathrm{bandwidth}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8.2 KV quantization

Main idea. Compressing kv cache can increase batch or context capacity.

Core relation:

MKVM_\mathrm{KV}\downarrow

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8.3 Accuracy tradeoff

Main idea. Lower precision can change probabilities.

Core relation:

DKL(pfp,pquant)D_\mathrm{KL}(p_\mathrm{fp},p_\mathrm{quant})

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8.4 Kernel support

Main idea. A quantized format helps only if kernels are fast.

Core relation:

T_\mathrm{kernel}$ matters

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8.5 Mixed systems

Main idea. Weights, activations, and kv cache may use different precision.

Core relation:

bweightbKVb_\mathrm{weight}\ne b_\mathrm{KV}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9. Latency Metrics

This part studies latency metrics as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Queue timeuser latency includes time before GPU work startsTqueueT_\mathrm{queue}
First-token latencylong prompts mainly hurt TTFTTTFT\mathrm{TTFT}
Inter-token latencylong outputs mainly accumulate TPOTToutputnoutTPOTT_\mathrm{output}\approx n_\mathrm{out}\mathrm{TPOT}
Tail latencyp95 and p99 matter more than average for productsQ0.95(T)Q_{0.95}(T)
Cost per tokenserving cost combines hardware time and utilizationcost/token=Tgpuprice/tokens\mathrm{cost/token}=T_\mathrm{gpu}\mathrm{price}/\mathrm{tokens}

9.1 Queue time

Main idea. User latency includes time before gpu work starts.

Core relation:

TqueueT_\mathrm{queue}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9.2 First-token latency

Main idea. Long prompts mainly hurt ttft.

Core relation:

TTFT\mathrm{TTFT}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9.3 Inter-token latency

Main idea. Long outputs mainly accumulate tpot.

Core relation:

ToutputnoutTPOTT_\mathrm{output}\approx n_\mathrm{out}\mathrm{TPOT}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9.4 Tail latency

Main idea. P95 and p99 matter more than average for products.

Core relation:

Q0.95(T)Q_{0.95}(T)

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9.5 Cost per token

Main idea. Serving cost combines hardware time and utilization.

Core relation:

cost/token=Tgpuprice/tokens\mathrm{cost/token}=T_\mathrm{gpu}\mathrm{price}/\mathrm{tokens}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10. Debugging Efficient Inference

This part studies debugging efficient inference as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

SubtopicQuestionFormula
Shape checksKV heads, query heads, and head dimension must alignHq/HkvH_q/H_{kv} integer for GQA
Cache correctnesscached decode must match full recomputation$\max
Memory accountingestimate weights plus KV cache plus workspaceM=Mweights+MKV+MworkspaceM=M_\mathrm{weights}+M_\mathrm{KV}+M_\mathrm{workspace}
Bottleneck attributionseparate compute, memory, scheduler, and network timeT=jTjT=\sum_j T_j
Quality regressionspeedups must preserve target quality unless explicitly approximateΔL,ΔS\Delta L,\Delta S tracked

10.1 Shape checks

Main idea. Kv heads, query heads, and head dimension must align.

Core relation:

H_q/H_{kv}$ integer for GQA

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10.2 Cache correctness

Main idea. Cached decode must match full recomputation.

Core relation:

\max|y_\mathrm{cache}-y_\mathrm{full}|$ small

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. A fast cached path is worthless if it disagrees with full recomputation.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10.3 Memory accounting

Main idea. Estimate weights plus kv cache plus workspace.

Core relation:

M=Mweights+MKV+MworkspaceM=M_\mathrm{weights}+M_\mathrm{KV}+M_\mathrm{workspace}

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10.4 Bottleneck attribution

Main idea. Separate compute, memory, scheduler, and network time.

Core relation:

T=jTjT=\sum_j T_j

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10.5 Quality regression

Main idea. Speedups must preserve target quality unless explicitly approximate.

Core relation:

\Delta L,\Delta S$ tracked

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with L=32L=32 layers, batch B=8B=8, context T=4096T=4096, Hkv=8H_{kv}=8 KV heads, head dimension dh=128d_h=128, and bf16 cache values with b=2b=2 bytes, the KV cache is 2BLTHkvdhb2BLTH_{kv}d_hb bytes. Reducing HkvH_{kv} from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.


Practice Exercises

  1. Compute KV cache memory for a model configuration.
  2. Compare MHA, GQA, and MQA cache sizes.
  3. Estimate prefill versus decode attention operations.
  4. Compute naive attention-score memory and FlashAttention savings intuition.
  5. Estimate roofline bandwidth-limited runtime.
  6. Compute page/block waste for variable request lengths.
  7. Estimate continuous batching utilization.
  8. Compute speculative decoding expected target passes.
  9. Build a latency budget from queue, prefill, decode, and sampling.
  10. Write a cached-decode correctness checklist.

Why This Matters for AI

Training makes a model capable. Inference makes it usable. A model that is excellent but too slow, too expensive, or too memory-hungry cannot serve real users well. Efficient inference math helps you decide whether to change attention kernels, KV head count, cache layout, quantization, batching, or decoding strategy.

Bridge to Mixture of Experts and Routing

Mixture-of-experts models change the inference problem by activating only some parameters per token while keeping many parameters in memory. The next section studies routing, expert capacity, load balancing, and the difference between total parameters and active parameters.

References