Efficient Attention and Inference

Efficient inference is the mathematics of turning a trained LLM into a responsive service. The main objects are attention cost, KV cache memory, prefill and decode latency, batching, scheduling, and probability-preserving acceleration.

Overview

Autoregressive inference has two phases:

prompt tokens -> prefill -> KV cache -> decode one token -> append -> decode one token -> ...

Prefill processes the prompt in parallel and is often compute-heavy. Decode emits one token at a time and is often memory-bandwidth-heavy because each step reads weights and KV cache. Efficient serving is therefore not just "make attention faster." It is a coordinated memory, kernel, cache, batching, and scheduling problem.

The central cache formula is:

$$ M_\mathrm{KV}=2BLTH_{kv}d_hb. $$

The factor 2 is for keys and values. $B$ is batch size, $L$ is layers, $T$ is cached context length, $H_{kv}$ is the number of key-value heads, $d_h$ is head dimension, and $b$ is bytes per cache element.

Prerequisites

Attention mechanism math and causal masking
Positional encodings and autoregressive decoding
Scaling-law and training-at-scale cost vocabulary
Basic memory units and tensor shapes

Learning Objectives

After this section, you should be able to:

Distinguish prefill latency, decode latency, TTFT, TPOT, throughput, and tail latency.
Estimate dense attention and decode attention cost.
Compute KV cache memory for MHA, MQA, and GQA.
Explain why FlashAttention is exact but IO-aware.
Explain why PagedAttention-style cache management improves serving memory utilization.
Reason about continuous batching, chunked prefill, and head-of-line blocking.
Estimate speculative decoding speedup from acceptance rate and draft latency.
Build an inference debugging checklist that checks both speed and correctness.

Inference Phases
1.1 Prefill
1.2 Decode
1.3 TTFT
1.4 TPOT
1.5 Throughput
Attention Cost
2.1 Training attention
2.2 Decode attention
2.3 Causal mask
2.4 Memory traffic
2.5 Arithmetic intensity
KV Cache Math
3.1 Cache contents
3.2 Cache memory
3.3 MHA
3.4 MQA
3.5 GQA
FlashAttention and IO Awareness
4.1 Naive attention memory
4.2 Tiling
4.3 Online softmax
4.4 Exactness
4.5 Long context
Serving Memory Management
5.1 Request lengths vary
5.2 Paged KV cache
5.3 Continuous batching
5.4 Prefix sharing
5.5 Eviction and recompute
Decode Acceleration
6.1 Greedy serial bottleneck
6.2 Speculative decoding
6.3 Acceptance rate
6.4 Parallel heads
6.5 Distribution preservation
Batching and Scheduling
7.1 Static batching
7.2 Dynamic batching
7.3 Head-of-line blocking
7.4 Chunked prefill
7.5 SLA tradeoff
Quantization and Bandwidth
8.1 Weight bandwidth
8.2 KV quantization
8.3 Accuracy tradeoff
8.4 Kernel support
8.5 Mixed systems
Latency Metrics
9.1 Queue time
9.2 First-token latency
9.3 Inter-token latency
9.4 Tail latency
9.5 Cost per token
Debugging Efficient Inference
10.1 Shape checks
10.2 Cache correctness
10.3 Memory accounting
10.4 Bottleneck attribution
10.5 Quality regression

Mental Model

quality target
   |
model weights -- kernels -- memory bandwidth -- KV cache -- scheduler -- user latency
                   |              |             |             |
              FlashAttention   quantization   paging      batching

Every speedup lives somewhere in this chain. The math tells you which bottleneck it can actually improve.

1. Inference Phases

This part studies inference phases as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Prefill	process the prompt in parallel and create the KV cache	$T_\mathrm{prefill}\propto T_\mathrm{prompt}^2$ for dense attention
Decode	generate one token at a time using cached keys and values	$T_\mathrm{decode}\propto T_\mathrm{context}$ per output token
TTFT	time to first token includes scheduling plus prefill	$\mathrm{TTFT}=T_\mathrm{queue}+T_\mathrm{prefill}+T_\mathrm{sample}$
TPOT	time per output token measures decode speed	$\mathrm{TPOT}=T_\mathrm{decode\ step}$
Throughput	serving systems balance tokens per second against latency	$\mathrm{tokens/sec}=\mathrm{completed\ tokens}/T$

1.1 Prefill

Main idea. Process the prompt in parallel and create the kv cache.

Core relation:

$$T_\mathrm{prefill}\propto T_\mathrm{prompt}^2$ for dense attention$$

Inference math is different from training math because the bottleneck shifts. Training usually wants maximum throughput over large batches. Interactive inference must manage first-token latency, per-token latency, KV cache memory, and request scheduling. The same transformer can feel fast or slow depending on prompt length, output length, batch shape, cache layout, and kernel support.

Worked micro-example. For a model with $L=32$ layers, batch $B=8$, context $T=4096$, $H_{kv}=8$ KV heads, head dimension $d_h=128$, and bf16 cache values with $b=2$ bytes, the KV cache is $2BLTH_{kv}d_hb$ bytes. Reducing $H_{kv}$ from 32 to 8 cuts this cache by 4x.

Implementation check. Compare cached decode against full-prefix recomputation on a tiny example. Then measure memory and latency separately for prefill and decode; an aggregate tokens/sec number hides too much.

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

1.2 Decode

Main idea. Generate one token at a time using cached keys and values.

Core relation:

$$T_\mathrm{decode}\propto T_\mathrm{context}$ per output token$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

1.3 TTFT

Main idea. Time to first token includes scheduling plus prefill.

Core relation:

$$\mathrm{TTFT}=T_\mathrm{queue}+T_\mathrm{prefill}+T_\mathrm{sample}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

1.4 TPOT

Main idea. Time per output token measures decode speed.

Core relation:

$$\mathrm{TPOT}=T_\mathrm{decode\ step}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

1.5 Throughput

Main idea. Serving systems balance tokens per second against latency.

Core relation:

$$\mathrm{tokens/sec}=\mathrm{completed\ tokens}/T$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2. Attention Cost

This part studies attention cost as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Training attention	dense attention over a full sequence forms a T by T score matrix	$O(T^2d)$
Decode attention	one query attends to all cached keys	$O(Td)$ per generated token
Causal mask	future tokens are invisible but the triangular work is still large in prefill	$j\le i$
Memory traffic	attention can be limited by reads and writes, not only FLOPs	$T\approx\max(T_\mathrm{compute},T_\mathrm{memory})$
Arithmetic intensity	roofline reasoning compares FLOPs to bytes moved	$I=\mathrm{FLOPs}/\mathrm{bytes}$

2.1 Training attention

Main idea. Dense attention over a full sequence forms a t by t score matrix.

Core relation:

$$O(T^2d)$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2.2 Decode attention

Main idea. One query attends to all cached keys.

Core relation:

$$O(Td)$ per generated token$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2.3 Causal mask

Main idea. Future tokens are invisible but the triangular work is still large in prefill.

Core relation:

$$j\le i$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2.4 Memory traffic

Main idea. Attention can be limited by reads and writes, not only flops.

Core relation:

$$T\approx\max(T_\mathrm{compute},T_\mathrm{memory})$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

2.5 Arithmetic intensity

Main idea. Roofline reasoning compares flops to bytes moved.

Core relation:

$$I=\mathrm{FLOPs}/\mathrm{bytes}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3. KV Cache Math

This part studies kv cache math as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Cache contents	store keys and values for every layer, token, and KV head	$K,V\in\mathbb{R}^{L\times T\times H_{kv}\times d_h}$
Cache memory	KV memory grows linearly with context and batch	$M_\mathrm{KV}=2BLTH_{kv}d_hb$
MHA	standard multi-head attention has one KV head per query head	$H_{kv}=H_q$
MQA	multi-query attention shares one KV head across query heads	$H_{kv}=1$
GQA	grouped-query attention interpolates between MHA and MQA	$1\le H_{kv}\le H_q$

3.1 Cache contents

Main idea. Store keys and values for every layer, token, and kv head.

Core relation:

$$K,V\in\mathbb{R}^{L\times T\times H_{kv}\times d_h}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3.2 Cache memory

Main idea. Kv memory grows linearly with context and batch.

Core relation:

$$M_\mathrm{KV}=2BLTH_{kv}d_hb$$

AI connection. This one formula often decides how many requests fit on a serving GPU.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3.3 MHA

Main idea. Standard multi-head attention has one kv head per query head.

Core relation:

$$H_{kv}=H_q$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3.4 MQA

Main idea. Multi-query attention shares one kv head across query heads.

Core relation:

$$H_{kv}=1$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

3.5 GQA

Main idea. Grouped-query attention interpolates between mha and mqa.

Core relation:

$$1\le H_{kv}\le H_q$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4. FlashAttention and IO Awareness

This part studies flashattention and io awareness as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Naive attention memory	materializing attention scores costs T squared memory	$M_\mathrm{scores}=BT^2H$
Tiling	FlashAttention computes blocks while keeping partial statistics	$\mathrm{softmax}(QK^\top)V$ without storing all scores
Online softmax	blockwise max and denominator keep softmax exact	$m=\max_i s_i,\quad l=\sum_i e^{s_i-m}$
Exactness	FlashAttention changes memory access, not the attention definition	$\mathrm{output}_\mathrm{flash}=\mathrm{output}_\mathrm{dense}$
Long context	IO savings matter more as T grows	$T^2$ score storage becomes the wall

4.1 Naive attention memory

Main idea. Materializing attention scores costs t squared memory.

Core relation:

$$M_\mathrm{scores}=BT^2H$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4.2 Tiling

Main idea. Flashattention computes blocks while keeping partial statistics.

Core relation:

$$\mathrm{softmax}(QK^\top)V$ without storing all scores$$

AI connection. FlashAttention is fast because it respects the memory hierarchy, not because it approximates attention.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4.3 Online softmax

Main idea. Blockwise max and denominator keep softmax exact.

Core relation:

$$m=\max_i s_i,\quad l=\sum_i e^{s_i-m}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4.4 Exactness

Main idea. Flashattention changes memory access, not the attention definition.

Core relation:

$$\mathrm{output}_\mathrm{flash}=\mathrm{output}_\mathrm{dense}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

4.5 Long context

Main idea. Io savings matter more as t grows.

Core relation:

$$T^2$ score storage becomes the wall$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5. Serving Memory Management

This part studies serving memory management as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Request lengths vary	static allocation wastes KV memory for short or finished requests	$\mathrm{waste}=M_\mathrm{reserved}-M_\mathrm{used}$
Paged KV cache	allocate KV cache in blocks and map logical tokens to physical blocks	$\mathrm{token}\rightarrow\mathrm{block\ id},\mathrm{offset}$
Continuous batching	add and remove requests between decode steps	$B_t$ changes over time
Prefix sharing	shared prompts can reuse KV cache	$KV(x,y)=KV(x)+KV(y\mid x)$
Eviction and recompute	memory pressure can force swapping or rebuilding cache	$T_\mathrm{recompute}$ trades against $M_\mathrm{free}$

5.1 Request lengths vary

Main idea. Static allocation wastes kv memory for short or finished requests.

Core relation:

$$\mathrm{waste}=M_\mathrm{reserved}-M_\mathrm{used}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5.2 Paged KV cache

Main idea. Allocate kv cache in blocks and map logical tokens to physical blocks.

Core relation:

$$\mathrm{token}\rightarrow\mathrm{block\ id},\mathrm{offset}$$

AI connection. Paged allocation makes serving look more like an operating-system memory problem.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5.3 Continuous batching

Main idea. Add and remove requests between decode steps.

Core relation:

$$B_t$ changes over time$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5.4 Prefix sharing

Main idea. Shared prompts can reuse kv cache.

Core relation:

$$KV(x,y)=KV(x)+KV(y\mid x)$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

5.5 Eviction and recompute

Main idea. Memory pressure can force swapping or rebuilding cache.

Core relation:

$$T_\mathrm{recompute}$ trades against $M_\mathrm{free}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6. Decode Acceleration

This part studies decode acceleration as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Greedy serial bottleneck	autoregressive decoding normally emits one token per target-model step	$K$ tokens require $K$ target passes
Speculative decoding	a draft model proposes tokens and the target verifies them	$\mathrm{speedup}\approx \mathrm{accepted\ tokens}/\mathrm{target\ pass}$
Acceptance rate	speedup depends on draft quality and draft latency	$a=P(\mathrm{accept})$
Parallel heads	methods such as Medusa predict multiple future tokens with extra heads	$y_{t+1:t+k}$ candidates
Distribution preservation	exact speculative schemes preserve the target distribution when acceptance rules are correct	$p_\mathrm{output}=p_\mathrm{target}$

6.1 Greedy serial bottleneck

Main idea. Autoregressive decoding normally emits one token per target-model step.

Core relation:

$$K$ tokens require $K$ target passes$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6.2 Speculative decoding

Main idea. A draft model proposes tokens and the target verifies them.

Core relation:

$$\mathrm{speedup}\approx \mathrm{accepted\ tokens}/\mathrm{target\ pass}$$

AI connection. The target model can verify several proposed tokens in one pass, reducing serial decode steps.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6.3 Acceptance rate

Main idea. Speedup depends on draft quality and draft latency.

Core relation:

$$a=P(\mathrm{accept})$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6.4 Parallel heads

Main idea. Methods such as medusa predict multiple future tokens with extra heads.

Core relation:

$$y_{t+1:t+k}$ candidates$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

6.5 Distribution preservation

Main idea. Exact speculative schemes preserve the target distribution when acceptance rules are correct.

Core relation:

$$p_\mathrm{output}=p_\mathrm{target}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7. Batching and Scheduling

This part studies batching and scheduling as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Static batching	wait for a batch, run together, then finish together	$B$ fixed
Dynamic batching	merge compatible requests at runtime	$B_t$ variable
Head-of-line blocking	one long request can delay short ones	$T_\mathrm{latency}$ depends on schedule
Chunked prefill	split long prompts so decode traffic is not starved	$T_\mathrm{prompt}$ chunks
SLA tradeoff	higher throughput can increase tail latency	$p95\ \mathrm{latency}$ versus tokens/sec

7.1 Static batching

Main idea. Wait for a batch, run together, then finish together.

Core relation:

$$B$ fixed$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7.2 Dynamic batching

Main idea. Merge compatible requests at runtime.

Core relation:

$$B_t$ variable$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7.3 Head-of-line blocking

Main idea. One long request can delay short ones.

Core relation:

$$T_\mathrm{latency}$ depends on schedule$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7.4 Chunked prefill

Main idea. Split long prompts so decode traffic is not starved.

Core relation:

$$T_\mathrm{prompt}$ chunks$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

7.5 SLA tradeoff

Main idea. Higher throughput can increase tail latency.

Core relation:

$$p95\ \mathrm{latency}$ versus tokens/sec$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8. Quantization and Bandwidth

This part studies quantization and bandwidth as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Weight bandwidth	decode often reloads weights for each generated token	$T_\mathrm{weights}\approx M_\mathrm{weights}/\mathrm{bandwidth}$
KV quantization	compressing KV cache can increase batch or context capacity	$M_\mathrm{KV}\downarrow$
Accuracy tradeoff	lower precision can change probabilities	$D_\mathrm{KL}(p_\mathrm{fp},p_\mathrm{quant})$
Kernel support	a quantized format helps only if kernels are fast	$T_\mathrm{kernel}$ matters
Mixed systems	weights, activations, and KV cache may use different precision	$b_\mathrm{weight}\ne b_\mathrm{KV}$

8.1 Weight bandwidth

Main idea. Decode often reloads weights for each generated token.

Core relation:

$$T_\mathrm{weights}\approx M_\mathrm{weights}/\mathrm{bandwidth}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8.2 KV quantization

Main idea. Compressing kv cache can increase batch or context capacity.

Core relation:

$$M_\mathrm{KV}\downarrow$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8.3 Accuracy tradeoff

Main idea. Lower precision can change probabilities.

Core relation:

$$D_\mathrm{KL}(p_\mathrm{fp},p_\mathrm{quant})$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8.4 Kernel support

Main idea. A quantized format helps only if kernels are fast.

Core relation:

$$T_\mathrm{kernel}$ matters$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

8.5 Mixed systems

Main idea. Weights, activations, and kv cache may use different precision.

Core relation:

$$b_\mathrm{weight}\ne b_\mathrm{KV}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9. Latency Metrics

This part studies latency metrics as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Queue time	user latency includes time before GPU work starts	$T_\mathrm{queue}$
First-token latency	long prompts mainly hurt TTFT	$\mathrm{TTFT}$
Inter-token latency	long outputs mainly accumulate TPOT	$T_\mathrm{output}\approx n_\mathrm{out}\mathrm{TPOT}$
Tail latency	p95 and p99 matter more than average for products	$Q_{0.95}(T)$
Cost per token	serving cost combines hardware time and utilization	$\mathrm{cost/token}=T_\mathrm{gpu}\mathrm{price}/\mathrm{tokens}$

9.1 Queue time

Main idea. User latency includes time before gpu work starts.

Core relation:

$$T_\mathrm{queue}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9.2 First-token latency

Main idea. Long prompts mainly hurt ttft.

Core relation:

$$\mathrm{TTFT}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9.3 Inter-token latency

Main idea. Long outputs mainly accumulate tpot.

Core relation:

$$T_\mathrm{output}\approx n_\mathrm{out}\mathrm{TPOT}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9.4 Tail latency

Main idea. P95 and p99 matter more than average for products.

Core relation:

$$Q_{0.95}(T)$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

9.5 Cost per token

Main idea. Serving cost combines hardware time and utilization.

Core relation:

$$\mathrm{cost/token}=T_\mathrm{gpu}\mathrm{price}/\mathrm{tokens}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10. Debugging Efficient Inference

This part studies debugging efficient inference as serving math. The goal is to connect latency and memory behavior to concrete tensor shapes, not to memorize system names.

Subtopic	Question	Formula
Shape checks	KV heads, query heads, and head dimension must align	$H_q/H_{kv}$ integer for GQA
Cache correctness	cached decode must match full recomputation	$\max\|y_\mathrm{cache}-y_\mathrm{full}\|$ small
Memory accounting	estimate weights plus KV cache plus workspace	$M=M_\mathrm{weights}+M_\mathrm{KV}+M_\mathrm{workspace}$
Bottleneck attribution	separate compute, memory, scheduler, and network time	$T=\sum_j T_j$
Quality regression	speedups must preserve target quality unless explicitly approximate	$\Delta L,\Delta S$ tracked

10.1 Shape checks

Main idea. Kv heads, query heads, and head dimension must align.

Core relation:

$$H_q/H_{kv}$ integer for GQA$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10.2 Cache correctness

Main idea. Cached decode must match full recomputation.

Core relation:

$$\max|y_\mathrm{cache}-y_\mathrm{full}|$ small$$

AI connection. A fast cached path is worthless if it disagrees with full recomputation.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10.3 Memory accounting

Main idea. Estimate weights plus kv cache plus workspace.

Core relation:

$$M=M_\mathrm{weights}+M_\mathrm{KV}+M_\mathrm{workspace}$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10.4 Bottleneck attribution

Main idea. Separate compute, memory, scheduler, and network time.

Core relation:

$$T=\sum_j T_j$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

10.5 Quality regression

Main idea. Speedups must preserve target quality unless explicitly approximate.

Core relation:

$$\Delta L,\Delta S$ tracked$$

AI connection. This quantity is a practical lever for LLM inference.

Common mistake. Do not optimize benchmark throughput while ignoring p95 latency, prompt length distribution, output length distribution, and quality regression.

Practice Exercises

Compute KV cache memory for a model configuration.
Compare MHA, GQA, and MQA cache sizes.
Estimate prefill versus decode attention operations.
Compute naive attention-score memory and FlashAttention savings intuition.
Estimate roofline bandwidth-limited runtime.
Compute page/block waste for variable request lengths.
Estimate continuous batching utilization.
Compute speculative decoding expected target passes.
Build a latency budget from queue, prefill, decode, and sampling.
Write a cached-decode correctness checklist.

Why This Matters for AI

Training makes a model capable. Inference makes it usable. A model that is excellent but too slow, too expensive, or too memory-hungry cannot serve real users well. Efficient inference math helps you decide whether to change attention kernels, KV head count, cache layout, quantization, batching, or decoding strategy.

Bridge to Mixture of Experts and Routing

Mixture-of-experts models change the inference problem by activating only some parameters per token while keeping many parameters in memory. The next section studies routing, expert capacity, load balancing, and the difference between total parameters and active parameters.

References

Ashish Vaswani et al., "Attention Is All You Need", 2017: https://arxiv.org/abs/1706.03762
Noam Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need", 2019: https://arxiv.org/abs/1911.02150
Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", 2022: https://arxiv.org/abs/2205.14135
Joshua Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints", 2023: https://arxiv.org/abs/2305.13245
Woosuk Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", 2023: https://arxiv.org/abs/2309.06180
Yaniv Leviathan, Matan Kalman, and Yossi Matias, "Fast Inference from Transformers via Speculative Decoding", 2023: https://proceedings.mlr.press/v202/leviathan23a.html
Tianle Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024: https://arxiv.org/abs/2401.10774

Efficient Attention and Inference

Overview

Prerequisites

Learning Objectives

Table of Contents

Mental Model

1. Inference Phases

1.1 Prefill

1.2 Decode

1.3 TTFT

1.4 TPOT

1.5 Throughput

2. Attention Cost

2.1 Training attention

2.2 Decode attention

2.3 Causal mask

2.4 Memory traffic

2.5 Arithmetic intensity

3. KV Cache Math

3.1 Cache contents

3.2 Cache memory

3.3 MHA

3.4 MQA

3.5 GQA

4. FlashAttention and IO Awareness

4.1 Naive attention memory

4.2 Tiling

4.3 Online softmax

4.4 Exactness

4.5 Long context

5. Serving Memory Management

5.1 Request lengths vary

5.2 Paged KV cache

5.3 Continuous batching

5.4 Prefix sharing

5.5 Eviction and recompute

6. Decode Acceleration

6.1 Greedy serial bottleneck

6.2 Speculative decoding

6.3 Acceptance rate

6.4 Parallel heads

6.5 Distribution preservation

7. Batching and Scheduling

7.1 Static batching

7.2 Dynamic batching

7.3 Head-of-line blocking

7.4 Chunked prefill

7.5 SLA tradeoff

8. Quantization and Bandwidth

8.1 Weight bandwidth

8.2 KV quantization

8.3 Accuracy tradeoff

8.4 Kernel support

8.5 Mixed systems

9. Latency Metrics

9.1 Queue time

9.2 First-token latency

9.3 Inter-token latency

9.4 Tail latency

9.5 Cost per token

10. Debugging Efficient Inference

10.1 Shape checks

10.2 Cache correctness

10.3 Memory accounting

10.4 Bottleneck attribution

10.5 Quality regression

Practice Exercises

Why This Matters for AI

Bridge to Mixture of Experts and Routing

References