Part 2Math for LLMs

Attention Mechanism Math: Part 2 - Multi Head Attention To 6 Complexity And Efficient Attention

Math for LLMs / Attention Mechanism Math

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
24 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Attention Mechanism Math: Part 4: Multi-Head Attention to 6. Complexity and Efficient Attention

4. Multi-Head Attention

Multi-Head Attention explains how transformer layers route information across sequence positions using differentiable, mask-aware retrieval.

4.1 Head dimensions

Purpose. Head dimensions focuses on splitting representation into multiple subspaces. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

Y=AV.Y=AV.

Operational definition.

Multi-head attention runs several attention mechanisms in parallel using different learned projections.

Worked reading.

Each head has width dk=dmodel/hd_k=d_{\mathrm{model}}/h in the standard design, then head outputs are concatenated and projected.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. syntax-like heads.
  2. copy heads.
  3. multi-query attention.

Non-examples:

  1. one monolithic attention map only.
  2. duplicating the same head without learned projections.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

4.2 Parallel heads

Purpose. Parallel heads focuses on different learned projections over the same sequence. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

MHA(X)=Concat(H1,,Hh)WO.\operatorname{MHA}(X)=\operatorname{Concat}(H_1,\ldots,H_h)W_O.

Operational definition.

Multi-head attention runs several attention mechanisms in parallel using different learned projections.

Worked reading.

Each head has width dk=dmodel/hd_k=d_{\mathrm{model}}/h in the standard design, then head outputs are concatenated and projected.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. syntax-like heads.
  2. copy heads.
  3. multi-query attention.

Non-examples:

  1. one monolithic attention map only.
  2. duplicating the same head without learned projections.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

4.3 Concatenation and output projection

Purpose. Concatenation and output projection focuses on returning to model width. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

costscoresO(T2dk),memoryscoresO(T2).\operatorname{cost}_{\mathrm{scores}}\in O(T^2d_k),\qquad \operatorname{memory}_{\mathrm{scores}}\in O(T^2).

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

4.4 Head specialization and redundancy

Purpose. Head specialization and redundancy focuses on why heads can be interpretable or unused. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

H(Ai)=jAijlogAij.H(A_i)=-\sum_j A_{ij}\log A_{ij}.

Operational definition.

Multi-head attention runs several attention mechanisms in parallel using different learned projections.

Worked reading.

Each head has width dk=dmodel/hd_k=d_{\mathrm{model}}/h in the standard design, then head outputs are concatenated and projected.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. syntax-like heads.
  2. copy heads.
  3. multi-query attention.

Non-examples:

  1. one monolithic attention map only.
  2. duplicating the same head without learned projections.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

4.5 Grouped query and multi-query attention

Purpose. Grouped query and multi-query attention focuses on sharing K/V for inference efficiency. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

causal maskij=0 if ji, otherwise.\operatorname{causal\ mask}_{ij}=0\text{ if }j\le i,\quad -\infty\text{ otherwise}.

Operational definition.

Multi-head attention runs several attention mechanisms in parallel using different learned projections.

Worked reading.

Each head has width dk=dmodel/hd_k=d_{\mathrm{model}}/h in the standard design, then head outputs are concatenated and projected.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. syntax-like heads.
  2. copy heads.
  3. multi-query attention.

Non-examples:

  1. one monolithic attention map only.
  2. duplicating the same head without learned projections.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

5. Decoder Attention in LLMs

Decoder Attention in LLMs explains how transformer layers route information across sequence positions using differentiable, mask-aware retrieval.

5.1 Autoregressive causal attention

Purpose. Autoregressive causal attention focuses on preventing future-token leakage. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

MHA(X)=Concat(H1,,Hh)WO.\operatorname{MHA}(X)=\operatorname{Concat}(H_1,\ldots,H_h)W_O.

Operational definition.

A mask changes which key positions a query is allowed to see by adding large negative values to forbidden logits before softmax.

Worked reading.

In decoder-only language modeling, token ii may attend to positions jij\le i but not to future positions j>ij>i.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. causal masks.
  2. padding masks.
  3. structured prompt masks.

Non-examples:

  1. zeroing output after softmax.
  2. trusting data order without a mask.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

5.2 KV cache

Purpose. KV cache focuses on reusing past keys and values during generation. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

costscoresO(T2dk),memoryscoresO(T2).\operatorname{cost}_{\mathrm{scores}}\in O(T^2d_k),\qquad \operatorname{memory}_{\mathrm{scores}}\in O(T^2).

Operational definition.

During autoregressive generation, past keys and values can be cached so each new token computes attention against old K/V instead of recomputing the entire prefix.

Worked reading.

Prefill processes the whole prompt; decode appends one token at a time and reuses cached K/V tensors.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. LLM serving.
  2. streaming decode.
  3. multi-query attention.

Non-examples:

  1. training with full parallel sequence processing.
  2. recomputing all previous keys every token.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

5.3 Prefill versus decode

Purpose. Prefill versus decode focuses on two different attention workloads. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

H(Ai)=jAijlogAij.H(A_i)=-\sum_j A_{ij}\log A_{ij}.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

5.4 Attention with positional encodings

Purpose. Attention with positional encodings focuses on how RoPE and ALiBi modify scores. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

causal maskij=0 if ji, otherwise.\operatorname{causal\ mask}_{ij}=0\text{ if }j\le i,\quad -\infty\text{ otherwise}.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

5.5 Cross-attention preview

Purpose. Cross-attention preview focuses on encoder-decoder and retrieval-conditioned variants. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

Q=XWQ,K=XWK,V=XWV.Q=XW_Q,\qquad K=XW_K,\qquad V=XW_V.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

6. Complexity and Efficient Attention

Complexity and Efficient Attention explains how transformer layers route information across sequence positions using differentiable, mask-aware retrieval.

6.1 Quadratic token cost

Purpose. Quadratic token cost focuses on why T2T^2 score matrices dominate long context. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

costscoresO(T2dk),memoryscoresO(T2).\operatorname{cost}_{\mathrm{scores}}\in O(T^2d_k),\qquad \operatorname{memory}_{\mathrm{scores}}\in O(T^2).

Operational definition.

Standard attention forms all pairwise query-key scores, so score memory grows quadratically with sequence length.

Worked reading.

Doubling context length roughly quadruples the score matrix size, even before considering layer count and batch size.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. long-context training.
  2. KV-cache sizing.
  3. FlashAttention kernels.

Non-examples:

  1. linear cost assumptions.
  2. ignoring memory traffic.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

6.2 Memory layout and IO

Purpose. Memory layout and IO focuses on why exact attention can be slow despite simple formulas. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

H(Ai)=jAijlogAij.H(A_i)=-\sum_j A_{ij}\log A_{ij}.

Operational definition.

Standard attention forms all pairwise query-key scores, so score memory grows quadratically with sequence length.

Worked reading.

Doubling context length roughly quadruples the score matrix size, even before considering layer count and batch size.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. long-context training.
  2. KV-cache sizing.
  3. FlashAttention kernels.

Non-examples:

  1. linear cost assumptions.
  2. ignoring memory traffic.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

6.3 FlashAttention intuition

Purpose. FlashAttention intuition focuses on tiling exact attention to reduce memory traffic. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

causal maskij=0 if ji, otherwise.\operatorname{causal\ mask}_{ij}=0\text{ if }j\le i,\quad -\infty\text{ otherwise}.

Operational definition.

FlashAttention computes exact attention while avoiding materializing the full attention matrix in high-bandwidth memory.

Worked reading.

It tiles Q, K, and V blocks and maintains online softmax statistics so memory traffic is lower even though the mathematical result is exact.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. long-context training.
  2. GPU attention kernels.
  3. memory-efficient exact attention.

Non-examples:

  1. approximate sparse attention.
  2. changing the attention formula.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

6.4 Sparse and local attention preview

Purpose. Sparse and local attention preview focuses on approximating visibility patterns. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

Q=XWQ,K=XWK,V=XWV.Q=XW_Q,\qquad K=XW_K,\qquad V=XW_V.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

6.5 Long-context diagnostics

Purpose. Long-context diagnostics focuses on checking quality cost and position behavior together. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

S=QKdk.S=\frac{QK^\top}{\sqrt{d_k}}.

Operational definition.

Attention diagnostics inspect weights, entropy, masks, and head importance, but they do not by themselves prove causal explanations.

Worked reading.

A low-entropy row means one or a few keys dominate; a high-entropy row means information is mixed broadly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. attention heatmaps.
  2. head ablations.
  3. entropy dashboards.

Non-examples:

  1. claiming attention weight equals explanation.
  2. inspecting only one prompt.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue