The transformer is a stack of attention, feed-forward, residual, normalization, and position mechanisms. Its core advantage is that training can process all token positions in parallel while attention lets each token condition on other tokens.
Overview
The central attention operation is:
This formula turns token representations into queries, keys, and values. Queries ask, keys match, values carry content. Multi-head attention repeats this operation in several subspaces. Residual blocks and normalization make the stack trainable. Positional information restores order. Masks define which tokens are allowed to interact.
Prerequisites
- Matrix multiplication and softmax
- Embeddings and positional encodings
- Sequence probability and autoregressive masking
- RNN section for the recurrence-to-attention comparison
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates attention shapes, causal masks, multi-head splitting, MLP blocks, Pre-LN/Post-LN, positional signals, KV cache size, and diagnostics. |
| exercises.ipynb | Ten practice problems for attention arithmetic, masks, shapes, parameter counts, and transformer debugging. |
Learning Objectives
After this section, you should be able to:
- Compute scaled dot-product attention from Q, K, and V.
- Explain multi-head attention and head dimension arithmetic.
- Distinguish attention token mixing from position-wise MLP channel mixing.
- Explain residual streams, LayerNorm, Pre-LN, Post-LN, and RMSNorm.
- Compare encoder, decoder, encoder-decoder, encoder-only, and decoder-only architectures.
- Explain positional signals, attention masks, and KV cache memory.
- Estimate attention and MLP complexity.
- Build shape, mask, attention-entropy, and residual-norm diagnostics.
Table of Contents
- Transformer Design Goal
- 1.1 Sequence transduction
- 1.2 Parallel token processing
- 1.3 Context mixing
- 1.4 Position information
- 1.5 Stacked blocks
- Scaled Dot-Product Attention
- 2.1 Queries keys values
- 2.2 Score matrix
- 2.3 Masking
- 2.4 Attention weights
- 2.5 Weighted values
- Multi-Head Attention
- 3.1 Head splitting
- 3.2 Per-head dimension
- 3.3 Concatenation
- 3.4 Specialization
- 3.5 GQA and MQA bridge
- Feed-Forward Network
- 4.1 Position-wise MLP
- 4.2 Expansion ratio
- 4.3 Activation
- 4.4 Parameter count
- 4.5 Token independence
- Residuals and Normalization
- 5.1 Residual stream
- 5.2 Layer normalization
- 5.3 Pre-LN block
- 5.4 Post-LN block
- 5.5 RMSNorm
- Encoder Decoder and Decoder-Only Forms
- 6.1 Encoder self-attention
- 6.2 Decoder causal self-attention
- 6.3 Cross-attention
- 6.4 Encoder-only
- 6.5 Decoder-only
- Positional Information
- 7.1 Absolute positions
- 7.2 Sinusoidal positions
- 7.3 Relative bias
- 7.4 RoPE
- 7.5 Length behavior
- Complexity and Memory
- 8.1 Attention FLOPs
- 8.2 Attention score memory
- 8.3 MLP FLOPs
- 8.4 KV cache
- 8.5 FlashAttention bridge
- Training Signals
- 9.1 Language-model loss
- 9.2 Masked LM loss
- 9.3 Teacher forcing
- 9.4 Attention masks
- 9.5 Weight tying
- Diagnostics
- 10.1 Shape checks
- 10.2 Mask tests
- 10.3 Attention entropy
- 10.4 Residual norms
- 10.5 Ablations
Block Shape Map
hidden states: H shape (B, T, d_model)
queries: Q shape (B, heads, T, d_head)
keys: K shape (B, heads, T, d_head)
values: V shape (B, heads, T, d_head)
attention scores: S shape (B, heads, T, T)
attention output: O shape (B, T, d_model)
1. Transformer Design Goal
This part studies transformer design goal as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Sequence transduction | map one sequence representation to another | |
| Parallel token processing | avoid recurrence during training | |
| Context mixing | let every token read other tokens through attention | |
| Position information | inject order because attention alone is permutation equivariant | |
| Stacked blocks | repeat attention and MLP transformations |
1.1 Sequence transduction
Main idea. Map one sequence representation to another.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
1.2 Parallel token processing
Main idea. Avoid recurrence during training.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
1.3 Context mixing
Main idea. Let every token read other tokens through attention.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
1.4 Position information
Main idea. Inject order because attention alone is permutation equivariant.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
1.5 Stacked blocks
Main idea. Repeat attention and mlp transformations.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
2. Scaled Dot-Product Attention
This part studies scaled dot-product attention as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Queries keys values | project hidden states into matching and content spaces | |
| Score matrix | compare every query with every key | |
| Masking | block padding or future positions before softmax | when masked |
| Attention weights | softmax gives a distribution over source positions | |
| Weighted values | mix value vectors by attention weights |
2.1 Queries keys values
Main idea. Project hidden states into matching and content spaces.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
2.2 Score matrix
Main idea. Compare every query with every key.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is the computation that lets each token ask which other tokens matter.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
2.3 Masking
Main idea. Block padding or future positions before softmax.
Core relation:
S_{ij}\leftarrow-\infty$ when maskedTransformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
2.4 Attention weights
Main idea. Softmax gives a distribution over source positions.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
2.5 Weighted values
Main idea. Mix value vectors by attention weights.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
3. Multi-Head Attention
This part studies multi-head attention as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Head splitting | several smaller attention heads run in parallel | |
| Per-head dimension | model dimension is split across heads | |
| Concatenation | head outputs are concatenated and projected | |
| Specialization | different heads can learn different relation patterns | varies by head |
| GQA and MQA bridge | serving variants share key-value heads |
3.1 Head splitting
Main idea. Several smaller attention heads run in parallel.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
3.2 Per-head dimension
Main idea. Model dimension is split across heads.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
3.3 Concatenation
Main idea. Head outputs are concatenated and projected.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
3.4 Specialization
Main idea. Different heads can learn different relation patterns.
Core relation:
A_h$ varies by headTransformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
3.5 GQA and MQA bridge
Main idea. Serving variants share key-value heads.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
4. Feed-Forward Network
This part studies feed-forward network as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Position-wise MLP | apply the same MLP to each token independently | |
| Expansion ratio | hidden width is often larger than model width | |
| Activation | GELU or SwiGLU controls nonlinear feature mixing | |
| Parameter count | MLP parameters often dominate each block | |
| Token independence | cross-token mixing happens in attention, not the MLP | processed separately |
4.1 Position-wise MLP
Main idea. Apply the same mlp to each token independently.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
4.2 Expansion ratio
Main idea. Hidden width is often larger than model width.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
4.3 Activation
Main idea. Gelu or swiglu controls nonlinear feature mixing.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
4.4 Parameter count
Main idea. Mlp parameters often dominate each block.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
4.5 Token independence
Main idea. Cross-token mixing happens in attention, not the mlp.
Core relation:
x_t$ processed separatelyTransformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
5. Residuals and Normalization
This part studies residuals and normalization as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Residual stream | blocks add updates to a persistent representation | |
| Layer normalization | normalize features within each token | |
| Pre-LN block | normalize before sublayer for more stable deep training | |
| Post-LN block | original transformer normalized after residual addition | |
| RMSNorm | normalize by root mean square without mean subtraction |
5.1 Residual stream
Main idea. Blocks add updates to a persistent representation.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
5.2 Layer normalization
Main idea. Normalize features within each token.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
5.3 Pre-LN block
Main idea. Normalize before sublayer for more stable deep training.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. Pre-LN is a key reason very deep transformer stacks are easier to optimize.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
5.4 Post-LN block
Main idea. Original transformer normalized after residual addition.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
5.5 RMSNorm
Main idea. Normalize by root mean square without mean subtraction.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
6. Encoder Decoder and Decoder-Only Forms
This part studies encoder decoder and decoder-only forms as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Encoder self-attention | source tokens attend bidirectionally | allowed for all source positions |
| Decoder causal self-attention | target tokens cannot see future target tokens | |
| Cross-attention | decoder queries attend to encoder keys and values | |
| Encoder-only | BERT-style models produce contextual representations | |
| Decoder-only | GPT-style models predict next tokens autoregressively |
6.1 Encoder self-attention
Main idea. Source tokens attend bidirectionally.
Core relation:
A_{ij}$ allowed for all source positionsTransformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
6.2 Decoder causal self-attention
Main idea. Target tokens cannot see future target tokens.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
6.3 Cross-attention
Main idea. Decoder queries attend to encoder keys and values.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
6.4 Encoder-only
Main idea. Bert-style models produce contextual representations.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
6.5 Decoder-only
Main idea. Gpt-style models predict next tokens autoregressively.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is the architecture behind most modern autoregressive LLMs.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
7. Positional Information
This part studies positional information as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Absolute positions | add learned or fixed position vectors | |
| Sinusoidal positions | fixed frequencies encode index information | |
| Relative bias | attention scores depend on distance | |
| RoPE | rotate query and key pairs by position-dependent angles | depends on |
| Length behavior | position design affects long-context extrapolation |
7.1 Absolute positions
Main idea. Add learned or fixed position vectors.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
7.2 Sinusoidal positions
Main idea. Fixed frequencies encode index information.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
7.3 Relative bias
Main idea. Attention scores depend on distance.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
7.4 RoPE
Main idea. Rotate query and key pairs by position-dependent angles.
Core relation:
q_i^\top k_j$ depends on $i-jTransformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
7.5 Length behavior
Main idea. Position design affects long-context extrapolation.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
8. Complexity and Memory
This part studies complexity and memory as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Attention FLOPs | dense attention is quadratic in sequence length | |
| Attention score memory | naive attention materializes T by T scores | |
| MLP FLOPs | MLP cost is linear in sequence length but large in width | |
| KV cache | decoder inference stores keys and values per layer | |
| FlashAttention bridge | IO-aware attention computes exact attention without storing all scores | tiled |
8.1 Attention FLOPs
Main idea. Dense attention is quadratic in sequence length.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
8.2 Attention score memory
Main idea. Naive attention materializes t by t scores.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
8.3 MLP FLOPs
Main idea. Mlp cost is linear in sequence length but large in width.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
8.4 KV cache
Main idea. Decoder inference stores keys and values per layer.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is the central memory object for fast decoder inference.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
8.5 FlashAttention bridge
Main idea. Io-aware attention computes exact attention without storing all scores.
Core relation:
\mathrm{softmax}(QK^\top)V$ tiledTransformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
9. Training Signals
This part studies training signals as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Language-model loss | decoder-only models train by next-token prediction | |
| Masked LM loss | encoder-only models predict hidden masked tokens | |
| Teacher forcing | decoder training conditions on gold previous tokens | |
| Attention masks | loss and attention masks must agree with the task | |
| Weight tying | input embedding and output projection can share weights |
9.1 Language-model loss
Main idea. Decoder-only models train by next-token prediction.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
9.2 Masked LM loss
Main idea. Encoder-only models predict hidden masked tokens.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
9.3 Teacher forcing
Main idea. Decoder training conditions on gold previous tokens.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
9.4 Attention masks
Main idea. Loss and attention masks must agree with the task.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
9.5 Weight tying
Main idea. Input embedding and output projection can share weights.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
10. Diagnostics
This part studies diagnostics as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.
| Subtopic | Question | Formula |
|---|---|---|
| Shape checks | track batch, time, heads, head dimension, and model dimension | |
| Mask tests | future leakage breaks autoregressive training | for |
| Attention entropy | overly sharp or diffuse heads can indicate issues | |
| Residual norms | monitor update size relative to residual stream | |
| Ablations | compare head count, MLP width, norm placement, and position method |
10.1 Shape checks
Main idea. Track batch, time, heads, head dimension, and model dimension.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
10.2 Mask tests
Main idea. Future leakage breaks autoregressive training.
Core relation:
A_{ij}=0$ for $j>iTransformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. A single mask bug can make a model appear to learn while leaking future answers.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
10.3 Attention entropy
Main idea. Overly sharp or diffuse heads can indicate issues.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
10.4 Residual norms
Main idea. Monitor update size relative to residual stream.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
10.5 Ablations
Main idea. Compare head count, mlp width, norm placement, and position method.
Core relation:
Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.
Worked micro-example. If and there are 12 attention heads, each head has . The attention score matrix for one head and one sequence of length 128 has shape by . That score matrix is why attention is powerful and also why long-context attention is expensive.
Implementation check. Write the shapes for , , , scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.
AI connection. This is a practical transformer design variable.
Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.
Practice Exercises
- Compute scaled dot-product attention for tiny matrices.
- Build a causal mask and apply it to attention scores.
- Split model dimension into heads.
- Count parameters in Q, K, V, and output projections.
- Count MLP parameters and compare to attention parameters.
- Compute LayerNorm and RMSNorm for one vector.
- Compare Pre-LN and Post-LN block equations.
- Compute KV cache memory for a decoder.
- Identify encoder-only, decoder-only, and encoder-decoder masks.
- Write a transformer debugging checklist.
Why This Matters for AI
The transformer is the architectural base of modern LLMs and many multimodal models. Understanding it at the level of shapes, masks, residuals, and costs prevents vague explanations and catches real implementation mistakes.
Bridge to Reinforcement Learning
The next model-specific section studies reinforcement learning. In modern LLM systems, transformer policies are often optimized further with preference or reward-based objectives, so architecture and optimization meet there.
References
- Ashish Vaswani et al., "Attention Is All You Need", 2017: https://arxiv.org/abs/1706.03762
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey Hinton, "Layer Normalization", 2016: https://arxiv.org/abs/1607.06450
- Ruibin Xiong et al., "On Layer Normalization in the Transformer Architecture", 2020: https://arxiv.org/abs/2002.04745
- Biao Zhang and Rico Sennrich, "Root Mean Square Layer Normalization", 2019: https://arxiv.org/abs/1910.07467