NotesMath for LLMs

Transformer Architecture

Math for Specific Models / Transformer Architecture

Notes

The transformer is a stack of attention, feed-forward, residual, normalization, and position mechanisms. Its core advantage is that training can process all token positions in parallel while attention lets each token condition on other tokens.

Overview

The central attention operation is:

Attention(Q,K,V)=softmax(QKdk)V.\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.

This formula turns token representations into queries, keys, and values. Queries ask, keys match, values carry content. Multi-head attention repeats this operation in several subspaces. Residual blocks and normalization make the stack trainable. Positional information restores order. Masks define which tokens are allowed to interact.

Prerequisites

  • Matrix multiplication and softmax
  • Embeddings and positional encodings
  • Sequence probability and autoregressive masking
  • RNN section for the recurrence-to-attention comparison

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates attention shapes, causal masks, multi-head splitting, MLP blocks, Pre-LN/Post-LN, positional signals, KV cache size, and diagnostics.
exercises.ipynbTen practice problems for attention arithmetic, masks, shapes, parameter counts, and transformer debugging.

Learning Objectives

After this section, you should be able to:

  • Compute scaled dot-product attention from Q, K, and V.
  • Explain multi-head attention and head dimension arithmetic.
  • Distinguish attention token mixing from position-wise MLP channel mixing.
  • Explain residual streams, LayerNorm, Pre-LN, Post-LN, and RMSNorm.
  • Compare encoder, decoder, encoder-decoder, encoder-only, and decoder-only architectures.
  • Explain positional signals, attention masks, and KV cache memory.
  • Estimate attention and MLP complexity.
  • Build shape, mask, attention-entropy, and residual-norm diagnostics.

Table of Contents

  1. Transformer Design Goal
  2. Scaled Dot-Product Attention
  3. Multi-Head Attention
  4. Feed-Forward Network
  5. Residuals and Normalization
  6. Encoder Decoder and Decoder-Only Forms
  7. Positional Information
  8. Complexity and Memory
  9. Training Signals
  10. Diagnostics

Block Shape Map

hidden states:       H       shape (B, T, d_model)
queries:             Q       shape (B, heads, T, d_head)
keys:                K       shape (B, heads, T, d_head)
values:              V       shape (B, heads, T, d_head)
attention scores:    S       shape (B, heads, T, T)
attention output:    O       shape (B, T, d_model)

1. Transformer Design Goal

This part studies transformer design goal as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Sequence transductionmap one sequence representation to anotherx1:Ty1:Mx_{1:T}\rightarrow y_{1:M}
Parallel token processingavoid recurrence during trainingH(0)RB×T×dH^{(0)}\in\mathbb{R}^{B\times T\times d}
Context mixinglet every token read other tokens through attentionAttention(Q,K,V)\mathrm{Attention}(Q,K,V)
Position informationinject order because attention alone is permutation equivarianthi=xi+pih_i=x_i+p_i
Stacked blocksrepeat attention and MLP transformationsH(+1)=Block(H())H^{(\ell+1)}=\mathrm{Block}_\ell(H^{(\ell)})

1.1 Sequence transduction

Main idea. Map one sequence representation to another.

Core relation:

x1:Ty1:Mx_{1:T}\rightarrow y_{1:M}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

1.2 Parallel token processing

Main idea. Avoid recurrence during training.

Core relation:

H(0)RB×T×dH^{(0)}\in\mathbb{R}^{B\times T\times d}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

1.3 Context mixing

Main idea. Let every token read other tokens through attention.

Core relation:

Attention(Q,K,V)\mathrm{Attention}(Q,K,V)

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

1.4 Position information

Main idea. Inject order because attention alone is permutation equivariant.

Core relation:

hi=xi+pih_i=x_i+p_i

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

1.5 Stacked blocks

Main idea. Repeat attention and mlp transformations.

Core relation:

H(+1)=Block(H())H^{(\ell+1)}=\mathrm{Block}_\ell(H^{(\ell)})

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

2. Scaled Dot-Product Attention

This part studies scaled dot-product attention as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Queries keys valuesproject hidden states into matching and content spacesQ=HWQ, K=HWK, V=HWVQ=HW_Q,\ K=HW_K,\ V=HW_V
Score matrixcompare every query with every keyS=QK/dkS=QK^\top/\sqrt{d_k}
Maskingblock padding or future positions before softmaxSijS_{ij}\leftarrow-\infty when masked
Attention weightssoftmax gives a distribution over source positionsA=softmax(S)A=\mathrm{softmax}(S)
Weighted valuesmix value vectors by attention weightsO=AVO=AV

2.1 Queries keys values

Main idea. Project hidden states into matching and content spaces.

Core relation:

Q=HWQ, K=HWK, V=HWVQ=HW_Q,\ K=HW_K,\ V=HW_V

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

2.2 Score matrix

Main idea. Compare every query with every key.

Core relation:

S=QK/dkS=QK^\top/\sqrt{d_k}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is the computation that lets each token ask which other tokens matter.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

2.3 Masking

Main idea. Block padding or future positions before softmax.

Core relation:

S_{ij}\leftarrow-\infty$ when masked

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

2.4 Attention weights

Main idea. Softmax gives a distribution over source positions.

Core relation:

A=softmax(S)A=\mathrm{softmax}(S)

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

2.5 Weighted values

Main idea. Mix value vectors by attention weights.

Core relation:

O=AVO=AV

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

3. Multi-Head Attention

This part studies multi-head attention as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Head splittingseveral smaller attention heads run in parallelh=1,,Hh=1,\ldots,H
Per-head dimensionmodel dimension is split across headsdh=dmodel/Hd_h=d_\mathrm{model}/H
Concatenationhead outputs are concatenated and projectedMHA(H)=Concat(O1,,OH)WO\mathrm{MHA}(H)=\mathrm{Concat}(O_1,\ldots,O_H)W_O
Specializationdifferent heads can learn different relation patternsAhA_h varies by head
GQA and MQA bridgeserving variants share key-value headsHkvHqH_{kv}\le H_q

3.1 Head splitting

Main idea. Several smaller attention heads run in parallel.

Core relation:

h=1,,Hh=1,\ldots,H

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

3.2 Per-head dimension

Main idea. Model dimension is split across heads.

Core relation:

dh=dmodel/Hd_h=d_\mathrm{model}/H

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

3.3 Concatenation

Main idea. Head outputs are concatenated and projected.

Core relation:

MHA(H)=Concat(O1,,OH)WO\mathrm{MHA}(H)=\mathrm{Concat}(O_1,\ldots,O_H)W_O

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

3.4 Specialization

Main idea. Different heads can learn different relation patterns.

Core relation:

A_h$ varies by head

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

3.5 GQA and MQA bridge

Main idea. Serving variants share key-value heads.

Core relation:

HkvHqH_{kv}\le H_q

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

4. Feed-Forward Network

This part studies feed-forward network as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Position-wise MLPapply the same MLP to each token independentlyFFN(x)=W2ϕ(W1x+b1)+b2\mathrm{FFN}(x)=W_2\phi(W_1x+b_1)+b_2
Expansion ratiohidden width is often larger than model widthdff4dmodeld_\mathrm{ff}\approx4d_\mathrm{model}
ActivationGELU or SwiGLU controls nonlinear feature mixingϕ\phi
Parameter countMLP parameters often dominate each block2dmodeldff2d_\mathrm{model}d_\mathrm{ff}
Token independencecross-token mixing happens in attention, not the MLPxtx_t processed separately

4.1 Position-wise MLP

Main idea. Apply the same mlp to each token independently.

Core relation:

FFN(x)=W2ϕ(W1x+b1)+b2\mathrm{FFN}(x)=W_2\phi(W_1x+b_1)+b_2

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

4.2 Expansion ratio

Main idea. Hidden width is often larger than model width.

Core relation:

dff4dmodeld_\mathrm{ff}\approx4d_\mathrm{model}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

4.3 Activation

Main idea. Gelu or swiglu controls nonlinear feature mixing.

Core relation:

ϕ\phi

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

4.4 Parameter count

Main idea. Mlp parameters often dominate each block.

Core relation:

2dmodeldff2d_\mathrm{model}d_\mathrm{ff}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

4.5 Token independence

Main idea. Cross-token mixing happens in attention, not the mlp.

Core relation:

x_t$ processed separately

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

5. Residuals and Normalization

This part studies residuals and normalization as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Residual streamblocks add updates to a persistent representationxx+F(x)x\leftarrow x+F(x)
Layer normalizationnormalize features within each tokenx^=(xμ)/σ2+ϵ\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}
Pre-LN blocknormalize before sublayer for more stable deep trainingxx+F(LN(x))x\leftarrow x+F(\mathrm{LN}(x))
Post-LN blockoriginal transformer normalized after residual additionxLN(x+F(x))x\leftarrow\mathrm{LN}(x+F(x))
RMSNormnormalize by root mean square without mean subtractionx/mean(x2)+ϵx/\sqrt{\mathrm{mean}(x^2)+\epsilon}

5.1 Residual stream

Main idea. Blocks add updates to a persistent representation.

Core relation:

xx+F(x)x\leftarrow x+F(x)

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

5.2 Layer normalization

Main idea. Normalize features within each token.

Core relation:

x^=(xμ)/σ2+ϵ\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

5.3 Pre-LN block

Main idea. Normalize before sublayer for more stable deep training.

Core relation:

xx+F(LN(x))x\leftarrow x+F(\mathrm{LN}(x))

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. Pre-LN is a key reason very deep transformer stacks are easier to optimize.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

5.4 Post-LN block

Main idea. Original transformer normalized after residual addition.

Core relation:

xLN(x+F(x))x\leftarrow\mathrm{LN}(x+F(x))

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

5.5 RMSNorm

Main idea. Normalize by root mean square without mean subtraction.

Core relation:

x/mean(x2)+ϵx/\sqrt{\mathrm{mean}(x^2)+\epsilon}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

6. Encoder Decoder and Decoder-Only Forms

This part studies encoder decoder and decoder-only forms as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Encoder self-attentionsource tokens attend bidirectionallyAijA_{ij} allowed for all source positions
Decoder causal self-attentiontarget tokens cannot see future target tokensjij\le i
Cross-attentiondecoder queries attend to encoder keys and valuesQ=HdecWQ, K,V=HencWK,VQ=H_\mathrm{dec}W_Q,\ K,V=H_\mathrm{enc}W_{K,V}
Encoder-onlyBERT-style models produce contextual representationsp(masked tokenx)p(\mathrm{masked}\ token\mid x)
Decoder-onlyGPT-style models predict next tokens autoregressivelyp(xtx<t)p(x_t\mid x_{<t})

6.1 Encoder self-attention

Main idea. Source tokens attend bidirectionally.

Core relation:

A_{ij}$ allowed for all source positions

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

6.2 Decoder causal self-attention

Main idea. Target tokens cannot see future target tokens.

Core relation:

jij\le i

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

6.3 Cross-attention

Main idea. Decoder queries attend to encoder keys and values.

Core relation:

Q=HdecWQ, K,V=HencWK,VQ=H_\mathrm{dec}W_Q,\ K,V=H_\mathrm{enc}W_{K,V}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

6.4 Encoder-only

Main idea. Bert-style models produce contextual representations.

Core relation:

p(masked tokenx)p(\mathrm{masked}\ token\mid x)

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

6.5 Decoder-only

Main idea. Gpt-style models predict next tokens autoregressively.

Core relation:

p(xtx<t)p(x_t\mid x_{<t})

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is the architecture behind most modern autoregressive LLMs.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

7. Positional Information

This part studies positional information as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Absolute positionsadd learned or fixed position vectorshi=xi+pih_i=x_i+p_i
Sinusoidal positionsfixed frequencies encode index informationsin(i/ωk),cos(i/ωk)\sin(i/\omega_k),\cos(i/\omega_k)
Relative biasattention scores depend on distanceSijSij+bijS_{ij}\leftarrow S_{ij}+b_{i-j}
RoPErotate query and key pairs by position-dependent anglesqikjq_i^\top k_j depends on iji-j
Length behaviorposition design affects long-context extrapolationTtest>TtrainT_\mathrm{test}>T_\mathrm{train}

7.1 Absolute positions

Main idea. Add learned or fixed position vectors.

Core relation:

hi=xi+pih_i=x_i+p_i

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

7.2 Sinusoidal positions

Main idea. Fixed frequencies encode index information.

Core relation:

sin(i/ωk),cos(i/ωk)\sin(i/\omega_k),\cos(i/\omega_k)

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

7.3 Relative bias

Main idea. Attention scores depend on distance.

Core relation:

SijSij+bijS_{ij}\leftarrow S_{ij}+b_{i-j}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

7.4 RoPE

Main idea. Rotate query and key pairs by position-dependent angles.

Core relation:

q_i^\top k_j$ depends on $i-j

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

7.5 Length behavior

Main idea. Position design affects long-context extrapolation.

Core relation:

Ttest>TtrainT_\mathrm{test}>T_\mathrm{train}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

8. Complexity and Memory

This part studies complexity and memory as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Attention FLOPsdense attention is quadratic in sequence lengthO(BHT2dh)O(BHT^2d_h)
Attention score memorynaive attention materializes T by T scoresO(BHT2)O(BHT^2)
MLP FLOPsMLP cost is linear in sequence length but large in widthO(BTdmodeldff)O(BTd_\mathrm{model}d_\mathrm{ff})
KV cachedecoder inference stores keys and values per layerMKV=2BLTHkvdhbM_\mathrm{KV}=2BLTH_{kv}d_hb
FlashAttention bridgeIO-aware attention computes exact attention without storing all scoressoftmax(QK)V\mathrm{softmax}(QK^\top)V tiled

8.1 Attention FLOPs

Main idea. Dense attention is quadratic in sequence length.

Core relation:

O(BHT2dh)O(BHT^2d_h)

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

8.2 Attention score memory

Main idea. Naive attention materializes t by t scores.

Core relation:

O(BHT2)O(BHT^2)

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

8.3 MLP FLOPs

Main idea. Mlp cost is linear in sequence length but large in width.

Core relation:

O(BTdmodeldff)O(BTd_\mathrm{model}d_\mathrm{ff})

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

8.4 KV cache

Main idea. Decoder inference stores keys and values per layer.

Core relation:

MKV=2BLTHkvdhbM_\mathrm{KV}=2BLTH_{kv}d_hb

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is the central memory object for fast decoder inference.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

8.5 FlashAttention bridge

Main idea. Io-aware attention computes exact attention without storing all scores.

Core relation:

\mathrm{softmax}(QK^\top)V$ tiled

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

9. Training Signals

This part studies training signals as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Language-model lossdecoder-only models train by next-token predictionL=tlogpθ(xtx<t)L=-\sum_t\log p_\theta(x_t\mid x_{<t})
Masked LM lossencoder-only models predict hidden masked tokensL=iMlogpθ(xixM)L=-\sum_{i\in M}\log p_\theta(x_i\mid x_{\setminus M})
Teacher forcingdecoder training conditions on gold previous tokensy<ty_{<t}^\star
Attention masksloss and attention masks must agree with the taskMattn,MlossM_\mathrm{attn},M_\mathrm{loss}
Weight tyinginput embedding and output projection can share weightsWout=EW_\mathrm{out}=E^\top

9.1 Language-model loss

Main idea. Decoder-only models train by next-token prediction.

Core relation:

L=tlogpθ(xtx<t)L=-\sum_t\log p_\theta(x_t\mid x_{<t})

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

9.2 Masked LM loss

Main idea. Encoder-only models predict hidden masked tokens.

Core relation:

L=iMlogpθ(xixM)L=-\sum_{i\in M}\log p_\theta(x_i\mid x_{\setminus M})

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

9.3 Teacher forcing

Main idea. Decoder training conditions on gold previous tokens.

Core relation:

y<ty_{<t}^\star

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

9.4 Attention masks

Main idea. Loss and attention masks must agree with the task.

Core relation:

Mattn,MlossM_\mathrm{attn},M_\mathrm{loss}

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

9.5 Weight tying

Main idea. Input embedding and output projection can share weights.

Core relation:

Wout=EW_\mathrm{out}=E^\top

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

10. Diagnostics

This part studies diagnostics as transformer block math. Keep track of token axes, head axes, masks, residual updates, and where cross-token mixing happens.

SubtopicQuestionFormula
Shape checkstrack batch, time, heads, head dimension, and model dimension(B,H,T,dh)(B,H,T,d_h)
Mask testsfuture leakage breaks autoregressive trainingAij=0A_{ij}=0 for j>ij>i
Attention entropyoverly sharp or diffuse heads can indicate issuesH(Ai)H(A_i)
Residual normsmonitor update size relative to residual streamF(x)/x\Vert F(x)\Vert/\Vert x\Vert
Ablationscompare head count, MLP width, norm placement, and position methodΔL,ΔT\Delta L,\Delta T

10.1 Shape checks

Main idea. Track batch, time, heads, head dimension, and model dimension.

Core relation:

(B,H,T,dh)(B,H,T,d_h)

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

10.2 Mask tests

Main idea. Future leakage breaks autoregressive training.

Core relation:

A_{ij}=0$ for $j>i

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. A single mask bug can make a model appear to learn while leaking future answers.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

10.3 Attention entropy

Main idea. Overly sharp or diffuse heads can indicate issues.

Core relation:

H(Ai)H(A_i)

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

10.4 Residual norms

Main idea. Monitor update size relative to residual stream.

Core relation:

F(x)/x\Vert F(x)\Vert/\Vert x\Vert

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.

10.5 Ablations

Main idea. Compare head count, mlp width, norm placement, and position method.

Core relation:

ΔL,ΔT\Delta L,\Delta T

Transformers separate two jobs. Attention mixes information across token positions. The feed-forward network transforms each token independently. Residual connections preserve a running representation, and normalization keeps optimization stable.

Worked micro-example. If dmodel=768d_\mathrm{model}=768 and there are 12 attention heads, each head has dh=64d_h=64. The attention score matrix for one head and one sequence of length 128 has shape 128128 by 128128. That score matrix is why attention is powerful and also why long-context attention is expensive.

Implementation check. Write the shapes for QQ, KK, VV, scores, attention weights, and output before coding. Most transformer bugs are wrong axes or wrong masks.

AI connection. This is a practical transformer design variable.

Common mistake. Do not say the MLP mixes tokens. In a standard transformer block, token mixing happens in attention; the MLP is position-wise.


Practice Exercises

  1. Compute scaled dot-product attention for tiny matrices.
  2. Build a causal mask and apply it to attention scores.
  3. Split model dimension into heads.
  4. Count parameters in Q, K, V, and output projections.
  5. Count MLP parameters and compare to attention parameters.
  6. Compute LayerNorm and RMSNorm for one vector.
  7. Compare Pre-LN and Post-LN block equations.
  8. Compute KV cache memory for a decoder.
  9. Identify encoder-only, decoder-only, and encoder-decoder masks.
  10. Write a transformer debugging checklist.

Why This Matters for AI

The transformer is the architectural base of modern LLMs and many multimodal models. Understanding it at the level of shapes, masks, residuals, and costs prevents vague explanations and catches real implementation mistakes.

Bridge to Reinforcement Learning

The next model-specific section studies reinforcement learning. In modern LLM systems, transformer policies are often optimized further with preference or reward-based objectives, so architecture and optimization meet there.

References