Exercises Notebook

Converted from exercises.ipynb for web reading.

Transformer Architecture: Exercises

Ten exercises cover attention arithmetic, masks, head dimensions, parameter counts, normalization, KV cache, architecture masks, and debugging.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Exercise 1: Scaled attention

Compute attention weights for one query.

Code cell 4

# Your Solution
q = np.array([1.0, 0.0])
K = np.array([[1.0, 0.0], [0.0, 1.0]])
print("Starter: scores=q@K.T/sqrt(2), then softmax.")

Code cell 5

# Solution
q = np.array([1.0, 0.0])
K = np.array([[1.0, 0.0], [0.0, 1.0]])
scores = q @ K.T / np.sqrt(2)
e = np.exp(scores - scores.max())
w = e / e.sum()
print("weights:", w)

Exercise 2: Causal mask

Create a 4 by 4 future mask.

Code cell 7

# Your Solution
T = 4
print("Starter: np.triu(np.ones((T,T)), k=1).")

Code cell 8

# Solution
T = 4
mask = np.triu(np.ones((T, T), dtype=bool), k=1)
print(mask.astype(int))

Exercise 3: Head dimension

Compute head dimension.

Code cell 10

# Your Solution
d_model = 1024
heads = 16
print("Starter: d_model // heads.")

Code cell 11

# Solution
d_model = 1024
heads = 16
d_head = d_model // heads
print("d_head:", d_head)

Exercise 4: Attention params

Count Q,K,V,O projection parameters.

Code cell 13

# Your Solution
d = 512
print("Starter: 4*d*d.")

Code cell 14

# Solution
d = 512
params = 4 * d * d
print("params:", params)

Exercise 5: MLP params

Count two linear layers with d_ff=4d.

Code cell 16

# Your Solution
d = 512
d_ff = 4 * d
print("Starter: 2*d*d_ff.")

Code cell 17

# Solution
d = 512
d_ff = 4 * d
params = 2 * d * d_ff
print("params:", params)

Exercise 6: LayerNorm

Normalize one vector.

Code cell 19

# Your Solution
x = np.array([1.0, 2.0, 3.0])
print("Starter: subtract mean and divide by std.")

Code cell 20

# Solution
x = np.array([1.0, 2.0, 3.0])
y = (x - x.mean()) / np.sqrt(x.var() + 1e-5)
print("LayerNorm:", y)

Exercise 7: Pre-LN block

Compute x + F(LN(x)) for simple F.

Code cell 22

# Your Solution
x = np.array([1.0, -1.0])
print("Starter: define LN, then add 0.5*LN(x).")

Code cell 23

# Solution
x = np.array([1.0, -1.0])
ln = (x - x.mean()) / np.sqrt(x.var() + 1e-5)
y = x + 0.5 * ln
print("y:", y)

Exercise 8: KV cache

Compute KV cache bytes.

Code cell 25

# Your Solution
B, L, T, H_kv, d_h, b = 2, 4, 128, 8, 64, 2
print("Starter: 2*B*L*T*H_kv*d_h*b.")

Code cell 26

# Solution
B, L, T, H_kv, d_h, b = 2, 4, 128, 8, 64, 2
bytes_total = 2 * B * L * T * H_kv * d_h * b
print("bytes:", bytes_total)

Exercise 9: Architecture masks

State which mask a decoder-only LM uses.

Code cell 28

# Your Solution
print("Starter: decoder-only next-token models use causal self-attention.")

Code cell 29

# Solution
mask_type = "causal self-attention mask plus padding mask when padded"
print(mask_type)

Exercise 10: Debug checklist

Write four transformer checks.

Code cell 31

# Your Solution
print("Starter: include shapes, masks, head dimension, and KV cache.")

Code cell 32

# Solution
checks = [
    "Q,K,V shapes are correct",
    "causal and padding masks are tested",
    "d_model is divisible by number of heads",
    "KV cache memory is estimated for serving",
]
for check in checks:
    print("-", check)

Closing Reflection

Transformer implementation is mostly shape, mask, residual, and cost discipline. The equations are compact; the axes are where mistakes hide.