Exercises Notebook
Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Transformer Architecture: Exercises
Ten exercises cover attention arithmetic, masks, head dimensions, parameter counts, normalization, KV cache, architecture masks, and debugging.
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Exercise 1: Scaled attention
Compute attention weights for one query.
Code cell 4
# Your Solution
q = np.array([1.0, 0.0])
K = np.array([[1.0, 0.0], [0.0, 1.0]])
print("Starter: scores=q@K.T/sqrt(2), then softmax.")
Code cell 5
# Solution
q = np.array([1.0, 0.0])
K = np.array([[1.0, 0.0], [0.0, 1.0]])
scores = q @ K.T / np.sqrt(2)
e = np.exp(scores - scores.max())
w = e / e.sum()
print("weights:", w)
Exercise 2: Causal mask
Create a 4 by 4 future mask.
Code cell 7
# Your Solution
T = 4
print("Starter: np.triu(np.ones((T,T)), k=1).")
Code cell 8
# Solution
T = 4
mask = np.triu(np.ones((T, T), dtype=bool), k=1)
print(mask.astype(int))
Exercise 3: Head dimension
Compute head dimension.
Code cell 10
# Your Solution
d_model = 1024
heads = 16
print("Starter: d_model // heads.")
Code cell 11
# Solution
d_model = 1024
heads = 16
d_head = d_model // heads
print("d_head:", d_head)
Exercise 4: Attention params
Count Q,K,V,O projection parameters.
Code cell 13
# Your Solution
d = 512
print("Starter: 4*d*d.")
Code cell 14
# Solution
d = 512
params = 4 * d * d
print("params:", params)
Exercise 5: MLP params
Count two linear layers with d_ff=4d.
Code cell 16
# Your Solution
d = 512
d_ff = 4 * d
print("Starter: 2*d*d_ff.")
Code cell 17
# Solution
d = 512
d_ff = 4 * d
params = 2 * d * d_ff
print("params:", params)
Exercise 6: LayerNorm
Normalize one vector.
Code cell 19
# Your Solution
x = np.array([1.0, 2.0, 3.0])
print("Starter: subtract mean and divide by std.")
Code cell 20
# Solution
x = np.array([1.0, 2.0, 3.0])
y = (x - x.mean()) / np.sqrt(x.var() + 1e-5)
print("LayerNorm:", y)
Exercise 7: Pre-LN block
Compute x + F(LN(x)) for simple F.
Code cell 22
# Your Solution
x = np.array([1.0, -1.0])
print("Starter: define LN, then add 0.5*LN(x).")
Code cell 23
# Solution
x = np.array([1.0, -1.0])
ln = (x - x.mean()) / np.sqrt(x.var() + 1e-5)
y = x + 0.5 * ln
print("y:", y)
Exercise 8: KV cache
Compute KV cache bytes.
Code cell 25
# Your Solution
B, L, T, H_kv, d_h, b = 2, 4, 128, 8, 64, 2
print("Starter: 2*B*L*T*H_kv*d_h*b.")
Code cell 26
# Solution
B, L, T, H_kv, d_h, b = 2, 4, 128, 8, 64, 2
bytes_total = 2 * B * L * T * H_kv * d_h * b
print("bytes:", bytes_total)
Exercise 9: Architecture masks
State which mask a decoder-only LM uses.
Code cell 28
# Your Solution
print("Starter: decoder-only next-token models use causal self-attention.")
Code cell 29
# Solution
mask_type = "causal self-attention mask plus padding mask when padded"
print(mask_type)
Exercise 10: Debug checklist
Write four transformer checks.
Code cell 31
# Your Solution
print("Starter: include shapes, masks, head dimension, and KV cache.")
Code cell 32
# Solution
checks = [
"Q,K,V shapes are correct",
"causal and padding masks are tested",
"d_model is divisible by number of heads",
"KV cache memory is estimated for serving",
]
for check in checks:
print("-", check)
Closing Reflection
Transformer implementation is mostly shape, mask, residual, and cost discipline. The equations are compact; the axes are where mistakes hide.