Exercises NotebookMath for LLMs

RNN and LSTM Math

Math for Specific Models / RNN and LSTM Math

Run notebook
Exercises Notebook

Exercises Notebook

Converted from exercises.ipynb for web reading.

RNN and LSTM Math: Exercises

Ten exercises cover the recurring mechanics: hidden updates, sequence probability, BPTT scale, clipping, LSTM and GRU gates, masks, shapes, attention context, and diagnostics.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Exercise 1: Vanilla RNN update

Compute one tanh hidden-state update.

Code cell 4

# Your Solution
x = np.array([1.0, -1.0])
h_prev = np.array([0.5, 0.0])
W_xh = np.eye(2)
W_hh = 0.5 * np.eye(2)
print("Starter: h=tanh(W_xh@x + W_hh@h_prev).")

Code cell 5

# Solution
x = np.array([1.0, -1.0])
h_prev = np.array([0.5, 0.0])
W_xh = np.eye(2)
W_hh = 0.5 * np.eye(2)
h = np.tanh(W_xh @ x + W_hh @ h_prev)
print("h:", h)

Exercise 2: Sequence probability

Multiply conditional probabilities for a sequence.

Code cell 7

# Your Solution
probs = np.array([0.8, 0.6, 0.5])
print("Starter: product for probability, sum logs for log probability.")

Code cell 8

# Solution
probs = np.array([0.8, 0.6, 0.5])
p = probs.prod()
logp = np.log(probs).sum()
print("p:", p, "logp:", logp)

Exercise 3: Gradient product

Compute scalar gradient scale over 10 steps.

Code cell 10

# Your Solution
scale = 0.8
steps = 10
print("Starter: scale ** steps.")

Code cell 11

# Solution
scale = 0.8
steps = 10
print("gradient scale:", scale ** steps)

Exercise 4: Gradient clipping

Clip vector [6,8] to norm 5.

Code cell 13

# Your Solution
g = np.array([6.0, 8.0])
print("Starter: multiply by min(1, 5/norm(g)).")

Code cell 14

# Solution
g = np.array([6.0, 8.0])
scale = min(1.0, 5.0 / np.linalg.norm(g))
clipped = g * scale
print("clipped:", clipped, "norm:", np.linalg.norm(clipped))

Exercise 5: LSTM cell update

Compute c_t from gates and candidate.

Code cell 16

# Your Solution
f = np.array([0.9, 0.2])
i = np.array([0.1, 0.8])
c_prev = np.array([1.0, -1.0])
cand = np.array([0.5, 0.25])
print("Starter: c=f*c_prev + i*cand.")

Code cell 17

# Solution
f = np.array([0.9, 0.2])
i = np.array([0.1, 0.8])
c_prev = np.array([1.0, -1.0])
cand = np.array([0.5, 0.25])
c = f * c_prev + i * cand
print("c:", c)

Exercise 6: GRU update

Blend old state and candidate with update gate.

Code cell 19

# Your Solution
z = np.array([0.25, 0.75])
h_prev = np.array([1.0, -1.0])
h_tilde = np.array([0.0, 0.5])
print("Starter: h=(1-z)*h_prev + z*h_tilde.")

Code cell 20

# Solution
z = np.array([0.25, 0.75])
h_prev = np.array([1.0, -1.0])
h_tilde = np.array([0.0, 0.5])
h = (1 - z) * h_prev + z * h_tilde
print("h:", h)

Exercise 7: Masked loss

Average losses over real tokens only.

Code cell 22

# Your Solution
losses = np.array([0.4, 0.6, 0.0])
mask = np.array([1, 1, 0])
print("Starter: sum(losses*mask)/sum(mask).")

Code cell 23

# Solution
losses = np.array([0.4, 0.6, 0.0])
mask = np.array([1, 1, 0])
masked = (losses * mask).sum() / mask.sum()
print("masked loss:", masked)

Exercise 8: Task shapes

Identify output shape for many-to-many logits.

Code cell 25

# Your Solution
B, T, V = 3, 5, 100
print("Starter: logits shape is (B,T,V).")

Code cell 26

# Solution
B, T, V = 3, 5, 100
logits_shape = (B, T, V)
print("logits shape:", logits_shape)

Exercise 9: Attention context

Compute attention context from weights and encoder states.

Code cell 28

# Your Solution
weights = np.array([0.2, 0.3, 0.5])
encoder = np.array([[1.,0.], [0.,1.], [1.,1.]])
print("Starter: context = weights @ encoder.")

Code cell 29

# Solution
weights = np.array([0.2, 0.3, 0.5])
encoder = np.array([[1.,0.], [0.,1.], [1.,1.]])
context = weights @ encoder
print("context:", context)

Exercise 10: Debug checklist

Write four RNN diagnostics.

Code cell 31

# Your Solution
print("Starter: include masks, gradient norms, gate stats, and length tests.")

Code cell 32

# Solution
checks = [
    "padding masks are applied before averaging loss",
    "gradient norms are tracked through time",
    "LSTM/GRU gate saturation is monitored",
    "short and long sequence metrics are reported separately",
]
for check in checks:
    print("-", check)

Closing Reflection

RNNs teach the core sequence-learning tension: memory needs long paths, but gradients dislike long paths. Gates and attention are two different answers to that tension.