NotesMath for LLMs

RNN and LSTM Math

Math for Specific Models / RNN and LSTM Math

Notes

Recurrent neural networks model sequences by reusing the same transition at every time step. Their power is memory; their difficulty is training through long chains of time.

Overview

The basic recurrent equation is:

ht=fθ(ht1,xt).h_t=f_\theta(h_{t-1},x_t).

This state update lets a model process variable-length sequences. But training the model means backpropagating through the unrolled computation graph. Long products of recurrent Jacobians can vanish or explode. LSTM and GRU cells add gates and additive memory paths to make sequence learning more stable.

Prerequisites

  • Matrix multiplication and nonlinear activations
  • Chain rule and backpropagation
  • Cross-entropy for sequence prediction
  • Basic transformer attention intuition for the bridge section

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates vanilla recurrence, BPTT gradient products, clipping, LSTM/GRU gates, masks, teacher forcing, attention weights, and diagnostics.
exercises.ipynbTen practice problems for recurrent equations, BPTT, gates, masking, and sequence-task shapes.

Learning Objectives

After this section, you should be able to:

  • Write vanilla RNN, LSTM, and GRU update equations.
  • Explain recurrence as a parameter-shared deep network along time.
  • Derive why gradients vanish or explode through repeated Jacobian products.
  • Apply gradient clipping and truncated BPTT.
  • Interpret LSTM forget/input/output gates and GRU update/reset gates.
  • Distinguish many-to-one, many-to-many, seq2seq, and bidirectional tasks.
  • Explain why attention was introduced to reduce the fixed-context bottleneck.
  • Build diagnostics for masks, gate saturation, length generalization, and gradient norms.

Table of Contents

  1. Sequence Modeling View
  2. Vanilla RNN Equations
  3. Backpropagation Through Time
  4. Vanishing and Exploding Gradients
  5. LSTM Cell
  6. GRU Cell
  7. Sequence Tasks
  8. Attention Bridge
  9. Training Practice
  10. Diagnostics

Shape Map

input sequence:       X      shape (B, T, d_x)
hidden state:         h_t    shape (B, d_h)
hidden sequence:      H      shape (B, T, d_h)
logits per token:     O      shape (B, T, |V|)
padding mask:         M      shape (B, T)

1. Sequence Modeling View

This part studies sequence modeling view through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Sequential dataobservations arrive with order and historyx1:T=(x1,,xT)x_{1:T}=(x_1,\ldots,x_T)
Hidden statethe model carries a summary of the pastht=fθ(ht1,xt)h_t=f_\theta(h_{t-1},x_t)
Output distributioneach state can produce a predictionp(ytxt)=gθ(ht)p(y_t\mid x_{\le t})=g_\theta(h_t)
Autoregressive generationthe previous output can become the next inputp(x1:T)=tp(xtx<t)p(x_{1:T})=\prod_t p(x_t\mid x_{<t})
RNN versus transformerRNNs compress history into a state, transformers keep token states visiblehth_t versus H1:tH_{1:t}

1.1 Sequential data

Main idea. Observations arrive with order and history.

Core relation:

x1:T=(x1,,xT)x_{1:T}=(x_1,\ldots,x_T)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.2 Hidden state

Main idea. The model carries a summary of the past.

Core relation:

ht=fθ(ht1,xt)h_t=f_\theta(h_{t-1},x_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.3 Output distribution

Main idea. Each state can produce a prediction.

Core relation:

p(ytxt)=gθ(ht)p(y_t\mid x_{\le t})=g_\theta(h_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.4 Autoregressive generation

Main idea. The previous output can become the next input.

Core relation:

p(x1:T)=tp(xtx<t)p(x_{1:T})=\prod_t p(x_t\mid x_{<t})

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.5 RNN versus transformer

Main idea. Rnns compress history into a state, transformers keep token states visible.

Core relation:

h_t$ versus $H_{1:t}

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2. Vanilla RNN Equations

This part studies vanilla rnn equations through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
State updatecombine current input and previous hidden stateht=ϕ(Wxhxt+Whhht1+bh)h_t=\phi(W_{xh}x_t+W_{hh}h_{t-1}+b_h)
Output headmap hidden state to logits or regression outputot=Whyht+byo_t=W_{hy}h_t+b_y
Parameter sharingthe same weights are reused at every time stepθt=θ\theta_t=\theta
Unrolled graphan RNN is a deep network along timeh1h2hTh_1\rightarrow h_2\rightarrow\cdots\rightarrow h_T
Initial statestart from zeros or a learned stateh0=0h_0=0 or trainable

2.1 State update

Main idea. Combine current input and previous hidden state.

Core relation:

ht=ϕ(Wxhxt+Whhht1+bh)h_t=\phi(W_{xh}x_t+W_{hh}h_{t-1}+b_h)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.2 Output head

Main idea. Map hidden state to logits or regression output.

Core relation:

ot=Whyht+byo_t=W_{hy}h_t+b_y

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.3 Parameter sharing

Main idea. The same weights are reused at every time step.

Core relation:

θt=θ\theta_t=\theta

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.4 Unrolled graph

Main idea. An rnn is a deep network along time.

Core relation:

h1h2hTh_1\rightarrow h_2\rightarrow\cdots\rightarrow h_T

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.5 Initial state

Main idea. Start from zeros or a learned state.

Core relation:

h_0=0$ or trainable

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3. Backpropagation Through Time

This part studies backpropagation through time through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Sequence losssum or average losses over timeL=t=1TtL=\sum_{t=1}^{T}\ell_t
Temporal chain rulefuture losses depend on earlier states through recurrenceL/ht=t/ht+(ht+1/ht)L/ht+1\partial L/\partial h_t=\partial \ell_t/\partial h_t+(\partial h_{t+1}/\partial h_t)^\top\partial L/\partial h_{t+1}
Weight gradientshared weights collect gradient from every time stepL/W=tLt/W\partial L/\partial W=\sum_t \partial L_t/\partial W
Truncated BPTTbackpropagate through a limited window for efficiencytk,,tt-k,\ldots,t
State detachmentdetach hidden state between chunks to control graph lengthhtstopgrad(ht)h_t\leftarrow\mathrm{stopgrad}(h_t)

3.1 Sequence loss

Main idea. Sum or average losses over time.

Core relation:

L=t=1TtL=\sum_{t=1}^{T}\ell_t

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.2 Temporal chain rule

Main idea. Future losses depend on earlier states through recurrence.

Core relation:

L/ht=t/ht+(ht+1/ht)L/ht+1\partial L/\partial h_t=\partial \ell_t/\partial h_t+(\partial h_{t+1}/\partial h_t)^\top\partial L/\partial h_{t+1}

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is why RNN training is training a very deep network along time.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.3 Weight gradient

Main idea. Shared weights collect gradient from every time step.

Core relation:

L/W=tLt/W\partial L/\partial W=\sum_t \partial L_t/\partial W

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.4 Truncated BPTT

Main idea. Backpropagate through a limited window for efficiency.

Core relation:

tk,,tt-k,\ldots,t

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.5 State detachment

Main idea. Detach hidden state between chunks to control graph length.

Core relation:

htstopgrad(ht)h_t\leftarrow\mathrm{stopgrad}(h_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4. Vanishing and Exploding Gradients

This part studies vanishing and exploding gradients through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Jacobian productlong-range gradients multiply many recurrent Jacobiansi=s+1thi/hi1\prod_{i=s+1}^{t}\partial h_i/\partial h_{i-1}
Spectral radiusgradient scale depends on recurrent dynamicsρ(Whh)\rho(W_{hh})
Vanishingsingular values below one shrink long-range gradientsgs0\Vert g_s\Vert\rightarrow 0
Explodingsingular values above one can blow up gradientsgs\Vert g_s\Vert\rightarrow\infty
Gradient clippingcap gradient norm before optimizer updateggmin(1,c/g)g\leftarrow g\min(1,c/\Vert g\Vert)

4.1 Jacobian product

Main idea. Long-range gradients multiply many recurrent jacobians.

Core relation:

i=s+1thi/hi1\prod_{i=s+1}^{t}\partial h_i/\partial h_{i-1}

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.2 Spectral radius

Main idea. Gradient scale depends on recurrent dynamics.

Core relation:

ρ(Whh)\rho(W_{hh})

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.3 Vanishing

Main idea. Singular values below one shrink long-range gradients.

Core relation:

gs0\Vert g_s\Vert\rightarrow 0

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.4 Exploding

Main idea. Singular values above one can blow up gradients.

Core relation:

gs\Vert g_s\Vert\rightarrow\infty

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.5 Gradient clipping

Main idea. Cap gradient norm before optimizer update.

Core relation:

ggmin(1,c/g)g\leftarrow g\min(1,c/\Vert g\Vert)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. Clipping became a standard tool because recurrent gradient products can explode suddenly.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5. LSTM Cell

This part studies lstm cell through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Forget gatedecide what memory to keepft=σ(Wf[xt,ht1]+bf)f_t=\sigma(W_f[x_t,h_{t-1}]+b_f)
Input gatedecide what new content to writeit=σ(Wi[xt,ht1]+bi)i_t=\sigma(W_i[x_t,h_{t-1}]+b_i)
Candidate memorypropose new cell contentc~t=tanh(Wc[xt,ht1]+bc)\tilde c_t=\tanh(W_c[x_t,h_{t-1}]+b_c)
Cell updateadditive memory path improves gradient flowct=ftct1+itc~tc_t=f_t\odot c_{t-1}+i_t\odot\tilde c_t
Output gateexpose part of cell memory as hidden stateht=ottanh(ct)h_t=o_t\odot\tanh(c_t)

5.1 Forget gate

Main idea. Decide what memory to keep.

Core relation:

ft=σ(Wf[xt,ht1]+bf)f_t=\sigma(W_f[x_t,h_{t-1}]+b_f)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.2 Input gate

Main idea. Decide what new content to write.

Core relation:

it=σ(Wi[xt,ht1]+bi)i_t=\sigma(W_i[x_t,h_{t-1}]+b_i)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.3 Candidate memory

Main idea. Propose new cell content.

Core relation:

c~t=tanh(Wc[xt,ht1]+bc)\tilde c_t=\tanh(W_c[x_t,h_{t-1}]+b_c)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.4 Cell update

Main idea. Additive memory path improves gradient flow.

Core relation:

ct=ftct1+itc~tc_t=f_t\odot c_{t-1}+i_t\odot\tilde c_t

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. The additive cell path is the mathematical reason LSTMs can carry information longer than a plain tanh RNN.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.5 Output gate

Main idea. Expose part of cell memory as hidden state.

Core relation:

ht=ottanh(ct)h_t=o_t\odot\tanh(c_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

6. GRU Cell

This part studies gru cell through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Update gateinterpolate old and candidate stateszt=σ(Wz[xt,ht1])z_t=\sigma(W_z[x_t,h_{t-1}])
Reset gatecontrol how much past enters the candidatert=σ(Wr[xt,ht1])r_t=\sigma(W_r[x_t,h_{t-1}])
Candidate hiddenbuild proposed new hidden stateh~t=tanh(Wh[xt,rtht1])\tilde h_t=\tanh(W_h[x_t,r_t\odot h_{t-1}])
Hidden updateblend old state and candidateht=(1zt)ht1+zth~th_t=(1-z_t)\odot h_{t-1}+z_t\odot\tilde h_t
GRU versus LSTMGRU is smaller; LSTM has separate cell and hidden statehth_t only versus (ct,ht)(c_t,h_t)

6.1 Update gate

Main idea. Interpolate old and candidate states.

Core relation:

zt=σ(Wz[xt,ht1])z_t=\sigma(W_z[x_t,h_{t-1}])

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

6.2 Reset gate

Main idea. Control how much past enters the candidate.

Core relation:

rt=σ(Wr[xt,ht1])r_t=\sigma(W_r[x_t,h_{t-1}])

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

6.3 Candidate hidden

Main idea. Build proposed new hidden state.

Core relation:

h~t=tanh(Wh[xt,rtht1])\tilde h_t=\tanh(W_h[x_t,r_t\odot h_{t-1}])

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

6.4 Hidden update

Main idea. Blend old state and candidate.

Core relation:

ht=(1zt)ht1+zth~th_t=(1-z_t)\odot h_{t-1}+z_t\odot\tilde h_t

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

6.5 GRU versus LSTM

Main idea. Gru is smaller; lstm has separate cell and hidden state.

Core relation:

h_t$ only versus $(c_t,h_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7. Sequence Tasks

This part studies sequence tasks through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Many-to-oneclassify a whole sequence from final or pooled statey^=g(hT)\hat y=g(h_T)
Many-to-manypredict at every time stepy^t=g(ht)\hat y_t=g(h_t)
Seq2seqencode one sequence and decode anotherp(y1:Mx1:T)=jp(yjy<j,c)p(y_{1:M}\mid x_{1:T})=\prod_jp(y_j\mid y_{<j},c)
Bidirectional RNNuse past and future context for non-causal tasksht=[ht;ht]h_t=[\overrightarrow h_t;\overleftarrow h_t]
Teacher forcingdecoder conditions on gold previous outputs during trainingp(yty<t,c)p(y_t\mid y_{<t}^\star,c)

7.1 Many-to-one

Main idea. Classify a whole sequence from final or pooled state.

Core relation:

y^=g(hT)\hat y=g(h_T)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7.2 Many-to-many

Main idea. Predict at every time step.

Core relation:

y^t=g(ht)\hat y_t=g(h_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7.3 Seq2seq

Main idea. Encode one sequence and decode another.

Core relation:

p(y1:Mx1:T)=jp(yjy<j,c)p(y_{1:M}\mid x_{1:T})=\prod_jp(y_j\mid y_{<j},c)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7.4 Bidirectional RNN

Main idea. Use past and future context for non-causal tasks.

Core relation:

ht=[ht;ht]h_t=[\overrightarrow h_t;\overleftarrow h_t]

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7.5 Teacher forcing

Main idea. Decoder conditions on gold previous outputs during training.

Core relation:

p(yty<t,c)p(y_t\mid y_{<t}^\star,c)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8. Attention Bridge

This part studies attention bridge through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Fixed context bottlenecka single encoder vector struggles with long inputsc=hTc=h_T
Alignment scoresdecoder state scores every encoder stateetj=a(st1,hj)e_{tj}=a(s_{t-1},h_j)
Attention weightssoftmax turns scores into a distribution over positionsαtj=softmaxj(etj)\alpha_{tj}=\mathrm{softmax}_j(e_{tj})
Context vectorweighted sum exposes relevant encoder statesct=jαtjhjc_t=\sum_j\alpha_{tj}h_j
Transformer bridgeself-attention removes recurrence and exposes all token states directlyAttention(Q,K,V)\mathrm{Attention}(Q,K,V)

8.1 Fixed context bottleneck

Main idea. A single encoder vector struggles with long inputs.

Core relation:

c=hTc=h_T

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8.2 Alignment scores

Main idea. Decoder state scores every encoder state.

Core relation:

etj=a(st1,hj)e_{tj}=a(s_{t-1},h_j)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8.3 Attention weights

Main idea. Softmax turns scores into a distribution over positions.

Core relation:

αtj=softmaxj(etj)\alpha_{tj}=\mathrm{softmax}_j(e_{tj})

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is the historical bridge from seq2seq RNNs to transformer attention.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8.4 Context vector

Main idea. Weighted sum exposes relevant encoder states.

Core relation:

ct=jαtjhjc_t=\sum_j\alpha_{tj}h_j

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8.5 Transformer bridge

Main idea. Self-attention removes recurrence and exposes all token states directly.

Core relation:

Attention(Q,K,V)\mathrm{Attention}(Q,K,V)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9. Training Practice

This part studies training practice through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Padding maskssequence batches have variable lengthsmt{0,1}m_t\in\{0,1\}
Packed sequencesavoid computing loss on paddingL=tmtt/tmtL=\sum_tm_t\ell_t/\sum_tm_t
Stateful streamingcarry state across chunks for long streamshnext=hTh_\mathrm{next}=h_T
Initializationorthogonal recurrent matrices can stabilize early dynamicsWhhWhh=IW_{hh}^\top W_{hh}=I
Regularizationdropout, weight decay, and clipping fight overfit and instabilityL+λθ2L+\lambda\Vert\theta\Vert^2

9.1 Padding masks

Main idea. Sequence batches have variable lengths.

Core relation:

mt{0,1}m_t\in\{0,1\}

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9.2 Packed sequences

Main idea. Avoid computing loss on padding.

Core relation:

L=tmtt/tmtL=\sum_tm_t\ell_t/\sum_tm_t

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9.3 Stateful streaming

Main idea. Carry state across chunks for long streams.

Core relation:

hnext=hTh_\mathrm{next}=h_T

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9.4 Initialization

Main idea. Orthogonal recurrent matrices can stabilize early dynamics.

Core relation:

WhhWhh=IW_{hh}^\top W_{hh}=I

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9.5 Regularization

Main idea. Dropout, weight decay, and clipping fight overfit and instability.

Core relation:

L+λθ2L+\lambda\Vert\theta\Vert^2

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10. Diagnostics

This part studies diagnostics through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Shape checksinputs, hidden states, gates, and outputs have distinct axes(B,T,d)(B,T,d) and (B,h)(B,h)
Gradient normstrack exploding or vanishing gradients through timegt\Vert g_t\Vert
Gate statisticssaturated gates indicate stuck memory behaviorft,it,ztf_t,i_t,z_t near 0 or 1
Length testsevaluate short and long sequences separatelyS(T)S(T)
Ablationscompare vanilla RNN, GRU, LSTM, and attention bridgeΔL,ΔS\Delta L,\Delta S

10.1 Shape checks

Main idea. Inputs, hidden states, gates, and outputs have distinct axes.

Core relation:

(B,T,d)$ and $(B,h)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10.2 Gradient norms

Main idea. Track exploding or vanishing gradients through time.

Core relation:

gt\Vert g_t\Vert

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10.3 Gate statistics

Main idea. Saturated gates indicate stuck memory behavior.

Core relation:

f_t,i_t,z_t$ near 0 or 1

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. Gate histograms quickly reveal whether an LSTM or GRU is actually using its memory.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10.4 Length tests

Main idea. Evaluate short and long sequences separately.

Core relation:

S(T)S(T)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10.5 Ablations

Main idea. Compare vanilla rnn, gru, lstm, and attention bridge.

Core relation:

ΔL,ΔS\Delta L,\Delta S

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.


Practice Exercises

  1. Compute one vanilla RNN hidden update.
  2. Compute a sequence log probability from conditional probabilities.
  3. Show a scalar gradient product that vanishes or explodes.
  4. Clip a gradient vector by norm.
  5. Compute one LSTM cell update.
  6. Compute one GRU hidden update.
  7. Apply a padding mask to sequence losses.
  8. Identify shapes for many-to-one and many-to-many tasks.
  9. Compute attention weights and context over encoder states.
  10. Write an RNN debugging checklist.

Why This Matters for AI

Transformers dominate current LLMs, but RNNs still teach the core sequence-learning problems: hidden state, recurrence, long-range credit assignment, gradient stability, teacher forcing, and attention as a solution to fixed-context bottlenecks. Understanding RNNs makes transformer design feel less arbitrary.

Bridge to Transformer Architecture

Transformers replace recurrent state updates with attention over token states. The next section studies how self-attention, residual streams, normalization, and feed-forward blocks solve many RNN bottlenecks while introducing their own memory and compute tradeoffs.

References