Recurrent neural networks model sequences by reusing the same transition at every time step. Their power is memory; their difficulty is training through long chains of time.
Overview
The basic recurrent equation is:
This state update lets a model process variable-length sequences. But training the model means backpropagating through the unrolled computation graph. Long products of recurrent Jacobians can vanish or explode. LSTM and GRU cells add gates and additive memory paths to make sequence learning more stable.
Prerequisites
- Matrix multiplication and nonlinear activations
- Chain rule and backpropagation
- Cross-entropy for sequence prediction
- Basic transformer attention intuition for the bridge section
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates vanilla recurrence, BPTT gradient products, clipping, LSTM/GRU gates, masks, teacher forcing, attention weights, and diagnostics. |
| exercises.ipynb | Ten practice problems for recurrent equations, BPTT, gates, masking, and sequence-task shapes. |
Learning Objectives
After this section, you should be able to:
- Write vanilla RNN, LSTM, and GRU update equations.
- Explain recurrence as a parameter-shared deep network along time.
- Derive why gradients vanish or explode through repeated Jacobian products.
- Apply gradient clipping and truncated BPTT.
- Interpret LSTM forget/input/output gates and GRU update/reset gates.
- Distinguish many-to-one, many-to-many, seq2seq, and bidirectional tasks.
- Explain why attention was introduced to reduce the fixed-context bottleneck.
- Build diagnostics for masks, gate saturation, length generalization, and gradient norms.
Table of Contents
- Sequence Modeling View
- 1.1 Sequential data
- 1.2 Hidden state
- 1.3 Output distribution
- 1.4 Autoregressive generation
- 1.5 RNN versus transformer
- Vanilla RNN Equations
- 2.1 State update
- 2.2 Output head
- 2.3 Parameter sharing
- 2.4 Unrolled graph
- 2.5 Initial state
- Backpropagation Through Time
- 3.1 Sequence loss
- 3.2 Temporal chain rule
- 3.3 Weight gradient
- 3.4 Truncated BPTT
- 3.5 State detachment
- Vanishing and Exploding Gradients
- 4.1 Jacobian product
- 4.2 Spectral radius
- 4.3 Vanishing
- 4.4 Exploding
- 4.5 Gradient clipping
- LSTM Cell
- 5.1 Forget gate
- 5.2 Input gate
- 5.3 Candidate memory
- 5.4 Cell update
- 5.5 Output gate
- GRU Cell
- 6.1 Update gate
- 6.2 Reset gate
- 6.3 Candidate hidden
- 6.4 Hidden update
- 6.5 GRU versus LSTM
- Sequence Tasks
- 7.1 Many-to-one
- 7.2 Many-to-many
- 7.3 Seq2seq
- 7.4 Bidirectional RNN
- 7.5 Teacher forcing
- Attention Bridge
- 8.1 Fixed context bottleneck
- 8.2 Alignment scores
- 8.3 Attention weights
- 8.4 Context vector
- 8.5 Transformer bridge
- Training Practice
- 9.1 Padding masks
- 9.2 Packed sequences
- 9.3 Stateful streaming
- 9.4 Initialization
- 9.5 Regularization
- Diagnostics
- 10.1 Shape checks
- 10.2 Gradient norms
- 10.3 Gate statistics
- 10.4 Length tests
- 10.5 Ablations
Shape Map
input sequence: X shape (B, T, d_x)
hidden state: h_t shape (B, d_h)
hidden sequence: H shape (B, T, d_h)
logits per token: O shape (B, T, |V|)
padding mask: M shape (B, T)
1. Sequence Modeling View
This part studies sequence modeling view through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Sequential data | observations arrive with order and history | |
| Hidden state | the model carries a summary of the past | |
| Output distribution | each state can produce a prediction | |
| Autoregressive generation | the previous output can become the next input | |
| RNN versus transformer | RNNs compress history into a state, transformers keep token states visible | versus |
1.1 Sequential data
Main idea. Observations arrive with order and history.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
1.2 Hidden state
Main idea. The model carries a summary of the past.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
1.3 Output distribution
Main idea. Each state can produce a prediction.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
1.4 Autoregressive generation
Main idea. The previous output can become the next input.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
1.5 RNN versus transformer
Main idea. Rnns compress history into a state, transformers keep token states visible.
Core relation:
h_t$ versus $H_{1:t}An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2. Vanilla RNN Equations
This part studies vanilla rnn equations through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| State update | combine current input and previous hidden state | |
| Output head | map hidden state to logits or regression output | |
| Parameter sharing | the same weights are reused at every time step | |
| Unrolled graph | an RNN is a deep network along time | |
| Initial state | start from zeros or a learned state | or trainable |
2.1 State update
Main idea. Combine current input and previous hidden state.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2.2 Output head
Main idea. Map hidden state to logits or regression output.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2.3 Parameter sharing
Main idea. The same weights are reused at every time step.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2.4 Unrolled graph
Main idea. An rnn is a deep network along time.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2.5 Initial state
Main idea. Start from zeros or a learned state.
Core relation:
h_0=0$ or trainableAn RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3. Backpropagation Through Time
This part studies backpropagation through time through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Sequence loss | sum or average losses over time | |
| Temporal chain rule | future losses depend on earlier states through recurrence | |
| Weight gradient | shared weights collect gradient from every time step | |
| Truncated BPTT | backpropagate through a limited window for efficiency | |
| State detachment | detach hidden state between chunks to control graph length |
3.1 Sequence loss
Main idea. Sum or average losses over time.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3.2 Temporal chain rule
Main idea. Future losses depend on earlier states through recurrence.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is why RNN training is training a very deep network along time.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3.3 Weight gradient
Main idea. Shared weights collect gradient from every time step.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3.4 Truncated BPTT
Main idea. Backpropagate through a limited window for efficiency.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3.5 State detachment
Main idea. Detach hidden state between chunks to control graph length.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4. Vanishing and Exploding Gradients
This part studies vanishing and exploding gradients through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Jacobian product | long-range gradients multiply many recurrent Jacobians | |
| Spectral radius | gradient scale depends on recurrent dynamics | |
| Vanishing | singular values below one shrink long-range gradients | |
| Exploding | singular values above one can blow up gradients | |
| Gradient clipping | cap gradient norm before optimizer update |
4.1 Jacobian product
Main idea. Long-range gradients multiply many recurrent jacobians.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4.2 Spectral radius
Main idea. Gradient scale depends on recurrent dynamics.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4.3 Vanishing
Main idea. Singular values below one shrink long-range gradients.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4.4 Exploding
Main idea. Singular values above one can blow up gradients.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4.5 Gradient clipping
Main idea. Cap gradient norm before optimizer update.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. Clipping became a standard tool because recurrent gradient products can explode suddenly.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5. LSTM Cell
This part studies lstm cell through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Forget gate | decide what memory to keep | |
| Input gate | decide what new content to write | |
| Candidate memory | propose new cell content | |
| Cell update | additive memory path improves gradient flow | |
| Output gate | expose part of cell memory as hidden state |
5.1 Forget gate
Main idea. Decide what memory to keep.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5.2 Input gate
Main idea. Decide what new content to write.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5.3 Candidate memory
Main idea. Propose new cell content.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5.4 Cell update
Main idea. Additive memory path improves gradient flow.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. The additive cell path is the mathematical reason LSTMs can carry information longer than a plain tanh RNN.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5.5 Output gate
Main idea. Expose part of cell memory as hidden state.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
6. GRU Cell
This part studies gru cell through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Update gate | interpolate old and candidate states | |
| Reset gate | control how much past enters the candidate | |
| Candidate hidden | build proposed new hidden state | |
| Hidden update | blend old state and candidate | |
| GRU versus LSTM | GRU is smaller; LSTM has separate cell and hidden state | only versus |
6.1 Update gate
Main idea. Interpolate old and candidate states.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
6.2 Reset gate
Main idea. Control how much past enters the candidate.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
6.3 Candidate hidden
Main idea. Build proposed new hidden state.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
6.4 Hidden update
Main idea. Blend old state and candidate.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
6.5 GRU versus LSTM
Main idea. Gru is smaller; lstm has separate cell and hidden state.
Core relation:
h_t$ only versus $(c_t,h_t)An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
7. Sequence Tasks
This part studies sequence tasks through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Many-to-one | classify a whole sequence from final or pooled state | |
| Many-to-many | predict at every time step | |
| Seq2seq | encode one sequence and decode another | |
| Bidirectional RNN | use past and future context for non-causal tasks | |
| Teacher forcing | decoder conditions on gold previous outputs during training |
7.1 Many-to-one
Main idea. Classify a whole sequence from final or pooled state.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
7.2 Many-to-many
Main idea. Predict at every time step.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
7.3 Seq2seq
Main idea. Encode one sequence and decode another.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
7.4 Bidirectional RNN
Main idea. Use past and future context for non-causal tasks.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
7.5 Teacher forcing
Main idea. Decoder conditions on gold previous outputs during training.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
8. Attention Bridge
This part studies attention bridge through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Fixed context bottleneck | a single encoder vector struggles with long inputs | |
| Alignment scores | decoder state scores every encoder state | |
| Attention weights | softmax turns scores into a distribution over positions | |
| Context vector | weighted sum exposes relevant encoder states | |
| Transformer bridge | self-attention removes recurrence and exposes all token states directly |
8.1 Fixed context bottleneck
Main idea. A single encoder vector struggles with long inputs.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
8.2 Alignment scores
Main idea. Decoder state scores every encoder state.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
8.3 Attention weights
Main idea. Softmax turns scores into a distribution over positions.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is the historical bridge from seq2seq RNNs to transformer attention.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
8.4 Context vector
Main idea. Weighted sum exposes relevant encoder states.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
8.5 Transformer bridge
Main idea. Self-attention removes recurrence and exposes all token states directly.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
9. Training Practice
This part studies training practice through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Padding masks | sequence batches have variable lengths | |
| Packed sequences | avoid computing loss on padding | |
| Stateful streaming | carry state across chunks for long streams | |
| Initialization | orthogonal recurrent matrices can stabilize early dynamics | |
| Regularization | dropout, weight decay, and clipping fight overfit and instability |
9.1 Padding masks
Main idea. Sequence batches have variable lengths.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
9.2 Packed sequences
Main idea. Avoid computing loss on padding.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
9.3 Stateful streaming
Main idea. Carry state across chunks for long streams.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
9.4 Initialization
Main idea. Orthogonal recurrent matrices can stabilize early dynamics.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
9.5 Regularization
Main idea. Dropout, weight decay, and clipping fight overfit and instability.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
10. Diagnostics
This part studies diagnostics through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Shape checks | inputs, hidden states, gates, and outputs have distinct axes | and |
| Gradient norms | track exploding or vanishing gradients through time | |
| Gate statistics | saturated gates indicate stuck memory behavior | near 0 or 1 |
| Length tests | evaluate short and long sequences separately | |
| Ablations | compare vanilla RNN, GRU, LSTM, and attention bridge |
10.1 Shape checks
Main idea. Inputs, hidden states, gates, and outputs have distinct axes.
Core relation:
(B,T,d)$ and $(B,h)An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
10.2 Gradient norms
Main idea. Track exploding or vanishing gradients through time.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
10.3 Gate statistics
Main idea. Saturated gates indicate stuck memory behavior.
Core relation:
f_t,i_t,z_t$ near 0 or 1An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. Gate histograms quickly reveal whether an LSTM or GRU is actually using its memory.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
10.4 Length tests
Main idea. Evaluate short and long sequences separately.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
10.5 Ablations
Main idea. Compare vanilla rnn, gru, lstm, and attention bridge.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
Practice Exercises
- Compute one vanilla RNN hidden update.
- Compute a sequence log probability from conditional probabilities.
- Show a scalar gradient product that vanishes or explodes.
- Clip a gradient vector by norm.
- Compute one LSTM cell update.
- Compute one GRU hidden update.
- Apply a padding mask to sequence losses.
- Identify shapes for many-to-one and many-to-many tasks.
- Compute attention weights and context over encoder states.
- Write an RNN debugging checklist.
Why This Matters for AI
Transformers dominate current LLMs, but RNNs still teach the core sequence-learning problems: hidden state, recurrence, long-range credit assignment, gradient stability, teacher forcing, and attention as a solution to fixed-context bottlenecks. Understanding RNNs makes transformer design feel less arbitrary.
Bridge to Transformer Architecture
Transformers replace recurrent state updates with attention over token states. The next section studies how self-attention, residual streams, normalization, and feed-forward blocks solve many RNN bottlenecks while introducing their own memory and compute tradeoffs.
References
- Sepp Hochreiter and Jurgen Schmidhuber, "Long Short-Term Memory", Neural Computation, 1997: https://doi.org/10.1162/neco.1997.9.8.1735
- Kyunghyun Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation", 2014: https://arxiv.org/abs/1406.1078
- Alex Graves, "Generating Sequences With Recurrent Neural Networks", 2013: https://arxiv.org/abs/1308.0850
- Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, "On the difficulty of training Recurrent Neural Networks", 2013: https://arxiv.org/abs/1211.5063
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate", 2014: https://arxiv.org/abs/1409.0473