Lesson overview | Lesson overview | Next part
RNN and LSTM Math: Part 1: Sequence Modeling View to 5. LSTM Cell
1. Sequence Modeling View
This part studies sequence modeling view through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Sequential data | observations arrive with order and history | |
| Hidden state | the model carries a summary of the past | |
| Output distribution | each state can produce a prediction | |
| Autoregressive generation | the previous output can become the next input | |
| RNN versus transformer | RNNs compress history into a state, transformers keep token states visible | versus |
1.1 Sequential data
Main idea. Observations arrive with order and history.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
1.2 Hidden state
Main idea. The model carries a summary of the past.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
1.3 Output distribution
Main idea. Each state can produce a prediction.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
1.4 Autoregressive generation
Main idea. The previous output can become the next input.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
1.5 RNN versus transformer
Main idea. Rnns compress history into a state, transformers keep token states visible.
Core relation:
h_t$ versus $H_{1:t}An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2. Vanilla RNN Equations
This part studies vanilla rnn equations through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| State update | combine current input and previous hidden state | |
| Output head | map hidden state to logits or regression output | |
| Parameter sharing | the same weights are reused at every time step | |
| Unrolled graph | an RNN is a deep network along time | |
| Initial state | start from zeros or a learned state | or trainable |
2.1 State update
Main idea. Combine current input and previous hidden state.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2.2 Output head
Main idea. Map hidden state to logits or regression output.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2.3 Parameter sharing
Main idea. The same weights are reused at every time step.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2.4 Unrolled graph
Main idea. An rnn is a deep network along time.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
2.5 Initial state
Main idea. Start from zeros or a learned state.
Core relation:
h_0=0$ or trainableAn RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3. Backpropagation Through Time
This part studies backpropagation through time through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Sequence loss | sum or average losses over time | |
| Temporal chain rule | future losses depend on earlier states through recurrence | |
| Weight gradient | shared weights collect gradient from every time step | |
| Truncated BPTT | backpropagate through a limited window for efficiency | |
| State detachment | detach hidden state between chunks to control graph length |
3.1 Sequence loss
Main idea. Sum or average losses over time.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3.2 Temporal chain rule
Main idea. Future losses depend on earlier states through recurrence.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is why RNN training is training a very deep network along time.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3.3 Weight gradient
Main idea. Shared weights collect gradient from every time step.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3.4 Truncated BPTT
Main idea. Backpropagate through a limited window for efficiency.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
3.5 State detachment
Main idea. Detach hidden state between chunks to control graph length.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4. Vanishing and Exploding Gradients
This part studies vanishing and exploding gradients through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Jacobian product | long-range gradients multiply many recurrent Jacobians | |
| Spectral radius | gradient scale depends on recurrent dynamics | |
| Vanishing | singular values below one shrink long-range gradients | |
| Exploding | singular values above one can blow up gradients | |
| Gradient clipping | cap gradient norm before optimizer update |
4.1 Jacobian product
Main idea. Long-range gradients multiply many recurrent jacobians.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4.2 Spectral radius
Main idea. Gradient scale depends on recurrent dynamics.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4.3 Vanishing
Main idea. Singular values below one shrink long-range gradients.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4.4 Exploding
Main idea. Singular values above one can blow up gradients.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
4.5 Gradient clipping
Main idea. Cap gradient norm before optimizer update.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. Clipping became a standard tool because recurrent gradient products can explode suddenly.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5. LSTM Cell
This part studies lstm cell through the lens of sequence learning. The central question is how information and gradients move through time.
| Subtopic | Question | Formula |
|---|---|---|
| Forget gate | decide what memory to keep | |
| Input gate | decide what new content to write | |
| Candidate memory | propose new cell content | |
| Cell update | additive memory path improves gradient flow | |
| Output gate | expose part of cell memory as hidden state |
5.1 Forget gate
Main idea. Decide what memory to keep.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5.2 Input gate
Main idea. Decide what new content to write.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5.3 Candidate memory
Main idea. Propose new cell content.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5.4 Cell update
Main idea. Additive memory path improves gradient flow.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. The additive cell path is the mathematical reason LSTMs can carry information longer than a plain tanh RNN.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.
5.5 Output gate
Main idea. Expose part of cell memory as hidden state.
Core relation:
An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.
Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about . If it is 1.2, the same path grows like . This is the vanishing and exploding gradient problem in one line.
Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).
AI connection. This is a practical sequence-modeling control variable.
Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.