Part 2

28 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

RNN and LSTM Math: Part 6: GRU Cell to References

6. GRU Cell

This part studies gru cell through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
Update gate	interpolate old and candidate states	$z_t=\sigma(W_z[x_t,h_{t-1}])$
Reset gate	control how much past enters the candidate	$r_t=\sigma(W_r[x_t,h_{t-1}])$
Candidate hidden	build proposed new hidden state	$\tilde h_t=\tanh(W_h[x_t,r_t\odot h_{t-1}])$
Hidden update	blend old state and candidate	$h_t=(1-z_t)\odot h_{t-1}+z_t\odot\tilde h_t$
GRU versus LSTM	GRU is smaller; LSTM has separate cell and hidden state	$h_t$ only versus $(c_t,h_t)$

6.1 Update gate

Main idea. Interpolate old and candidate states.

Core relation:

z_t=\sigma(W_z[x_t,h_{t-1}])

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about $0.8^{20}$ . If it is 1.2, the same path grows like $1.2^{20}$ . This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

6.2 Reset gate

Main idea. Control how much past enters the candidate.

Core relation:

r_t=\sigma(W_r[x_t,h_{t-1}])

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

6.3 Candidate hidden

Main idea. Build proposed new hidden state.

Core relation:

\tilde h_t=\tanh(W_h[x_t,r_t\odot h_{t-1}])

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

6.4 Hidden update

Main idea. Blend old state and candidate.

Core relation:

h_t=(1-z_t)\odot h_{t-1}+z_t\odot\tilde h_t

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

6.5 GRU versus LSTM

Main idea. Gru is smaller; lstm has separate cell and hidden state.

Core relation:

h_t$ only versus $(c_t,h_t)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7. Sequence Tasks

This part studies sequence tasks through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
Many-to-one	classify a whole sequence from final or pooled state	$\hat y=g(h_T)$
Many-to-many	predict at every time step	$\hat y_t=g(h_t)$
Seq2seq	encode one sequence and decode another	$p(y_{1:M}\mid x_{1:T})=\prod_jp(y_j\mid y_{<j},c)$
Bidirectional RNN	use past and future context for non-causal tasks	$h_t=[\overrightarrow h_t;\overleftarrow h_t]$
Teacher forcing	decoder conditions on gold previous outputs during training	$p(y_t\mid y_{<t}^\star,c)$

7.1 Many-to-one

Main idea. Classify a whole sequence from final or pooled state.

Core relation:

\hat y=g(h_T)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7.2 Many-to-many

Main idea. Predict at every time step.

Core relation:

\hat y_t=g(h_t)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7.3 Seq2seq

Main idea. Encode one sequence and decode another.

Core relation:

p(y_{1:M}\mid x_{1:T})=\prod_jp(y_j\mid y_{<j},c)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7.4 Bidirectional RNN

Main idea. Use past and future context for non-causal tasks.

Core relation:

h_t=[\overrightarrow h_t;\overleftarrow h_t]

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

7.5 Teacher forcing

Main idea. Decoder conditions on gold previous outputs during training.

Core relation:

p(y_t\mid y_{<t}^\star,c)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8. Attention Bridge

This part studies attention bridge through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
Fixed context bottleneck	a single encoder vector struggles with long inputs	$c=h_T$
Alignment scores	decoder state scores every encoder state	$e_{tj}=a(s_{t-1},h_j)$
Attention weights	softmax turns scores into a distribution over positions	$\alpha_{tj}=\mathrm{softmax}_j(e_{tj})$
Context vector	weighted sum exposes relevant encoder states	$c_t=\sum_j\alpha_{tj}h_j$
Transformer bridge	self-attention removes recurrence and exposes all token states directly	$\mathrm{Attention}(Q,K,V)$

8.1 Fixed context bottleneck

Main idea. A single encoder vector struggles with long inputs.

Core relation:

c=h_T

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8.2 Alignment scores

Main idea. Decoder state scores every encoder state.

Core relation:

e_{tj}=a(s_{t-1},h_j)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8.3 Attention weights

Main idea. Softmax turns scores into a distribution over positions.

Core relation:

\alpha_{tj}=\mathrm{softmax}_j(e_{tj})

AI connection. This is the historical bridge from seq2seq RNNs to transformer attention.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8.4 Context vector

Main idea. Weighted sum exposes relevant encoder states.

Core relation:

c_t=\sum_j\alpha_{tj}h_j

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

8.5 Transformer bridge

Main idea. Self-attention removes recurrence and exposes all token states directly.

Core relation:

\mathrm{Attention}(Q,K,V)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9. Training Practice

This part studies training practice through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
Padding masks	sequence batches have variable lengths	$m_t\in\{0,1\}$
Packed sequences	avoid computing loss on padding	$L=\sum_tm_t\ell_t/\sum_tm_t$
Stateful streaming	carry state across chunks for long streams	$h_\mathrm{next}=h_T$
Initialization	orthogonal recurrent matrices can stabilize early dynamics	$W_{hh}^\top W_{hh}=I$
Regularization	dropout, weight decay, and clipping fight overfit and instability	$L+\lambda\Vert\theta\Vert^2$

9.1 Padding masks

Main idea. Sequence batches have variable lengths.

Core relation:

m_t\in\{0,1\}

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9.2 Packed sequences

Main idea. Avoid computing loss on padding.

Core relation:

L=\sum_tm_t\ell_t/\sum_tm_t

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9.3 Stateful streaming

Main idea. Carry state across chunks for long streams.

Core relation:

h_\mathrm{next}=h_T

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9.4 Initialization

Main idea. Orthogonal recurrent matrices can stabilize early dynamics.

Core relation:

W_{hh}^\top W_{hh}=I

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

9.5 Regularization

Main idea. Dropout, weight decay, and clipping fight overfit and instability.

Core relation:

L+\lambda\Vert\theta\Vert^2

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10. Diagnostics

This part studies diagnostics through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
Shape checks	inputs, hidden states, gates, and outputs have distinct axes	$(B,T,d)$ and $(B,h)$
Gradient norms	track exploding or vanishing gradients through time	$\Vert g_t\Vert$
Gate statistics	saturated gates indicate stuck memory behavior	$f_t,i_t,z_t$ near 0 or 1
Length tests	evaluate short and long sequences separately	$S(T)$
Ablations	compare vanilla RNN, GRU, LSTM, and attention bridge	$\Delta L,\Delta S$

10.1 Shape checks

Main idea. Inputs, hidden states, gates, and outputs have distinct axes.

Core relation:

(B,T,d)$ and $(B,h)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10.2 Gradient norms

Main idea. Track exploding or vanishing gradients through time.

Core relation:

\Vert g_t\Vert

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10.3 Gate statistics

Main idea. Saturated gates indicate stuck memory behavior.

Core relation:

f_t,i_t,z_t$ near 0 or 1

AI connection. Gate histograms quickly reveal whether an LSTM or GRU is actually using its memory.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10.4 Length tests

Main idea. Evaluate short and long sequences separately.

Core relation:

S(T)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

10.5 Ablations

Main idea. Compare vanilla rnn, gru, lstm, and attention bridge.

Core relation:

\Delta L,\Delta S

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

Practice Exercises

Compute one vanilla RNN hidden update.
Compute a sequence log probability from conditional probabilities.
Show a scalar gradient product that vanishes or explodes.
Clip a gradient vector by norm.
Compute one LSTM cell update.
Compute one GRU hidden update.
Apply a padding mask to sequence losses.
Identify shapes for many-to-one and many-to-many tasks.
Compute attention weights and context over encoder states.
Write an RNN debugging checklist.

Why This Matters for AI

Transformers dominate current LLMs, but RNNs still teach the core sequence-learning problems: hidden state, recurrence, long-range credit assignment, gradient stability, teacher forcing, and attention as a solution to fixed-context bottlenecks. Understanding RNNs makes transformer design feel less arbitrary.

Bridge to Transformer Architecture

Transformers replace recurrent state updates with attention over token states. The next section studies how self-attention, residual streams, normalization, and feed-forward blocks solve many RNN bottlenecks while introducing their own memory and compute tradeoffs.

References

Sepp Hochreiter and Jurgen Schmidhuber, "Long Short-Term Memory", Neural Computation, 1997: https://doi.org/10.1162/neco.1997.9.8.1735
Kyunghyun Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation", 2014: https://arxiv.org/abs/1406.1078
Alex Graves, "Generating Sequences With Recurrent Neural Networks", 2013: https://arxiv.org/abs/1308.0850
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, "On the difficulty of training Recurrent Neural Networks", 2013: https://arxiv.org/abs/1211.5063
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate", 2014: https://arxiv.org/abs/1409.0473

RNN and LSTM Math: Part 2 - Gru Cell To References

RNN and LSTM Math: Part 6: GRU Cell to References

6. GRU Cell

6.1 Update gate

6.2 Reset gate

6.3 Candidate hidden

6.4 Hidden update

6.5 GRU versus LSTM

7. Sequence Tasks

7.1 Many-to-one

7.2 Many-to-many

7.3 Seq2seq

7.4 Bidirectional RNN

7.5 Teacher forcing

8. Attention Bridge

8.1 Fixed context bottleneck

8.2 Alignment scores

8.3 Attention weights

8.4 Context vector

8.5 Transformer bridge

9. Training Practice

9.1 Padding masks

9.2 Packed sequences

9.3 Stateful streaming

9.4 Initialization

9.5 Regularization

10. Diagnostics

10.1 Shape checks

10.2 Gradient norms

10.3 Gate statistics

10.4 Length tests

10.5 Ablations

Practice Exercises

Why This Matters for AI

Bridge to Transformer Architecture

References

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?