Part 1

27 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

RNN and LSTM Math: Part 1: Sequence Modeling View to 5. LSTM Cell

1. Sequence Modeling View

This part studies sequence modeling view through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
Sequential data	observations arrive with order and history	$x_{1:T}=(x_1,\ldots,x_T)$
Hidden state	the model carries a summary of the past	$h_t=f_\theta(h_{t-1},x_t)$
Output distribution	each state can produce a prediction	$p(y_t\mid x_{\le t})=g_\theta(h_t)$
Autoregressive generation	the previous output can become the next input	$p(x_{1:T})=\prod_t p(x_t\mid x_{<t})$
RNN versus transformer	RNNs compress history into a state, transformers keep token states visible	$h_t$ versus $H_{1:t}$

1.1 Sequential data

Main idea. Observations arrive with order and history.

Core relation:

x_{1:T}=(x_1,\ldots,x_T)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about $0.8^{20}$ . If it is 1.2, the same path grows like $1.2^{20}$ . This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.2 Hidden state

Main idea. The model carries a summary of the past.

Core relation:

h_t=f_\theta(h_{t-1},x_t)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.3 Output distribution

Main idea. Each state can produce a prediction.

Core relation:

p(y_t\mid x_{\le t})=g_\theta(h_t)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.4 Autoregressive generation

Main idea. The previous output can become the next input.

Core relation:

p(x_{1:T})=\prod_t p(x_t\mid x_{<t})

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.5 RNN versus transformer

Main idea. Rnns compress history into a state, transformers keep token states visible.

Core relation:

h_t$ versus $H_{1:t}

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2. Vanilla RNN Equations

This part studies vanilla rnn equations through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
State update	combine current input and previous hidden state	$h_t=\phi(W_{xh}x_t+W_{hh}h_{t-1}+b_h)$
Output head	map hidden state to logits or regression output	$o_t=W_{hy}h_t+b_y$
Parameter sharing	the same weights are reused at every time step	$\theta_t=\theta$
Unrolled graph	an RNN is a deep network along time	$h_1\rightarrow h_2\rightarrow\cdots\rightarrow h_T$
Initial state	start from zeros or a learned state	$h_0=0$ or trainable

2.1 State update

Main idea. Combine current input and previous hidden state.

Core relation:

h_t=\phi(W_{xh}x_t+W_{hh}h_{t-1}+b_h)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.2 Output head

Main idea. Map hidden state to logits or regression output.

Core relation:

o_t=W_{hy}h_t+b_y

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

Main idea. The same weights are reused at every time step.

Core relation:

\theta_t=\theta

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.4 Unrolled graph

Main idea. An rnn is a deep network along time.

Core relation:

h_1\rightarrow h_2\rightarrow\cdots\rightarrow h_T

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.5 Initial state

Main idea. Start from zeros or a learned state.

Core relation:

h_0=0$ or trainable

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3. Backpropagation Through Time

This part studies backpropagation through time through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
Sequence loss	sum or average losses over time	$L=\sum_{t=1}^{T}\ell_t$
Temporal chain rule	future losses depend on earlier states through recurrence	$\partial L/\partial h_t=\partial \ell_t/\partial h_t+(\partial h_{t+1}/\partial h_t)^\top\partial L/\partial h_{t+1}$
Weight gradient	shared weights collect gradient from every time step	$\partial L/\partial W=\sum_t \partial L_t/\partial W$
Truncated BPTT	backpropagate through a limited window for efficiency	$t-k,\ldots,t$
State detachment	detach hidden state between chunks to control graph length	$h_t\leftarrow\mathrm{stopgrad}(h_t)$

3.1 Sequence loss

Main idea. Sum or average losses over time.

Core relation:

L=\sum_{t=1}^{T}\ell_t

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.2 Temporal chain rule

Main idea. Future losses depend on earlier states through recurrence.

Core relation:

\partial L/\partial h_t=\partial \ell_t/\partial h_t+(\partial h_{t+1}/\partial h_t)^\top\partial L/\partial h_{t+1}

AI connection. This is why RNN training is training a very deep network along time.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.3 Weight gradient

Main idea. Shared weights collect gradient from every time step.

Core relation:

\partial L/\partial W=\sum_t \partial L_t/\partial W

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.4 Truncated BPTT

Main idea. Backpropagate through a limited window for efficiency.

Core relation:

t-k,\ldots,t

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.5 State detachment

Main idea. Detach hidden state between chunks to control graph length.

Core relation:

h_t\leftarrow\mathrm{stopgrad}(h_t)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4. Vanishing and Exploding Gradients

This part studies vanishing and exploding gradients through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
Jacobian product	long-range gradients multiply many recurrent Jacobians	$\prod_{i=s+1}^{t}\partial h_i/\partial h_{i-1}$
Spectral radius	gradient scale depends on recurrent dynamics	$\rho(W_{hh})$
Vanishing	singular values below one shrink long-range gradients	$\Vert g_s\Vert\rightarrow 0$
Exploding	singular values above one can blow up gradients	$\Vert g_s\Vert\rightarrow\infty$
Gradient clipping	cap gradient norm before optimizer update	$g\leftarrow g\min(1,c/\Vert g\Vert)$

4.1 Jacobian product

Main idea. Long-range gradients multiply many recurrent jacobians.

Core relation:

\prod_{i=s+1}^{t}\partial h_i/\partial h_{i-1}

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.2 Spectral radius

Main idea. Gradient scale depends on recurrent dynamics.

Core relation:

\rho(W_{hh})

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.3 Vanishing

Main idea. Singular values below one shrink long-range gradients.

Core relation:

\Vert g_s\Vert\rightarrow 0

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.4 Exploding

Main idea. Singular values above one can blow up gradients.

Core relation:

\Vert g_s\Vert\rightarrow\infty

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.5 Gradient clipping

Main idea. Cap gradient norm before optimizer update.

Core relation:

g\leftarrow g\min(1,c/\Vert g\Vert)

AI connection. Clipping became a standard tool because recurrent gradient products can explode suddenly.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5. LSTM Cell

This part studies lstm cell through the lens of sequence learning. The central question is how information and gradients move through time.

Subtopic	Question	Formula
Forget gate	decide what memory to keep	$f_t=\sigma(W_f[x_t,h_{t-1}]+b_f)$
Input gate	decide what new content to write	$i_t=\sigma(W_i[x_t,h_{t-1}]+b_i)$
Candidate memory	propose new cell content	$\tilde c_t=\tanh(W_c[x_t,h_{t-1}]+b_c)$
Cell update	additive memory path improves gradient flow	$c_t=f_t\odot c_{t-1}+i_t\odot\tilde c_t$
Output gate	expose part of cell memory as hidden state	$h_t=o_t\odot\tanh(c_t)$

5.1 Forget gate

Main idea. Decide what memory to keep.

Core relation:

f_t=\sigma(W_f[x_t,h_{t-1}]+b_f)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.2 Input gate

Main idea. Decide what new content to write.

Core relation:

i_t=\sigma(W_i[x_t,h_{t-1}]+b_i)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.3 Candidate memory

Main idea. Propose new cell content.

Core relation:

\tilde c_t=\tanh(W_c[x_t,h_{t-1}]+b_c)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.4 Cell update

Main idea. Additive memory path improves gradient flow.

Core relation:

c_t=f_t\odot c_{t-1}+i_t\odot\tilde c_t

AI connection. The additive cell path is the mathematical reason LSTMs can carry information longer than a plain tanh RNN.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.5 Output gate

Main idea. Expose part of cell memory as hidden state.

Core relation:

h_t=o_t\odot\tanh(c_t)

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

RNN and LSTM Math: Part 1 - Sequence Modeling View To 5 Lstm Cell

RNN and LSTM Math: Part 1: Sequence Modeling View to 5. LSTM Cell

1. Sequence Modeling View

1.1 Sequential data

1.2 Hidden state

1.3 Output distribution

1.4 Autoregressive generation

1.5 RNN versus transformer

2. Vanilla RNN Equations

2.1 State update

2.2 Output head

2.4 Unrolled graph

2.5 Initial state

3. Backpropagation Through Time

3.1 Sequence loss

3.2 Temporal chain rule

3.3 Weight gradient

3.4 Truncated BPTT

3.5 State detachment

4. Vanishing and Exploding Gradients

4.1 Jacobian product

4.2 Spectral radius

4.3 Vanishing

4.4 Exploding

4.5 Gradient clipping

5. LSTM Cell

5.1 Forget gate

5.2 Input gate

5.3 Candidate memory

5.4 Cell update

5.5 Output gate

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

RNN and LSTM Math: Part 1 - Sequence Modeling View To 5 Lstm Cell

RNN and LSTM Math: Part 1: Sequence Modeling View to 5. LSTM Cell

1. Sequence Modeling View

1.1 Sequential data

1.2 Hidden state

1.3 Output distribution

1.4 Autoregressive generation

1.5 RNN versus transformer

2. Vanilla RNN Equations

2.1 State update

2.2 Output head

2.3 Parameter sharing

2.4 Unrolled graph

2.5 Initial state

3. Backpropagation Through Time

3.1 Sequence loss

3.2 Temporal chain rule

3.3 Weight gradient

3.4 Truncated BPTT

3.5 State detachment

4. Vanishing and Exploding Gradients

4.1 Jacobian product

4.2 Spectral radius

4.3 Vanishing

4.4 Exploding

4.5 Gradient clipping

5. LSTM Cell

5.1 Forget gate

5.2 Input gate

5.3 Candidate memory

5.4 Cell update

5.5 Output gate

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?