Part 1Math for LLMs

RNN and LSTM Math: Part 1 - Sequence Modeling View To 5 Lstm Cell

Math for Specific Models / RNN and LSTM Math

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
27 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

RNN and LSTM Math: Part 1: Sequence Modeling View to 5. LSTM Cell

1. Sequence Modeling View

This part studies sequence modeling view through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Sequential dataobservations arrive with order and historyx1:T=(x1,,xT)x_{1:T}=(x_1,\ldots,x_T)
Hidden statethe model carries a summary of the pastht=fθ(ht1,xt)h_t=f_\theta(h_{t-1},x_t)
Output distributioneach state can produce a predictionp(ytxt)=gθ(ht)p(y_t\mid x_{\le t})=g_\theta(h_t)
Autoregressive generationthe previous output can become the next inputp(x1:T)=tp(xtx<t)p(x_{1:T})=\prod_t p(x_t\mid x_{<t})
RNN versus transformerRNNs compress history into a state, transformers keep token states visiblehth_t versus H1:tH_{1:t}

1.1 Sequential data

Main idea. Observations arrive with order and history.

Core relation:

x1:T=(x1,,xT)x_{1:T}=(x_1,\ldots,x_T)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.2 Hidden state

Main idea. The model carries a summary of the past.

Core relation:

ht=fθ(ht1,xt)h_t=f_\theta(h_{t-1},x_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.3 Output distribution

Main idea. Each state can produce a prediction.

Core relation:

p(ytxt)=gθ(ht)p(y_t\mid x_{\le t})=g_\theta(h_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.4 Autoregressive generation

Main idea. The previous output can become the next input.

Core relation:

p(x1:T)=tp(xtx<t)p(x_{1:T})=\prod_t p(x_t\mid x_{<t})

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

1.5 RNN versus transformer

Main idea. Rnns compress history into a state, transformers keep token states visible.

Core relation:

h_t$ versus $H_{1:t}

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2. Vanilla RNN Equations

This part studies vanilla rnn equations through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
State updatecombine current input and previous hidden stateht=ϕ(Wxhxt+Whhht1+bh)h_t=\phi(W_{xh}x_t+W_{hh}h_{t-1}+b_h)
Output headmap hidden state to logits or regression outputot=Whyht+byo_t=W_{hy}h_t+b_y
Parameter sharingthe same weights are reused at every time stepθt=θ\theta_t=\theta
Unrolled graphan RNN is a deep network along timeh1h2hTh_1\rightarrow h_2\rightarrow\cdots\rightarrow h_T
Initial statestart from zeros or a learned stateh0=0h_0=0 or trainable

2.1 State update

Main idea. Combine current input and previous hidden state.

Core relation:

ht=ϕ(Wxhxt+Whhht1+bh)h_t=\phi(W_{xh}x_t+W_{hh}h_{t-1}+b_h)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.2 Output head

Main idea. Map hidden state to logits or regression output.

Core relation:

ot=Whyht+byo_t=W_{hy}h_t+b_y

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.3 Parameter sharing

Main idea. The same weights are reused at every time step.

Core relation:

θt=θ\theta_t=\theta

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.4 Unrolled graph

Main idea. An rnn is a deep network along time.

Core relation:

h1h2hTh_1\rightarrow h_2\rightarrow\cdots\rightarrow h_T

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

2.5 Initial state

Main idea. Start from zeros or a learned state.

Core relation:

h_0=0$ or trainable

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3. Backpropagation Through Time

This part studies backpropagation through time through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Sequence losssum or average losses over timeL=t=1TtL=\sum_{t=1}^{T}\ell_t
Temporal chain rulefuture losses depend on earlier states through recurrenceL/ht=t/ht+(ht+1/ht)L/ht+1\partial L/\partial h_t=\partial \ell_t/\partial h_t+(\partial h_{t+1}/\partial h_t)^\top\partial L/\partial h_{t+1}
Weight gradientshared weights collect gradient from every time stepL/W=tLt/W\partial L/\partial W=\sum_t \partial L_t/\partial W
Truncated BPTTbackpropagate through a limited window for efficiencytk,,tt-k,\ldots,t
State detachmentdetach hidden state between chunks to control graph lengthhtstopgrad(ht)h_t\leftarrow\mathrm{stopgrad}(h_t)

3.1 Sequence loss

Main idea. Sum or average losses over time.

Core relation:

L=t=1TtL=\sum_{t=1}^{T}\ell_t

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.2 Temporal chain rule

Main idea. Future losses depend on earlier states through recurrence.

Core relation:

L/ht=t/ht+(ht+1/ht)L/ht+1\partial L/\partial h_t=\partial \ell_t/\partial h_t+(\partial h_{t+1}/\partial h_t)^\top\partial L/\partial h_{t+1}

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is why RNN training is training a very deep network along time.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.3 Weight gradient

Main idea. Shared weights collect gradient from every time step.

Core relation:

L/W=tLt/W\partial L/\partial W=\sum_t \partial L_t/\partial W

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.4 Truncated BPTT

Main idea. Backpropagate through a limited window for efficiency.

Core relation:

tk,,tt-k,\ldots,t

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

3.5 State detachment

Main idea. Detach hidden state between chunks to control graph length.

Core relation:

htstopgrad(ht)h_t\leftarrow\mathrm{stopgrad}(h_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4. Vanishing and Exploding Gradients

This part studies vanishing and exploding gradients through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Jacobian productlong-range gradients multiply many recurrent Jacobiansi=s+1thi/hi1\prod_{i=s+1}^{t}\partial h_i/\partial h_{i-1}
Spectral radiusgradient scale depends on recurrent dynamicsρ(Whh)\rho(W_{hh})
Vanishingsingular values below one shrink long-range gradientsgs0\Vert g_s\Vert\rightarrow 0
Explodingsingular values above one can blow up gradientsgs\Vert g_s\Vert\rightarrow\infty
Gradient clippingcap gradient norm before optimizer updateggmin(1,c/g)g\leftarrow g\min(1,c/\Vert g\Vert)

4.1 Jacobian product

Main idea. Long-range gradients multiply many recurrent jacobians.

Core relation:

i=s+1thi/hi1\prod_{i=s+1}^{t}\partial h_i/\partial h_{i-1}

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.2 Spectral radius

Main idea. Gradient scale depends on recurrent dynamics.

Core relation:

ρ(Whh)\rho(W_{hh})

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.3 Vanishing

Main idea. Singular values below one shrink long-range gradients.

Core relation:

gs0\Vert g_s\Vert\rightarrow 0

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.4 Exploding

Main idea. Singular values above one can blow up gradients.

Core relation:

gs\Vert g_s\Vert\rightarrow\infty

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

4.5 Gradient clipping

Main idea. Cap gradient norm before optimizer update.

Core relation:

ggmin(1,c/g)g\leftarrow g\min(1,c/\Vert g\Vert)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. Clipping became a standard tool because recurrent gradient products can explode suddenly.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5. LSTM Cell

This part studies lstm cell through the lens of sequence learning. The central question is how information and gradients move through time.

SubtopicQuestionFormula
Forget gatedecide what memory to keepft=σ(Wf[xt,ht1]+bf)f_t=\sigma(W_f[x_t,h_{t-1}]+b_f)
Input gatedecide what new content to writeit=σ(Wi[xt,ht1]+bi)i_t=\sigma(W_i[x_t,h_{t-1}]+b_i)
Candidate memorypropose new cell contentc~t=tanh(Wc[xt,ht1]+bc)\tilde c_t=\tanh(W_c[x_t,h_{t-1}]+b_c)
Cell updateadditive memory path improves gradient flowct=ftct1+itc~tc_t=f_t\odot c_{t-1}+i_t\odot\tilde c_t
Output gateexpose part of cell memory as hidden stateht=ottanh(ct)h_t=o_t\odot\tanh(c_t)

5.1 Forget gate

Main idea. Decide what memory to keep.

Core relation:

ft=σ(Wf[xt,ht1]+bf)f_t=\sigma(W_f[x_t,h_{t-1}]+b_f)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.2 Input gate

Main idea. Decide what new content to write.

Core relation:

it=σ(Wi[xt,ht1]+bi)i_t=\sigma(W_i[x_t,h_{t-1}]+b_i)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.3 Candidate memory

Main idea. Propose new cell content.

Core relation:

c~t=tanh(Wc[xt,ht1]+bc)\tilde c_t=\tanh(W_c[x_t,h_{t-1}]+b_c)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.4 Cell update

Main idea. Additive memory path improves gradient flow.

Core relation:

ct=ftct1+itc~tc_t=f_t\odot c_{t-1}+i_t\odot\tilde c_t

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. The additive cell path is the mathematical reason LSTMs can carry information longer than a plain tanh RNN.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

5.5 Output gate

Main idea. Expose part of cell memory as hidden state.

Core relation:

ht=ottanh(ct)h_t=o_t\odot\tanh(c_t)

An RNN is a parameter-shared computation graph unrolled across time. The hidden state is useful because it carries information forward, but it also creates a long chain for gradients to travel backward. LSTM and GRU gates are designed to make this information path more controllable.

Worked micro-example. In a plain linearized RNN, the contribution of an early hidden state to a later hidden state contains powers of the recurrent matrix. If the dominant singular value is 0.8, a signal over 20 steps is scaled by about 0.8200.8^{20}. If it is 1.2, the same path grows like 1.2201.2^{20}. This is the vanishing and exploding gradient problem in one line.

Implementation check. For a batch-first tensor, keep the axes explicit: batch, time, feature. A hidden state usually has shape (batch, hidden), while a full output sequence has shape (batch, time, hidden).

AI connection. This is a practical sequence-modeling control variable.

Common mistake. Do not treat the final hidden state as magic memory. For long sequences it can become a bottleneck, which is exactly why attention was introduced in seq2seq systems.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue