Part 3

24 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Reinforcement Learning: Part 5: Dynamic Programming to 6. Sampling-Based Prediction

5. Dynamic Programming

Dynamic Programming is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

5.1 Policy evaluation

Purpose. Policy evaluation focuses on computing $V^\pi$ when the model is known. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of $\nabla_\theta\log\pi_\theta(a\mid s)$ times a return-like signal.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

REINFORCE.
actor-critic.
PPO for RLHF.

Non-examples:

choosing $\arg\max_a Q(s,a)$ from a tiny action table.
behavior cloning without reward feedback.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

5.2 Policy improvement

Purpose. Policy improvement focuses on turning a value function into a better greedy policy. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of $\nabla_\theta\log\pi_\theta(a\mid s)$ times a return-like signal.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

REINFORCE.
actor-critic.
PPO for RLHF.

Non-examples:

choosing $\arg\max_a Q(s,a)$ from a tiny action table.
behavior cloning without reward feedback.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

5.3 Policy iteration

Purpose. Policy iteration focuses on alternating evaluation and improvement. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of $\nabla_\theta\log\pi_\theta(a\mid s)$ times a return-like signal.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

REINFORCE.
actor-critic.
PPO for RLHF.

Non-examples:

choosing $\arg\max_a Q(s,a)$ from a tiny action table.
behavior cloning without reward feedback.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

5.4 Value iteration

Purpose. Value iteration focuses on combining backup and improvement into one operator. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

$V^\pi$ .
$Q^\pi$ .
$A^\pi$ .

Non-examples:

instant reward only.
a metric computed on states never reached by the policy.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

5.5 Planning versus learning

Purpose. Planning versus learning focuses on why model access changes the algorithm. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

Dynamic programming uses a known model to compute values and policies through Bellman backups.

Worked reading.

Policy iteration alternates evaluation and greedy improvement; value iteration applies optimal backups directly.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

policy evaluation.
policy iteration.
value iteration.

Non-examples:

model-free Q-learning.
one-step supervised classification.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6. Sampling-Based Prediction

Sampling-Based Prediction is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

6.1 Monte Carlo returns

Purpose. Monte Carlo returns focuses on learning from complete sampled episodes. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

Sampling-based prediction learns values from trajectories. Monte Carlo waits for full returns; TD bootstraps from the next estimate.

Worked reading.

The TD error $\delta_t=R_{t+1}+\gamma V(S_{t+1})-V(S_t)$ is a local surprise signal.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

TD(0).
n-step returns.
TD(lambda).

Non-examples:

solving the Bellman linear system exactly.
using labels independent of the current policy.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6.2 Temporal-difference learning

Purpose. Temporal-difference learning focuses on learning from bootstrapped one-step targets. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

Sampling-based prediction learns values from trajectories. Monte Carlo waits for full returns; TD bootstraps from the next estimate.

Worked reading.

The TD error $\delta_t=R_{t+1}+\gamma V(S_{t+1})-V(S_t)$ is a local surprise signal.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

TD(0).
n-step returns.
TD(lambda).

Non-examples:

solving the Bellman linear system exactly.
using labels independent of the current policy.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6.3 Bias variance and bootstrapping

Purpose. Bias variance and bootstrapping focuses on why MC and TD make different errors. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

small tabular MDPs.
neural value functions.
preference-optimized language policies.

Non-examples:

static regression.
uncontrolled simulation traces.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6.4 N-step returns

Purpose. N-step returns focuses on interpolating between TD(0) and Monte Carlo. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

$V^\pi$ .
$Q^\pi$ .
$A^\pi$ .

Non-examples:

instant reward only.
a metric computed on states never reached by the policy.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6.5 Eligibility traces and TD lambda

Purpose. Eligibility traces and TD lambda focuses on credit assignment across recent visits. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

Sampling-based prediction learns values from trajectories. Monte Carlo waits for full returns; TD bootstraps from the next estimate.

Worked reading.

The TD error $\delta_t=R_{t+1}+\gamma V(S_{t+1})-V(S_t)$ is a local surprise signal.

Object	Mathematical role	ML interpretation
$\mathcal{S}$	state space	task context, simulator state, dialogue state
$\mathcal{A}$	action space	moves, controls, generated tokens, tool choices
$P(s'\mid s,a)$	transition kernel	environment dynamics or next-context distribution
$r(s,a,s')$	reward function	scalar training signal, preference score, task score
$\pi(a\mid s)$	policy	behavior rule or neural action distribution
$V^\pi,Q^\pi$	value functions	estimates of future performance

Examples:

TD(0).
n-step returns.
TD(lambda).

Non-examples:

solving the Bellman linear system exactly.
using labels independent of the current policy.

Derivation habit.

State whether the policy is fixed, greedy-improved, or being optimized.
Write the return or Bellman target before writing code.
Identify whether the target is sampled, model-based, bootstrapped, or exact.
Check whether the data are on-policy or off-policy.
Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

Useful checks:

Does the transition or sample contain the next state?
Is the update using $V$ , $Q$ , an advantage, or a reward-only signal?
Is the current policy also the data-collection policy?
Does discounting express time preference or just numerical convenience?
Is the reward model being optimized beyond its trustworthy region?

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

Reinforcement Learning: Part 3 - Dynamic Programming To 6 Sampling Based Prediction

Reinforcement Learning: Part 5: Dynamic Programming to 6. Sampling-Based Prediction

5. Dynamic Programming

5.1 Policy evaluation

5.2 Policy improvement

5.3 Policy iteration

5.4 Value iteration

5.5 Planning versus learning

6. Sampling-Based Prediction

6.1 Monte Carlo returns

6.2 Temporal-difference learning

6.3 Bias variance and bootstrapping

6.4 N-step returns

6.5 Eligibility traces and TD lambda

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?