NotesMath for LLMs

Reinforcement Learning

Math for Specific Models / Reinforcement Learning

Notes

"An RL algorithm is a way to turn consequences into better future decisions."

Overview

Reinforcement learning studies agents that learn by acting. The data are not a fixed table of labeled examples; the policy changes which states are visited, which rewards are observed, and which mistakes are possible. That single fact is why RL needs Markov decision processes, Bellman equations, temporal-difference learning, exploration, and policy-gradient estimators.

For AI systems, RL matters in two directions. Classical RL explains games, robotics, recommender feedback loops, and online control. Modern LLM work uses the same math when a model is fine-tuned from human preferences, constrained by KL divergence, or evaluated through interactive feedback rather than static labels.

This section is written as LaTeX Markdown. Inline math uses $...$, display math uses `

......

`, and the notebooks use small synthetic MDPs so every update can be inspected without external data.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbExecutable demonstrations of MDPs, Bellman backups, TD learning, Q-learning, policy gradients, PPO, and preference optimization.
exercises.ipynbTen graded practice problems with runnable scaffolds and full checked solutions.

Learning Objectives

After completing this section, you will be able to:

  • Define a finite Markov decision process and identify its state, action, reward, transition, and discount components.
  • Compute discounted returns and explain temporal credit assignment.
  • Derive Bellman expectation and optimality equations.
  • Run policy evaluation, policy iteration, and value iteration in a tabular MDP.
  • Explain Monte Carlo, TD(0), n-step, and eligibility-trace prediction.
  • Implement SARSA and Q-learning updates and distinguish on-policy from off-policy control.
  • Explain why replay buffers and target networks stabilize DQN-style learning.
  • Derive the policy-gradient estimator and explain baseline variance reduction.
  • Interpret actor-critic, GAE, PPO clipping, and KL regularization.
  • Connect RL math to RLHF, reward modeling, DPO, and reward hacking risk in LLM systems.

Table of Contents


1. Intuition and Motivation

Intuition and Motivation is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

1.1 Sequential decision making

Purpose. Sequential decision making focuses on why actions change future data. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

M=(S,A,P,r,γ),P(ss,a)=Pr(St+1=sSt=s,At=a).\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

RL is the mathematics of decisions whose consequences alter later observations. A training example is no longer fixed before learning; the policy helps create the future dataset.

Worked reading.

At time tt, the agent sees StS_t, chooses AtA_t, receives Rt+1R_{t+1}, and reaches St+1S_{t+1}. The learning signal is attached to a trajectory, not a single independent example.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. robot navigation.
  2. game play.
  3. dialogue policy tuning.

Non-examples:

  1. ordinary regression with fixed labels.
  2. a bandit problem with no state evolution.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

1.2 Rewards returns and delayed credit

Purpose. Rewards returns and delayed credit focuses on why scalar feedback is hard to assign. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

The reward signal is local, but the objective is cumulative. The central difficulty is assigning a future return back to earlier actions.

Worked reading.

The discounted return is Gt=Rt+1+γRt+2+γ2Rt+3+G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\cdots.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. sparse game win/loss rewards.
  2. human preference scores for full responses.
  3. robot task completion bonuses.

Non-examples:

  1. per-token supervised labels.
  2. a deterministic lookup table with no delayed consequence.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

1.3 Exploration versus exploitation

Purpose. Exploration versus exploitation focuses on why the learner must choose data. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

The agent must balance actions that look good now with actions that reveal useful information. This makes data collection part of the optimization problem.

Worked reading.

An ϵ\epsilon-greedy policy chooses a greedy action with probability 1ϵ1-\epsilon and explores otherwise.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. optimistic initialization.
  2. Boltzmann exploration.
  3. UCB-style uncertainty bonuses.

Non-examples:

  1. evaluating one fixed logged policy only.
  2. training on a static supervised corpus.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

1.4 RL versus supervised learning

Purpose. RL versus supervised learning focuses on why labels are not enough. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

Supervised learning assumes labeled targets. RL instead observes consequences from an interactive process and must handle shifting state-action distributions.

Worked reading.

A policy update changes dπ(s)d^\pi(s), the visitation distribution, so future data are policy-dependent.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. offline RL from logged data.
  2. online policy improvement.
  3. preference-tuned language models.

Non-examples:

  1. i.i.d. image classification.
  2. least-squares regression with fixed design.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

1.5 Where RL appears in LLM systems

Purpose. Where RL appears in LLM systems focuses on why preference optimization uses this math. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

RL enters LLM systems through preference learning, reward modeling, KL-regularized policy updates, and evaluation policies that adapt from feedback.

Worked reading.

RLHF typically optimizes a reward model while penalizing divergence from a reference model with DKL(πθπref)D_{\mathrm{KL}}(\pi_\theta\Vert\pi_{\mathrm{ref}}).

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. PPO-style RLHF.
  2. DPO-style preference optimization.
  3. bandit feedback for ranking.

Non-examples:

  1. plain next-token pretraining.
  2. static instruction tuning without preference feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2. Formal MDP Setup

Formal MDP Setup is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

2.1 States actions rewards and transitions

Purpose. States actions rewards and transitions focuses on the tuple (S,A,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma). This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

An MDP is the formal object that makes sequential decision-making mathematically precise.

Worked reading.

The Markov property says the current state contains the predictive information needed for the next transition and reward.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. finite gridworld.
  2. inventory control.
  3. dialogue state tracking.

Non-examples:

  1. raw observations that omit hidden state.
  2. a static labeled dataset with no actions.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2.2 The Markov property

Purpose. The Markov property focuses on why the present state summarizes the useful past. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

An MDP is the formal object that makes sequential decision-making mathematically precise.

Worked reading.

The Markov property says the current state contains the predictive information needed for the next transition and reward.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. finite gridworld.
  2. inventory control.
  3. dialogue state tracking.

Non-examples:

  1. raw observations that omit hidden state.
  2. a static labeled dataset with no actions.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2.3 Episodic continuing finite and infinite horizon tasks

Purpose. Episodic continuing finite and infinite horizon tasks focuses on how time changes the objective. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2.4 Transition kernels and reward functions

Purpose. Transition kernels and reward functions focuses on how stochastic environments are represented. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Function approximation replaces tables with parameterized models so the agent can generalize across large state spaces.

Worked reading.

Deep RL is powerful because neural networks share statistical strength, but unstable because approximate bootstrapping can amplify errors.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. DQN target networks.
  2. experience replay.
  3. critic networks.

Non-examples:

  1. exact dynamic programming in a tiny known MDP.
  2. memorizing every state-action value in a table.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2.5 Partial observability and belief states

Purpose. Partial observability and belief states focuses on what breaks when observations are not states. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

An MDP is the formal object that makes sequential decision-making mathematically precise.

Worked reading.

The Markov property says the current state contains the predictive information needed for the next transition and reward.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. finite gridworld.
  2. inventory control.
  3. dialogue state tracking.

Non-examples:

  1. raw observations that omit hidden state.
  2. a static labeled dataset with no actions.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

3. Returns Policies and Value Functions

Returns Policies and Value Functions is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

3.1 Discounted return

Purpose. Discounted return focuses on why Gt=k0γkRt+k+1G_t=\sum_{k\ge 0}\gamma^k R_{t+k+1} is central. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. VπV^\pi.
  2. QπQ^\pi.
  3. AπA^\pi.

Non-examples:

  1. instant reward only.
  2. a metric computed on states never reached by the policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

3.2 Deterministic and stochastic policies

Purpose. Deterministic and stochastic policies focuses on how π(as)\pi(a\mid s) controls the data distribution. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

3.3 State-value and action-value functions

Purpose. State-value and action-value functions focuses on what VπV^\pi and QπQ^\pi estimate. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Function approximation replaces tables with parameterized models so the agent can generalize across large state spaces.

Worked reading.

Deep RL is powerful because neural networks share statistical strength, but unstable because approximate bootstrapping can amplify errors.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. DQN target networks.
  2. experience replay.
  3. critic networks.

Non-examples:

  1. exact dynamic programming in a tiny known MDP.
  2. memorizing every state-action value in a table.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

3.4 Advantage functions

Purpose. Advantage functions focuses on why Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s) reduces variance. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

Function approximation replaces tables with parameterized models so the agent can generalize across large state spaces.

Worked reading.

Deep RL is powerful because neural networks share statistical strength, but unstable because approximate bootstrapping can amplify errors.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. DQN target networks.
  2. experience replay.
  3. critic networks.

Non-examples:

  1. exact dynamic programming in a tiny known MDP.
  2. memorizing every state-action value in a table.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

3.5 Occupancy measures

Purpose. Occupancy measures focuses on how policies induce weighted state-action distributions. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. VπV^\pi.
  2. QπQ^\pi.
  3. AπA^\pi.

Non-examples:

  1. instant reward only.
  2. a metric computed on states never reached by the policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4. Bellman Equations

Bellman Equations is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

4.1 Bellman expectation equation

Purpose. Bellman expectation equation focuses on recursive consistency for a fixed policy. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

Bellman equations express a global return as immediate reward plus the value of the next state. They are recursive consistency equations.

Worked reading.

The Bellman backup replaces a value estimate by a reward-plus-next-value target under either a fixed policy or an optimal action.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. policy evaluation.
  2. value iteration.
  3. TD target construction.

Non-examples:

  1. a one-step supervised label.
  2. a loss that ignores future value.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4.2 Bellman optimality equation

Purpose. Bellman optimality equation focuses on recursive consistency for the best policy. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Bellman equations express a global return as immediate reward plus the value of the next state. They are recursive consistency equations.

Worked reading.

The Bellman backup replaces a value estimate by a reward-plus-next-value target under either a fixed policy or an optimal action.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. policy evaluation.
  2. value iteration.
  3. TD target construction.

Non-examples:

  1. a one-step supervised label.
  2. a loss that ignores future value.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4.3 Contraction mapping intuition

Purpose. Contraction mapping intuition focuses on why dynamic programming converges when γ<1\gamma<1. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4.4 Matrix form for finite MDPs

Purpose. Matrix form for finite MDPs focuses on how policy evaluation becomes a linear system. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

An MDP is the formal object that makes sequential decision-making mathematically precise.

Worked reading.

The Markov property says the current state contains the predictive information needed for the next transition and reward.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. finite gridworld.
  2. inventory control.
  3. dialogue state tracking.

Non-examples:

  1. raw observations that omit hidden state.
  2. a static labeled dataset with no actions.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4.5 Bellman residuals

Purpose. Bellman residuals focuses on how to diagnose approximate value functions. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

M=(S,A,P,r,γ),P(ss,a)=Pr(St+1=sSt=s,At=a).\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

Bellman equations express a global return as immediate reward plus the value of the next state. They are recursive consistency equations.

Worked reading.

The Bellman backup replaces a value estimate by a reward-plus-next-value target under either a fixed policy or an optimal action.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. policy evaluation.
  2. value iteration.
  3. TD target construction.

Non-examples:

  1. a one-step supervised label.
  2. a loss that ignores future value.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

5. Dynamic Programming

Dynamic Programming is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

5.1 Policy evaluation

Purpose. Policy evaluation focuses on computing VπV^\pi when the model is known. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

5.2 Policy improvement

Purpose. Policy improvement focuses on turning a value function into a better greedy policy. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

5.3 Policy iteration

Purpose. Policy iteration focuses on alternating evaluation and improvement. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

5.4 Value iteration

Purpose. Value iteration focuses on combining backup and improvement into one operator. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

M=(S,A,P,r,γ),P(ss,a)=Pr(St+1=sSt=s,At=a).\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. VπV^\pi.
  2. QπQ^\pi.
  3. AπA^\pi.

Non-examples:

  1. instant reward only.
  2. a metric computed on states never reached by the policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

5.5 Planning versus learning

Purpose. Planning versus learning focuses on why model access changes the algorithm. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

Dynamic programming uses a known model to compute values and policies through Bellman backups.

Worked reading.

Policy iteration alternates evaluation and greedy improvement; value iteration applies optimal backups directly.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. policy evaluation.
  2. policy iteration.
  3. value iteration.

Non-examples:

  1. model-free Q-learning.
  2. one-step supervised classification.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6. Sampling-Based Prediction

Sampling-Based Prediction is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

6.1 Monte Carlo returns

Purpose. Monte Carlo returns focuses on learning from complete sampled episodes. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

Sampling-based prediction learns values from trajectories. Monte Carlo waits for full returns; TD bootstraps from the next estimate.

Worked reading.

The TD error δt=Rt+1+γV(St+1)V(St)\delta_t=R_{t+1}+\gamma V(S_{t+1})-V(S_t) is a local surprise signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. TD(0).
  2. n-step returns.
  3. TD(lambda).

Non-examples:

  1. solving the Bellman linear system exactly.
  2. using labels independent of the current policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6.2 Temporal-difference learning

Purpose. Temporal-difference learning focuses on learning from bootstrapped one-step targets. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

Sampling-based prediction learns values from trajectories. Monte Carlo waits for full returns; TD bootstraps from the next estimate.

Worked reading.

The TD error δt=Rt+1+γV(St+1)V(St)\delta_t=R_{t+1}+\gamma V(S_{t+1})-V(S_t) is a local surprise signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. TD(0).
  2. n-step returns.
  3. TD(lambda).

Non-examples:

  1. solving the Bellman linear system exactly.
  2. using labels independent of the current policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6.3 Bias variance and bootstrapping

Purpose. Bias variance and bootstrapping focuses on why MC and TD make different errors. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

M=(S,A,P,r,γ),P(ss,a)=Pr(St+1=sSt=s,At=a).\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6.4 N-step returns

Purpose. N-step returns focuses on interpolating between TD(0) and Monte Carlo. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. VπV^\pi.
  2. QπQ^\pi.
  3. AπA^\pi.

Non-examples:

  1. instant reward only.
  2. a metric computed on states never reached by the policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

6.5 Eligibility traces and TD lambda

Purpose. Eligibility traces and TD lambda focuses on credit assignment across recent visits. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

Sampling-based prediction learns values from trajectories. Monte Carlo waits for full returns; TD bootstraps from the next estimate.

Worked reading.

The TD error δt=Rt+1+γV(St+1)V(St)\delta_t=R_{t+1}+\gamma V(S_{t+1})-V(S_t) is a local surprise signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. TD(0).
  2. n-step returns.
  3. TD(lambda).

Non-examples:

  1. solving the Bellman linear system exactly.
  2. using labels independent of the current policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

7. Control Algorithms

Control Algorithms is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

7.1 SARSA

Purpose. SARSA focuses on on-policy TD control. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

Control algorithms learn how to choose actions, not just how to evaluate a fixed policy.

Worked reading.

SARSA uses the action actually sampled by the behavior policy; Q-learning uses a greedy target and is therefore off-policy.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. tabular gridworld control.
  2. DQN-style value learning.
  3. epsilon-greedy exploration.

Non-examples:

  1. estimating VπV^\pi only.
  2. planning with a perfect model and no samples.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

7.2 Q-learning

Purpose. Q-learning focuses on off-policy TD control with greedy targets. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

M=(S,A,P,r,γ),P(ss,a)=Pr(St+1=sSt=s,At=a).\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

Control algorithms learn how to choose actions, not just how to evaluate a fixed policy.

Worked reading.

SARSA uses the action actually sampled by the behavior policy; Q-learning uses a greedy target and is therefore off-policy.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. tabular gridworld control.
  2. DQN-style value learning.
  3. epsilon-greedy exploration.

Non-examples:

  1. estimating VπV^\pi only.
  2. planning with a perfect model and no samples.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

7.3 Exploration schedules

Purpose. Exploration schedules focuses on epsilon greedy softmax and optimism. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

7.4 Double Q-learning

Purpose. Double Q-learning focuses on reducing maximization bias. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

Control algorithms learn how to choose actions, not just how to evaluate a fixed policy.

Worked reading.

SARSA uses the action actually sampled by the behavior policy; Q-learning uses a greedy target and is therefore off-policy.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. tabular gridworld control.
  2. DQN-style value learning.
  3. epsilon-greedy exploration.

Non-examples:

  1. estimating VπV^\pi only.
  2. planning with a perfect model and no samples.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

7.5 Convergence conditions

Purpose. Convergence conditions focuses on what tabular guarantees assume. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

8. Function Approximation and Deep RL

Function Approximation and Deep RL is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

8.1 Why tables fail

Purpose. Why tables fail focuses on large state spaces and generalization. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

M=(S,A,P,r,γ),P(ss,a)=Pr(St+1=sSt=s,At=a).\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

8.2 Linear value approximation

Purpose. Linear value approximation focuses on projected Bellman equations. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. VπV^\pi.
  2. QπQ^\pi.
  3. AπA^\pi.

Non-examples:

  1. instant reward only.
  2. a metric computed on states never reached by the policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

8.3 The deadly triad

Purpose. The deadly triad focuses on bootstrapping off-policy learning and approximation. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

8.4 DQN stabilization

Purpose. DQN stabilization focuses on replay buffers and target networks. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

Function approximation replaces tables with parameterized models so the agent can generalize across large state spaces.

Worked reading.

Deep RL is powerful because neural networks share statistical strength, but unstable because approximate bootstrapping can amplify errors.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. DQN target networks.
  2. experience replay.
  3. critic networks.

Non-examples:

  1. exact dynamic programming in a tiny known MDP.
  2. memorizing every state-action value in a table.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

8.5 Representation learning for policies and critics

Purpose. Representation learning for policies and critics focuses on how neural networks change the math. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

9. Policy Gradients

Policy Gradients is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

9.1 Policy objective

Purpose. Policy objective focuses on maximizing J(θ)=Eτπθ[G0]J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[G_0]. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

9.2 Log-derivative trick

Purpose. Log-derivative trick focuses on turning trajectory probabilities into score functions. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

9.3 Policy gradient theorem

Purpose. Policy gradient theorem focuses on why gradients can use Qπ(s,a)Q^\pi(s,a). This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

9.4 Baselines and variance reduction

Purpose. Baselines and variance reduction focuses on why subtracting b(s)b(s) is unbiased. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

9.5 Entropy regularization

Purpose. Entropy regularization focuses on why stochastic policies are encouraged. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10. Actor-Critic PPO and RLHF

Actor-Critic PPO and RLHF is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

10.1 Actor-critic decomposition

Purpose. Actor-critic decomposition focuses on policy and value learning together. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10.2 Generalized advantage estimation

Purpose. Generalized advantage estimation focuses on the λ\lambda tradeoff for advantages. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. VπV^\pi.
  2. QπQ^\pi.
  3. AπA^\pi.

Non-examples:

  1. instant reward only.
  2. a metric computed on states never reached by the policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10.3 Trust regions and KL control

Purpose. Trust regions and KL control focuses on limiting policy movement. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Control algorithms learn how to choose actions, not just how to evaluate a fixed policy.

Worked reading.

SARSA uses the action actually sampled by the behavior policy; Q-learning uses a greedy target and is therefore off-policy.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. tabular gridworld control.
  2. DQN-style value learning.
  3. epsilon-greedy exploration.

Non-examples:

  1. estimating VπV^\pi only.
  2. planning with a perfect model and no samples.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10.4 PPO clipped surrogate objective

Purpose. PPO clipped surrogate objective focuses on a practical trust-region approximation. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10.5 Reward modeling and preference optimization

Purpose. Reward modeling and preference optimization focuses on the RLHF bridge to language models. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

11. Common Mistakes

#MistakeWhy it is wrongFix
1Confusing rewards with returnsA reward is local; a return is accumulated over time.Always write GtG_t before deriving an update target.
2Ignoring the data distribution shiftChanging the policy changes which states are visited.Name whether the data are on-policy or off-policy.
3Treating Bellman equations as supervised labelsBellman targets contain estimates and bootstrapping.Track target networks, stop-gradient choices, or tabular guarantees.
4Using Q-learning for every problemContinuous action spaces and large state spaces often need different methods.Choose value-based, policy-gradient, or actor-critic methods from the action and data structure.
5Forgetting explorationA greedy policy may never see better actions.Use explicit exploration or uncertainty-aware data collection.
6Trusting average reward without varianceRL estimates are noisy and seed-sensitive.Report confidence intervals, seeds, and learning curves.
7Mixing offline and online assumptionsLogged data may not cover actions needed by the learned policy.Check coverage and use conservative offline RL when needed.
8Over-optimizing a reward modelThe policy may exploit reward model errors.Use KL control, held-out preference evaluation, and adversarial tests.
9Calling PPO a magic stabilizerPPO still depends on advantage quality, clipping, normalization, and KL monitoring.Audit ratios, advantages, entropy, KL, and value loss.
10Forgetting terminal statesBootstrapping through terminal states creates false future value.Mask terminal transitions in TD targets.

12. Exercises

  1. (*) Solve a Bellman policy-evaluation system for a three-state chain.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  2. (*) Run three value-iteration backups and track the sup-norm change.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  3. (*) Compute a Monte Carlo return and compare it with a TD target.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  4. (**) Apply one SARSA update and one Q-learning update to the same transition.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  5. (**) Compute an epsilon-greedy action distribution.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  6. (**) Estimate a policy-gradient direction for a two-action softmax policy.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  7. (**) Compute generalized advantages from rewards and value estimates.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  8. (***) Evaluate the PPO clipped surrogate for positive and negative advantages.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  9. (***) Fit a Bradley-Terry reward-model probability for preference data.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  10. (***) Compute a DPO-style preference loss and explain the KL-control intuition.

  • (a) Name the random variables.
  • (b) Write the target equation.
  • (c) Compute the numeric result.
  • (d) Explain what the result means for an agent or LLM policy.

13. Why This Matters for AI

ConceptAI impact
MDPsProvide the mathematical contract for agents, simulators, robotics tasks, games, and dialogue policies.
Bellman equationsTurn long-horizon objectives into one-step recursive learning targets.
TD learningExplains bootstrapping, credit assignment, and value-learning targets used in deep RL.
Q-learningPowers value-based control and explains why replay and target networks stabilize DQN-style systems.
Policy gradientsGive the gradient estimator behind REINFORCE, actor-critic, PPO, and many RLHF implementations.
Advantage estimationReduces variance and makes policy updates more sample-efficient.
KL regularizationKeeps a learned policy close to a reference model in RLHF and safe fine-tuning.
Reward modelingConnects human preferences to scalar optimization, while exposing reward hacking risks.

14. Conceptual Bridge

The backward bridge is probability and Markov chains. A Markov chain has transitions but no actions. An MDP adds actions, rewards, and optimization. Once actions enter the process, the learner must reason about both inference and control.

The forward bridge is alignment and interactive systems. RLHF, DPO, preference models, online experiments, and agentic tool-use loops all reuse RL ideas: reward signals, policies, KL constraints, distribution shift, and evaluation under feedback.

+-------------------+      +------------------------+      +----------------------+
| Markov chains     | ---> | Markov decision process | ---> | policy optimization  |
| transition only   |      | actions and rewards     |      | RLHF, agents, games  |
+-------------------+      +------------------------+      +----------------------+

The most important practical lesson is that RL is not just an optimizer. It is a complete data-generating loop. When a policy changes, the future dataset changes. That is why careful RL work always audits rewards, policies, value estimates, exploration, and evaluation together.

References