Part 2

29 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Multi-Agent Systems: Part 3: Stochastic and Markov Games to 4. Learning Dynamics

3. Stochastic and Markov Games

Stochastic and Markov Games develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

3.1 state transitions

State transitions belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for state transitions. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state $s$ , agents sample a joint action $\mathbf{a}$ , the environment transitions by $P(s'\mid s,\mathbf{a})$ , and each agent receives $r_i(s,\mathbf{a})$ .

Three examples of state transitions:

Two dialogue agents sharing a tool environment.
Robot teams with shared state and individual rewards.
Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

A single-agent MDP with a fixed environment.
A static normal-form game with no state.

Proof or verification habit for state transitions:

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, state transitions is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using state transitions responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. State transitions gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.2 value functions for agents

Value functions for agents belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for value functions for agents. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state $s$ , agents sample a joint action $\mathbf{a}$ , the environment transitions by $P(s'\mid s,\mathbf{a})$ , and each agent receives $r_i(s,\mathbf{a})$ .

Three examples of value functions for agents:

Two dialogue agents sharing a tool environment.
Robot teams with shared state and individual rewards.
Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

A single-agent MDP with a fixed environment.
A static normal-form game with no state.

Proof or verification habit for value functions for agents:

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, value functions for agents is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using value functions for agents responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Value functions for agents gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.3 Nash policies

Nash policies belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for nash policies. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Nash equilibrium is a profile of strategies where no player can improve by changing its own strategy while all other strategies remain fixed.

Worked reading.

In the prisoner's dilemma payoff convention, mutual defection can be a Nash equilibrium even when mutual cooperation is better for both players. This is the central warning: stability and desirability are different properties.

Three examples of nash policies:

A self-play policy pair where neither side has a profitable unilateral exploit.
A GAN fixed point where the generator distribution matches data and the discriminator cannot improve classification.
A routing market where no model provider benefits from changing only its bid.

Two non-examples clarify the boundary:

A high-welfare outcome with a profitable unilateral deviation.
A training checkpoint with low loss but a large best-response exploit.

Proof or verification habit for nash policies:

The proof is a universal deviation check: for each player $i$ , hold $\pi_{-i}$ fixed and show $u_i(\pi_i^*,\pi_{-i}^*)\ge u_i(\pi_i,\pi_{-i}^*)$ for all allowed $\pi_i$ .

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, nash policies is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

For AI agents, Nash is a stability diagnostic. It does not guarantee safety, alignment, fairness, or global efficiency unless those objectives are encoded in the game.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using nash policies responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask: if one deployed model, user, or attacker changed behavior alone, would it gain?

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Nash policies gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.4 cooperative team games

Cooperative team games belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for cooperative team games. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Cooperation and competition are payoff-structure choices. Cooperative games align rewards; competitive games put rewards in conflict; mixed-motive games do both.

Worked reading.

A common-payoff team has $r_1=\cdots=r_n$ . A zero-sum game has $r_1=-r_2$ . Most deployed multi-agent systems sit between these extremes.

Three examples of cooperative team games:

A team of tool agents sharing a task reward.
A self-play opponent trained to expose weaknesses.
A market of model providers competing while still serving user welfare.

Two non-examples clarify the boundary:

Calling agents cooperative because they are in the same codebase.
Calling a game competitive because agents are different processes.

Proof or verification habit for cooperative team games:

Classify the payoff relation before selecting an equilibrium or learning method.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, cooperative team games is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

The same algorithm can look safe or unsafe depending on whether rewards create cooperation, competition, or collusion.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using cooperative team games responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Write the reward vector, not just the environment reward.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Cooperative team games gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.5 partially observed settings preview

Partially observed settings preview belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for partially observed settings preview. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Partial observability means agents condition decisions on observations rather than the full state.

Worked reading.

A policy becomes $\pi_i(a_i\mid o_i)$ instead of $\pi_i(a_i\mid s)$ , and beliefs or histories may be needed to act well.

Three examples of partially observed settings preview:

Agents with different tool logs.
A defender that sees alerts but not the attacker's full plan.
A dialogue agent that sees conversation text but not hidden user intent.

Two non-examples clarify the boundary:

A fully observed Markov game.
A static matrix game.

Proof or verification habit for partially observed settings preview:

The proof habit is to specify observation functions and information sets before writing policies.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, partially observed settings preview is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI security and coordination problems are partially observed because intent, hidden prompts, and private tools are not directly visible.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using partially observed settings preview responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: State what each agent observes at decision time.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Partially observed settings preview gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4. Learning Dynamics

Learning Dynamics develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

4.1 independent learners

Independent learners belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for independent learners. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Learning dynamics study how strategies move over time, not just where equilibrium points are located.

Worked reading.

In fictitious play, each player tracks empirical frequencies of the opponent's past actions and best-responds to those beliefs.

Three examples of independent learners:

Rock-paper-scissors empirical play approaching the mixed region.
Independent Q-learners chasing each other's changing policies.
GAN gradients rotating around a saddle-like point.

Two non-examples clarify the boundary:

A static equilibrium certificate.
A supervised learner trained against an immutable dataset.

Proof or verification habit for independent learners:

Analyze updates as a dynamical system: fixed points, cycles, regret, and exploitability are different diagnostics.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, independent learners is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI failures are dynamic failures: the target moves while the learner is trying to fit it.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using independent learners responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Plot trajectories or regret; do not infer convergence from one snapshot.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Independent learners gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.2 fictitious play

Fictitious play belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for fictitious play. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Learning dynamics study how strategies move over time, not just where equilibrium points are located.

Worked reading.

In fictitious play, each player tracks empirical frequencies of the opponent's past actions and best-responds to those beliefs.

Three examples of fictitious play:

Rock-paper-scissors empirical play approaching the mixed region.
Independent Q-learners chasing each other's changing policies.
GAN gradients rotating around a saddle-like point.

Two non-examples clarify the boundary:

A static equilibrium certificate.
A supervised learner trained against an immutable dataset.

Proof or verification habit for fictitious play:

Analyze updates as a dynamical system: fixed points, cycles, regret, and exploitability are different diagnostics.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, fictitious play is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI failures are dynamic failures: the target moves while the learner is trying to fit it.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using fictitious play responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Plot trajectories or regret; do not infer convergence from one snapshot.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Fictitious play gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.3 no-regret learning

No-regret learning belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for no-regret learning. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

No-regret learning turns repeated play into approximate equilibrium guarantees by making average regret small.

Worked reading.

If both players in a zero-sum game have average regret at most $\epsilon$ , the average strategies are $O(\epsilon)$ -approximate minimax strategies.

Three examples of no-regret learning:

Multiplicative weights for action probabilities.
Self-play policies averaged over training.
Exploitability curves used to track poker or board-game agents.

Two non-examples clarify the boundary:

A decreasing supervised loss curve with no opponent model.
A single final policy checkpoint without averaging or regret accounting.

Proof or verification habit for no-regret learning:

The proof decomposes the average payoff gap into the row player's regret plus the column player's regret.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, no-regret learning is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

This is why practical game-playing systems track exploitability and regret-like quantities instead of only reward.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using no-regret learning responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Report whether the guarantee applies to last iterate, averaged iterate, or best checkpoint.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. No-regret learning gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.4 policy gradients in games

Policy gradients in games belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for policy gradients in games. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Learning dynamics study how strategies move over time, not just where equilibrium points are located.

Worked reading.

In fictitious play, each player tracks empirical frequencies of the opponent's past actions and best-responds to those beliefs.

Three examples of policy gradients in games:

Rock-paper-scissors empirical play approaching the mixed region.
Independent Q-learners chasing each other's changing policies.
GAN gradients rotating around a saddle-like point.

Two non-examples clarify the boundary:

A static equilibrium certificate.
A supervised learner trained against an immutable dataset.

Proof or verification habit for policy gradients in games:

Analyze updates as a dynamical system: fixed points, cycles, regret, and exploitability are different diagnostics.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, policy gradients in games is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI failures are dynamic failures: the target moves while the learner is trying to fit it.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using policy gradients in games responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Plot trajectories or regret; do not infer convergence from one snapshot.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Policy gradients in games gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.5 equilibrium selection

Equilibrium selection belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for equilibrium selection. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Equilibrium selection asks which equilibrium appears when several are mathematically possible.

Worked reading.

In a coordination game with two stable conventions, initialization, communication, history, or payoff-dominance can determine the selected convention.

Three examples of equilibrium selection:

Two agents converging to the same API schema.
Self-play selecting one opening strategy among many stable ones.
A market standard emerging from repeated routing choices.

Two non-examples clarify the boundary:

The proof that at least one equilibrium exists.
A claim that all equilibria are equally safe.

Proof or verification habit for equilibrium selection:

First verify the candidate equilibria, then study basin of attraction or selection criterion.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, equilibrium selection is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

In AI systems, multiple stable behaviors can differ sharply in safety and usefulness.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using equilibrium selection responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Report which equilibrium is selected and why that selection mechanism is credible.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Equilibrium selection gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

Multi Agent Systems: Part 2 - Stochastic And Markov Games To 4 Learning Dynamics

Multi-Agent Systems: Part 3: Stochastic and Markov Games to 4. Learning Dynamics

3. Stochastic and Markov Games

3.1 state transitions

3.2 value functions for agents

3.3 Nash policies

3.4 cooperative team games

3.5 partially observed settings preview

4. Learning Dynamics

4.1 independent learners

4.2 fictitious play

4.3 no-regret learning

4.4 policy gradients in games

4.5 equilibrium selection

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?