Notes

Multi-Agent Systems

"When many learners share an environment, every policy becomes part of someone else's data distribution."

Overview

Multi-agent systems study strategic learning when multiple agents act, adapt, communicate, and optimize in a shared environment.

Game theory is the part of the curriculum that studies adaptive decision makers. It asks what happens when each model, user, attacker, defender, or agent optimizes while anticipating the choices of others.

This section is written in LaTeX Markdown. Inline mathematics uses $...$ , and display equations use `

...

`. The notes emphasize strategy, payoff, best response, equilibrium, exploitability, and adversarial adaptation.

Prerequisites

Companion Notebooks

Notebook	Description
theory.ipynb	Executable demonstrations for multi-agent systems
exercises.ipynb	Graded practice for multi-agent systems

Learning Objectives

After completing this section, you will be able to:

Define Markov games using states, joint actions, transition kernels, rewards, and discounting
Compute simple joint-action transitions and agent-specific value functions
Explain why independent learning creates nonstationary data for every other learner
Relate Nash policies to equilibrium concepts in stochastic games
Simulate fictitious play and interpret empirical strategy trajectories
Compare cooperative, competitive, and mixed-motive multi-agent settings
Analyze communication, conventions, and credit assignment in team games
Connect multi-agent learning dynamics to self-play and LLM-agent orchestration
Use welfare and fairness criteria without confusing them with equilibrium
Identify when partial observability changes the mathematical model

1. Intuition
2. Formal Definitions
3. Stochastic and Markov Games
4. Learning Dynamics
5. Coordination and Communication
6. AI Applications
7. Common Mistakes
8. Exercises
9. Why This Matters for AI
10. Conceptual Bridge
References

1. Intuition

Intuition develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

1.1 many learners sharing one environment

Many learners sharing one environment belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for many learners sharing one environment. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Many-agent learning means every learner's policy is part of the environment seen by the others.

Worked reading.

If agent 1 changes its policy, agent 2's data distribution changes even when the physical simulator is unchanged. That is the core nonstationarity of multi-agent learning.

Three examples of many learners sharing one environment:

Self-play agents improving by training against earlier or current versions.
LLM tool agents changing each other's context and options.
A routing marketplace where traffic shifts after one provider changes quality.

Two non-examples clarify the boundary:

A single-agent RL problem with a fixed transition kernel.
Batch supervised learning on immutable labels.

Proof or verification habit for many learners sharing one environment:

Analyze the joint policy trajectory, not only individual losses.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, many learners sharing one environment is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Agentic LLM systems make multi-agent math practical: prompts, tools, memory, and policies interact in a shared state.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using many learners sharing one environment responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask which part of another agent's behavior enters this agent's observation or reward.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Many learners sharing one environment gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

1.2 nonstationarity

Nonstationarity belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for nonstationarity. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Learning dynamics study how strategies move over time, not just where equilibrium points are located.

Worked reading.

In fictitious play, each player tracks empirical frequencies of the opponent's past actions and best-responds to those beliefs.

Three examples of nonstationarity:

Rock-paper-scissors empirical play approaching the mixed region.
Independent Q-learners chasing each other's changing policies.
GAN gradients rotating around a saddle-like point.

Two non-examples clarify the boundary:

A static equilibrium certificate.
A supervised learner trained against an immutable dataset.

Proof or verification habit for nonstationarity:

Analyze updates as a dynamical system: fixed points, cycles, regret, and exploitability are different diagnostics.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, nonstationarity is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI failures are dynamic failures: the target moves while the learner is trying to fit it.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using nonstationarity responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Plot trajectories or regret; do not infer convergence from one snapshot.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Nonstationarity gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

1.3 cooperation vs competition

Cooperation vs competition belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for cooperation vs competition. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Cooperation and competition are payoff-structure choices. Cooperative games align rewards; competitive games put rewards in conflict; mixed-motive games do both.

Worked reading.

A common-payoff team has $r_1=\cdots=r_n$ . A zero-sum game has $r_1=-r_2$ . Most deployed multi-agent systems sit between these extremes.

Three examples of cooperation vs competition:

A team of tool agents sharing a task reward.
A self-play opponent trained to expose weaknesses.
A market of model providers competing while still serving user welfare.

Two non-examples clarify the boundary:

Calling agents cooperative because they are in the same codebase.
Calling a game competitive because agents are different processes.

Proof or verification habit for cooperation vs competition:

Classify the payoff relation before selecting an equilibrium or learning method.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, cooperation vs competition is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

The same algorithm can look safe or unsafe depending on whether rewards create cooperation, competition, or collusion.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using cooperation vs competition responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Write the reward vector, not just the environment reward.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Cooperation vs competition gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

1.4 communication and coordination

Communication and coordination belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for communication and coordination. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Coordination games contain multiple stable outcomes, so the mathematical problem is not only existence but selection.

Worked reading.

If two agents both prefer choosing the same protocol, both $(A,A)$ and $(B,B)$ can be equilibria. Which one appears may depend on initialization, communication, history, or focal points.

Three examples of communication and coordination:

LLM agents agree on a tool-call schema.
Distributed learners converge to a shared convention for labels.
A team of models selects the same plan representation before acting.

Two non-examples clarify the boundary:

A zero-sum contest where one player's gain is the other's loss.
A single model choosing a format without another agent needing to match it.

Proof or verification habit for communication and coordination:

Equilibrium verification is easy; equilibrium selection is the hard part. Show each matched profile is stable, then analyze basins, signals, or welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, communication and coordination is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Coordination failures are common in agentic systems because technically correct local policies can still fail to align interfaces.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using communication and coordination responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether agents need the same convention, and whether the convention is observable before action.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Communication and coordination gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

1.5 emergent behavior in AI systems

Emergent behavior in ai systems belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for emergent behavior in ai systems. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Many-agent learning means every learner's policy is part of the environment seen by the others.

Worked reading.

If agent 1 changes its policy, agent 2's data distribution changes even when the physical simulator is unchanged. That is the core nonstationarity of multi-agent learning.

Three examples of emergent behavior in ai systems:

Self-play agents improving by training against earlier or current versions.
LLM tool agents changing each other's context and options.
A routing marketplace where traffic shifts after one provider changes quality.

Two non-examples clarify the boundary:

A single-agent RL problem with a fixed transition kernel.
Batch supervised learning on immutable labels.

Proof or verification habit for emergent behavior in ai systems:

Analyze the joint policy trajectory, not only individual losses.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, emergent behavior in ai systems is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Agentic LLM systems make multi-agent math practical: prompts, tools, memory, and policies interact in a shared state.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using emergent behavior in ai systems responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask which part of another agent's behavior enters this agent's observation or reward.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Emergent behavior in ai systems gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

2. Formal Definitions

Formal Definitions develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

2.1 agent set $N$

Agent set $n$ belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for agent set $n$ . In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Players, actions, and payoffs define the interface of a game. If any one of them is vague, the equilibrium claim is usually vague too.

Worked reading.

A payoff matrix is a compact table: rows are one player's actions, columns are another player's actions, and entries are utilities or losses induced by the joint action.

Three examples of agent set $n$ :

A row action chooses a defense, while a column action chooses an attack family.
An agent set lists every model or tool-using process that can affect reward.
A utility function converts accuracy, safety, latency, and cost into strategic incentives.

Two non-examples clarify the boundary:

A metric with no actor who optimizes it.
An action that is impossible in deployment but included for convenience.

Proof or verification habit for agent set $n$ :

Before proving anything, audit the model specification: every allowed action must map to a payoff for every player.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, agent set $n$ is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Payoff design is AI system design. The game will faithfully optimize the incentives it is given, including bad incentives.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using agent set $n$ responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Can you name each player, enumerate or parameterize its actions, and compute its payoff from a joint action?

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Agent set $n$ gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

2.2 joint action $\mathbf{a}$

Joint action $\mathbf{a}$ belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for joint action $\mathbf{a}$ . In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state $s$ , agents sample a joint action $\mathbf{a}$ , the environment transitions by $P(s'\mid s,\mathbf{a})$ , and each agent receives $r_i(s,\mathbf{a})$ .

Three examples of joint action $\mathbf{a}$ :

Two dialogue agents sharing a tool environment.
Robot teams with shared state and individual rewards.
Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

A single-agent MDP with a fixed environment.
A static normal-form game with no state.

Proof or verification habit for joint action $\mathbf{a}$ :

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, joint action $\mathbf{a}$ is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using joint action $\mathbf{a}$ responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Joint action $\mathbf{a}$ gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

2.3 reward vector $\mathbf{r}$

Reward vector $\mathbf{r}$ belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for reward vector $\mathbf{r}$ . In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state $s$ , agents sample a joint action $\mathbf{a}$ , the environment transitions by $P(s'\mid s,\mathbf{a})$ , and each agent receives $r_i(s,\mathbf{a})$ .

Three examples of reward vector $\mathbf{r}$ :

Two dialogue agents sharing a tool environment.
Robot teams with shared state and individual rewards.
Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

A single-agent MDP with a fixed environment.
A static normal-form game with no state.

Proof or verification habit for reward vector $\mathbf{r}$ :

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, reward vector $\mathbf{r}$ is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using reward vector $\mathbf{r}$ responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Reward vector $\mathbf{r}$ gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

2.4 Markov game

Markov game belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for markov game. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state $s$ , agents sample a joint action $\mathbf{a}$ , the environment transitions by $P(s'\mid s,\mathbf{a})$ , and each agent receives $r_i(s,\mathbf{a})$ .

Three examples of markov game:

Two dialogue agents sharing a tool environment.
Robot teams with shared state and individual rewards.
Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

A single-agent MDP with a fixed environment.
A static normal-form game with no state.

Proof or verification habit for markov game:

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, markov game is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using markov game responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Markov game gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

2.5 joint policy $\boldsymbol{\pi}$

Joint policy $\boldsymbol{\pi}$ belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for joint policy $\boldsymbol{\pi}$ . In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state $s$ , agents sample a joint action $\mathbf{a}$ , the environment transitions by $P(s'\mid s,\mathbf{a})$ , and each agent receives $r_i(s,\mathbf{a})$ .

Three examples of joint policy $\boldsymbol{\pi}$ :

Two dialogue agents sharing a tool environment.
Robot teams with shared state and individual rewards.
Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

A single-agent MDP with a fixed environment.
A static normal-form game with no state.

Proof or verification habit for joint policy $\boldsymbol{\pi}$ :

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, joint policy $\boldsymbol{\pi}$ is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using joint policy $\boldsymbol{\pi}$ responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Joint policy $\boldsymbol{\pi}$ gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3. Stochastic and Markov Games

Stochastic and Markov Games develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

3.1 state transitions

State transitions belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for state transitions. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state $s$ , agents sample a joint action $\mathbf{a}$ , the environment transitions by $P(s'\mid s,\mathbf{a})$ , and each agent receives $r_i(s,\mathbf{a})$ .

Three examples of state transitions:

Two dialogue agents sharing a tool environment.
Robot teams with shared state and individual rewards.
Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

A single-agent MDP with a fixed environment.
A static normal-form game with no state.

Proof or verification habit for state transitions:

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, state transitions is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using state transitions responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. State transitions gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.2 value functions for agents

Value functions for agents belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for value functions for agents. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state $s$ , agents sample a joint action $\mathbf{a}$ , the environment transitions by $P(s'\mid s,\mathbf{a})$ , and each agent receives $r_i(s,\mathbf{a})$ .

Three examples of value functions for agents:

Two dialogue agents sharing a tool environment.
Robot teams with shared state and individual rewards.
Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

A single-agent MDP with a fixed environment.
A static normal-form game with no state.

Proof or verification habit for value functions for agents:

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, value functions for agents is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using value functions for agents responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Value functions for agents gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.3 Nash policies

Nash policies belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for nash policies. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Nash equilibrium is a profile of strategies where no player can improve by changing its own strategy while all other strategies remain fixed.

Worked reading.

In the prisoner's dilemma payoff convention, mutual defection can be a Nash equilibrium even when mutual cooperation is better for both players. This is the central warning: stability and desirability are different properties.

Three examples of nash policies:

A self-play policy pair where neither side has a profitable unilateral exploit.
A GAN fixed point where the generator distribution matches data and the discriminator cannot improve classification.
A routing market where no model provider benefits from changing only its bid.

Two non-examples clarify the boundary:

A high-welfare outcome with a profitable unilateral deviation.
A training checkpoint with low loss but a large best-response exploit.

Proof or verification habit for nash policies:

The proof is a universal deviation check: for each player $i$ , hold $\pi_{-i}$ fixed and show $u_i(\pi_i^*,\pi_{-i}^*)\ge u_i(\pi_i,\pi_{-i}^*)$ for all allowed $\pi_i$ .

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, nash policies is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

For AI agents, Nash is a stability diagnostic. It does not guarantee safety, alignment, fairness, or global efficiency unless those objectives are encoded in the game.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using nash policies responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask: if one deployed model, user, or attacker changed behavior alone, would it gain?

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Nash policies gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.4 cooperative team games

Cooperative team games belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for cooperative team games. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Cooperation and competition are payoff-structure choices. Cooperative games align rewards; competitive games put rewards in conflict; mixed-motive games do both.

Worked reading.

A common-payoff team has $r_1=\cdots=r_n$ . A zero-sum game has $r_1=-r_2$ . Most deployed multi-agent systems sit between these extremes.

Three examples of cooperative team games:

A team of tool agents sharing a task reward.
A self-play opponent trained to expose weaknesses.
A market of model providers competing while still serving user welfare.

Two non-examples clarify the boundary:

Calling agents cooperative because they are in the same codebase.
Calling a game competitive because agents are different processes.

Proof or verification habit for cooperative team games:

Classify the payoff relation before selecting an equilibrium or learning method.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, cooperative team games is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

The same algorithm can look safe or unsafe depending on whether rewards create cooperation, competition, or collusion.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using cooperative team games responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Write the reward vector, not just the environment reward.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Cooperative team games gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.5 partially observed settings preview

Partially observed settings preview belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for partially observed settings preview. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Partial observability means agents condition decisions on observations rather than the full state.

Worked reading.

A policy becomes $\pi_i(a_i\mid o_i)$ instead of $\pi_i(a_i\mid s)$ , and beliefs or histories may be needed to act well.

Three examples of partially observed settings preview:

Agents with different tool logs.
A defender that sees alerts but not the attacker's full plan.
A dialogue agent that sees conversation text but not hidden user intent.

Two non-examples clarify the boundary:

A fully observed Markov game.
A static matrix game.

Proof or verification habit for partially observed settings preview:

The proof habit is to specify observation functions and information sets before writing policies.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, partially observed settings preview is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI security and coordination problems are partially observed because intent, hidden prompts, and private tools are not directly visible.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using partially observed settings preview responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: State what each agent observes at decision time.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Partially observed settings preview gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4. Learning Dynamics

Learning Dynamics develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

4.1 independent learners

Independent learners belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for independent learners. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Learning dynamics study how strategies move over time, not just where equilibrium points are located.

Worked reading.

In fictitious play, each player tracks empirical frequencies of the opponent's past actions and best-responds to those beliefs.

Three examples of independent learners:

Rock-paper-scissors empirical play approaching the mixed region.
Independent Q-learners chasing each other's changing policies.
GAN gradients rotating around a saddle-like point.

Two non-examples clarify the boundary:

A static equilibrium certificate.
A supervised learner trained against an immutable dataset.

Proof or verification habit for independent learners:

Analyze updates as a dynamical system: fixed points, cycles, regret, and exploitability are different diagnostics.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, independent learners is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI failures are dynamic failures: the target moves while the learner is trying to fit it.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using independent learners responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Plot trajectories or regret; do not infer convergence from one snapshot.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Independent learners gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.2 fictitious play

Fictitious play belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for fictitious play. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Learning dynamics study how strategies move over time, not just where equilibrium points are located.

Worked reading.

In fictitious play, each player tracks empirical frequencies of the opponent's past actions and best-responds to those beliefs.

Three examples of fictitious play:

Rock-paper-scissors empirical play approaching the mixed region.
Independent Q-learners chasing each other's changing policies.
GAN gradients rotating around a saddle-like point.

Two non-examples clarify the boundary:

A static equilibrium certificate.
A supervised learner trained against an immutable dataset.

Proof or verification habit for fictitious play:

Analyze updates as a dynamical system: fixed points, cycles, regret, and exploitability are different diagnostics.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, fictitious play is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI failures are dynamic failures: the target moves while the learner is trying to fit it.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using fictitious play responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Plot trajectories or regret; do not infer convergence from one snapshot.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Fictitious play gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.3 no-regret learning

No-regret learning belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for no-regret learning. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

No-regret learning turns repeated play into approximate equilibrium guarantees by making average regret small.

Worked reading.

If both players in a zero-sum game have average regret at most $\epsilon$ , the average strategies are $O(\epsilon)$ -approximate minimax strategies.

Three examples of no-regret learning:

Multiplicative weights for action probabilities.
Self-play policies averaged over training.
Exploitability curves used to track poker or board-game agents.

Two non-examples clarify the boundary:

A decreasing supervised loss curve with no opponent model.
A single final policy checkpoint without averaging or regret accounting.

Proof or verification habit for no-regret learning:

The proof decomposes the average payoff gap into the row player's regret plus the column player's regret.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, no-regret learning is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

This is why practical game-playing systems track exploitability and regret-like quantities instead of only reward.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using no-regret learning responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Report whether the guarantee applies to last iterate, averaged iterate, or best checkpoint.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. No-regret learning gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.4 policy gradients in games

Policy gradients in games belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for policy gradients in games. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Learning dynamics study how strategies move over time, not just where equilibrium points are located.

Worked reading.

In fictitious play, each player tracks empirical frequencies of the opponent's past actions and best-responds to those beliefs.

Three examples of policy gradients in games:

Rock-paper-scissors empirical play approaching the mixed region.
Independent Q-learners chasing each other's changing policies.
GAN gradients rotating around a saddle-like point.

Two non-examples clarify the boundary:

A static equilibrium certificate.
A supervised learner trained against an immutable dataset.

Proof or verification habit for policy gradients in games:

Analyze updates as a dynamical system: fixed points, cycles, regret, and exploitability are different diagnostics.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, policy gradients in games is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI failures are dynamic failures: the target moves while the learner is trying to fit it.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using policy gradients in games responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Plot trajectories or regret; do not infer convergence from one snapshot.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Policy gradients in games gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.5 equilibrium selection

Equilibrium selection belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for equilibrium selection. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Equilibrium selection asks which equilibrium appears when several are mathematically possible.

Worked reading.

In a coordination game with two stable conventions, initialization, communication, history, or payoff-dominance can determine the selected convention.

Three examples of equilibrium selection:

Two agents converging to the same API schema.
Self-play selecting one opening strategy among many stable ones.
A market standard emerging from repeated routing choices.

Two non-examples clarify the boundary:

The proof that at least one equilibrium exists.
A claim that all equilibria are equally safe.

Proof or verification habit for equilibrium selection:

First verify the candidate equilibria, then study basin of attraction or selection criterion.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, equilibrium selection is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

In AI systems, multiple stable behaviors can differ sharply in safety and usefulness.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using equilibrium selection responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Report which equilibrium is selected and why that selection mechanism is credible.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Equilibrium selection gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5. Coordination and Communication

Coordination and Communication develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

5.1 common-payoff games

Common-payoff games belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for common-payoff games. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Coordination games contain multiple stable outcomes, so the mathematical problem is not only existence but selection.

Worked reading.

If two agents both prefer choosing the same protocol, both $(A,A)$ and $(B,B)$ can be equilibria. Which one appears may depend on initialization, communication, history, or focal points.

Three examples of common-payoff games:

LLM agents agree on a tool-call schema.
Distributed learners converge to a shared convention for labels.
A team of models selects the same plan representation before acting.

Two non-examples clarify the boundary:

A zero-sum contest where one player's gain is the other's loss.
A single model choosing a format without another agent needing to match it.

Proof or verification habit for common-payoff games:

Equilibrium verification is easy; equilibrium selection is the hard part. Show each matched profile is stable, then analyze basins, signals, or welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, common-payoff games is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Coordination failures are common in agentic systems because technically correct local policies can still fail to align interfaces.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using common-payoff games responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether agents need the same convention, and whether the convention is observable before action.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Common-payoff games gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.2 conventions

Conventions belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for conventions. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Coordination games contain multiple stable outcomes, so the mathematical problem is not only existence but selection.

Worked reading.

If two agents both prefer choosing the same protocol, both $(A,A)$ and $(B,B)$ can be equilibria. Which one appears may depend on initialization, communication, history, or focal points.

Three examples of conventions:

LLM agents agree on a tool-call schema.
Distributed learners converge to a shared convention for labels.
A team of models selects the same plan representation before acting.

Two non-examples clarify the boundary:

A zero-sum contest where one player's gain is the other's loss.
A single model choosing a format without another agent needing to match it.

Proof or verification habit for conventions:

Equilibrium verification is easy; equilibrium selection is the hard part. Show each matched profile is stable, then analyze basins, signals, or welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, conventions is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Coordination failures are common in agentic systems because technically correct local policies can still fail to align interfaces.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using conventions responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether agents need the same convention, and whether the convention is observable before action.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Conventions gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.3 communication protocols

Communication protocols belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for communication protocols. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Communication changes the information structure of a game; credit assignment changes how global outcomes are mapped back to individual actions.

Worked reading.

In a common-payoff team, all agents may receive the same reward, but each still needs enough signal to know which local decision helped.

Three examples of communication protocols:

Agents exchange intermediate plans before tool use.
A debate system where messages reveal evidence.
A cooperative safety monitor assigns responsibility to specialized agents.

Two non-examples clarify the boundary:

A hidden side channel not included in the game model.
A global score used as if it directly explains each agent's contribution.

Proof or verification habit for communication protocols:

Model messages as actions, observations, or signals, then analyze how they alter feasible strategies and incentives.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, communication protocols is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

LLM-agent systems often fail at interfaces before they fail at individual reasoning, so communication is a mathematical object, not an implementation detail.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using communication protocols responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Specify who observes each message and when.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Communication protocols gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.4 credit assignment

Credit assignment belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for credit assignment. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Communication changes the information structure of a game; credit assignment changes how global outcomes are mapped back to individual actions.

Worked reading.

In a common-payoff team, all agents may receive the same reward, but each still needs enough signal to know which local decision helped.

Three examples of credit assignment:

Agents exchange intermediate plans before tool use.
A debate system where messages reveal evidence.
A cooperative safety monitor assigns responsibility to specialized agents.

Two non-examples clarify the boundary:

A hidden side channel not included in the game model.
A global score used as if it directly explains each agent's contribution.

Proof or verification habit for credit assignment:

Model messages as actions, observations, or signals, then analyze how they alter feasible strategies and incentives.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, credit assignment is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

LLM-agent systems often fail at interfaces before they fail at individual reasoning, so communication is a mathematical object, not an implementation detail.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using credit assignment responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Specify who observes each message and when.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Credit assignment gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.5 social welfare and fairness

Social welfare and fairness belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for social welfare and fairness. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Pareto and welfare criteria evaluate outcomes across players; equilibrium evaluates unilateral incentives.

Worked reading.

An outcome is Pareto inefficient if another feasible outcome makes at least one player better off and no player worse off. A Nash equilibrium can fail this test.

Three examples of social welfare and fairness:

Mutual cooperation in a prisoner's dilemma improves both players but may be unstable.
A routing policy raises total quality but gives one provider incentive to deviate.
A safety policy improves social welfare but reduces one actor's private payoff.

Two non-examples clarify the boundary:

A unilateral deviation check.
A fairness claim without specifying the social objective.

Proof or verification habit for social welfare and fairness:

Separate the two predicates: first test deviations for equilibrium, then compare feasible outcomes for welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, social welfare and fairness is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

AI alignment often lives in the gap between private incentive and social objective, so this distinction is not philosophical decoration.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using social welfare and fairness responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: If an equilibrium is bad, changing incentives or constraints is usually required; wishing for cooperation is not a proof.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Social welfare and fairness gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6. AI Applications

AI Applications develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

6.1 multi-agent RL

Multi-agent rl belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for multi-agent rl. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Many-agent learning means every learner's policy is part of the environment seen by the others.

Worked reading.

If agent 1 changes its policy, agent 2's data distribution changes even when the physical simulator is unchanged. That is the core nonstationarity of multi-agent learning.

Three examples of multi-agent rl:

Self-play agents improving by training against earlier or current versions.
LLM tool agents changing each other's context and options.
A routing marketplace where traffic shifts after one provider changes quality.

Two non-examples clarify the boundary:

A single-agent RL problem with a fixed transition kernel.
Batch supervised learning on immutable labels.

Proof or verification habit for multi-agent rl:

Analyze the joint policy trajectory, not only individual losses.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, multi-agent rl is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Agentic LLM systems make multi-agent math practical: prompts, tools, memory, and policies interact in a shared state.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using multi-agent rl responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask which part of another agent's behavior enters this agent's observation or reward.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Multi-agent rl gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.2 self-play systems

Self-play systems belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for self-play systems. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Many-agent learning means every learner's policy is part of the environment seen by the others.

Worked reading.

If agent 1 changes its policy, agent 2's data distribution changes even when the physical simulator is unchanged. That is the core nonstationarity of multi-agent learning.

Three examples of self-play systems:

Self-play agents improving by training against earlier or current versions.
LLM tool agents changing each other's context and options.
A routing marketplace where traffic shifts after one provider changes quality.

Two non-examples clarify the boundary:

A single-agent RL problem with a fixed transition kernel.
Batch supervised learning on immutable labels.

Proof or verification habit for self-play systems:

Analyze the joint policy trajectory, not only individual losses.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, self-play systems is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Agentic LLM systems make multi-agent math practical: prompts, tools, memory, and policies interact in a shared state.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using self-play systems responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask which part of another agent's behavior enters this agent's observation or reward.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Self-play systems gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.3 tool-using LLM swarms

Tool-using llm swarms belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for tool-using llm swarms. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.

Worked reading.

In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.

Three examples of tool-using llm swarms:

GAN generator-discriminator training.
Jailbreak discovery against a deployed policy layer.
Benchmark gaming where systems optimize for the public metric instead of the intended task.

Two non-examples clarify the boundary:

One-time evaluation on a frozen hidden test set.
A content filter measured only against historical prompts.

Proof or verification habit for tool-using llm swarms:

The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, tool-using llm swarms is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using tool-using llm swarms responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Tool-using llm swarms gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.4 market-style model routing

Market-style model routing belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for market-style model routing. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.

Worked reading.

In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.

Three examples of market-style model routing:

GAN generator-discriminator training.
Jailbreak discovery against a deployed policy layer.
Benchmark gaming where systems optimize for the public metric instead of the intended task.

Two non-examples clarify the boundary:

One-time evaluation on a frozen hidden test set.
A content filter measured only against historical prompts.

Proof or verification habit for market-style model routing:

The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, market-style model routing is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using market-style model routing responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Market-style model routing gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.5 cooperative safety and oversight

Cooperative safety and oversight belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for cooperative safety and oversight. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Communication changes the information structure of a game; credit assignment changes how global outcomes are mapped back to individual actions.

Worked reading.

In a common-payoff team, all agents may receive the same reward, but each still needs enough signal to know which local decision helped.

Three examples of cooperative safety and oversight:

Agents exchange intermediate plans before tool use.
A debate system where messages reveal evidence.
A cooperative safety monitor assigns responsibility to specialized agents.

Two non-examples clarify the boundary:

A hidden side channel not included in the game model.
A global score used as if it directly explains each agent's contribution.

Proof or verification habit for cooperative safety and oversight:

Model messages as actions, observations, or signals, then analyze how they alter feasible strategies and incentives.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, cooperative safety and oversight is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

LLM-agent systems often fail at interfaces before they fail at individual reasoning, so communication is a mathematical object, not an implementation detail.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using cooperative safety and oversight responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Specify who observes each message and when.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Cooperative safety and oversight gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

7. Common Mistakes

#	Mistake	Why It Is Wrong	Fix
1	Treating equilibrium as social optimality	A Nash equilibrium can be inefficient or unfair.	Compare equilibrium outcomes with Pareto and welfare criteria.
2	Checking only one player's incentive	Equilibrium requires every player to lack profitable unilateral deviation.	Compute best responses for all players.
3	Ignoring mixed strategies	Some finite games have no pure equilibrium.	Use probability distributions over actions and the indifference principle.
4	Applying minimax to non-zero-sum games blindly	Minimax value is a zero-sum guarantee, not a general welfare solution.	State whether payoffs are strictly opposed before using minimax.
5	Confusing learning convergence with equilibrium	A learning process can cycle, diverge, or converge to a non-equilibrium behavior.	Track regret, exploitability, and stationarity separately.
6	Forgetting that other agents adapt	In multi-agent systems, each learner changes the data distribution of the others.	Model policies jointly and monitor nonstationarity.
7	Using average-case metrics against adaptive attackers	An adaptive opponent targets the worst exploitable gap.	Define threat sets and robust objectives.
8	Equating red teaming with complete security	Red-team examples are samples, not proofs against all attacks.	Use adaptive evaluation and explicit threat models.
9	Treating GAN instability as ordinary optimization only	GANs are games whose gradients can rotate instead of descend.	Analyze generator and discriminator objectives jointly.
10	Letting game abstractions erase values	Payoff design determines incentives and side effects.	Audit utility functions, constraints, and welfare implications.

8. Exercises

(*) Work through a game-theory task for multi-agent systems.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
(*) Work through a game-theory task for multi-agent systems.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
(*) Work through a game-theory task for multi-agent systems.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
(**) Work through a game-theory task for multi-agent systems.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
(**) Work through a game-theory task for multi-agent systems.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
(**) Work through a game-theory task for multi-agent systems.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
(***) Work through a game-theory task for multi-agent systems.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
(***) Work through a game-theory task for multi-agent systems.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
(***) Work through a game-theory task for multi-agent systems.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
(***) Work through a game-theory task for multi-agent systems.

(a) State the players, actions, and payoffs.
(b) Compute or characterize best responses.
(c) Decide whether the proposed joint strategy is stable.
(d) Interpret the result for an AI, LLM, or adversarial system.

9. Why This Matters for AI

Concept	AI Impact
Best response	Explains how users, attackers, or agents adapt to a model policy
Nash equilibrium	Defines strategic stability for GANs, self-play, routing, and agent systems
Mixed strategy	Motivates randomized defenses, stochastic policies, and exploration
Minimax value	Formalizes robust worst-case guarantees
Exploitability	Measures how far a policy is from strategic stability
No-regret learning	Connects repeated play to approximate equilibrium
Security game	Models limited defensive resources against adaptive threats
Payoff design	Shows why objective misspecification creates strategic side effects

10. Conceptual Bridge

Multi-Agent Systems follows causal inference because interventions often change incentives. Chapter 22 asks what changes when an action is taken. Chapter 23 asks what happens when other agents see that action, learn from it, and respond strategically.

The backward bridge is intervention. A policy change can have a causal effect, but if users or attackers adapt, the effect becomes part of a game. The forward bridge is measure theory: later probability foundations make the stochastic strategies, repeated games, and distributional assumptions more rigorous.

+--------------------------------------------------------------+
| Chapter 22: intervention and causal mechanisms               |
| Chapter 23: strategic adaptation and adversarial objectives   |
| Chapter 24: rigorous probability and measure foundations      |
+--------------------------------------------------------------+

References

Shoham and Leyton-Brown. Multiagent Systems. https://www.masfoundations.org/toc.pdf
Littman. Markov games as a framework for multi-agent reinforcement learning. https://www.cs.rutgers.edu/~mlittman/papers/ml94-final.pdf
Nisan et al.. Algorithmic Game Theory. https://doi.org/10.1017/CBO9780511800481
Cesa-Bianchi and Lugosi. Prediction, Learning, and Games. https://www.cambridge.org/core/books/prediction-learning-and-games/30E375A151AD4A73012C9BA075E6C482

Multi Agent Systems

Multi-Agent Systems

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Table of Contents

1. Intuition

1.1 many learners sharing one environment

1.2 nonstationarity

1.3 cooperation vs competition

1.4 communication and coordination

1.5 emergent behavior in AI systems

2. Formal Definitions

2.1 agent set NNN

2.2 joint action a\mathbf{a}a

2.3 reward vector r\mathbf{r}r

2.4 Markov game

2.5 joint policy π\boldsymbol{\pi}π

3. Stochastic and Markov Games

3.1 state transitions

3.2 value functions for agents

3.3 Nash policies

3.4 cooperative team games

3.5 partially observed settings preview

4. Learning Dynamics

4.1 independent learners

4.2 fictitious play

4.3 no-regret learning

4.4 policy gradients in games

4.5 equilibrium selection

5. Coordination and Communication

5.1 common-payoff games

5.2 conventions

5.3 communication protocols

5.4 credit assignment

5.5 social welfare and fairness

6. AI Applications

6.1 multi-agent RL

6.2 self-play systems

6.3 tool-using LLM swarms

6.4 market-style model routing

6.5 cooperative safety and oversight

7. Common Mistakes

8. Exercises

9. Why This Matters for AI

10. Conceptual Bridge

References

2.1 agent set $N$

2.2 joint action $\mathbf{a}$

2.3 reward vector $\mathbf{r}$

2.5 joint policy $\boldsymbol{\pi}$