Part 3

29 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Multi-Agent Systems: Part 5: Coordination and Communication to 6. AI Applications

5. Coordination and Communication

Coordination and Communication develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

5.1 common-payoff games

Common-payoff games belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for common-payoff games. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Coordination games contain multiple stable outcomes, so the mathematical problem is not only existence but selection.

Worked reading.

If two agents both prefer choosing the same protocol, both $(A,A)$ and $(B,B)$ can be equilibria. Which one appears may depend on initialization, communication, history, or focal points.

Three examples of common-payoff games:

LLM agents agree on a tool-call schema.
Distributed learners converge to a shared convention for labels.
A team of models selects the same plan representation before acting.

Two non-examples clarify the boundary:

A zero-sum contest where one player's gain is the other's loss.
A single model choosing a format without another agent needing to match it.

Proof or verification habit for common-payoff games:

Equilibrium verification is easy; equilibrium selection is the hard part. Show each matched profile is stable, then analyze basins, signals, or welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, common-payoff games is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Coordination failures are common in agentic systems because technically correct local policies can still fail to align interfaces.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using common-payoff games responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether agents need the same convention, and whether the convention is observable before action.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Common-payoff games gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.2 conventions

Conventions belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for conventions. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Coordination games contain multiple stable outcomes, so the mathematical problem is not only existence but selection.

Worked reading.

If two agents both prefer choosing the same protocol, both $(A,A)$ and $(B,B)$ can be equilibria. Which one appears may depend on initialization, communication, history, or focal points.

Three examples of conventions:

LLM agents agree on a tool-call schema.
Distributed learners converge to a shared convention for labels.
A team of models selects the same plan representation before acting.

Two non-examples clarify the boundary:

A zero-sum contest where one player's gain is the other's loss.
A single model choosing a format without another agent needing to match it.

Proof or verification habit for conventions:

Equilibrium verification is easy; equilibrium selection is the hard part. Show each matched profile is stable, then analyze basins, signals, or welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, conventions is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Coordination failures are common in agentic systems because technically correct local policies can still fail to align interfaces.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using conventions responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether agents need the same convention, and whether the convention is observable before action.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Conventions gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.3 communication protocols

Communication protocols belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for communication protocols. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Communication changes the information structure of a game; credit assignment changes how global outcomes are mapped back to individual actions.

Worked reading.

In a common-payoff team, all agents may receive the same reward, but each still needs enough signal to know which local decision helped.

Three examples of communication protocols:

Agents exchange intermediate plans before tool use.
A debate system where messages reveal evidence.
A cooperative safety monitor assigns responsibility to specialized agents.

Two non-examples clarify the boundary:

A hidden side channel not included in the game model.
A global score used as if it directly explains each agent's contribution.

Proof or verification habit for communication protocols:

Model messages as actions, observations, or signals, then analyze how they alter feasible strategies and incentives.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, communication protocols is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

LLM-agent systems often fail at interfaces before they fail at individual reasoning, so communication is a mathematical object, not an implementation detail.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using communication protocols responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Specify who observes each message and when.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Communication protocols gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.4 credit assignment

Credit assignment belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for credit assignment. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Communication changes the information structure of a game; credit assignment changes how global outcomes are mapped back to individual actions.

Worked reading.

In a common-payoff team, all agents may receive the same reward, but each still needs enough signal to know which local decision helped.

Three examples of credit assignment:

Agents exchange intermediate plans before tool use.
A debate system where messages reveal evidence.
A cooperative safety monitor assigns responsibility to specialized agents.

Two non-examples clarify the boundary:

A hidden side channel not included in the game model.
A global score used as if it directly explains each agent's contribution.

Proof or verification habit for credit assignment:

Model messages as actions, observations, or signals, then analyze how they alter feasible strategies and incentives.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, credit assignment is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

LLM-agent systems often fail at interfaces before they fail at individual reasoning, so communication is a mathematical object, not an implementation detail.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using credit assignment responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Specify who observes each message and when.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Credit assignment gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

Social welfare and fairness belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for social welfare and fairness. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Pareto and welfare criteria evaluate outcomes across players; equilibrium evaluates unilateral incentives.

Worked reading.

An outcome is Pareto inefficient if another feasible outcome makes at least one player better off and no player worse off. A Nash equilibrium can fail this test.

Three examples of social welfare and fairness:

Mutual cooperation in a prisoner's dilemma improves both players but may be unstable.
A routing policy raises total quality but gives one provider incentive to deviate.
A safety policy improves social welfare but reduces one actor's private payoff.

Two non-examples clarify the boundary:

A unilateral deviation check.
A fairness claim without specifying the social objective.

Proof or verification habit for social welfare and fairness:

Separate the two predicates: first test deviations for equilibrium, then compare feasible outcomes for welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, social welfare and fairness is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

AI alignment often lives in the gap between private incentive and social objective, so this distinction is not philosophical decoration.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using social welfare and fairness responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: If an equilibrium is bad, changing incentives or constraints is usually required; wishing for cooperation is not a proof.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Social welfare and fairness gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6. AI Applications

AI Applications develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

6.1 multi-agent RL

Multi-agent rl belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for multi-agent rl. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Many-agent learning means every learner's policy is part of the environment seen by the others.

Worked reading.

If agent 1 changes its policy, agent 2's data distribution changes even when the physical simulator is unchanged. That is the core nonstationarity of multi-agent learning.

Three examples of multi-agent rl:

Self-play agents improving by training against earlier or current versions.
LLM tool agents changing each other's context and options.
A routing marketplace where traffic shifts after one provider changes quality.

Two non-examples clarify the boundary:

A single-agent RL problem with a fixed transition kernel.
Batch supervised learning on immutable labels.

Proof or verification habit for multi-agent rl:

Analyze the joint policy trajectory, not only individual losses.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, multi-agent rl is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Agentic LLM systems make multi-agent math practical: prompts, tools, memory, and policies interact in a shared state.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using multi-agent rl responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask which part of another agent's behavior enters this agent's observation or reward.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Multi-agent rl gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.2 self-play systems

Self-play systems belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for self-play systems. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Many-agent learning means every learner's policy is part of the environment seen by the others.

Worked reading.

If agent 1 changes its policy, agent 2's data distribution changes even when the physical simulator is unchanged. That is the core nonstationarity of multi-agent learning.

Three examples of self-play systems:

Self-play agents improving by training against earlier or current versions.
LLM tool agents changing each other's context and options.
A routing marketplace where traffic shifts after one provider changes quality.

Two non-examples clarify the boundary:

A single-agent RL problem with a fixed transition kernel.
Batch supervised learning on immutable labels.

Proof or verification habit for self-play systems:

Analyze the joint policy trajectory, not only individual losses.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, self-play systems is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Agentic LLM systems make multi-agent math practical: prompts, tools, memory, and policies interact in a shared state.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using self-play systems responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask which part of another agent's behavior enters this agent's observation or reward.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Self-play systems gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.3 tool-using LLM swarms

Tool-using llm swarms belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for tool-using llm swarms. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.

Worked reading.

In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.

Three examples of tool-using llm swarms:

GAN generator-discriminator training.
Jailbreak discovery against a deployed policy layer.
Benchmark gaming where systems optimize for the public metric instead of the intended task.

Two non-examples clarify the boundary:

One-time evaluation on a frozen hidden test set.
A content filter measured only against historical prompts.

Proof or verification habit for tool-using llm swarms:

The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, tool-using llm swarms is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using tool-using llm swarms responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Tool-using llm swarms gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.4 market-style model routing

Market-style model routing belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for market-style model routing. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.

Worked reading.

In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.

Three examples of market-style model routing:

GAN generator-discriminator training.
Jailbreak discovery against a deployed policy layer.
Benchmark gaming where systems optimize for the public metric instead of the intended task.

Two non-examples clarify the boundary:

One-time evaluation on a frozen hidden test set.
A content filter measured only against historical prompts.

Proof or verification habit for market-style model routing:

The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, market-style model routing is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using market-style model routing responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Market-style model routing gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.5 cooperative safety and oversight

Cooperative safety and oversight belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for cooperative safety and oversight. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Communication changes the information structure of a game; credit assignment changes how global outcomes are mapped back to individual actions.

Worked reading.

In a common-payoff team, all agents may receive the same reward, but each still needs enough signal to know which local decision helped.

Three examples of cooperative safety and oversight:

Agents exchange intermediate plans before tool use.
A debate system where messages reveal evidence.
A cooperative safety monitor assigns responsibility to specialized agents.

Two non-examples clarify the boundary:

A hidden side channel not included in the game model.
A global score used as if it directly explains each agent's contribution.

Proof or verification habit for cooperative safety and oversight:

Model messages as actions, observations, or signals, then analyze how they alter feasible strategies and incentives.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, cooperative safety and oversight is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

LLM-agent systems often fail at interfaces before they fail at individual reasoning, so communication is a mathematical object, not an implementation detail.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using cooperative safety and oversight responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Specify who observes each message and when.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Cooperative safety and oversight gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

Multi Agent Systems: Part 3 - Coordination And Communication To 6 Ai Applications

Multi-Agent Systems: Part 5: Coordination and Communication to 6. AI Applications

5. Coordination and Communication

5.1 common-payoff games

5.2 conventions

5.3 communication protocols

5.4 credit assignment

6. AI Applications

6.1 multi-agent RL

6.2 self-play systems

6.3 tool-using LLM swarms

6.4 market-style model routing

6.5 cooperative safety and oversight

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

Multi Agent Systems: Part 3 - Coordination And Communication To 6 Ai Applications

Multi-Agent Systems: Part 5: Coordination and Communication to 6. AI Applications

5. Coordination and Communication

5.1 common-payoff games

5.2 conventions

5.3 communication protocols

5.4 credit assignment

5.5 social welfare and fairness

6. AI Applications

6.1 multi-agent RL

6.2 self-play systems

6.3 tool-using LLM swarms

6.4 market-style model routing

6.5 cooperative safety and oversight

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?