Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
29 min read12 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Multi-Agent Systems: Part 1: Intuition to 2. Formal Definitions

1. Intuition

Intuition develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

1.1 many learners sharing one environment

Many learners sharing one environment belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

G=(N,S,(Ai)iN,P,(ri)iN,γ).\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for many learners sharing one environment. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

Many-agent learning means every learner's policy is part of the environment seen by the others.

Worked reading.

If agent 1 changes its policy, agent 2's data distribution changes even when the physical simulator is unchanged. That is the core nonstationarity of multi-agent learning.

Three examples of many learners sharing one environment:

  1. Self-play agents improving by training against earlier or current versions.
  2. LLM tool agents changing each other's context and options.
  3. A routing marketplace where traffic shifts after one provider changes quality.

Two non-examples clarify the boundary:

  1. A single-agent RL problem with a fixed transition kernel.
  2. Batch supervised learning on immutable labels.

Proof or verification habit for many learners sharing one environment:

Analyze the joint policy trajectory, not only individual losses.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, many learners sharing one environment is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Agentic LLM systems make multi-agent math practical: prompts, tools, memory, and policies interact in a shared state.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using many learners sharing one environment responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Ask which part of another agent's behavior enters this agent's observation or reward.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Many learners sharing one environment gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

1.2 nonstationarity

Nonstationarity belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

at=(a1,t,,an,t).\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for nonstationarity. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

Learning dynamics study how strategies move over time, not just where equilibrium points are located.

Worked reading.

In fictitious play, each player tracks empirical frequencies of the opponent's past actions and best-responds to those beliefs.

Three examples of nonstationarity:

  1. Rock-paper-scissors empirical play approaching the mixed region.
  2. Independent Q-learners chasing each other's changing policies.
  3. GAN gradients rotating around a saddle-like point.

Two non-examples clarify the boundary:

  1. A static equilibrium certificate.
  2. A supervised learner trained against an immutable dataset.

Proof or verification habit for nonstationarity:

Analyze updates as a dynamical system: fixed points, cycles, regret, and exploitability are different diagnostics.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, nonstationarity is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI failures are dynamic failures: the target moves while the learner is trying to fit it.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using nonstationarity responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Plot trajectories or regret; do not infer convergence from one snapshot.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Nonstationarity gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

1.3 cooperation vs competition

Cooperation vs competition belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

Viπ(s)=E[t=0γtri(st,at)s0=s].V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for cooperation vs competition. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

Cooperation and competition are payoff-structure choices. Cooperative games align rewards; competitive games put rewards in conflict; mixed-motive games do both.

Worked reading.

A common-payoff team has r1==rnr_1=\cdots=r_n. A zero-sum game has r1=r2r_1=-r_2. Most deployed multi-agent systems sit between these extremes.

Three examples of cooperation vs competition:

  1. A team of tool agents sharing a task reward.
  2. A self-play opponent trained to expose weaknesses.
  3. A market of model providers competing while still serving user welfare.

Two non-examples clarify the boundary:

  1. Calling agents cooperative because they are in the same codebase.
  2. Calling a game competitive because agents are different processes.

Proof or verification habit for cooperation vs competition:

Classify the payoff relation before selecting an equilibrium or learning method.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, cooperation vs competition is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

The same algorithm can look safe or unsafe depending on whether rewards create cooperation, competition, or collusion.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using cooperation vs competition responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Write the reward vector, not just the environment reward.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Cooperation vs competition gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

1.4 communication and coordination

Communication and coordination belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

π is Nash if Viπi,πi(s)Viπi,πi(s).\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for communication and coordination. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

Coordination games contain multiple stable outcomes, so the mathematical problem is not only existence but selection.

Worked reading.

If two agents both prefer choosing the same protocol, both (A,A)(A,A) and (B,B)(B,B) can be equilibria. Which one appears may depend on initialization, communication, history, or focal points.

Three examples of communication and coordination:

  1. LLM agents agree on a tool-call schema.
  2. Distributed learners converge to a shared convention for labels.
  3. A team of models selects the same plan representation before acting.

Two non-examples clarify the boundary:

  1. A zero-sum contest where one player's gain is the other's loss.
  2. A single model choosing a format without another agent needing to match it.

Proof or verification habit for communication and coordination:

Equilibrium verification is easy; equilibrium selection is the hard part. Show each matched profile is stable, then analyze basins, signals, or welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, communication and coordination is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Coordination failures are common in agentic systems because technically correct local policies can still fail to align interfaces.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using communication and coordination responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Ask whether agents need the same convention, and whether the convention is observable before action.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Communication and coordination gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

1.5 emergent behavior in AI systems

Emergent behavior in ai systems belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

G=(N,S,(Ai)iN,P,(ri)iN,γ).\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for emergent behavior in ai systems. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

Many-agent learning means every learner's policy is part of the environment seen by the others.

Worked reading.

If agent 1 changes its policy, agent 2's data distribution changes even when the physical simulator is unchanged. That is the core nonstationarity of multi-agent learning.

Three examples of emergent behavior in ai systems:

  1. Self-play agents improving by training against earlier or current versions.
  2. LLM tool agents changing each other's context and options.
  3. A routing marketplace where traffic shifts after one provider changes quality.

Two non-examples clarify the boundary:

  1. A single-agent RL problem with a fixed transition kernel.
  2. Batch supervised learning on immutable labels.

Proof or verification habit for emergent behavior in ai systems:

Analyze the joint policy trajectory, not only individual losses.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, emergent behavior in ai systems is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Agentic LLM systems make multi-agent math practical: prompts, tools, memory, and policies interact in a shared state.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using emergent behavior in ai systems responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Ask which part of another agent's behavior enters this agent's observation or reward.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Emergent behavior in ai systems gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

2. Formal Definitions

Formal Definitions develops the part of multi-agent systems specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

2.1 agent set NN

Agent set nn belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

at=(a1,t,,an,t).\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for agent set nn. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

Players, actions, and payoffs define the interface of a game. If any one of them is vague, the equilibrium claim is usually vague too.

Worked reading.

A payoff matrix is a compact table: rows are one player's actions, columns are another player's actions, and entries are utilities or losses induced by the joint action.

Three examples of agent set nn:

  1. A row action chooses a defense, while a column action chooses an attack family.
  2. An agent set lists every model or tool-using process that can affect reward.
  3. A utility function converts accuracy, safety, latency, and cost into strategic incentives.

Two non-examples clarify the boundary:

  1. A metric with no actor who optimizes it.
  2. An action that is impossible in deployment but included for convenience.

Proof or verification habit for agent set nn:

Before proving anything, audit the model specification: every allowed action must map to a payoff for every player.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, agent set nn is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Payoff design is AI system design. The game will faithfully optimize the incentives it is given, including bad incentives.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using agent set nn responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Can you name each player, enumerate or parameterize its actions, and compute its payoff from a joint action?

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Agent set nn gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

2.2 joint action a\mathbf{a}

Joint action a\mathbf{a} belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

Viπ(s)=E[t=0γtri(st,at)s0=s].V_i^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_i(s_t,\mathbf{a}_t)\mid s_0=s\right].

The formula gives the mathematical handle for joint action a\mathbf{a}. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state ss, agents sample a joint action a\mathbf{a}, the environment transitions by P(ss,a)P(s'\mid s,\mathbf{a}), and each agent receives ri(s,a)r_i(s,\mathbf{a}).

Three examples of joint action a\mathbf{a}:

  1. Two dialogue agents sharing a tool environment.
  2. Robot teams with shared state and individual rewards.
  3. Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

  1. A single-agent MDP with a fixed environment.
  2. A static normal-form game with no state.

Proof or verification habit for joint action a\mathbf{a}:

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, joint action a\mathbf{a} is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using joint action a\mathbf{a} responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Joint action a\mathbf{a} gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

2.3 reward vector r\mathbf{r}

Reward vector r\mathbf{r} belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

π is Nash if Viπi,πi(s)Viπi,πi(s).\boldsymbol{\pi}^* \text{ is Nash if } V_i^{\pi_i^*,\boldsymbol{\pi}_{-i}^*}(s)\ge V_i^{\pi_i,\boldsymbol{\pi}_{-i}^*}(s).

The formula gives the mathematical handle for reward vector r\mathbf{r}. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state ss, agents sample a joint action a\mathbf{a}, the environment transitions by P(ss,a)P(s'\mid s,\mathbf{a}), and each agent receives ri(s,a)r_i(s,\mathbf{a}).

Three examples of reward vector r\mathbf{r}:

  1. Two dialogue agents sharing a tool environment.
  2. Robot teams with shared state and individual rewards.
  3. Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

  1. A single-agent MDP with a fixed environment.
  2. A static normal-form game with no state.

Proof or verification habit for reward vector r\mathbf{r}:

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, reward vector r\mathbf{r} is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using reward vector r\mathbf{r} responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Reward vector r\mathbf{r} gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

2.4 Markov game

Markov game belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

G=(N,S,(Ai)iN,P,(ri)iN,γ).\mathcal{G}=(N,\mathcal{S},(A_i)_{i\in N},P,(r_i)_{i\in N},\gamma).

The formula gives the mathematical handle for markov game. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state ss, agents sample a joint action a\mathbf{a}, the environment transitions by P(ss,a)P(s'\mid s,\mathbf{a}), and each agent receives ri(s,a)r_i(s,\mathbf{a}).

Three examples of markov game:

  1. Two dialogue agents sharing a tool environment.
  2. Robot teams with shared state and individual rewards.
  3. Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

  1. A single-agent MDP with a fixed environment.
  2. A static normal-form game with no state.

Proof or verification habit for markov game:

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, markov game is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using markov game responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Markov game gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

2.5 joint policy π\boldsymbol{\pi}

Joint policy π\boldsymbol{\pi} belongs to the canonical scope of Multi-Agent Systems. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is Markov games, joint actions, multi-agent value functions, nonstationarity, learning dynamics, coordination, and AI-agent systems. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

at=(a1,t,,an,t).\mathbf{a}_t=(a_{1,t},\ldots,a_{n,t}).

The formula gives the mathematical handle for joint policy π\boldsymbol{\pi}. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game objectMeaningAI interpretation
PlayerDecision maker with an objectiveModel, user, attacker, defender, generator, evaluator, tool-using agent
ActionChoice available to a playerPrompt, route, attack, defense, bid, policy update, generated sample
StrategyRule or distribution over actionsStochastic policy, decoding policy, defense randomization, routing policy
PayoffUtility or negative lossAccuracy, reward, cost, safety score, exploitability, compute budget
EquilibriumStable joint behaviorNo agent can improve by changing alone under the stated game

Operational definition.

A Markov game extends an MDP by replacing one action and one reward with joint actions and agent-specific rewards.

Worked reading.

At state ss, agents sample a joint action a\mathbf{a}, the environment transitions by P(ss,a)P(s'\mid s,\mathbf{a}), and each agent receives ri(s,a)r_i(s,\mathbf{a}).

Three examples of joint policy π\boldsymbol{\pi}:

  1. Two dialogue agents sharing a tool environment.
  2. Robot teams with shared state and individual rewards.
  3. Self-play systems where the opponent policy is part of the transition distribution.

Two non-examples clarify the boundary:

  1. A single-agent MDP with a fixed environment.
  2. A static normal-form game with no state.

Proof or verification habit for joint policy π\boldsymbol{\pi}:

Bellman-style reasoning still applies, but values are indexed by both agent and joint policy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, joint policy π\boldsymbol{\pi} is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Multi-agent RL inherits all MDP difficulty and adds strategic nonstationarity.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using joint policy π\boldsymbol{\pi} responsibly:

  • State the players and their objectives.
  • State the action spaces and information structure.
  • Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
  • Identify pure, mixed, or policy strategies.
  • Compute best responses or exploitability before claiming stability.
  • Separate equilibrium analysis from welfare analysis.
  • Explain what changes if opponents adapt.

Local diagnostic: Ask whether rewards are common, opposed, or mixed; the answer changes the solution concept.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Joint policy π\boldsymbol{\pi} gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic questionGame-theoretic discipline it tests
Who can respond?Player modeling
What can they change?Action space
What do they want?Payoff design
Can one side commit first?Stackelberg structure
Is the worst case important?Minimax or robust objective

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue