Part 3

30 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Nash Equilibria: Part 3: Pure-Strategy Equilibria to 4. Mixed-Strategy Equilibria

3. Pure-Strategy Equilibria

Pure-Strategy Equilibria develops the part of nash equilibria specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

3.1 best-response tables

Best-response tables belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is normal-form games, pure and mixed strategies, best responses, Nash equilibria, existence, computation, and AI equilibrium failures. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

\boldsymbol{\pi}_i\in\Delta(A_i),\qquad \sum_{a_i\in A_i}\pi_i(a_i)=1.

The formula gives the mathematical handle for best-response tables. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A pure strategy chooses one action deterministically. Best-response tables mark which pure actions are optimal against each opponent action.

Worked reading.

For every column, highlight the row entries with maximal row payoff; for every row, highlight the column entries with maximal column payoff. A cell highlighted for both players is a pure Nash equilibrium.

Three examples of best-response tables:

A deterministic guardrail mode.
A fixed model route.
A single action chosen by each player in a coordination game.

Two non-examples clarify the boundary:

A probability distribution over actions.
A randomized audit policy.

Proof or verification habit for best-response tables:

The proof is finite enumeration: compare every row within a column and every column within a row.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, best-response tables is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Pure-strategy analysis is the fastest sanity check before moving to mixed strategies or dynamic learning.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using best-response tables responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: If no cell is jointly best-response highlighted, search for mixed equilibria instead of forcing a pure one.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Best-response tables gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.2 dominant strategies

Dominant strategies belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

u_i(\boldsymbol{\pi}_i^*,\boldsymbol{\pi}_{-i}^*) \ge u_i(\boldsymbol{\pi}_i,\boldsymbol{\pi}_{-i}^*) \quad \forall \boldsymbol{\pi}_i.

The formula gives the mathematical handle for dominant strategies. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A dominant strategy is best regardless of the other players' actions. It is stronger than being a best response to one particular opponent strategy.

Worked reading.

If action $D$ gives player 1 at least as much payoff as action $C$ for every opponent action, and strictly more for one opponent action, then $D$ weakly dominates $C$ .

Three examples of dominant strategies:

Always rejecting a malicious input class when the false-positive cost is explicitly lower than the exploit cost.
A bid strategy that wins under every competitor bid in a simplified auction.
A safe fallback tool that dominates risky tool use under every audited state.

Two non-examples clarify the boundary:

An action that is best only against the current opponent.
An action with higher average payoff but lower payoff in some opponent response.

Proof or verification habit for dominant strategies:

Prove dominance by comparing payoffs row-by-row or column-by-column across every opponent action.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, dominant strategies is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Dominance is rare in rich AI systems, but when present it simplifies analysis before searching for equilibria.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using dominant strategies responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Do not infer dominance from one cell or from expected payoff under one distribution.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Dominant strategies gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.3 coordination games

Coordination games belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

G=(N,(A_i)_{i\in N},(u_i)_{i\in N}).

The formula gives the mathematical handle for coordination games. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Coordination games contain multiple stable outcomes, so the mathematical problem is not only existence but selection.

Worked reading.

If two agents both prefer choosing the same protocol, both $(A,A)$ and $(B,B)$ can be equilibria. Which one appears may depend on initialization, communication, history, or focal points.

Three examples of coordination games:

LLM agents agree on a tool-call schema.
Distributed learners converge to a shared convention for labels.
A team of models selects the same plan representation before acting.

Two non-examples clarify the boundary:

A zero-sum contest where one player's gain is the other's loss.
A single model choosing a format without another agent needing to match it.

Proof or verification habit for coordination games:

Equilibrium verification is easy; equilibrium selection is the hard part. Show each matched profile is stable, then analyze basins, signals, or welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, coordination games is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Coordination failures are common in agentic systems because technically correct local policies can still fail to align interfaces.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using coordination games responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask whether agents need the same convention, and whether the convention is observable before action.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Coordination games gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.4 Pareto inefficiency

Pareto inefficiency belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

u_i(a_i,a_{-i}) \ge u_i(a_i',a_{-i}) \quad \forall a_i'\in A_i.

The formula gives the mathematical handle for pareto inefficiency. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Pareto and welfare criteria evaluate outcomes across players; equilibrium evaluates unilateral incentives.

Worked reading.

An outcome is Pareto inefficient if another feasible outcome makes at least one player better off and no player worse off. A Nash equilibrium can fail this test.

Three examples of pareto inefficiency:

Mutual cooperation in a prisoner's dilemma improves both players but may be unstable.
A routing policy raises total quality but gives one provider incentive to deviate.
A safety policy improves social welfare but reduces one actor's private payoff.

Two non-examples clarify the boundary:

A unilateral deviation check.
A fairness claim without specifying the social objective.

Proof or verification habit for pareto inefficiency:

Separate the two predicates: first test deviations for equilibrium, then compare feasible outcomes for welfare.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, pareto inefficiency is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

AI alignment often lives in the gap between private incentive and social objective, so this distinction is not philosophical decoration.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using pareto inefficiency responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: If an equilibrium is bad, changing incentives or constraints is usually required; wishing for cooperation is not a proof.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Pareto inefficiency gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

3.5 no-pure-equilibrium examples

No-pure-equilibrium examples belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}_i\in\Delta(A_i),\qquad \sum_{a_i\in A_i}\pi_i(a_i)=1.

The formula gives the mathematical handle for no-pure-equilibrium examples. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Nash equilibrium is a profile of strategies where no player can improve by changing its own strategy while all other strategies remain fixed.

Worked reading.

In the prisoner's dilemma payoff convention, mutual defection can be a Nash equilibrium even when mutual cooperation is better for both players. This is the central warning: stability and desirability are different properties.

Three examples of no-pure-equilibrium examples:

A self-play policy pair where neither side has a profitable unilateral exploit.
A GAN fixed point where the generator distribution matches data and the discriminator cannot improve classification.
A routing market where no model provider benefits from changing only its bid.

Two non-examples clarify the boundary:

A high-welfare outcome with a profitable unilateral deviation.
A training checkpoint with low loss but a large best-response exploit.

Proof or verification habit for no-pure-equilibrium examples:

The proof is a universal deviation check: for each player $i$ , hold $\pi_{-i}$ fixed and show $u_i(\pi_i^*,\pi_{-i}^*)\ge u_i(\pi_i,\pi_{-i}^*)$ for all allowed $\pi_i$ .

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, no-pure-equilibrium examples is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

For AI agents, Nash is a stability diagnostic. It does not guarantee safety, alignment, fairness, or global efficiency unless those objectives are encoded in the game.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using no-pure-equilibrium examples responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask: if one deployed model, user, or attacker changed behavior alone, would it gain?

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. No-pure-equilibrium examples gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4. Mixed-Strategy Equilibria

Mixed-Strategy Equilibria develops the part of nash equilibria specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

4.1 probability simplex

Probability simplex belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

u_i(\boldsymbol{\pi}_i^*,\boldsymbol{\pi}_{-i}^*) \ge u_i(\boldsymbol{\pi}_i,\boldsymbol{\pi}_{-i}^*) \quad \forall \boldsymbol{\pi}_i.

The formula gives the mathematical handle for probability simplex. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A mixed strategy is a probability distribution over actions. In equilibrium, actions used with positive probability must usually give the same expected payoff; otherwise probability can move to the better action.

Worked reading.

In matching pennies, the row player is indifferent only when the column player randomizes heads and tails equally. The same calculation makes the column player indifferent, giving the $1/2,1/2$ equilibrium.

Three examples of probability simplex:

Randomized audits that make attackers uncertain.
Stochastic decoding policies that prevent deterministic exploitation.
Exploration policies in self-play where pure repetition would be exploited.

Two non-examples clarify the boundary:

Adding noise after choosing a deterministic losing action.
A distribution that assigns probability to an action with strictly lower payoff while another supported action is better.

Proof or verification habit for probability simplex:

Set expected payoffs of supported actions equal, solve for probabilities, then verify unsupported actions are not better.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, probability simplex is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Mixed strategies explain why robust systems often randomize: predictability can be a vulnerability when opponents adapt.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using probability simplex responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Check both support equality and off-support inequalities.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Probability simplex gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.2 indifference principle

Indifference principle belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

G=(N,(A_i)_{i\in N},(u_i)_{i\in N}).

The formula gives the mathematical handle for indifference principle. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Worked reading.

Three examples of indifference principle:

Randomized audits that make attackers uncertain.
Stochastic decoding policies that prevent deterministic exploitation.
Exploration policies in self-play where pure repetition would be exploited.

Two non-examples clarify the boundary:

Adding noise after choosing a deterministic losing action.
A distribution that assigns probability to an action with strictly lower payoff while another supported action is better.

Proof or verification habit for indifference principle:

Set expected payoffs of supported actions equal, solve for probabilities, then verify unsupported actions are not better.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, indifference principle is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Mixed strategies explain why robust systems often randomize: predictability can be a vulnerability when opponents adapt.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using indifference principle responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Check both support equality and off-support inequalities.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Indifference principle gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.3 matching pennies

Matching pennies belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

u_i(a_i,a_{-i}) \ge u_i(a_i',a_{-i}) \quad \forall a_i'\in A_i.

The formula gives the mathematical handle for matching pennies. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Worked reading.

Three examples of matching pennies:

Randomized audits that make attackers uncertain.
Stochastic decoding policies that prevent deterministic exploitation.
Exploration policies in self-play where pure repetition would be exploited.

Two non-examples clarify the boundary:

Adding noise after choosing a deterministic losing action.
A distribution that assigns probability to an action with strictly lower payoff while another supported action is better.

Proof or verification habit for matching pennies:

Set expected payoffs of supported actions equal, solve for probabilities, then verify unsupported actions are not better.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, matching pennies is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Mixed strategies explain why robust systems often randomize: predictability can be a vulnerability when opponents adapt.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using matching pennies responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Check both support equality and off-support inequalities.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Matching pennies gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.4 support enumeration preview

Support enumeration preview belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\boldsymbol{\pi}_i\in\Delta(A_i),\qquad \sum_{a_i\in A_i}\pi_i(a_i)=1.

The formula gives the mathematical handle for support enumeration preview. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Worked reading.

Three examples of support enumeration preview:

Randomized audits that make attackers uncertain.
Stochastic decoding policies that prevent deterministic exploitation.
Exploration policies in self-play where pure repetition would be exploited.

Two non-examples clarify the boundary:

Adding noise after choosing a deterministic losing action.
A distribution that assigns probability to an action with strictly lower payoff while another supported action is better.

Proof or verification habit for support enumeration preview:

Set expected payoffs of supported actions equal, solve for probabilities, then verify unsupported actions are not better.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, support enumeration preview is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Mixed strategies explain why robust systems often randomize: predictability can be a vulnerability when opponents adapt.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using support enumeration preview responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Check both support equality and off-support inequalities.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Support enumeration preview gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.5 entropy and stochastic policies

Entropy and stochastic policies belongs to the canonical scope of Nash Equilibria. The central object is not a single optimizer but a system of decision makers whose objectives interact.

u_i(\boldsymbol{\pi}_i^*,\boldsymbol{\pi}_{-i}^*) \ge u_i(\boldsymbol{\pi}_i,\boldsymbol{\pi}_{-i}^*) \quad \forall \boldsymbol{\pi}_i.

The formula gives the mathematical handle for entropy and stochastic policies. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Entropy measures how spread out a mixed strategy or stochastic policy is. Strategic entropy can make behavior less predictable to opponents.

Worked reading.

A deterministic policy on two actions has entropy $0$ ; a uniform policy has entropy $\log 2$ . In a game, adding entropy changes both exploration and exploitability.

Three examples of entropy and stochastic policies:

Entropy-regularized self-play.
Randomized security audits.
Stochastic decoding that avoids always exposing the same response pattern.

Two non-examples clarify the boundary:

Noise added without considering payoffs.
Randomness that violates constraints.

Proof or verification habit for entropy and stochastic policies:

Check whether the entropy term appears in the payoff, the learning algorithm, or only in the modeler's description.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, entropy and stochastic policies is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many AI policies are stochastic, but game theory asks whether that stochasticity improves strategic robustness or just hides deterministic weakness.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using entropy and stochastic policies responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Compare expected payoff and exploitability as entropy changes.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Entropy and stochastic policies gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

Nash Equilibria: Part 3 - Pure Strategy Equilibria To 4 Mixed Strategy Equilibria

Nash Equilibria: Part 3: Pure-Strategy Equilibria to 4. Mixed-Strategy Equilibria

3. Pure-Strategy Equilibria

3.1 best-response tables

3.2 dominant strategies

3.3 coordination games

3.4 Pareto inefficiency

3.5 no-pure-equilibrium examples

4. Mixed-Strategy Equilibria

4.1 probability simplex

4.2 indifference principle

4.3 matching pennies

4.4 support enumeration preview

4.5 entropy and stochastic policies

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?