Part 1

15 min read6 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Adversarial Game Theory: Part 1: Intuition

1. Intuition

Intuition develops the part of adversarial game theory specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

1.1 attacker-defender thinking

Attacker-defender thinking belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is attacker-defender games, threat sets, robust optimization, Stackelberg security games, adversarial examples, and adaptive evaluation. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

a_A\in A_A,\qquad a_D\in A_D.

The formula gives the mathematical handle for attacker-defender thinking. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Strategic interaction begins when another decision maker can react to the policy being studied. Stability means the modeled behavior remains defensible after that reaction is allowed.

Worked reading.

Start with one proposed joint behavior. Freeze everyone except one player, compute that player's best alternative, then repeat for every player. If a profitable switch exists, the behavior is not strategically stable.

Three examples of attacker-defender thinking:

A guardrail remains effective after attackers see examples of blocked prompts.
A model-routing policy remains attractive after providers update prices.
A self-play policy cannot be easily exploited by a newly trained opponent.

Two non-examples clarify the boundary:

A high average score on a fixed dataset.
A local minimum of one model's loss with no opponent.

Proof or verification habit for attacker-defender thinking:

The verification habit is adversarial: search for profitable deviations rather than only confirming the proposed behavior works in the original scenario.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, attacker-defender thinking is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

This is the mathematical shift from offline ML to deployed AI systems where users, competitors, and automated attacks learn from the model.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using attacker-defender thinking responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: State the adaptation channel: what can the other side observe, change, and optimize?

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Attacker-defender thinking gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

1.2 threat models

Threat models belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

R_{\mathrm{rob}}(\theta)=\mathbb{E}_{(\mathbf{x},y)}\left[\max_{\boldsymbol{\delta}\in\mathcal{S}}\mathcal{L}(f_\theta(\mathbf{x}+\boldsymbol{\delta}),y)\right].

The formula gives the mathematical handle for threat models. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A threat model defines the attacker's allowed moves. Robust optimization then trains or evaluates against the worst allowed move.

Worked reading.

For an $\ell_\infty$ perturbation set, PGD repeatedly steps in the gradient-sign direction and projects back into the allowed box.

Three examples of threat models:

Image perturbations bounded by a norm.
Prompt transformations allowed by a jailbreak policy.
Retrieval poisoning constrained by an index-insertion budget.

Two non-examples clarify the boundary:

Any attack the modeler can imagine but has not formalized.
Random corruption treated as adaptive attack.

Proof or verification habit for threat models:

The nested objective is proved meaningful only after the feasible attack set is stated. The inner maximum is over that set, not over all possible bad events.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, threat models is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Adversarial training improves robustness to the modeled threat, not to every strategic behavior.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using threat models responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Write the set $\mathcal{S}$ before writing the max.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Threat models gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

1.3 strategic adaptation

Strategic adaptation belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

a_A^*(a_D)\in\arg\max_{a_A}u_A(a_A,a_D).

The formula gives the mathematical handle for strategic adaptation. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Strategic interaction begins when another decision maker can react to the policy being studied. Stability means the modeled behavior remains defensible after that reaction is allowed.

Worked reading.

Three examples of strategic adaptation:

A guardrail remains effective after attackers see examples of blocked prompts.
A model-routing policy remains attractive after providers update prices.
A self-play policy cannot be easily exploited by a newly trained opponent.

Two non-examples clarify the boundary:

A high average score on a fixed dataset.
A local minimum of one model's loss with no opponent.

Proof or verification habit for strategic adaptation:

The verification habit is adversarial: search for profitable deviations rather than only confirming the proposed behavior works in the original scenario.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, strategic adaptation is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

This is the mathematical shift from offline ML to deployed AI systems where users, competitors, and automated attacks learn from the model.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using strategic adaptation responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: State the adaptation channel: what can the other side observe, change, and optimize?

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Strategic adaptation gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

1.4 robust vs average-case performance

Robust vs average-case performance belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\min_G\max_D \mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}}\log D(\mathbf{x})+\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}\log(1-D(G(\mathbf{z}))).

The formula gives the mathematical handle for robust vs average-case performance. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A threat model defines the attacker's allowed moves. Robust optimization then trains or evaluates against the worst allowed move.

Worked reading.

For an $\ell_\infty$ perturbation set, PGD repeatedly steps in the gradient-sign direction and projects back into the allowed box.

Three examples of robust vs average-case performance:

Image perturbations bounded by a norm.
Prompt transformations allowed by a jailbreak policy.
Retrieval poisoning constrained by an index-insertion budget.

Two non-examples clarify the boundary:

Any attack the modeler can imagine but has not formalized.
Random corruption treated as adaptive attack.

Proof or verification habit for robust vs average-case performance:

The nested objective is proved meaningful only after the feasible attack set is stated. The inner maximum is over that set, not over all possible bad events.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, robust vs average-case performance is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Adversarial training improves robustness to the modeled threat, not to every strategic behavior.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using robust vs average-case performance responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Write the set $\mathcal{S}$ before writing the max.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Robust vs average-case performance gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

1.5 adversarial AI as game design

Adversarial ai as game design belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

a_A\in A_A,\qquad a_D\in A_D.

The formula gives the mathematical handle for adversarial ai as game design. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Strategic interaction begins when another decision maker can react to the policy being studied. Stability means the modeled behavior remains defensible after that reaction is allowed.

Worked reading.

Three examples of adversarial ai as game design:

A guardrail remains effective after attackers see examples of blocked prompts.
A model-routing policy remains attractive after providers update prices.
A self-play policy cannot be easily exploited by a newly trained opponent.

Two non-examples clarify the boundary:

A high average score on a fixed dataset.
A local minimum of one model's loss with no opponent.

Proof or verification habit for adversarial ai as game design:

The verification habit is adversarial: search for profitable deviations rather than only confirming the proposed behavior works in the original scenario.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, adversarial ai as game design is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

This is the mathematical shift from offline ML to deployed AI systems where users, competitors, and automated attacks learn from the model.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using adversarial ai as game design responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: State the adaptation channel: what can the other side observe, change, and optimize?

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Adversarial ai as game design gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

Adversarial Game Theory: Part 1 - Intuition

Adversarial Game Theory: Part 1: Intuition

1. Intuition

1.1 attacker-defender thinking

1.2 threat models

1.3 strategic adaptation

1.4 robust vs average-case performance

1.5 adversarial AI as game design

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?