Lesson overview | Previous part | Lesson overview
Adversarial Game Theory: Part 6: AI Applications to References
6. AI Applications
AI Applications develops the part of adversarial game theory specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.
6.1 jailbreak defenses
Jailbreak defenses belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.
For this subsection, the working scope is attacker-defender games, threat sets, robust optimization, Stackelberg security games, adversarial examples, and adaptive evaluation. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.
The formula gives the mathematical handle for jailbreak defenses. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.
| Game object | Meaning | AI interpretation |
|---|---|---|
| Player | Decision maker with an objective | Model, user, attacker, defender, generator, evaluator, tool-using agent |
| Action | Choice available to a player | Prompt, route, attack, defense, bid, policy update, generated sample |
| Strategy | Rule or distribution over actions | Stochastic policy, decoding policy, defense randomization, routing policy |
| Payoff | Utility or negative loss | Accuracy, reward, cost, safety score, exploitability, compute budget |
| Equilibrium | Stable joint behavior | No agent can improve by changing alone under the stated game |
Operational definition.
Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.
Worked reading.
In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.
Three examples of jailbreak defenses:
- GAN generator-discriminator training.
- Jailbreak discovery against a deployed policy layer.
- Benchmark gaming where systems optimize for the public metric instead of the intended task.
Two non-examples clarify the boundary:
- One-time evaluation on a frozen hidden test set.
- A content filter measured only against historical prompts.
Proof or verification habit for jailbreak defenses:
The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.
single-agent optimization: choose theta to minimize L(theta)
game-theoretic optimization: choose pi_i while others choose pi_-i
adversarial objective: choose defense against best attack
multi-agent learning: policies change the environment itself
In AI systems, jailbreak defenses is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.
Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.
Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.
Checklist for using jailbreak defenses responsibly:
- State the players and their objectives.
- State the action spaces and information structure.
- Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
- Identify pure, mixed, or policy strategies.
- Compute best responses or exploitability before claiming stability.
- Separate equilibrium analysis from welfare analysis.
- Explain what changes if opponents adapt.
Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.
This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.
Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Jailbreak defenses gives the language to reason about that pressure.
A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.
| Diagnostic question | Game-theoretic discipline it tests |
|---|---|
| Who can respond? | Player modeling |
| What can they change? | Action space |
| What do they want? | Payoff design |
| Can one side commit first? | Stackelberg structure |
| Is the worst case important? | Minimax or robust objective |
6.2 adversarial training
Adversarial training belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.
For this subsection, the working scope is attacker-defender games, threat sets, robust optimization, Stackelberg security games, adversarial examples, and adaptive evaluation. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.
The formula gives the mathematical handle for adversarial training. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.
| Game object | Meaning | AI interpretation |
|---|---|---|
| Player | Decision maker with an objective | Model, user, attacker, defender, generator, evaluator, tool-using agent |
| Action | Choice available to a player | Prompt, route, attack, defense, bid, policy update, generated sample |
| Strategy | Rule or distribution over actions | Stochastic policy, decoding policy, defense randomization, routing policy |
| Payoff | Utility or negative loss | Accuracy, reward, cost, safety score, exploitability, compute budget |
| Equilibrium | Stable joint behavior | No agent can improve by changing alone under the stated game |
Operational definition.
Adversarial training and governance both treat the opponent as adaptive rather than as a fixed noise source.
Worked reading.
Training solves an approximate inner attack problem, then updates the model on those attacks. Governance designs rules and monitoring under the expectation that actors respond strategically.
Three examples of adversarial training:
- PGD adversarial training.
- Adaptive jailbreak evaluation.
- Policy rules that anticipate model providers optimizing around metrics.
Two non-examples clarify the boundary:
- A static checklist.
- One red-team run treated as exhaustive.
Proof or verification habit for adversarial training:
The argument must connect the adaptation model to the defense or policy mechanism.
single-agent optimization: choose theta to minimize L(theta)
game-theoretic optimization: choose pi_i while others choose pi_-i
adversarial objective: choose defense against best attack
multi-agent learning: policies change the environment itself
In AI systems, adversarial training is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.
Robust AI governance needs game-theoretic assumptions because rules create incentives.
Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.
Checklist for using adversarial training responsibly:
- State the players and their objectives.
- State the action spaces and information structure.
- Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
- Identify pure, mixed, or policy strategies.
- Compute best responses or exploitability before claiming stability.
- Separate equilibrium analysis from welfare analysis.
- Explain what changes if opponents adapt.
Local diagnostic: Specify the adaptive opponent, not only the defense.
This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.
Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Adversarial training gives the language to reason about that pressure.
A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.
| Diagnostic question | Game-theoretic discipline it tests |
|---|---|
| Who can respond? | Player modeling |
| What can they change? | Action space |
| What do they want? | Payoff design |
| Can one side commit first? | Stackelberg structure |
| Is the worst case important? | Minimax or robust objective |
6.3 model extraction and poisoning
Model extraction and poisoning belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.
For this subsection, the working scope is attacker-defender games, threat sets, robust optimization, Stackelberg security games, adversarial examples, and adaptive evaluation. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.
The formula gives the mathematical handle for model extraction and poisoning. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.
| Game object | Meaning | AI interpretation |
|---|---|---|
| Player | Decision maker with an objective | Model, user, attacker, defender, generator, evaluator, tool-using agent |
| Action | Choice available to a player | Prompt, route, attack, defense, bid, policy update, generated sample |
| Strategy | Rule or distribution over actions | Stochastic policy, decoding policy, defense randomization, routing policy |
| Payoff | Utility or negative loss | Accuracy, reward, cost, safety score, exploitability, compute budget |
| Equilibrium | Stable joint behavior | No agent can improve by changing alone under the stated game |
Operational definition.
Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.
Worked reading.
In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.
Three examples of model extraction and poisoning:
- GAN generator-discriminator training.
- Jailbreak discovery against a deployed policy layer.
- Benchmark gaming where systems optimize for the public metric instead of the intended task.
Two non-examples clarify the boundary:
- One-time evaluation on a frozen hidden test set.
- A content filter measured only against historical prompts.
Proof or verification habit for model extraction and poisoning:
The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.
single-agent optimization: choose theta to minimize L(theta)
game-theoretic optimization: choose pi_i while others choose pi_-i
adversarial objective: choose defense against best attack
multi-agent learning: policies change the environment itself
In AI systems, model extraction and poisoning is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.
Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.
Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.
Checklist for using model extraction and poisoning responsibly:
- State the players and their objectives.
- State the action spaces and information structure.
- Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
- Identify pure, mixed, or policy strategies.
- Compute best responses or exploitability before claiming stability.
- Separate equilibrium analysis from welfare analysis.
- Explain what changes if opponents adapt.
Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.
This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.
Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Model extraction and poisoning gives the language to reason about that pressure.
A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.
| Diagnostic question | Game-theoretic discipline it tests |
|---|---|
| Who can respond? | Player modeling |
| What can they change? | Action space |
| What do they want? | Payoff design |
| Can one side commit first? | Stackelberg structure |
| Is the worst case important? | Minimax or robust objective |
6.4 robust retrieval and tool gates
Robust retrieval and tool gates belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.
For this subsection, the working scope is attacker-defender games, threat sets, robust optimization, Stackelberg security games, adversarial examples, and adaptive evaluation. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.
The formula gives the mathematical handle for robust retrieval and tool gates. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.
| Game object | Meaning | AI interpretation |
|---|---|---|
| Player | Decision maker with an objective | Model, user, attacker, defender, generator, evaluator, tool-using agent |
| Action | Choice available to a player | Prompt, route, attack, defense, bid, policy update, generated sample |
| Strategy | Rule or distribution over actions | Stochastic policy, decoding policy, defense randomization, routing policy |
| Payoff | Utility or negative loss | Accuracy, reward, cost, safety score, exploitability, compute budget |
| Equilibrium | Stable joint behavior | No agent can improve by changing alone under the stated game |
Operational definition.
A threat model defines the attacker's allowed moves. Robust optimization then trains or evaluates against the worst allowed move.
Worked reading.
For an perturbation set, PGD repeatedly steps in the gradient-sign direction and projects back into the allowed box.
Three examples of robust retrieval and tool gates:
- Image perturbations bounded by a norm.
- Prompt transformations allowed by a jailbreak policy.
- Retrieval poisoning constrained by an index-insertion budget.
Two non-examples clarify the boundary:
- Any attack the modeler can imagine but has not formalized.
- Random corruption treated as adaptive attack.
Proof or verification habit for robust retrieval and tool gates:
The nested objective is proved meaningful only after the feasible attack set is stated. The inner maximum is over that set, not over all possible bad events.
single-agent optimization: choose theta to minimize L(theta)
game-theoretic optimization: choose pi_i while others choose pi_-i
adversarial objective: choose defense against best attack
multi-agent learning: policies change the environment itself
In AI systems, robust retrieval and tool gates is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.
Adversarial training improves robustness to the modeled threat, not to every strategic behavior.
Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.
Checklist for using robust retrieval and tool gates responsibly:
- State the players and their objectives.
- State the action spaces and information structure.
- Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
- Identify pure, mixed, or policy strategies.
- Compute best responses or exploitability before claiming stability.
- Separate equilibrium analysis from welfare analysis.
- Explain what changes if opponents adapt.
Local diagnostic: Write the set before writing the max.
This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.
Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Robust retrieval and tool gates gives the language to reason about that pressure.
A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.
| Diagnostic question | Game-theoretic discipline it tests |
|---|---|
| Who can respond? | Player modeling |
| What can they change? | Action space |
| What do they want? | Payoff design |
| Can one side commit first? | Stackelberg structure |
| Is the worst case important? | Minimax or robust objective |
6.5 governance under adaptive opponents
Governance under adaptive opponents belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.
For this subsection, the working scope is attacker-defender games, threat sets, robust optimization, Stackelberg security games, adversarial examples, and adaptive evaluation. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.
The formula gives the mathematical handle for governance under adaptive opponents. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.
| Game object | Meaning | AI interpretation |
|---|---|---|
| Player | Decision maker with an objective | Model, user, attacker, defender, generator, evaluator, tool-using agent |
| Action | Choice available to a player | Prompt, route, attack, defense, bid, policy update, generated sample |
| Strategy | Rule or distribution over actions | Stochastic policy, decoding policy, defense randomization, routing policy |
| Payoff | Utility or negative loss | Accuracy, reward, cost, safety score, exploitability, compute budget |
| Equilibrium | Stable joint behavior | No agent can improve by changing alone under the stated game |
Operational definition.
Adversarial training and governance both treat the opponent as adaptive rather than as a fixed noise source.
Worked reading.
Training solves an approximate inner attack problem, then updates the model on those attacks. Governance designs rules and monitoring under the expectation that actors respond strategically.
Three examples of governance under adaptive opponents:
- PGD adversarial training.
- Adaptive jailbreak evaluation.
- Policy rules that anticipate model providers optimizing around metrics.
Two non-examples clarify the boundary:
- A static checklist.
- One red-team run treated as exhaustive.
Proof or verification habit for governance under adaptive opponents:
The argument must connect the adaptation model to the defense or policy mechanism.
single-agent optimization: choose theta to minimize L(theta)
game-theoretic optimization: choose pi_i while others choose pi_-i
adversarial objective: choose defense against best attack
multi-agent learning: policies change the environment itself
In AI systems, governance under adaptive opponents is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.
Robust AI governance needs game-theoretic assumptions because rules create incentives.
Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.
Checklist for using governance under adaptive opponents responsibly:
- State the players and their objectives.
- State the action spaces and information structure.
- Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
- Identify pure, mixed, or policy strategies.
- Compute best responses or exploitability before claiming stability.
- Separate equilibrium analysis from welfare analysis.
- Explain what changes if opponents adapt.
Local diagnostic: Specify the adaptive opponent, not only the defense.
This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.
Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Governance under adaptive opponents gives the language to reason about that pressure.
A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.
| Diagnostic question | Game-theoretic discipline it tests |
|---|---|
| Who can respond? | Player modeling |
| What can they change? | Action space |
| What do they want? | Payoff design |
| Can one side commit first? | Stackelberg structure |
| Is the worst case important? | Minimax or robust objective |
7. Common Mistakes
| # | Mistake | Why It Is Wrong | Fix |
|---|---|---|---|
| 1 | Treating equilibrium as social optimality | A Nash equilibrium can be inefficient or unfair. | Compare equilibrium outcomes with Pareto and welfare criteria. |
| 2 | Checking only one player's incentive | Equilibrium requires every player to lack profitable unilateral deviation. | Compute best responses for all players. |
| 3 | Ignoring mixed strategies | Some finite games have no pure equilibrium. | Use probability distributions over actions and the indifference principle. |
| 4 | Applying minimax to non-zero-sum games blindly | Minimax value is a zero-sum guarantee, not a general welfare solution. | State whether payoffs are strictly opposed before using minimax. |
| 5 | Confusing learning convergence with equilibrium | A learning process can cycle, diverge, or converge to a non-equilibrium behavior. | Track regret, exploitability, and stationarity separately. |
| 6 | Forgetting that other agents adapt | In multi-agent systems, each learner changes the data distribution of the others. | Model policies jointly and monitor nonstationarity. |
| 7 | Using average-case metrics against adaptive attackers | An adaptive opponent targets the worst exploitable gap. | Define threat sets and robust objectives. |
| 8 | Equating red teaming with complete security | Red-team examples are samples, not proofs against all attacks. | Use adaptive evaluation and explicit threat models. |
| 9 | Treating GAN instability as ordinary optimization only | GANs are games whose gradients can rotate instead of descend. | Analyze generator and discriminator objectives jointly. |
| 10 | Letting game abstractions erase values | Payoff design determines incentives and side effects. | Audit utility functions, constraints, and welfare implications. |
8. Exercises
-
(*) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
-
(*) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
-
(*) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
-
(**) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
-
(**) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
-
(**) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
-
(***) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
-
(***) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
-
(***) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
-
(***) Work through a game-theory task for adversarial game theory.
- (a) State the players, actions, and payoffs.
- (b) Compute or characterize best responses.
- (c) Decide whether the proposed joint strategy is stable.
- (d) Interpret the result for an AI, LLM, or adversarial system.
9. Why This Matters for AI
| Concept | AI Impact |
|---|---|
| Best response | Explains how users, attackers, or agents adapt to a model policy |
| Nash equilibrium | Defines strategic stability for GANs, self-play, routing, and agent systems |
| Mixed strategy | Motivates randomized defenses, stochastic policies, and exploration |
| Minimax value | Formalizes robust worst-case guarantees |
| Exploitability | Measures how far a policy is from strategic stability |
| No-regret learning | Connects repeated play to approximate equilibrium |
| Security game | Models limited defensive resources against adaptive threats |
| Payoff design | Shows why objective misspecification creates strategic side effects |
10. Conceptual Bridge
Adversarial Game Theory follows causal inference because interventions often change incentives. Chapter 22 asks what changes when an action is taken. Chapter 23 asks what happens when other agents see that action, learn from it, and respond strategically.
The backward bridge is intervention. A policy change can have a causal effect, but if users or attackers adapt, the effect becomes part of a game. The forward bridge is measure theory: later probability foundations make the stochastic strategies, repeated games, and distributional assumptions more rigorous.
+--------------------------------------------------------------+
| Chapter 22: intervention and causal mechanisms |
| Chapter 23: strategic adaptation and adversarial objectives |
| Chapter 24: rigorous probability and measure foundations |
+--------------------------------------------------------------+
References
- Madry et al.. Towards Deep Learning Models Resistant to Adversarial Attacks. https://arxiv.org/abs/1706.06083
- Goodfellow et al.. Generative Adversarial Nets. https://arxiv.org/abs/1406.2661
- Nisan et al.. Algorithmic Game Theory. https://doi.org/10.1017/CBO9780511800481
- Brown and Sandholm. Superhuman AI for multiplayer poker. https://www.science.org/doi/10.1126/science.aay2400