Part 3

30 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Adversarial Game Theory: Part 4: Strategic Security Games to 5. Generative and Evaluation Games

4. Strategic Security Games

Strategic Security Games develops the part of adversarial game theory specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

4.1 defender resource allocation

Defender resource allocation belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is attacker-defender games, threat sets, robust optimization, Stackelberg security games, adversarial examples, and adaptive evaluation. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

\min_G\max_D \mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}}\log D(\mathbf{x})+\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}\log(1-D(G(\mathbf{z}))).

The formula gives the mathematical handle for defender resource allocation. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Security games often have timing: a defender commits to a randomized allocation, then an attacker chooses a best response.

Worked reading.

A defender with two monitors and three targets chooses coverage probabilities; the attacker chooses the target with highest expected utility after observing the commitment rule.

Three examples of defender resource allocation:

Random audits over model outputs.
Rate-limit allocation over API endpoints.
Canary documents placed to detect extraction.

Two non-examples clarify the boundary:

A simultaneous zero-sum matrix game with no commitment.
A fixed checklist that attackers cannot observe or learn from.

Proof or verification habit for defender resource allocation:

Stackelberg analysis proves optimal commitment by solving the follower's best-response constraints inside the leader's optimization.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, defender resource allocation is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

For AI security, commitment and observability matter because attackers often adapt after seeing public defenses.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using defender resource allocation responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: State what the attacker knows about the defense.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Defender resource allocation gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.2 attacker best response

Attacker best response belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

a_A\in A_A,\qquad a_D\in A_D.

The formula gives the mathematical handle for attacker best response. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A best response is an action or policy that maximizes one player's payoff while the other players' strategies are held fixed.

Worked reading.

For a payoff matrix $A$ , if the column player chooses column $j$ , the row player's best responses are the rows attaining $\max_i A_{ij}$ . In mixed play, the best response maximizes expected payoff against the opponent's distribution.

Three examples of attacker best response:

A discriminator chooses the classifier update that most separates generated and real samples.
An attacker chooses the prompt family with highest bypass rate against a fixed guardrail.
A retrieval system chooses the route with highest utility against the current user distribution.

Two non-examples clarify the boundary:

The globally highest payoff cell when the opponent is not fixed.
A socially preferred action that is not payoff-maximizing for the player.

Proof or verification habit for attacker best response:

To prove a response is best, compare it to every allowed unilateral deviation under the same opponent strategy.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, attacker best response is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Best-response thinking is how exploitability is measured: ask what an adaptive user, attacker, or agent could gain by switching strategy alone.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using attacker best response responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Never call an outcome stable until every player has passed the same best-response check.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Attacker best response gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.3 Stackelberg equilibrium

Stackelberg equilibrium belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

R_{\mathrm{rob}}(\theta)=\mathbb{E}_{(\mathbf{x},y)}\left[\max_{\boldsymbol{\delta}\in\mathcal{S}}\mathcal{L}(f_\theta(\mathbf{x}+\boldsymbol{\delta}),y)\right].

The formula gives the mathematical handle for stackelberg equilibrium. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Nash equilibrium is a profile of strategies where no player can improve by changing its own strategy while all other strategies remain fixed.

Worked reading.

In the prisoner's dilemma payoff convention, mutual defection can be a Nash equilibrium even when mutual cooperation is better for both players. This is the central warning: stability and desirability are different properties.

Three examples of stackelberg equilibrium:

A self-play policy pair where neither side has a profitable unilateral exploit.
A GAN fixed point where the generator distribution matches data and the discriminator cannot improve classification.
A routing market where no model provider benefits from changing only its bid.

Two non-examples clarify the boundary:

A high-welfare outcome with a profitable unilateral deviation.
A training checkpoint with low loss but a large best-response exploit.

Proof or verification habit for stackelberg equilibrium:

The proof is a universal deviation check: for each player $i$ , hold $\pi_{-i}$ fixed and show $u_i(\pi_i^*,\pi_{-i}^*)\ge u_i(\pi_i,\pi_{-i}^*)$ for all allowed $\pi_i$ .

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, stackelberg equilibrium is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

For AI agents, Nash is a stability diagnostic. It does not guarantee safety, alignment, fairness, or global efficiency unless those objectives are encoded in the game.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using stackelberg equilibrium responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask: if one deployed model, user, or attacker changed behavior alone, would it gain?

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Stackelberg equilibrium gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.4 deception and randomization

Deception and randomization belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

a_A^*(a_D)\in\arg\max_{a_A}u_A(a_A,a_D).

The formula gives the mathematical handle for deception and randomization. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Security games often have timing: a defender commits to a randomized allocation, then an attacker chooses a best response.

Worked reading.

A defender with two monitors and three targets chooses coverage probabilities; the attacker chooses the target with highest expected utility after observing the commitment rule.

Three examples of deception and randomization:

Random audits over model outputs.
Rate-limit allocation over API endpoints.
Canary documents placed to detect extraction.

Two non-examples clarify the boundary:

A simultaneous zero-sum matrix game with no commitment.
A fixed checklist that attackers cannot observe or learn from.

Proof or verification habit for deception and randomization:

Stackelberg analysis proves optimal commitment by solving the follower's best-response constraints inside the leader's optimization.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, deception and randomization is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

For AI security, commitment and observability matter because attackers often adapt after seeing public defenses.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using deception and randomization responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: State what the attacker knows about the defense.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Deception and randomization gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

4.5 audit and monitoring

Audit and monitoring belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\min_G\max_D \mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}}\log D(\mathbf{x})+\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}\log(1-D(G(\mathbf{z}))).

The formula gives the mathematical handle for audit and monitoring. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Security games often have timing: a defender commits to a randomized allocation, then an attacker chooses a best response.

Worked reading.

A defender with two monitors and three targets chooses coverage probabilities; the attacker chooses the target with highest expected utility after observing the commitment rule.

Three examples of audit and monitoring:

Random audits over model outputs.
Rate-limit allocation over API endpoints.
Canary documents placed to detect extraction.

Two non-examples clarify the boundary:

A simultaneous zero-sum matrix game with no commitment.
A fixed checklist that attackers cannot observe or learn from.

Proof or verification habit for audit and monitoring:

Stackelberg analysis proves optimal commitment by solving the follower's best-response constraints inside the leader's optimization.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, audit and monitoring is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

For AI security, commitment and observability matter because attackers often adapt after seeing public defenses.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using audit and monitoring responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: State what the attacker knows about the defense.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Audit and monitoring gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5. Generative and Evaluation Games

Generative and Evaluation Games develops the part of adversarial game theory specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

5.1 GAN discriminator-generator game

Gan discriminator-generator game belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

a_A\in A_A,\qquad a_D\in A_D.

The formula gives the mathematical handle for gan discriminator-generator game. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.

Worked reading.

In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.

Three examples of gan discriminator-generator game:

GAN generator-discriminator training.
Jailbreak discovery against a deployed policy layer.
Benchmark gaming where systems optimize for the public metric instead of the intended task.

Two non-examples clarify the boundary:

One-time evaluation on a frozen hidden test set.
A content filter measured only against historical prompts.

Proof or verification habit for gan discriminator-generator game:

The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, gan discriminator-generator game is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using gan discriminator-generator game responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Gan discriminator-generator game gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.2 red-team blue-team loops

Red-team blue-team loops belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

R_{\mathrm{rob}}(\theta)=\mathbb{E}_{(\mathbf{x},y)}\left[\max_{\boldsymbol{\delta}\in\mathcal{S}}\mathcal{L}(f_\theta(\mathbf{x}+\boldsymbol{\delta}),y)\right].

The formula gives the mathematical handle for red-team blue-team loops. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.

Worked reading.

In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.

Three examples of red-team blue-team loops:

GAN generator-discriminator training.
Jailbreak discovery against a deployed policy layer.
Benchmark gaming where systems optimize for the public metric instead of the intended task.

Two non-examples clarify the boundary:

One-time evaluation on a frozen hidden test set.
A content filter measured only against historical prompts.

Proof or verification habit for red-team blue-team loops:

The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, red-team blue-team loops is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using red-team blue-team loops responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Red-team blue-team loops gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.3 benchmark gaming

Benchmark gaming belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

a_A^*(a_D)\in\arg\max_{a_A}u_A(a_A,a_D).

The formula gives the mathematical handle for benchmark gaming. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.

Worked reading.

In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.

Three examples of benchmark gaming:

GAN generator-discriminator training.
Jailbreak discovery against a deployed policy layer.
Benchmark gaming where systems optimize for the public metric instead of the intended task.

Two non-examples clarify the boundary:

One-time evaluation on a frozen hidden test set.
A content filter measured only against historical prompts.

Proof or verification habit for benchmark gaming:

The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, benchmark gaming is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using benchmark gaming responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Benchmark gaming gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.4 reward hacking

Reward hacking belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\min_G\max_D \mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}}\log D(\mathbf{x})+\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}\log(1-D(G(\mathbf{z}))).

The formula gives the mathematical handle for reward hacking. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.

Worked reading.

In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.

Three examples of reward hacking:

GAN generator-discriminator training.
Jailbreak discovery against a deployed policy layer.
Benchmark gaming where systems optimize for the public metric instead of the intended task.

Two non-examples clarify the boundary:

One-time evaluation on a frozen hidden test set.
A content filter measured only against historical prompts.

Proof or verification habit for reward hacking:

The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, reward hacking is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using reward hacking responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Reward hacking gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.5 adaptive evaluation

Adaptive evaluation belongs to the canonical scope of Adversarial Game Theory. The central object is not a single optimizer but a system of decision makers whose objectives interact.

a_A\in A_A,\qquad a_D\in A_D.

The formula gives the mathematical handle for adaptive evaluation. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Generative, evaluation, and deployment games arise when model behavior changes in response to the measurement or defense mechanism.

Worked reading.

In a GAN, the discriminator improves its classifier while the generator improves samples to fool it. In red-team evaluation, the attacker improves examples after seeing failures of the defense.

Three examples of adaptive evaluation:

GAN generator-discriminator training.
Jailbreak discovery against a deployed policy layer.
Benchmark gaming where systems optimize for the public metric instead of the intended task.

Two non-examples clarify the boundary:

One-time evaluation on a frozen hidden test set.
A content filter measured only against historical prompts.

Proof or verification habit for adaptive evaluation:

The mathematical proof obligation is to identify the adaptive loop and the payoff each side optimizes.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, adaptive evaluation is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Many LLM safety and evaluation failures are game failures: optimizing the metric changes the population of attempts.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using adaptive evaluation responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask who can observe the metric, adapt to it, and benefit from adaptation.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Adaptive evaluation gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

Adversarial Game Theory: Part 3 - Strategic Security Games To 5 Generative And Evaluation Games

Adversarial Game Theory: Part 4: Strategic Security Games to 5. Generative and Evaluation Games

4. Strategic Security Games

4.1 defender resource allocation

4.2 attacker best response

4.3 Stackelberg equilibrium

4.4 deception and randomization

4.5 audit and monitoring

5. Generative and Evaluation Games

5.1 GAN discriminator-generator game

5.2 red-team blue-team loops

5.3 benchmark gaming

5.4 reward hacking

5.5 adaptive evaluation

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?