Part 3

29 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Minimax Theorem: Part 5: Algorithms and Approximation to 6. AI Applications

5. Algorithms and Approximation

Algorithms and Approximation develops the part of minimax theorem specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

5.1 linear programming solution

Linear programming solution belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

For this subsection, the working scope is zero-sum matrix games, maximin and minimax values, saddle points, LP duality, no-regret approximation, and robust AI objectives. We use players, action sets, strategies, payoffs, and response rules. The key question is whether a proposed behavior is stable when another agent adapts.

v^- = \max_{\mathbf{p}\in\Delta_m}\min_{\mathbf{q}\in\Delta_n}\mathbf{p}^\top A\mathbf{q}.

The formula gives the mathematical handle for linear programming solution. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Finite zero-sum games can be solved as primal-dual linear programs over probability simplices and value variables.

Worked reading.

The row player maximizes $v$ subject to every column giving expected payoff at least $v$ . The column player minimizes $w$ subject to every row giving expected payoff at most $w$ .

Three examples of linear programming solution:

Solving rock-paper-scissors as an LP.
Computing an adversarial test mixture with linear constraints.
Finding a randomized defense allocation under budget constraints.

Two non-examples clarify the boundary:

Using an unconstrained optimizer that ignores probabilities summing to one.
Applying LP duality to a nonlinear continuous attack without convexity assumptions.

Proof or verification habit for linear programming solution:

The dual variables are the opponent's mixed strategy; complementary slackness explains why supported actions tie in payoff.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, linear programming solution is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

LP form makes game values auditable: constraints correspond directly to opponent deviations.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using linear programming solution responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Inspect primal feasibility, dual feasibility, and matching objective values.

This chapter follows Chapter 22 by adding strategic adaptation. Causal inference asks what happens when we intervene. Game theory asks what happens when other decision makers anticipate or respond to that intervention.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Linear programming solution gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.2 no-regret dynamics

No-regret dynamics belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

v^+ = \min_{\mathbf{q}\in\Delta_n}\max_{\mathbf{p}\in\Delta_m}\mathbf{p}^\top A\mathbf{q}.

The formula gives the mathematical handle for no-regret dynamics. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

No-regret learning turns repeated play into approximate equilibrium guarantees by making average regret small.

Worked reading.

If both players in a zero-sum game have average regret at most $\epsilon$ , the average strategies are $O(\epsilon)$ -approximate minimax strategies.

Three examples of no-regret dynamics:

Multiplicative weights for action probabilities.
Self-play policies averaged over training.
Exploitability curves used to track poker or board-game agents.

Two non-examples clarify the boundary:

A decreasing supervised loss curve with no opponent model.
A single final policy checkpoint without averaging or regret accounting.

Proof or verification habit for no-regret dynamics:

The proof decomposes the average payoff gap into the row player's regret plus the column player's regret.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, no-regret dynamics is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

This is why practical game-playing systems track exploitability and regret-like quantities instead of only reward.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using no-regret dynamics responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Report whether the guarantee applies to last iterate, averaged iterate, or best checkpoint.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. No-regret dynamics gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.3 multiplicative weights preview

Multiplicative weights preview belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\max_{\mathbf{p}\in\Delta_m}\min_j (A^\top\mathbf{p})_j = \min_{\mathbf{q}\in\Delta_n}\max_i (A\mathbf{q})_i.

The formula gives the mathematical handle for multiplicative weights preview. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

No-regret learning turns repeated play into approximate equilibrium guarantees by making average regret small.

Worked reading.

If both players in a zero-sum game have average regret at most $\epsilon$ , the average strategies are $O(\epsilon)$ -approximate minimax strategies.

Three examples of multiplicative weights preview:

Multiplicative weights for action probabilities.
Self-play policies averaged over training.
Exploitability curves used to track poker or board-game agents.

Two non-examples clarify the boundary:

A decreasing supervised loss curve with no opponent model.
A single final policy checkpoint without averaging or regret accounting.

Proof or verification habit for multiplicative weights preview:

The proof decomposes the average payoff gap into the row player's regret plus the column player's regret.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, multiplicative weights preview is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

This is why practical game-playing systems track exploitability and regret-like quantities instead of only reward.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using multiplicative weights preview responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Report whether the guarantee applies to last iterate, averaged iterate, or best checkpoint.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Multiplicative weights preview gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.4 exploitability

Exploitability belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\min_\theta \max_{\boldsymbol{\delta}\in\mathcal{S}} \mathcal{L}(f_\theta(\mathbf{x}+\boldsymbol{\delta}),y).

The formula gives the mathematical handle for exploitability. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

No-regret learning turns repeated play into approximate equilibrium guarantees by making average regret small.

Worked reading.

If both players in a zero-sum game have average regret at most $\epsilon$ , the average strategies are $O(\epsilon)$ -approximate minimax strategies.

Three examples of exploitability:

Multiplicative weights for action probabilities.
Self-play policies averaged over training.
Exploitability curves used to track poker or board-game agents.

Two non-examples clarify the boundary:

A decreasing supervised loss curve with no opponent model.
A single final policy checkpoint without averaging or regret accounting.

Proof or verification habit for exploitability:

The proof decomposes the average payoff gap into the row player's regret plus the column player's regret.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, exploitability is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

This is why practical game-playing systems track exploitability and regret-like quantities instead of only reward.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using exploitability responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Report whether the guarantee applies to last iterate, averaged iterate, or best checkpoint.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Exploitability gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

5.5 approximate equilibrium

Approximate equilibrium belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

v^- = \max_{\mathbf{p}\in\Delta_m}\min_{\mathbf{q}\in\Delta_n}\mathbf{p}^\top A\mathbf{q}.

The formula gives the mathematical handle for approximate equilibrium. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A Nash equilibrium is a profile of strategies where no player can improve by changing its own strategy while all other strategies remain fixed.

Worked reading.

In the prisoner's dilemma payoff convention, mutual defection can be a Nash equilibrium even when mutual cooperation is better for both players. This is the central warning: stability and desirability are different properties.

Three examples of approximate equilibrium:

A self-play policy pair where neither side has a profitable unilateral exploit.
A GAN fixed point where the generator distribution matches data and the discriminator cannot improve classification.
A routing market where no model provider benefits from changing only its bid.

Two non-examples clarify the boundary:

A high-welfare outcome with a profitable unilateral deviation.
A training checkpoint with low loss but a large best-response exploit.

Proof or verification habit for approximate equilibrium:

The proof is a universal deviation check: for each player $i$ , hold $\pi_{-i}$ fixed and show $u_i(\pi_i^*,\pi_{-i}^*)\ge u_i(\pi_i,\pi_{-i}^*)$ for all allowed $\pi_i$ .

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, approximate equilibrium is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

For AI agents, Nash is a stability diagnostic. It does not guarantee safety, alignment, fairness, or global efficiency unless those objectives are encoded in the game.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using approximate equilibrium responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Ask: if one deployed model, user, or attacker changed behavior alone, would it gain?

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Approximate equilibrium gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6. AI Applications

AI Applications develops the part of minimax theorem specified by the approved Chapter 23 table of contents. The treatment is game-theoretic, not merely an optimization recipe.

6.1 adversarial robustness objective

Adversarial robustness objective belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

v^+ = \min_{\mathbf{q}\in\Delta_n}\max_{\mathbf{p}\in\Delta_m}\mathbf{p}^\top A\mathbf{q}.

The formula gives the mathematical handle for adversarial robustness objective. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A threat model defines the attacker's allowed moves. Robust optimization then trains or evaluates against the worst allowed move.

Worked reading.

For an $\ell_\infty$ perturbation set, PGD repeatedly steps in the gradient-sign direction and projects back into the allowed box.

Three examples of adversarial robustness objective:

Image perturbations bounded by a norm.
Prompt transformations allowed by a jailbreak policy.
Retrieval poisoning constrained by an index-insertion budget.

Two non-examples clarify the boundary:

Any attack the modeler can imagine but has not formalized.
Random corruption treated as adaptive attack.

Proof or verification habit for adversarial robustness objective:

The nested objective is proved meaningful only after the feasible attack set is stated. The inner maximum is over that set, not over all possible bad events.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, adversarial robustness objective is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Adversarial training improves robustness to the modeled threat, not to every strategic behavior.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using adversarial robustness objective responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Write the set $\mathcal{S}$ before writing the max.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Adversarial robustness objective gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.2 GAN minimax loss

Gan minimax loss belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\max_{\mathbf{p}\in\Delta_m}\min_j (A^\top\mathbf{p})_j = \min_{\mathbf{q}\in\Delta_n}\max_i (A\mathbf{q})_i.

The formula gives the mathematical handle for gan minimax loss. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Minimax reasoning chooses a strategy by its guaranteed performance against the strongest opponent response in a zero-sum game.

Worked reading.

The row player computes $\max_p\min_q p^\top A q$ while the column player computes $\min_q\max_p p^\top A q$ . The minimax theorem says these values agree for finite zero-sum games.

Three examples of gan minimax loss:

Robust classification against bounded perturbations.
A discriminator maximizing the generator's loss in a simplified GAN objective.
Worst-case evaluation where the tester chooses the hardest valid case.

Two non-examples clarify the boundary:

Average validation loss over a fixed dataset.
General-sum bargaining where both players can gain together.

Proof or verification habit for gan minimax loss:

The LP-duality proof writes each player's guarantee as a linear program; strong duality equates the two optimal values.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, gan minimax loss is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Minimax is the mathematical backbone of adversarial robustness, but only relative to the stated threat model.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using gan minimax loss responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Check zero-sum structure before importing minimax conclusions.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Gan minimax loss gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.3 worst-case decoding and evaluation

Worst-case decoding and evaluation belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

\min_\theta \max_{\boldsymbol{\delta}\in\mathcal{S}} \mathcal{L}(f_\theta(\mathbf{x}+\boldsymbol{\delta}),y).

The formula gives the mathematical handle for worst-case decoding and evaluation. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

Minimax reasoning chooses a strategy by its guaranteed performance against the strongest opponent response in a zero-sum game.

Worked reading.

The row player computes $\max_p\min_q p^\top A q$ while the column player computes $\min_q\max_p p^\top A q$ . The minimax theorem says these values agree for finite zero-sum games.

Three examples of worst-case decoding and evaluation:

Robust classification against bounded perturbations.
A discriminator maximizing the generator's loss in a simplified GAN objective.
Worst-case evaluation where the tester chooses the hardest valid case.

Two non-examples clarify the boundary:

Average validation loss over a fixed dataset.
General-sum bargaining where both players can gain together.

Proof or verification habit for worst-case decoding and evaluation:

The LP-duality proof writes each player's guarantee as a linear program; strong duality equates the two optimal values.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, worst-case decoding and evaluation is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Minimax is the mathematical backbone of adversarial robustness, but only relative to the stated threat model.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using worst-case decoding and evaluation responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Check zero-sum structure before importing minimax conclusions.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Worst-case decoding and evaluation gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.4 Yao-style lower bounds preview

Yao-style lower bounds preview belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

v^- = \max_{\mathbf{p}\in\Delta_m}\min_{\mathbf{q}\in\Delta_n}\mathbf{p}^\top A\mathbf{q}.

The formula gives the mathematical handle for yao-style lower bounds preview. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

The game value is the payoff both sides can guarantee in a finite zero-sum game when they use optimal mixed strategies.

Worked reading.

The row player's guarantee is the lower value; the column player's guarantee is the upper value. The minimax theorem states these agree in finite zero-sum games.

Three examples of yao-style lower bounds preview:

The value of rock-paper-scissors is zero.
A robust classifier's worst-case loss is the value of a specified attack game.
Yao-style lower bounds swap randomized algorithms and input distributions under minimax reasoning.

Two non-examples clarify the boundary:

A general-sum welfare score.
A last-iterate training reward.

Proof or verification habit for yao-style lower bounds preview:

The finite proof can be read through LP duality: the column player's dual program certifies the same scalar value as the row player's primal program.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, yao-style lower bounds preview is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

The value is useful because it is a guarantee, not a hope about average-case behavior.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using yao-style lower bounds preview responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Check the assumptions: finite action sets or the right compactness, convexity, and continuity conditions for extensions.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Yao-style lower bounds preview gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

6.5 robust policy learning

Robust policy learning belongs to the canonical scope of Minimax Theorem. The central object is not a single optimizer but a system of decision makers whose objectives interact.

v^+ = \min_{\mathbf{q}\in\Delta_n}\max_{\mathbf{p}\in\Delta_m}\mathbf{p}^\top A\mathbf{q}.

The formula gives the mathematical handle for robust policy learning. In game theory, this expression should always be read with the opponent's decision rule in mind. A policy that is optimal in isolation may be exploitable once another player observes and responds to it.

Game object	Meaning	AI interpretation
Player	Decision maker with an objective	Model, user, attacker, defender, generator, evaluator, tool-using agent
Action	Choice available to a player	Prompt, route, attack, defense, bid, policy update, generated sample
Strategy	Rule or distribution over actions	Stochastic policy, decoding policy, defense randomization, routing policy
Payoff	Utility or negative loss	Accuracy, reward, cost, safety score, exploitability, compute budget
Equilibrium	Stable joint behavior	No agent can improve by changing alone under the stated game

Operational definition.

A threat model defines the attacker's allowed moves. Robust optimization then trains or evaluates against the worst allowed move.

Worked reading.

For an $\ell_\infty$ perturbation set, PGD repeatedly steps in the gradient-sign direction and projects back into the allowed box.

Three examples of robust policy learning:

Image perturbations bounded by a norm.
Prompt transformations allowed by a jailbreak policy.
Retrieval poisoning constrained by an index-insertion budget.

Two non-examples clarify the boundary:

Any attack the modeler can imagine but has not formalized.
Random corruption treated as adaptive attack.

Proof or verification habit for robust policy learning:

The nested objective is proved meaningful only after the feasible attack set is stated. The inner maximum is over that set, not over all possible bad events.

single-agent optimization:    choose theta to minimize L(theta)
game-theoretic optimization:  choose pi_i while others choose pi_-i
adversarial objective:        choose defense against best attack
multi-agent learning:         policies change the environment itself

In AI systems, robust policy learning is useful because modern models are deployed into adaptive environments: users learn prompt tricks, attackers search for failures, evaluators change rubrics, and other agents compete for resources.

Adversarial training improves robustness to the modeled threat, not to every strategic behavior.

Notebook implementation will use small synthetic payoff matrices and learning dynamics. This keeps the mathematics executable while avoiding external datasets or heavyweight game solvers.

Checklist for using robust policy learning responsibly:

State the players and their objectives.
State the action spaces and information structure.
Decide whether the game is zero-sum, general-sum, cooperative, or adversarial.
Identify pure, mixed, or policy strategies.
Compute best responses or exploitability before claiming stability.
Separate equilibrium analysis from welfare analysis.
Explain what changes if opponents adapt.

Local diagnostic: Write the set $\mathcal{S}$ before writing the max.

Modern AI makes the distinction practical. A deployed model can be optimized against by users, attackers, competitors, automated evaluators, and other models. Robust policy learning gives the language to reason about that pressure.

A final diagnostic question is whether a decision remains good after another agent learns from it. If not, the analysis needs game theory, not just prediction, causality, or optimization.

Diagnostic question	Game-theoretic discipline it tests
Who can respond?	Player modeling
What can they change?	Action space
What do they want?	Payoff design
Can one side commit first?	Stackelberg structure
Is the worst case important?	Minimax or robust objective

Minimax Theorem: Part 3 - Algorithms And Approximation To 6 Ai Applications

Minimax Theorem: Part 5: Algorithms and Approximation to 6. AI Applications

5. Algorithms and Approximation

5.1 linear programming solution

5.2 no-regret dynamics

5.3 multiplicative weights preview

5.4 exploitability

5.5 approximate equilibrium

6. AI Applications

6.1 adversarial robustness objective

6.2 GAN minimax loss

6.3 worst-case decoding and evaluation

6.4 Yao-style lower bounds preview

6.5 robust policy learning

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?