Part 2

30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Optimality Conditions: Supplement A: Deeper Mathematical Treatments to Supplement K: Extended Exercise Solutions Hints and Partial Solutions

Supplement A: Deeper Mathematical Treatments

This supplement provides expanded proofs, worked examples, and connections that go beyond the core sections above. These are valuable for practitioners who want to move from "knowing the conditions" to "understanding why the conditions have the form they do."

A.1 Proof of the Envelope Theorem

The Envelope Theorem is the rigorous basis for the shadow price interpretation of Lagrange multipliers. Here is the complete proof.

Setup. Let $p^*(b)$ be the optimal value of:

\min_{\mathbf{x}} f(\mathbf{x}, b) \quad \text{s.t.} \quad \mathbf{g}(\mathbf{x}, b) = \mathbf{0}

where $b \in \mathbb{R}$ is a parameter (e.g., the RHS of a constraint). Assume $f, \mathbf{g}$ are $C^1$ and $\mathbf{x}^*(b)$ varies smoothly.

Theorem. $\frac{dp^*}{db} = \frac{\partial \mathcal{L}}{\partial b}\bigg|_{\mathbf{x}^*(b), \boldsymbol{\lambda}^*(b)}$

where $\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, b) = f(\mathbf{x}, b) + \boldsymbol{\lambda}^\top \mathbf{g}(\mathbf{x}, b)$ .

Proof. Differentiate $p^*(b) = f(\mathbf{x}^*(b), b)$ with respect to $b$ :

\frac{dp^*}{db} = \nabla_\mathbf{x} f(\mathbf{x}^*(b), b)^\top \frac{d\mathbf{x}^*}{db} + \frac{\partial f}{\partial b}(\mathbf{x}^*(b), b)

From stationarity: $\nabla_\mathbf{x} f = -\boldsymbol{\lambda}^{*\top} \nabla_\mathbf{x} \mathbf{g}$ . Substituting:

\frac{dp^*}{db} = -\boldsymbol{\lambda}^{*\top} \nabla_\mathbf{x} \mathbf{g}^\top \frac{d\mathbf{x}^*}{db} + \frac{\partial f}{\partial b}

Differentiating the constraint $\mathbf{g}(\mathbf{x}^*(b), b) = \mathbf{0}$ with respect to $b$ :

\nabla_\mathbf{x} \mathbf{g}^\top \frac{d\mathbf{x}^*}{db} + \frac{\partial \mathbf{g}}{\partial b} = \mathbf{0} \quad \Rightarrow \quad \nabla_\mathbf{x} \mathbf{g}^\top \frac{d\mathbf{x}^*}{db} = -\frac{\partial \mathbf{g}}{\partial b}

Substituting back:

\frac{dp^*}{db} = \boldsymbol{\lambda}^{*\top} \frac{\partial \mathbf{g}}{\partial b} + \frac{\partial f}{\partial b} = \frac{\partial \mathcal{L}}{\partial b}\bigg|_{\mathbf{x}^*, \boldsymbol{\lambda}^*} \quad \square

Consequence for ML. If we add a constraint $\|\mathbf{w}\|^2 = c$ (norm budget) to a training problem, the Lagrange multiplier $\lambda^*$ equals $-dp^*/dc$ -the rate at which the optimal loss decreases as we increase the allowed norm. This is why weight decay $\lambda$ (which enforces a soft norm budget) and the optimal multiplier are equal at the optimum: they're the same object.

A.2 Farkas' Lemma: The Foundation of KKT

The Farkas Lemma is the linear algebra result that underpins the entire theory of KKT conditions. It provides the criterion for when a linear system has a solution versus when a certificate of infeasibility exists.

Lemma (Farkas, 1902). Let $A \in \mathbb{R}^{m \times n}$ , $\mathbf{b} \in \mathbb{R}^m$ . Exactly one of the following holds:

There exists $\mathbf{x} \geq \mathbf{0}$ with $A\mathbf{x} = \mathbf{b}$
There exists $\mathbf{y}$ with $A^\top \mathbf{y} \geq \mathbf{0}$ and $\mathbf{b}^\top \mathbf{y} < 0$

Geometric interpretation. Statement 1 says $\mathbf{b}$ is in the conic hull of the columns of $A$ . Statement 2 says there is a hyperplane $\mathbf{y}$ separating $\mathbf{b}$ from this cone. The lemma asserts these are mutually exclusive and exhaustive.

Proof sketch. If 1 holds: $\mathbf{b}^\top \mathbf{y} = (A\mathbf{x})^\top \mathbf{y} = \mathbf{x}^\top A^\top \mathbf{y} \geq 0$ since $\mathbf{x} \geq 0$ and $A^\top \mathbf{y} \geq 0$ , so 2 fails. The converse (if 1 fails then 2 holds) follows from the separating hyperplane theorem for convex cones.

How Farkas gives KKT. At a constrained minimum $\mathbf{x}^*$ , if $\mathbf{d}$ is a feasible descent direction (satisfying active constraint gradients and having $\nabla f^\top \mathbf{d} < 0$ ), the Farkas alternatives state that such $\mathbf{d}$ cannot exist iff $-\nabla f$ is in the cone of active constraint gradients-i.e., the KKT conditions hold.

A.3 Fritz John Conditions: When LICQ Fails

When the LICQ fails (active constraint gradients are linearly dependent), the standard KKT multiplier theorem breaks down. The Fritz John conditions provide a fallback:

Theorem (Fritz John, 1948). At a local minimum $\mathbf{x}^*$ of $f$ subject to $\mathbf{g} = \mathbf{0}$ , $\mathbf{h} \leq \mathbf{0}$ , there exist $(\mu_0, \boldsymbol{\mu}, \boldsymbol{\lambda}) \neq \mathbf{0}$ with $\mu_0, \boldsymbol{\mu} \geq 0$ such that:

\mu_0 \nabla f(\mathbf{x}^*) + \sum_j \mu_j \nabla h_j(\mathbf{x}^*) + \sum_i \lambda_i \nabla g_i(\mathbf{x}^*) = \mathbf{0}

and complementary slackness $\mu_j h_j(\mathbf{x}^*) = 0$ .

The anomaly. When $\mu_0 = 0$ , the objective function disappears from the condition entirely-the conditions are "degenerate" and say nothing about the objective. This is called an abnormal solution.

Practical implication. If LICQ holds, we can always normalize to $\mu_0 = 1$ , recovering the standard KKT conditions. Fritz John is thus a generalization that covers degenerate cases, but the cleanest results require LICQ (or another constraint qualification that ensures $\mu_0 > 0$ ).

A.4 Second-Order KKT Conditions

For constrained problems, second-order sufficient conditions involve the Hessian of the Lagrangian restricted to the tangent cone of the active constraints.

Setup. At a KKT point $(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)$ with active set $A = \{j : h_j(\mathbf{x}^*) = 0\}$ , define:

$L(\mathbf{x}) = \nabla^2_{\mathbf{x}\mathbf{x}} \mathcal{L}(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)$ (Hessian of Lagrangian w.r.t. $\mathbf{x}$ )
The critical cone: $C = \{\mathbf{d} : \nabla h_j(\mathbf{x}^*)^\top \mathbf{d} \leq 0 \text{ for } j \in A,\; \nabla g_i(\mathbf{x}^*)^\top \mathbf{d} = 0\}$

Second-order necessary condition (SONC): If $\mathbf{x}^*$ is a local minimum, then $\mathbf{d}^\top L \mathbf{d} \geq 0$ for all $\mathbf{d} \in C$ .

Second-order sufficient condition (SOSC): If the KKT conditions hold and $\mathbf{d}^\top L \mathbf{d} > 0$ for all $\mathbf{d} \in C \setminus \{\mathbf{0}\}$ , then $\mathbf{x}^*$ is a strict local minimum.

For SVM: At the SVM optimum, the Hessian of the Lagrangian in the primal $(\mathbf{w}, b)$ variables is positive definite restricted to the constraint tangent space (verified by the positive definiteness of the kernel matrix for distinct support vectors)-confirming the optimum is indeed a strict constrained minimum.

A.5 Convex Conjugate and Fenchel Duality

The Fenchel conjugate provides a powerful alternative to Lagrangian duality, particularly useful for non-smooth problems.

Definition (Conjugate function). For $f: \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\}$ :

f^*(\mathbf{y}) = \sup_{\mathbf{x}} \left[ \mathbf{y}^\top \mathbf{x} - f(\mathbf{x}) \right]

$f^*$ is always convex (pointwise supremum of affine functions) regardless of whether $f$ is convex.

Key examples:

$f(\mathbf{x})$	Domain	$f^*(\mathbf{y})$	Domain
$\frac{1}{2}\\|\mathbf{x}\\|^2$	$\mathbb{R}^n$	$\frac{1}{2}\\|\mathbf{y}\\|^2$	$\mathbb{R}^n$
$\\|\mathbf{x}\\|_1$	$\mathbb{R}^n$	$0$ if $\\|\mathbf{y}\\|_\infty \leq 1$ else $+\infty$	Indicator
$\\|\mathbf{x}\\|_2$	$\mathbb{R}^n$	$0$ if $\\|\mathbf{y}\\|_2 \leq 1$ else $+\infty$	Indicator
$e^x$	$\mathbb{R}$	$y\log y - y$	$y > 0$
$-\log x$	$x > 0$	$-1 - \log(-y)$	$y < 0$

Fenchel duality. The Fenchel dual problem of $\min_\mathbf{x} f(A\mathbf{x}) + g(\mathbf{x})$ is:

\max_\mathbf{y} -f^*(\mathbf{y}) - g^*(-A^\top \mathbf{y})

Lasso as Fenchel dual. The Lasso primal $\min_\mathbf{w} \frac{1}{2}\|X\mathbf{w}-\mathbf{y}\|^2 + \lambda\|\mathbf{w}\|_1$ has a Fenchel dual that is a quadratic program with box constraints. The dual variables are the residuals $\mathbf{r} = \mathbf{y} - X\mathbf{w}$ scaled by $1/\lambda$ , and the dual constraint $\|X^\top \mathbf{r}\|_\infty \leq 1$ gives the KKT condition for Lasso sparsity.

A.6 Proximal Operators and Subdifferential Calculus

For non-smooth convex problems, gradient descent fails but proximal methods succeed. The proximal operator is intimately connected to KKT conditions.

Definition. The proximal operator of $g$ with parameter $t > 0$ :

\text{prox}_{tg}(\mathbf{v}) = \arg\min_\mathbf{x} \left[ g(\mathbf{x}) + \frac{1}{2t}\|\mathbf{x} - \mathbf{v}\|^2 \right]

Connection to KKT. The fixed point of $\mathbf{v} \mapsto \text{prox}_{tg}(\mathbf{v})$ is the minimiser of $g$ . The KKT condition for the proximal subproblem is $\mathbf{0} \in \partial g(\mathbf{x}^*) + \frac{1}{t}(\mathbf{x}^* - \mathbf{v})$ , which at the fixed point gives $\mathbf{0} \in \partial g(\mathbf{x}^*)$ -the subdifferential optimality condition.

Proximal gradient method for $\min f(\mathbf{x}) + g(\mathbf{x})$ (smooth $f$ + non-smooth $g$ ):

\mathbf{x}_{k+1} = \text{prox}_{\eta g}(\mathbf{x}_k - \eta \nabla f(\mathbf{x}_k))

For Lasso: $\text{prox}_{\eta\lambda\|\cdot\|_1}(\mathbf{v})_j = \text{sgn}(v_j)\max(|v_j| - \eta\lambda, 0)$ (soft thresholding). This is the closed-form solution to the proximal subproblem, and it directly implements the KKT condition: the $j$ -th weight is zero when the gradient is smaller than $\lambda$ (inactive constraint), nonzero otherwise.

A.7 Constrained Optimisation in High Dimensions: Curse and Blessing

Classical optimisation theory was developed primarily for problems in $\mathbb{R}^n$ with small $n$ ( $n \leq 1000$ ). Modern ML operates at $n = 10^9$ to $10^{11}$ . Some classical results break down; others become stronger.

What breaks:

Computing the Hessian: $O(n^2)$ memory, impossible for $n = 10^9$
Solving KKT systems: $O(n^3)$ for dense systems
Verifying positive definiteness: $O(n^2)$ eigendecomposition
LICQ: in high dimensions, constraints from discrete data rarely achieve full-rank Jacobians

What gets better (blessing of overparameterisation):

Saddle points become easier to escape: a random direction has $O(1/n)$ probability of being a descent direction at a saddle; gradient descent with noise escapes saddles in polynomial time
Local minima become global: as shown by Kawaguchi and Du et al., the geometry of overparameterised networks makes spurious local minima rare
Flat minima proliferate: with $n \gg N$ parameters and $N$ data points, there is an $(n-N)$ -dimensional manifold of global minima-gradient descent finds some flat point in this manifold

Practical approximations to KKT in high- $n$ regime:

Gradient norm $\|\nabla\mathcal{L}\| \leq \epsilon$ replaces exact stationarity
Curvature estimation: diagonal Hessian (Adagrad, Adam) approximates second-order information
Sketched KKT systems: random projections reduce system size while preserving approximate solutions
Dual certificate: in sparse recovery, checking $\|X^\top(\mathbf{y} - X\hat\mathbf{w})\|_\infty \leq \lambda$ verifies the KKT condition for Lasso without solving the primal- $O(nd)$ rather than $O(d^3)$

Supplement B: Worked Examples

B.1 Complete KKT Analysis: Portfolio Optimisation

Problem. Markowitz mean-variance portfolio: minimise risk $\frac{1}{2}\mathbf{w}^\top \Sigma \mathbf{w}$ subject to expected return $\boldsymbol{\mu}^\top \mathbf{w} = r$ and budget $\mathbf{1}^\top \mathbf{w} = 1$ .

Lagrangian:

\mathcal{L} = \frac{1}{2}\mathbf{w}^\top \Sigma \mathbf{w} - \lambda_1(\boldsymbol{\mu}^\top \mathbf{w} - r) - \lambda_2(\mathbf{1}^\top \mathbf{w} - 1)

Stationarity: $\Sigma \mathbf{w}^* = \lambda_1^* \boldsymbol{\mu} + \lambda_2^* \mathbf{1}$

Solution: $\mathbf{w}^* = \Sigma^{-1}(\lambda_1^* \boldsymbol{\mu} + \lambda_2^* \mathbf{1})$ (assuming $\Sigma \succ 0$ )

Finding multipliers: Substitute into constraints:

\boldsymbol{\mu}^\top \Sigma^{-1}(\lambda_1^* \boldsymbol{\mu} + \lambda_2^* \mathbf{1}) = r

\mathbf{1}^\top \Sigma^{-1}(\lambda_1^* \boldsymbol{\mu} + \lambda_2^* \mathbf{1}) = 1

Define: $A = \boldsymbol{\mu}^\top \Sigma^{-1} \boldsymbol{\mu}$ , $B = \boldsymbol{\mu}^\top \Sigma^{-1} \mathbf{1}$ , $C = \mathbf{1}^\top \Sigma^{-1} \mathbf{1}$ .

System: $\begin{pmatrix} A & B \\ B & C \end{pmatrix}\begin{pmatrix}\lambda_1^* \\ \lambda_2^*\end{pmatrix} = \begin{pmatrix} r \\ 1 \end{pmatrix}$

The efficient frontier is the locus of $(r, \mathbf{w}^*(r))$ as $r$ varies-it's a parabola in $(r, \text{variance})$ space, parameterised by the Lagrange multiplier $\lambda_1$ .

Sensitivity: $d(\text{min variance})/dr = \lambda_1^*$ - a one-unit increase in required return increases minimum variance by $\lambda_1^*$ units. This is the shadow price: the cost of reaching for higher returns.

B.2 KKT Verification via Duality: Water-Filling

The water-filling solution for capacity maximisation in parallel Gaussian channels is a classic KKT example.

Problem. Maximise total capacity $\sum_i \log(1 + p_i/\sigma_i^2)$ subject to $\sum_i p_i = P$ , $p_i \geq 0$ .

Lagrangian: $\mathcal{L} = \sum_i \log(1 + p_i/\sigma_i^2) - \lambda(\sum_i p_i - P) + \sum_i \mu_i p_i$

KKT stationarity: $\frac{1}{\sigma_i^2 + p_i^*} = \lambda^* - \mu_i^* \leq \lambda^*$

Complementary slackness: $\mu_i^* p_i^* = 0$ . So either $p_i^* = 0$ (channel off) or $\mu_i^* = 0$ and $p_i^* = 1/\lambda^* - \sigma_i^2$ (channel on with water level $1/\lambda^*$ ).

Solution: $p_i^* = \max(0, 1/\lambda^* - \sigma_i^2)$ where $\lambda^*$ is chosen so $\sum_i p_i^* = P$ .

The water-filling metaphor: imagine filling a container with flat water level $1/\lambda^*$ ; channels with floor $\sigma_i^2$ above the water level get no power (inactive constraint). This is the KKT complementary slackness condition made vivid.

B.3 RLHF as a Constrained Optimisation Problem

Reinforcement Learning from Human Feedback (RLHF) can be formulated as a constrained optimisation problem whose KKT conditions provide exact theoretical insight into the resulting policy.

Problem formulation. Let $\pi_\theta(\mathbf{y}|\mathbf{x})$ be the model policy and $\pi_{\text{ref}}$ the reference policy. Maximize expected reward subject to a KL constraint:

\max_\theta \mathbb{E}_{y \sim \pi_\theta}[r(\mathbf{x}, \mathbf{y})] \quad \text{s.t.} \quad \mathbb{E}_\mathbf{x}[D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})] \leq \delta

Lagrangian: $\mathcal{L} = \mathbb{E}[r(\mathbf{x}, \mathbf{y})] - \beta \cdot \mathbb{E}[D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})]$

where $\beta \geq 0$ is the Lagrange multiplier (dual variable) for the KL constraint.

KKT stationarity (with respect to the policy at each $(\mathbf{x}, \mathbf{y})$ ):

r(\mathbf{x}, \mathbf{y}) - \beta(\log \pi_\theta(\mathbf{y}|\mathbf{x}) - \log \pi_{\text{ref}}(\mathbf{y}|\mathbf{x})) - \beta = 0

Solving for the optimal policy:

\pi_\theta^*(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp(r(\mathbf{x},\mathbf{y})/\beta)}{Z(\mathbf{x})}

where $Z(\mathbf{x}) = \sum_\mathbf{y} \pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp(r(\mathbf{x},\mathbf{y})/\beta)$ is the partition function.

This is exactly the Boltzmann distribution from max-entropy principles! The optimal RLHF policy is the reference policy tilted by the reward, with temperature $\beta$ . Higher $\beta$ (larger KL penalty) $\Rightarrow$ stays closer to $\pi_{\text{ref}}$ ; lower $\beta$ $\Rightarrow$ maximises reward more aggressively.

Complementary slackness: If the KL budget $\delta$ is not exhausted (the model doesn't need to move far to maximise reward), then $\beta^* = 0$ and the optimal policy maximises reward unconstrained. If the constraint binds, $\beta^* > 0$ is the shadow price of KL-how much reward per unit of KL divergence the model could gain by allowing more deviation from reference.

Direct Preference Optimisation (DPO). DPO (Rafailov et al. 2023) avoids sampling from $\pi_\theta$ by reparameterising the reward using the KKT condition above: $r(\mathbf{x},\mathbf{y}) = \beta \log(\pi_\theta(\mathbf{y}|\mathbf{x})/\pi_{\text{ref}}(\mathbf{y}|\mathbf{x})) + \beta \log Z(\mathbf{x})$ . This transforms the constrained RL problem into a supervised binary classification problem-an elegant application of duality and KKT theory.

Supplement C: Computational Methods for Finding Critical Points

C.1 Newton's Method and KKT Systems

For smooth unconstrained problems, Newton's method solves $\nabla f(\mathbf{x}) = \mathbf{0}$ by iterating:

\mathbf{x}_{k+1} = \mathbf{x}_k - [\nabla^2 f(\mathbf{x}_k)]^{-1} \nabla f(\mathbf{x}_k)

Newton's method has quadratic convergence near a non-degenerate critical point (where $H \succ 0$ ): the error satisfies $\|\mathbf{x}_{k+1} - \mathbf{x}^*\| \leq c\|\mathbf{x}_k - \mathbf{x}^*\|^2$ .

For equality-constrained problems, Newton's method solves the KKT system:

\begin{pmatrix} \nabla^2_{\mathbf{x}\mathbf{x}} \mathcal{L} & \nabla_\mathbf{x} \mathbf{g}^\top \\ \nabla_\mathbf{x} \mathbf{g} & 0 \end{pmatrix} \begin{pmatrix} \Delta \mathbf{x} \\ \Delta \boldsymbol{\lambda} \end{pmatrix} = -\begin{pmatrix} \nabla_\mathbf{x} \mathcal{L} \\ \mathbf{g}(\mathbf{x}) \end{pmatrix}

This KKT matrix (also called the augmented system or saddle-point matrix) is symmetric but indefinite-it has both positive and negative eigenvalues. Solving it efficiently is the core computational challenge in constrained optimisation.

C.2 Interior Point Methods and the Central Path

Interior point methods (barrier methods) solve inequality-constrained problems by converting them to a sequence of unconstrained problems. Replace $\min f$ s.t. $h_j \leq 0$ with:

\min_\mathbf{x} f(\mathbf{x}) - t \sum_j \log(-h_j(\mathbf{x}))

for decreasing $t > 0$ . The log-barrier $-t\log(-h_j)$ forces iterates to stay feasible (it $\to +\infty$ as $h_j \to 0$ ) and approximates the indicator $\mathbf{1}[h_j \leq 0]$ as $t \to 0$ .

Central path. The solution $\mathbf{x}^*(t)$ as a function of $t$ traces the central path-a smooth curve converging to the optimal $\mathbf{x}^*$ as $t \to 0$ . Along the central path, the modified KKT conditions hold:

\nabla f(\mathbf{x}^*(t)) + \sum_j \mu_j(t) \nabla h_j(\mathbf{x}^*(t)) = \mathbf{0}, \quad \mu_j(t) h_j(\mathbf{x}^*(t)) = -t

As $t \to 0$ : $\mu_j(t) h_j(\mathbf{x}^*(t)) \to 0$ recovers complementary slackness. Interior point methods achieve the best known polynomial complexity for LP and convex QP: $O(\sqrt{n} \log(1/\epsilon))$ iterations.

C.3 Automatic Differentiation and Gradient Verification

Modern ML uses automatic differentiation (AD) to compute exact gradients of any differentiable computation graph. The core mechanism is an exact application of the chain rule through reverse mode accumulation-backpropagation.

Gradient checking. A standard technique for verifying backprop correctness is comparing against finite differences:

\frac{\partial \mathcal{L}}{\partial w_j} \approx \frac{\mathcal{L}(\mathbf{w} + h\mathbf{e}_j) - \mathcal{L}(\mathbf{w} - h\mathbf{e}_j)}{2h}

with $h = 10^{-5}$ . This centred difference approximation has error $O(h^2)$ , so at $h = 10^{-5}$ the finite-difference gradient should agree with the exact gradient to about 10 significant digits.

Why finite differences are not used in practice:

Cost: $2n$ forward passes for $n$ parameters, vs. 1 backward pass for AD
Numerical precision: for very large or small gradients, floating-point rounding limits accuracy

But finite differences remain the gold standard for correctness verification during algorithm development-if your AD gradient doesn't match finite differences at 1e-5 relative tolerance, you have a bug.

C.4 Constraint Satisfaction in Modern ML Systems

Modern ML increasingly uses hard and soft constraints, and the optimality conditions for handling these have real-world implementations:

Constitutional AI and RLHF with hard constraints. Recent LLM alignment work adds explicit constraints on outputs (harmlessness, helpfulness). The dual variable $\beta$ controlling KL deviation from a reference policy is analogous to the Lagrange multiplier on the constraint "don't deviate too far from the safe base model." The KKT condition determines the optimal trade-off.

Quantisation as constrained optimisation. Post-training quantisation (PTQ) and quantisation-aware training (QAT) seek weight matrices $\hat{W}$ that are close to $W$ but have values restricted to a discrete grid. This is:

\min_{\hat{W}} \|W - \hat{W}\|_F^2 \quad \text{s.t.} \quad \hat{W}_{ij} \in \{-q, \ldots, q\}

The optimal solution is the projection onto the nearest grid point (rounding)-the KKT conditions confirm that the Lagrange multiplier is the quantisation error, and the optimal $\hat{W}$ satisfies complementary slackness: either the weight is at the grid boundary (multiplier nonzero) or it's rounded to the nearest interior point (multiplier zero).

FlashAttention and memory constraints. FlashAttention (Dao et al. 2022) reformulates the attention computation to respect SRAM capacity constraints. The tiling strategy solves a constrained I/O minimisation problem: process tiles of $Q, K, V$ that fit in SRAM while computing exact attention. The optimal tile size is determined by KKT-like analysis of the memory hierarchy constraints.

Supplement D: Visualisation Guide

D.1 Saddle Point Visualisation

The canonical saddle point is $f(x,y) = x^2 - y^2$ at the origin. Its Hessian is $\text{diag}(2, -2)$ -indefinite. Level curves are hyperbolas; the gradient flow converges to the origin along the $y$ -axis but diverges along the $x$ -axis.

In high dimensions. For an $n$ -parameter saddle with Morse index $k$ , random gradient descent escapes in time $O(\log(1/\epsilon)/\epsilon^2)$ steps when corrupted with Gaussian noise of variance $\epsilon^2$ (Jin et al. 2017). The escape direction is the eigenvector corresponding to the most negative eigenvalue.

D.2 Level Curve Analysis

For any scalar $f: \mathbb{R}^2 \to \mathbb{R}$ :

Near a local minimum: level curves are approximately ellipses centred at $\mathbf{x}^*$ ; aspect ratio given by $\sqrt{\lambda_{\max}/\lambda_{\min}}$ of the Hessian
Near a saddle point: level curves form hyperbolas; the asymptotes are the stable/unstable manifolds
Near a local maximum: level curves are ellipses with gradient pointing inward (decreasing $f$ )

The condition number of the Hessian at a minimum determines the eccentricity of the ellipses. High condition number (elongated ellipses) means gradient descent zigzags slowly; Newton's method normalises this.

D.3 Constraint Geometry

For $\min f$ s.t. $g = 0$ :

The feasible set is a manifold (curve in $\mathbb{R}^2$ , surface in $\mathbb{R}^3$ )
Level curves of $f$ foliate the ambient space
The minimum is where a level curve is tangent to the constraint manifold
Tangency means $\nabla f \perp T_{\mathbf{x}^*}$ (constraint tangent space), i.e., $\nabla f \parallel \nabla g$ -which is exactly Lagrange's condition

The visual clarity of this geometric argument is why Lagrange's method is so compelling: it replaces algebra with geometry.

Supplement E: Key Formulas Reference

OPTIMALITY CONDITIONS CHEAT SHEET


UNCONSTRAINED

  FONC: nablaf(x*) = 0                    (necessary)
  SONC: H(x*)  0                     (necessary)
  SOSC: H(x*)  0 (+ FONC)            (sufficient for strict local min)
  Convex: FONC iff global min

EQUALITY CONSTRAINTS (g(x) = 0)

  Lagrangian: L = f + lambdag
  KKT: nablaL = 0  and  g(x*) = 0
  SOSC on tangent: dnabla^2L d > 0  for alld: nablag(x*)d = 0

INEQUALITY CONSTRAINTS (h(x) <= 0)

  Lagrangian: L = f + muh + lambdag
  KKT (4 conditions):
    (1) Stationarity:   nablaL = 0
    (2) Primal feas.:   h(x*) <= 0, g(x*) = 0
    (3) Dual feas.:     mu* >= 0
    (4) Comp. slack.:   mu* h(x*) = 0 for alli

DUALITY

  Dual function: q(mu,lambda) = inf_x L(x, mu, lambda)
  Weak duality:  d* <= p* always
  Strong duality: d* = p* under Slater's condition (convex, strictly feas.)
  Shadow price:  partialp*/partialb = lambda* (sensitivity of optimal value)

SPECIFIC PROBLEMS

  OLS:     (XX)w = Xy          (normal equations = FONC)
  Ridge:   (XX + nlambdaI)w = Xy   (FONC with L2 reg)
  SVM:     w* = Sigma alpha y x      (stationarity); alpha > 0 iff on margin
  PCA:     Sigmav = lambdav                (FONC for variance max on sphere)
  Softmax: p  exp(z)          (FONC for max entropy with logit constraints)
  RLHF:   pi*(y|x)  pif exp(r/beta) (KKT for KL-constrained reward max)

References and Further Reading

Primary References

Boyd & Vandenberghe, "Convex Optimization" (2004) - Chapters 4-5 (convex functions, duality), Chapter 5 (KKT conditions). The standard graduate reference. Available free online at stanford.edu/~boyd/cvxbook.
Nocedal & Wright, "Numerical Optimization" (2006) - Chapters 12-15. Rigorous treatment of constrained optimisation, KKT theory, interior point methods, and convergence analysis.
Rockafellar, "Convex Analysis" (1970) - The foundational text on subdifferentials, conjugate functions, and duality. Rigorous but dense.
Bertsekas, "Nonlinear Programming" (2016) - Chapter 3 (Lagrange multiplier methods) and Chapter 4 (duality). Excellent treatment of constraint qualifications.

For ML/AI Connections

Dauphin et al. "Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization" (2014) - Empirical evidence for saddle point dominance in deep networks.
Foret et al. "Sharpness-Aware Minimization for Efficiently Improving Generalization" (2021) - SAM algorithm; KKT analysis of the inner maximisation.
Papyan et al. "Prevalence of Neural Collapse during the Terminal Phase of Deep Learning Training" (2020) - Neural collapse as KKT critical point of the UFM.
Rafailov et al. "Direct Preference Optimization" (2023) - DPO derivation from RLHF KKT conditions.
Du et al. "Gradient Descent Finds Global Minima of Non-Convex Neural Networks" (2018) - Overparameterisation and global optimality.
Kawaguchi "Deep Learning without Poor Local Minima" (2016) - Theoretical analysis of local minima in deep linear networks.

Computational Tools

CVXPY (Diamond & Boyd 2016) - Python library for convex optimisation with automatic KKT verification. Handles LP, QP, SOCP, SDP with strong duality guarantees.
SciPy scipy.optimize.minimize - Numerical optimisation with constraint support; uses SLSQP (Sequential Least Squares Programming) which solves KKT subproblems at each step.

This section is part of the 05 Multivariate Calculus chapter. For gradient descent algorithms and convergence theory using these optimality conditions, see 05/05 Gradient Descent and Convergence. For the probability chapter where Lagrange multipliers appear in maximum likelihood and Bayesian estimation, see 06.

Supplement F: Extended Theory

F.1 The Morse Lemma and Local Geometry of Non-Degenerate Critical Points

Theorem (Morse Lemma). Near a non-degenerate critical point $\mathbf{x}^*$ (where $\nabla f(\mathbf{x}^*) = \mathbf{0}$ and $H(\mathbf{x}^*)$ is invertible), there exist local coordinates $\mathbf{u}$ in which $f$ has the form:

f(\mathbf{x}^* + \mathbf{u}) = f(\mathbf{x}^*) - u_1^2 - \cdots - u_k^2 + u_{k+1}^2 + \cdots + u_n^2

where $k$ is the Morse index (number of negative eigenvalues of $H$ ).

Implication. All non-degenerate critical points have the same local geometry as pure quadratics-up to a smooth change of coordinates. This is why the Hessian gives complete local information about a critical point: it completely determines the topological type of the critical point.

For AI. In transformer loss landscapes, measuring the Morse index of critical points found during training is computationally expensive but theoretically illuminating. Empirical studies using Lanczos approximations of $H$ suggest that good minima have small Morse index (few descent directions), while saddles found earlier in training have large index.

F.2 Generalized Saddle Points and Minimax Problems

Definition. A point $(\mathbf{x}^*, \mathbf{y}^*)$ is a saddle point of $f(\mathbf{x}, \mathbf{y})$ if:

f(\mathbf{x}^*, \mathbf{y}) \leq f(\mathbf{x}^*, \mathbf{y}^*) \leq f(\mathbf{x}, \mathbf{y}^*) \quad \forall \mathbf{x}, \mathbf{y}

(minimum over $\mathbf{x}$ , maximum over $\mathbf{y}$ ).

Existence theorem. If $f$ is convex in $\mathbf{x}$ and concave in $\mathbf{y}$ , and both domains are compact convex sets, then a saddle point exists (Von Neumann minimax theorem).

Optimality conditions for saddle points:

\nabla_\mathbf{x} f(\mathbf{x}^*, \mathbf{y}^*) = \mathbf{0}, \quad \nabla_\mathbf{y} f(\mathbf{x}^*, \mathbf{y}^*) = \mathbf{0}

This looks identical to an unconstrained critical point, but the additional saddle structure (convex-concave) ensures the point is a minimax rather than an arbitrary stationarity point.

GAN training as saddle point optimisation. Generative Adversarial Networks minimise the objective:

\min_G \max_D V(G, D) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_\mathbf{z}}[\log(1 - D(G(\mathbf{z})))]

The optimal discriminator $D^*$ maximises $V$ for fixed $G$ ; the optimal generator $G^*$ minimises $V$ for fixed $D^*$ . At the Nash equilibrium (saddle point), $G^*$ generates samples from $p_{\text{data}}$ and $D^* = 1/2$ everywhere.

The challenge: the convex-concave structure is not present in neural network GAN training (both $G$ and $D$ are non-convex), so saddle point existence is not guaranteed and gradient descent alternation can cycle or diverge.

F.3 Proximal Point Algorithm and Augmented Lagrangian

Proximal Point Algorithm (PPA). For convex $f$ :

\mathbf{x}_{k+1} = \arg\min_\mathbf{x} \left[ f(\mathbf{x}) + \frac{1}{2\eta}\|\mathbf{x} - \mathbf{x}_k\|^2 \right] = \text{prox}_{\eta f}(\mathbf{x}_k)

This is equivalent to finding the KKT point of the proximal subproblem: $\nabla f(\mathbf{x}_{k+1}) + \frac{1}{\eta}(\mathbf{x}_{k+1} - \mathbf{x}_k) = \mathbf{0}$ .

Augmented Lagrangian (method of multipliers). For $\min f(\mathbf{x})$ s.t. $A\mathbf{x} = \mathbf{b}$ , the augmented Lagrangian:

\mathcal{L}_\rho(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) + \boldsymbol{\lambda}^\top(A\mathbf{x}-\mathbf{b}) + \frac{\rho}{2}\|A\mathbf{x}-\mathbf{b}\|^2

combines the standard Lagrangian with a quadratic penalty term. The quadratic term is zero at feasibility, so the KKT conditions for the augmented Lagrangian match the original. But the augmented Lagrangian is better conditioned-the $\rho\|A\mathbf{x}-\mathbf{b}\|^2$ term adds curvature near the constraint.

ADMM. The Alternating Direction Method of Multipliers (ADMM) is an augmented Lagrangian method split across two blocks of variables. It is widely used for distributed training, consensus optimisation, and lasso-type problems. At convergence, the ADMM iterates satisfy the KKT conditions of the original problem.

F.4 Implicit Function Theorem and Sensitivity Analysis

The KKT conditions define a nonlinear system $F(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*; \boldsymbol{\theta}) = \mathbf{0}$ parameterised by $\boldsymbol{\theta}$ (problem data). The Implicit Function Theorem (IFT) tells us when and how the optimal solution changes with $\boldsymbol{\theta}$ .

Theorem (IFT). If $F(\mathbf{z}^*; \boldsymbol{\theta}^*) = \mathbf{0}$ and the Jacobian $\nabla_\mathbf{z} F(\mathbf{z}^*; \boldsymbol{\theta}^*)$ is invertible, then there exists a smooth function $\mathbf{z}^*(\boldsymbol{\theta})$ near $\boldsymbol{\theta}^*$ with $F(\mathbf{z}^*(\boldsymbol{\theta}); \boldsymbol{\theta}) = \mathbf{0}$ , and:

\frac{\partial \mathbf{z}^*}{\partial \boldsymbol{\theta}} = -[\nabla_\mathbf{z} F]^{-1} \nabla_{\boldsymbol{\theta}} F

Application: differentiating through optimisation. In meta-learning (MAML) and bilevel optimisation, we need $\partial \mathbf{x}^*(\boldsymbol{\theta})/\partial \boldsymbol{\theta}$ -the derivative of the optimal solution with respect to parameters. The IFT gives this derivative via solving a linear system involving the KKT Jacobian.

MAML connection. Model-Agnostic Meta-Learning computes:

\frac{d\mathcal{L}_{\text{meta}}(\theta - \alpha \nabla\mathcal{L}_{\text{task}}(\theta))}{d\theta}

This requires backpropagating through the inner gradient step-equivalent to applying the IFT to the first-order optimality condition $\nabla\mathcal{L}_{\text{task}}(\mathbf{x}^*(\theta)) = \mathbf{0}$ of the inner problem.

F.5 Optimality and Generalisation: The Flat Minima Hypothesis

A deep connection links optimality conditions to generalisation in deep learning. The sharpness of a minimum-measured by the trace or maximum eigenvalue of the Hessian-correlates empirically with test performance.

Measures of sharpness:

$\text{tr}(H) = \sum_i \lambda_i(H)$ : total curvature (Frobenius norm of $H^{1/2}$ )
$\lambda_{\max}(H)$ : maximum curvature (sharpness in the steepest direction)
$\Phi_\epsilon(\mathbf{w}^*) = \max_{\|\boldsymbol{\epsilon}\| \leq \rho} \mathcal{L}(\mathbf{w}^* + \boldsymbol{\epsilon}) - \mathcal{L}(\mathbf{w}^*)$ : PAC-Bayes sharpness

Why flat minima generalise better (hypothesis). A flat minimum has many parameters with nearly zero curvature. Small perturbations to the weights (due to finite data, distribution shift, or quantisation error) cause small loss increases. A sharp minimum has high curvature: small weight perturbations cause large loss spikes, leading to poor generalisation.

PAC-Bayes bound. The PAC-Bayes generalisation bound for a posterior $Q$ over weights near $\mathbf{w}^*$ includes a term:

\text{KL}(Q \| P) \leq \sum_i \frac{(\sigma_i)^2 \lambda_i(H)}{2}

where $P$ is the prior. This directly penalises large Hessian eigenvalues-confirming that flat minima (small eigenvalues) admit tighter generalisation bounds.

Implication for optimality. The "best" solution from a generalisation perspective is not just any critical point-it is a critical point that is also a flat critical point: one where $\lambda_{\max}(H)$ is minimised. SAM seeks this; the KKT condition for the SAM minimax problem gives the perturbation direction that reveals the sharpest descent direction.

Supplement G: Historical Perspective

G.1 Timeline of Key Developments

HISTORY OF OPTIMALITY CONDITIONS


  1788  Lagrange - "Mcanique Analytique": Lagrange multipliers for
        constrained mechanics. Invented to handle pendulum constraints.

  1815  Gauss - Method of least squares: unconstrained minimisation,
        normal equations as first-order conditions.

  1844  Cauchy - Gradient descent proposed as an optimisation method.
        First explicit use of the gradient as a direction.

  1902  Farkas - Farkas Lemma for linear systems of inequalities.
        The algebraic foundation of all LP and KKT theory.

  1928  Von Neumann - Minimax theorem for zero-sum games.
        Foundation of saddle-point theory and game theory.

  1939  Karush - Master's thesis: first-order conditions for
        inequality-constrained problems. (Unknown for decades.)

  1951  Kuhn & Tucker - Independent rediscovery, published in
        "Nonlinear Programming". Conditions become "KKT" in their honor
        (Karush's contribution recognised only later).

  1953  Wolfe - Duality theorem for nonlinear programming.

  1955  Charnes, Cooper, Henderson - Simplex method, LP duality.

  1963  Zoutendijk - Feasible direction methods, active set concepts.

  1979  Khachiyan - Ellipsoid method: first polynomial algorithm for LP.

  1984  Karmarkar - Interior point method for LP, O(n^3.5 L) time.
        Practical polynomial LP solver.

  1993  Cortes & Vapnik - Support Vector Machine: KKT dual is the
        modern SVM. Complementary slackness identifies support vectors.

  1996  Byrd et al. - L-BFGS-B: limited memory quasi-Newton for
        bound-constrained problems. Industry standard for large-scale
        constrained optimisation.

  2011  Duchi, Hazan, Singer - Adagrad: adaptive learning rate
        implicitly approximates inverse Hessian structure (second-order).

  2014  Dauphin et al. - Saddle point dominance in deep learning.
        First systematic empirical study of neural network landscape.

  2015  Kingma & Ba - Adam optimiser: diagonal Hessian approximation
        with momentum. Becomes default for LLM training.

  2018  Du et al. - Gradient descent finds global minima in wide nets.
        Theoretical basis for why deep learning "works."

  2021  Foret et al. - SAM: Sharpness-Aware Minimisation. KKT-derived
        update for flatter minima. New SOTA on multiple benchmarks.

  2021  Papyan et al. - Neural Collapse: KKT analysis of terminal
        training geometry. UFM framework.

  2023  Rafailov et al. - DPO: RLHF framed as KKT-constrained
        optimisation. Replaces PPO for LLM alignment.

G.2 The Karush Omission

The Karush-Kuhn-Tucker conditions are named for Kuhn and Tucker (1951), but Karush derived essentially the same conditions in his 1939 master's thesis at the University of Chicago under L.M. Graves. Karush never published his thesis, and his work remained unknown until the 1970s when Kuhn acknowledged the priority.

This is a cautionary tale for researchers: unpublished work does not enter the scientific record. Karush's conditions are mathematically equivalent to KKT and were derived first, but priority in science is determined by publication and dissemination.

Modern convention is to include Karush's name (KKT), and some authors use "FOOC" (first-order optimality conditions) to avoid the naming entirely.

G.3 From Constrained Mechanics to Language Models

Lagrange invented his multiplier method to handle constraints in mechanics-specifically, the rigid-rod constraint in a pendulum. The constraint that "the rod length is fixed" is $|\mathbf{r}|^2 = L^2$ , and the Lagrange multiplier is the tension in the rod.

Two centuries later, the same mathematical structure appears in:

The tension (KL penalty) in RLHF that prevents language models from deviating too far from the base policy
The margin constraint in SVMs, where the Lagrange multiplier is the "tension" pulling the decision boundary toward the support vectors
The softmax temperature, which is the Lagrange multiplier for the entropy constraint in maximum entropy next-token prediction

The deep unity of Lagrange's insight-that constraints can be absorbed into an objective via auxiliary multipliers-is one of the most productive ideas in the history of mathematics.

Supplement H: Problem Reduction Patterns

H.1 Reducing Constrained Problems to Unconstrained

Many constrained problems have hidden structure that allows elimination of constraints, converting a constrained problem to unconstrained.

Pattern 1: Substitution. If constraint $g(\mathbf{x}) = 0$ can be solved for one variable, substitute to eliminate it.

Example: $\min f(x,y)$ s.t. $y = x^2$ . Substitute: $\min_{x} f(x, x^2)$ . The constraint disappears, FONC is $\frac{d}{dx}f(x, x^2) = 0$ -an unconstrained problem.

Pattern 2: Reparameterisation. If the constraint defines a manifold $\mathcal{M}$ , parameterise $\mathcal{M}$ directly.

Example: Unit sphere $\|\mathbf{x}\|=1$ in $\mathbb{R}^3$ : parameterise as $\mathbf{x} = (\sin\theta\cos\phi, \sin\theta\sin\phi, \cos\theta)$ . Optimise over $(\theta, \phi) \in \mathbb{R}^2$ without any constraint.

For ML: Stiefel manifold optimisation (orthonormal matrices) uses geodesic gradient descent on the manifold, avoiding explicit constraint handling. Used in RNN training and low-rank optimisation.

Pattern 3: Lagrangian method. When substitution is impractical, Lagrange multipliers convert the constrained problem to a (possibly harder) unconstrained problem in augmented variables.

Pattern 4: Penalty methods. Replace $\min f$ s.t. $g = 0$ with $\min f + \frac{\rho}{2}|g|^2$ . As $\rho \to \infty$ , the penalised solution approaches the constrained solution. Less accurate than Lagrange but simpler to implement.

H.2 Problem Recognition Guide

When you encounter an optimisation problem, use this decision tree:

OPTIMIZATION PROBLEM RECOGNITION


  Does it have constraints?
   NO -> Unconstrained
           Is f convex?
            YES -> FONC (nablaf = 0) is sufficient; any local = global
            NO -> FONC necessary; check SOSC; may have local optima
  
   YES -> Constrained
             Are constraints equality only?
              YES -> Lagrange multipliers; solve n+m equations
              NO (has inequalities) -> KKT conditions
                  Is problem convex?
                   YES -> KKT iff global optimum (Slater's -> strong duality)
                   NO -> KKT necessary (LICQ needed); local optimum only

H.3 Standard Problem Forms and Their KKT Systems

Linear Program (LP):

\min \mathbf{c}^\top\mathbf{x} \quad \text{s.t.} \quad A\mathbf{x} = \mathbf{b},\; \mathbf{x} \geq 0

KKT: $\mathbf{c} - A^\top \boldsymbol{\lambda} - \boldsymbol{\mu} = \mathbf{0}$ , $\mu_i x_i = 0$ , $\boldsymbol{\mu} \geq 0$

Quadratic Program (QP):

\min \frac{1}{2}\mathbf{x}^\top Q\mathbf{x} + \mathbf{c}^\top \mathbf{x} \quad \text{s.t.} \quad A\mathbf{x} = \mathbf{b},\; G\mathbf{x} \leq \mathbf{h}

KKT: $Q\mathbf{x}^* + \mathbf{c} + A^\top\boldsymbol{\lambda}^* + G^\top\boldsymbol{\mu}^* = \mathbf{0}$ , standard feasibility and complementary slackness

Second-Order Cone Program (SOCP):

\min \mathbf{c}^\top\mathbf{x} \quad \text{s.t.} \quad \|A_i \mathbf{x} + \mathbf{b}_i\| \leq \mathbf{c}_i^\top \mathbf{x} + d_i

Subsumes QP and LP; solvable in polynomial time; KKT conditions involve second-order cone complementarity

Semidefinite Program (SDP):

\min \langle C, X \rangle \quad \text{s.t.} \quad \langle A_i, X \rangle = b_i,\; X \succeq 0

KKT: matrix complementarity $XZ = 0$ where $Z$ is the dual variable; powerful for combinatorial relaxations and spectral graph problems

These standard forms are recognised and solved by CVXPY and other convex optimisation packages. Expressing your ML problem in one of these forms gives you access to provably correct, polynomial-time solvers.

Supplement I: Numerical Experiments and Verification

I.1 Verifying KKT Conditions Numerically

In practice, KKT conditions are verified numerically after solving an optimisation problem. Here is a systematic protocol:

def verify_kkt(x_star, lambda_star, mu_star, f, g, h, tol=1e-6):
    """
    Verify KKT conditions for problem:
      min f(x) s.t. g(x) = 0, h(x) <= 0
    at candidate solution (x_star, lambda_star, mu_star).
    """
    # 1. Stationarity
    grad_f = numerical_gradient(f, x_star)
    grad_g = numerical_jacobian(g, x_star)  # (m, n)
    grad_h = numerical_jacobian(h, x_star)  # (p, n)
    stationarity = grad_f + lambda_star @ grad_g + mu_star @ grad_h
    cond1 = np.allclose(stationarity, 0, atol=tol)

    # 2. Primal feasibility
    eq_residual = g(x_star)
    ineq_residual = h(x_star)
    cond2 = np.allclose(eq_residual, 0, atol=tol) and np.all(ineq_residual <= tol)

    # 3. Dual feasibility
    cond3 = np.all(mu_star >= -tol)

    # 4. Complementary slackness
    cs = mu_star * h(x_star)
    cond4 = np.allclose(cs, 0, atol=tol)

    print(f"Stationarity:        {'PASS' if cond1 else 'FAIL'}")
    print(f"Primal feasibility:  {'PASS' if cond2 else 'FAIL'}")
    print(f"Dual feasibility:    {'PASS' if cond3 else 'FAIL'}")
    print(f"Complementary slack: {'PASS' if cond4 else 'FAIL'}")
    return cond1 and cond2 and cond3 and cond4

This systematic verification-checking all four KKT conditions numerically-is a key debugging tool when implementing constrained optimisation algorithms.

I.2 Duality Gap Monitoring

For convex problems, monitoring the duality gap during optimisation provides both a convergence certificate and a diagnostic:

def duality_gap(primal_val, dual_val):
    """
    Compute duality gap. Should be >= 0 (weak duality).
    Converges to 0 at optimality for convex problems (strong duality).
    """
    gap = primal_val - dual_val
    assert gap >= -1e-8, f"Negative duality gap: {gap}"  # weak duality violation
    return gap

A duality gap of $\epsilon$ certifies that both the primal and dual solutions are within $\epsilon$ of optimality-a certificate of near-optimality without knowing the true optimal value.

I.3 Critical Point Classification Checklist

When you find a critical point $\mathbf{x}^*$ numerically, this checklist determines its type:

Step	Check	How	Interpretation
1	Verify stationarity	$\\|\nabla f(\mathbf{x}^*)\\| < 10^{-6}$	True critical point (not just near-critical)
2	Compute Hessian	$H = \nabla^2 f(\mathbf{x}^*)$	Via AD or finite differences
3	Eigenvalue decomposition	$H = Q\Lambda Q^\top$	Sort eigenvalues
4	Count negatives	$k = \#\{\lambda_i < -10^{-8}\}$	Morse index
5	Count near-zero	$m = #{	\lambda_i
6	Classify	$k=0$ : min; $k=n$ : max; else: saddle	Standard classification
7	Check SOSC	All $\lambda_i > 0$ strictly	Strict local min
8	Verify global (if convex)	FONC alone suffices	No need for SOSC

This procedure is the numerical implementation of the second-order test and is embedded in standard optimisation post-processing.

Summary: The Three Pillars of Optimality

This section has developed three interconnected pillars that together form the mathematical foundation of constrained optimisation:

Pillar 1: First and Second-Order Conditions. The gradient and Hessian encode all local information about a critical point. FONC ( $\nabla f = 0$ ) is necessary; SOSC ( $H \succ 0$ ) is sufficient. For convex problems, FONC is sufficient for global optimality-the Hessian never needs to be checked.

Pillar 2: Lagrange Multipliers and KKT. Constraints are not obstacles but information: each active constraint generates a Lagrange multiplier (shadow price) measuring the sensitivity of the objective to that constraint. The KKT conditions generalize FONC to handle both equality and inequality constraints, with complementary slackness providing the geometric signature of which constraints matter.

Pillar 3: Duality. Every constrained problem has a dual that provides a lower bound (weak duality) and-for convex problems with a Slater point-an exact alternative formulation (strong duality). The dual reveals hidden structure (SVM kernel trick), provides certificates of optimality (duality gap), and connects constrained optimisation to game theory (saddle points).

These three pillars appear throughout AI and ML: gradient-based training (Pillar 1), SVM/PCA/attention (Pillar 2), and RL/alignment (Pillar 3). Understanding them deeply is not optional for a serious ML researcher-they are the grammar of the field.

Supplement J: Connections to Information Geometry

J.1 Fisher Information and the Natural Gradient

The gradient $\nabla_\theta \mathcal{L}$ is computed with respect to the Euclidean metric on parameter space. But parameter spaces of probability distributions carry a natural Riemannian metric-the Fisher information metric:

g_{ij}(\theta) = \mathbb{E}_{x \sim p_\theta}\left[\frac{\partial \log p_\theta(x)}{\partial \theta_i} \frac{\partial \log p_\theta(x)}{\partial \theta_j}\right] = F_{ij}(\theta)

Natural gradient. The gradient of $\mathcal{L}$ in the Fisher metric is:

\tilde{\nabla}_\theta \mathcal{L} = F(\theta)^{-1} \nabla_\theta \mathcal{L}

The natural gradient step $\theta \leftarrow \theta - \eta F^{-1} \nabla\mathcal{L}$ is the KKT condition for:

\min_{\Delta\theta} \nabla\mathcal{L}^\top \Delta\theta + \frac{1}{2\eta}\Delta\theta^\top F \Delta\theta

(minimise linearised loss subject to a KL-ball constraint $\Delta\theta^\top F \Delta\theta \leq \text{const}$ ).

K-FAC and Adam as natural gradient approximations. K-FAC (Kronecker-Factored Approximate Curvature) approximates $F^{-1}$ via Kronecker products for efficiency. Adam's adaptive learning rates approximate $\text{diag}(F)^{-1}$ -a diagonal approximation to the Fisher inverse. Both are principled from the viewpoint of natural gradient descent, whose KKT formulation makes the Fisher metric (= second-order constraint) explicit.

J.2 Optimal Transport and Wasserstein Duality

The Wasserstein distance between distributions $\mu$ and $\nu$ over $\mathbb{R}^d$ is defined as:

W_2(\mu, \nu)^2 = \min_{\gamma \in \Pi(\mu,\nu)} \mathbb{E}_{(x,y) \sim \gamma}[\|x-y\|^2]

where $\Pi(\mu,\nu)$ is the set of joint distributions with marginals $\mu$ and $\nu$ .

Kantorovich duality. By Lagrange duality (the constraint is linear in $\gamma$ ), the Wasserstein distance admits a dual:

W_2(\mu,\nu)^2 = \max_{\phi,\psi: \phi(x)+\psi(y) \leq \|x-y\|^2} \int \phi\, d\mu + \int \psi\, d\nu

The dual functions $\phi$ and $\psi$ are the Lagrange multipliers for the marginal constraints $\int \gamma(x, \cdot)\, dy = d\mu(x)$ and $\int \gamma(\cdot, y)\, dx = d\nu(y)$ .

For LLMs. Wasserstein GANs (WGANs) use the Kantorovich dual to train discriminators with 1-Lipschitz constraint (the dual constraint simplified). The dual formulation is more stable than the Jensen-Shannon divergence of standard GANs-a direct application of duality theory to generative model training.

End of 04 Optimality Conditions notes. This section is 2000+ lines covering unconstrained/constrained optimality from first principles through modern AI applications.

Navigation: <- Chain Rule and Backpropagation | Next: Gradient Descent and Convergence ->

Supplement K: Extended Exercise Solutions (Hints and Partial Solutions)

Exercise 1 Hints

(a) Finding critical points of $f(x,y) = x^3 - 3xy^2 + y^4$ :

$\partial f/\partial x = 3x^2 - 3y^2 = 3(x-y)(x+y) = 0$

$\partial f/\partial y = -6xy + 4y^3 = 2y(2y^2 - 3x) = 0$

Cases: ( $y=0$ or $x = \pm y$ ) combined with ( $y=0$ or $x = 2y^2/3$ ).

For $y=0$ : from $\partial f/\partial x = 0$ get $x=0$ . Critical point: $(0,0)$ .

For $y \neq 0$ , $x = y$ or $x = -y$ : combine with $x = 2y^2/3$ .

(b) At $(0,0)$ : $H = \begin{pmatrix}6x & -6y \\ -6y & -6x+12y^2\end{pmatrix}\bigg|_{(0,0)} = \begin{pmatrix}0 & 0 \\ 0 & 0\end{pmatrix}$ . Hessian is zero-degenerate critical point, inconclusive test. Must use higher-order analysis (origin is a saddle of $f$ ).

Exercise 4 Hints

(c) $f(A) = -\log \det A$ on $\mathbb{S}_{++}^n$ : this is convex. Key: compute the Hessian in the matrix sense. For a perturbation $\Delta$ : $-\log\det(A+\Delta) \approx -\log\det A + \text{tr}(A^{-1}\Delta) + \frac{1}{2}\text{tr}(A^{-1}\Delta A^{-1}\Delta) + \ldots$ The second-order term is $\frac{1}{2}\|A^{-1/2}\Delta A^{-1/2}\|_F^2 \geq 0$ .

(d) $f(x,y) = x^2/y$ for $y > 0$ . Hessian: $H = \begin{pmatrix} 2/y & -2x/y^2 \\ -2x/y^2 & 2x^2/y^3 \end{pmatrix}$ . Check: $\det H = (2/y)(2x^2/y^3) - (2x/y^2)^2 = 4x^2/y^4 - 4x^2/y^4 = 0$ . PSD but not PD. Actually convex: it's a perspective function $f(x,y) = g(x,y) = x^2/y$ -the perspective of the convex $g(x)=x^2$ .

Exercise 7 Extended Notes: SAM Convergence

The SAM update $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla\mathcal{L}(\mathbf{w} + \hat{\boldsymbol{\epsilon}})$ can be viewed as gradient descent on the perturbed objective $\mathcal{L}^{\text{SAM}}(\mathbf{w}) \approx \mathcal{L}(\mathbf{w}) + \rho \|\nabla\mathcal{L}(\mathbf{w})\|$ .

The KKT condition of the inner maximisation shows that $\hat{\boldsymbol{\epsilon}} = \rho \nabla\mathcal{L}/\|\nabla\mathcal{L}\|$ is correct when the constraint $\|\boldsymbol{\epsilon}\| \leq \rho$ is active (which it always is for nonzero gradients). The multiplier $\mu^* = \|\nabla\mathcal{L}\|/(2\rho)$ is the shadow price of relaxing the perturbation budget.

This connection shows SAM is not heuristic-it is the exact solution to a well-defined KKT problem. The sharpness regularisation effect comes from the fact that at a sharp minimum, $\|\nabla\mathcal{L}(\mathbf{w} + \hat{\boldsymbol{\epsilon}})\|$ is large (large Hessian eigenvalue $\times$ gradient), so SAM takes a larger effective step away from sharp directions.

Optimality Conditions: Part 2 - Supplement A Deeper Mathematical Treatments To Supplement K Extended E