Part 1

30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Optimality Conditions: Part 1: Intuition to Conceptual Bridge

1. Intuition

1.1 The Optimization Landscape

Imagine standing in a hilly landscape and trying to find the lowest valley. The gradient $\nabla f(\mathbf{x})$ tells you the direction of steepest ascent at your current position - to descend, move against the gradient. But this strategy can strand you in a local valley that is not the deepest one, or trap you at a mountain pass (saddle point) where the gradient is zero but you are not at a minimum.

THE OPTIMIZATION LANDSCAPE IN 1D AND 2D


  1D loss curve:              2D loss surface (contour view):

                                    
                           local     saddle  
                             min       x   
          local   global            
         min      min              global
                                     min 

  Key players:
  
   Point type       Characterisation                             
  
   Local minimum    nablaf = 0,  H  0  (all eigenvalues > 0)       
   Local maximum    nablaf = 0,  H  0  (all eigenvalues < 0)       
   Saddle point     nablaf = 0,  H indefinite (mixed signs)          
   Regular point    nablaf != 0  (not a critical point)

The core difficulty: the gradient condition $\nabla f(\mathbf{x}^*) = \mathbf{0}$ is necessary but not sufficient for a minimum. A zero gradient could mark a minimum, a maximum, or a saddle point. Second-order conditions (the Hessian) are needed to distinguish between them.

Constraints add further complexity. If the solution must lie on a curve or within a region, the unconstrained minimum may be infeasible, and the optimum occurs at a very different location - possibly on the boundary of the feasible region.

1.2 Why ML Needs Optimality Theory

Every core ML algorithm is secretly an optimisation problem, and the optimality conditions determine its solution structure:

ML Algorithm	Optimisation Problem	Optimality Condition Used
Linear regression	$\min_\mathbf{w} \\|X\mathbf{w} - \mathbf{y}\\|^2$	$\nabla = 0$ -> normal equations
Ridge regression	$\min_\mathbf{w} \\|X\mathbf{w}-\mathbf{y}\\|^2 + \lambda\\|\mathbf{w}\\|^2$	Lagrangian of constrained form
Lasso	$\min_\mathbf{w} \\|X\mathbf{w}-\mathbf{y}\\|^2 + \lambda\\|\mathbf{w}\\|_1$	Subdifferential conditions
SVM	$\min \tfrac{1}{2}\\|\mathbf{w}\\|^2$ s.t. margin $\geq 1$	KKT -> support vectors
PCA	$\max_\mathbf{v} \mathbf{v}^\top C \mathbf{v}$ s.t. $\\|\mathbf{v}\\|=1$	Lagrange -> eigenvalue problem
Logistic regression	$\min -\sum \log p_i$	Convex -> unique global minimum
Neural network	$\min \mathcal{L}(\theta)$ (non-convex)	Saddle point dominance; NTK
Attention (softmax)	$\max_\mathbf{p} \mathbf{s}^\top\mathbf{p}$ s.t. $\mathbf{p} \in \Delta$	Maximum entropy via Lagrange
SAM training	$\min_\theta \max_{\\|\epsilon\\|\leq\rho} \mathcal{L}(\theta+\epsilon)$	KKT of inner max

For AI: Understanding that the SVM dual problem depends only on inner products $\mathbf{x}_i^\top \mathbf{x}_j$ (from the KKT conditions) is what enables the kernel trick - the foundation of kernel methods. Understanding that softmax is the solution to a maximum entropy problem under linear constraints connects attention to thermodynamics and information theory.

1.3 Historical Context

Year	Person	Contribution
1788	Lagrange	Mcanique Analytique - method of multipliers for equality constraints
1847	Cauchy	Steepest descent algorithm
1939	Karush	Master's thesis: inequality constraints with multipliers (KKT conditions - unpublished)
1951	Kuhn & Tucker	Independent rediscovery and publication of KKT conditions
1952	Arrow, Hurwicz, Uzawa	Saddle point theorem for convex optimisation
1970	Rockafellar	Convex Analysis - definitive treatment of duality and subdifferentials
1995	Boser, Guyon, Vapnik	SVM uses the dual to derive the kernel trick
1996	Boyd & Vandenberghe	Convex Optimization (textbook; full text freely available online)
2014	Dauphin et al.	Saddle point dominance in deep networks
2021	Foret et al.	SAM: sharpness-aware minimisation as constrained min-max
2022	Papyan et al.	Neural collapse: KKT characterisation of terminal training phase

1.4 Unconstrained vs Constrained

There are three fundamental problem classes:

THREE PROBLEM CLASSES


  UNCONSTRAINED                    EQUALITY CONSTRAINED
                      
  min f(x)                         min f(x)
   xinR                            s.t. g(x) = 0, i=1,...,m

  Solution: nablaf = 0                 Solution: Lagrange multipliers
  Theory: 2nd-order conditions      nablaf + lambdanablag = 0,  g(x) = 0

  INEQUALITY CONSTRAINED
  
  min f(x)
  s.t. g(x) = 0,  i = 1,...,m
       h(x) <= 0,  j = 1,...,p

  Solution: KKT conditions
  nablaf + lambdanablag + munablah = 0,  g=0,  h<=0,  mu>=0,  muh=0 for allj

  KEY INSIGHT: Equality constraints are special case of KKT
  (set mu = 0 for inequalities, keep lambda free for equalities)

The feasible set $\mathcal{F} = \{\mathbf{x} : g_i(\mathbf{x}) = 0,\, h_j(\mathbf{x}) \leq 0\}$ is the set of points satisfying all constraints. The optimisation problem asks for the minimum of $f$ over $\mathcal{F}$ .

2. First-Order Necessary Conditions

2.1 The Gradient Condition

Theorem (First-Order Necessary Condition): Let $f: \mathbb{R}^n \to \mathbb{R}$ be differentiable. If $\mathbf{x}^*$ is a local minimum of $f$ , then:

\nabla f(\mathbf{x}^*) = \mathbf{0}

Proof: Suppose $\mathbf{x}^*$ is a local minimum but $\nabla f(\mathbf{x}^*) \neq \mathbf{0}$ . Define the direction $\mathbf{d} = -\nabla f(\mathbf{x}^*)$ . By the directional derivative formula:

\frac{d}{dt} f(\mathbf{x}^* + t\mathbf{d})\Big|_{t=0} = \nabla f(\mathbf{x}^*)^\top \mathbf{d} = -\|\nabla f(\mathbf{x}^*)\|^2 < 0

By continuity of $f$ , there exists $\epsilon > 0$ such that $f(\mathbf{x}^* + t\mathbf{d}) < f(\mathbf{x}^*)$ for all $t \in (0, \epsilon)$ . But this contradicts $\mathbf{x}^*$ being a local minimum. $\square$

Warning - necessity only: The gradient condition is necessary but not sufficient. The function $f(x) = x^3$ has $f'(0) = 0$ at $x=0$ but no local extremum there (it's an inflection point). The function $f(x,y) = x^2 - y^2$ has $\nabla f(0,0) = \mathbf{0}$ but a saddle point. Second-order conditions are needed to determine which type of critical point we have.

Critical points (also called stationary points) are all $\mathbf{x}$ satisfying $\nabla f(\mathbf{x}) = \mathbf{0}$ . They are candidates for minima, maxima, or saddle points - the Hessian distinguishes between them.

2.2 Critical Points and Their Classification

In $\mathbb{R}^2$ , a function $f(x,y)$ has four types of critical points, determined by the sign of the Hessian determinant $\det H = f_{xx}f_{yy} - f_{xy}^2$ and the sign of $f_{xx}$ :

SECOND-DERIVATIVE TEST IN R^2


  At critical point (nablaf = 0):

  det(H) > 0 and f_xx > 0  ->  LOCAL MINIMUM
                                  (all eigenvalues of H positive)

  det(H) > 0 and f_xx < 0  ->  LOCAL MAXIMUM
                                  (all eigenvalues of H negative)

  det(H) < 0                ->  SADDLE POINT
                                  (H has mixed-sign eigenvalues)

  det(H) = 0                ->  DEGENERATE (test inconclusive)
                                  (need higher-order analysis)

  Geometric picture for each:
  
     Min           Max           Saddle        Degenerate 
                                x                      
                                                  
                                          
   bowl shape    hill shape    mountain       flat region 
                                 pass

Standard examples:

$f(x,y) = x^2 + y^2$ : unique global minimum at origin; $H = 2I \succ 0$
$f(x,y) = -(x^2 + y^2)$ : unique global maximum at origin; $H = -2I \prec 0$
$f(x,y) = x^2 - y^2$ : saddle point at origin; $H = \text{diag}(2,-2)$ indefinite
$f(x,y) = x^4 + y^4$ : minimum at origin; $H(0,0) = 0$ (degenerate - minimum confirmed by inspection)

In higher dimensions ( $n > 2$ ): A critical point is a strict local minimum iff $H(\mathbf{x}^*) \succ 0$ (all eigenvalues positive). It is a strict local maximum iff $H(\mathbf{x}^*) \prec 0$ . Any other case (indefinite $H$ or semidefinite $H$ ) requires further analysis.

2.3 Saddle Points in Deep Learning

The classical concern was that neural networks might converge to poor local minima. Empirically, gradient descent finds solutions of similar quality regardless of initialisation for overparameterised networks. This was theoretically explained in a landmark 2014 paper.

Dauphin et al. (2014) - Identifying and attacking the saddle point problem: For a function $f: \mathbb{R}^n \to \mathbb{R}$ with $n$ large, a random critical point with loss $\epsilon$ above the global minimum is a saddle point with overwhelming probability (under a statistical physics model). The fraction of critical points that are local minima decreases exponentially with both $\epsilon$ and $n$ .

SADDLE POINT DOMINANCE


  Distribution of critical points at energy level epsilon above global min:

  Fraction that are local minima:  ~exp(-n * c(epsilon))

  For n = 10^9 (GPT-3),  c(epsilon) > 0  ->  essentially zero local minima

  What gradient descent actually encounters:
  
    High-loss region:   many saddle points, few local minima  
    Low-loss region:    many saddle points, even fewer local  
    Near global min:    flat regions (many equivalent minima) 
  

  SGD noise + momentum helps escape saddles (saddle-free Newton
  methods explicitly use the negative curvature direction).

Why does SGD escape saddle points? At a saddle point, the Hessian has at least one negative eigenvalue. The corresponding eigenvector is a descent direction - the function decreases along it. SGD's noise perturbations project onto this direction with nonzero probability, enabling escape. Deterministic gradient descent can be trapped at strict saddle points, but this is a measure-zero set.

For AI: The practical implication is that for overparameterised models (where parameters $\gg$ data), gradient descent on a non-convex loss converges to global or near-global optima. This is one theoretical explanation for why training large language models with SGD/Adam works so well.

2.4 Stationary Points of Common ML Loss Functions

MSE (Linear Regression): $f(\mathbf{w}) = \|X\mathbf{w} - \mathbf{y}\|^2$

\nabla_\mathbf{w} f = 2X^\top(X\mathbf{w} - \mathbf{y}) = \mathbf{0} \implies X^\top X \mathbf{w} = X^\top \mathbf{y}

This is the normal equation. It has a unique solution $\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}$ when $X$ has full column rank. The Hessian $H = 2X^\top X \succeq 0$ - MSE is convex, so the critical point is the global minimum.

Cross-Entropy (Logistic Regression): $f(\mathbf{w}) = -\sum_{i=1}^n [y_i \log \sigma(\mathbf{w}^\top \mathbf{x}_i) + (1-y_i)\log(1-\sigma(\mathbf{w}^\top \mathbf{x}_i))]$

\nabla_\mathbf{w} f = -\sum_{i=1}^n (y_i - \sigma(\mathbf{w}^\top \mathbf{x}_i))\mathbf{x}_i = X^\top(\boldsymbol{\sigma} - \mathbf{y})

No closed-form solution (transcendental equation). The Hessian $H = X^\top \text{diag}(\sigma_i(1-\sigma_i)) X \succeq 0$ confirms convexity. Gradient descent (or Newton's method) converges to the unique global minimum if it exists (may not if data is linearly separable).

Cross-Entropy (Softmax / LLM output): Setting $\nabla_\mathbf{z} [-\log p_y] = \mathbf{p} - \mathbf{e}_y = \mathbf{0}$ (as derived in 03) gives $p_y = 1$ and $p_k = 0$ for $k \neq y$ . This is a perfectly confident prediction - the loss approaches zero but never reaches it for finite logits.

3. Second-Order Conditions

First-order conditions identify candidates; second-order conditions distinguish minima from maxima from saddle points. The Hessian matrix-the matrix of second partial derivatives-encodes all the local curvature information needed for this classification.

3.1 Second-Order Necessary Condition (SONC)

Theorem (SONC). If $\mathbf{x}^*$ is a local minimum of $f: \mathbb{R}^n \to \mathbb{R}$ and $f \in C^2$ near $\mathbf{x}^*$ , then:

\nabla^2 f(\mathbf{x}^*) \succeq 0

(the Hessian is positive semi-definite).

Proof. Let $\mathbf{d} \in \mathbb{R}^n$ be any direction. Taylor expansion gives:

f(\mathbf{x}^* + t\mathbf{d}) = f(\mathbf{x}^*) + t \underbrace{\nabla f(\mathbf{x}^*)^\top \mathbf{d}}_{=0} + \frac{t^2}{2} \mathbf{d}^\top \nabla^2 f(\mathbf{x}^*) \mathbf{d} + O(t^3)

Since $\mathbf{x}^*$ is a local minimum, $f(\mathbf{x}^* + t\mathbf{d}) \geq f(\mathbf{x}^*)$ for all small $t$ . Thus:

\frac{t^2}{2} \mathbf{d}^\top H \mathbf{d} + O(t^3) \geq 0

Dividing by $t^2$ and letting $t \to 0^+$ :

\mathbf{d}^\top H \mathbf{d} \geq 0 \quad \forall \mathbf{d} \in \mathbb{R}^n

which is exactly $H \succeq 0$ . $\square$

Note: SONC is necessary but not sufficient. $f(x) = x^4$ at $x^* = 0$ has $f''(0) = 0 \succeq 0$ yet $x^* = 0$ is a minimum. $f(x,y) = x^2 - y^3$ at origin has $H = \text{diag}(2, 0)$ but origin is not a minimum.

3.2 Second-Order Sufficient Condition (SOSC)

Theorem (SOSC). If $\nabla f(\mathbf{x}^*) = \mathbf{0}$ and $\nabla^2 f(\mathbf{x}^*) \succ 0$ (positive definite), then $\mathbf{x}^*$ is a strict local minimum.

Proof. Since $H = \nabla^2 f(\mathbf{x}^*) \succ 0$ , its smallest eigenvalue satisfies $\lambda_{\min}(H) > 0$ . By continuity of the Hessian, there exists $\delta > 0$ such that $\nabla^2 f(\mathbf{x}) \succ \frac{\lambda_{\min}}{2} I$ for all $\|\mathbf{x} - \mathbf{x}^*\| < \delta$ .

For any $\mathbf{d}$ with $\|\mathbf{d}\| = 1$ and small $t > 0$ :

f(\mathbf{x}^* + t\mathbf{d}) = f(\mathbf{x}^*) + \frac{t^2}{2} \mathbf{d}^\top H \mathbf{d} + O(t^3) \geq f(\mathbf{x}^*) + \frac{t^2 \lambda_{\min}}{2} + O(t^3) > f(\mathbf{x}^*)

for sufficiently small $t$ . $\square$

Gap between SONC and SOSC: The boundary case $H \succeq 0$ but $H \not\succ 0$ (degenerate, semidefinite Hessian) requires higher-order analysis. This is common in deep learning where flat directions proliferate.

SECOND-ORDER TEST SUMMARY


  Critical point x*: nablaf(x*) = 0

  H = nabla^2f(x*)     Result          Name
  
  H  0           Strict local minimum        SOSC satisfied
  H  0           Strict local maximum        (max version of SOSC)
  H indefinite    Saddle point               Eigenvalues of both signs
  H  0, != 0     Might be min (degenerate)  Higher order needed
  H = 0           Need Taylor term >=3        Flat critical point
  

  For R^2 via determinant test (when nablaf = 0):
    det(H) > 0, H_1_1 > 0  ->  local minimum
    det(H) > 0, H_1_1 < 0  ->  local maximum
    det(H) < 0             ->  saddle point
    det(H) = 0             ->  inconclusive

3.3 Indefinite Hessian and Saddle Points

When $H = \nabla^2 f(\mathbf{x}^*)$ has both positive and negative eigenvalues, $\mathbf{x}^*$ is a saddle point: a local minimum along some directions, a local maximum along others.

Morse Theory Preview. For a smooth function $f: \mathbb{R}^n \to \mathbb{R}$ , a critical point $\mathbf{x}^*$ is non-degenerate if $H(\mathbf{x}^*)$ is invertible. The Morse index of $\mathbf{x}^*$ is:

k = \#\{\lambda_i(H) < 0\}

(the number of descent directions). A non-degenerate minimum has index 0; a saddle has index $1, 2, \ldots, n-1$ ; a maximum has index $n$ . Morse theory relates the topology of the sublevel sets $\{f \leq c\}$ to the critical points and their indices.

For deep learning: The loss landscape of an $n$ -parameter network at a critical point near a good solution tends to have many positive eigenvalues and few (often negligible) negative ones. The ratio $k/n$ is typically very small at good solutions-confirming that most critical points found in practice are near-minima, not true saddles with large negative Hessian components.

3.4 Global Optimality from Convexity

For convex functions, the local-global distinction disappears entirely:

Theorem. If $f$ is convex and $\mathbf{x}^*$ is a local minimum, then $\mathbf{x}^*$ is a global minimum.

Proof. Suppose $\mathbf{x}^*$ is a local min but not global: there exists $\mathbf{y}$ with $f(\mathbf{y}) < f(\mathbf{x}^*)$ . For $t \in (0,1)$ , convexity gives:

f((1-t)\mathbf{x}^* + t\mathbf{y}) \leq (1-t)f(\mathbf{x}^*) + tf(\mathbf{y}) < f(\mathbf{x}^*)

As $t \to 0^+$ , the point $(1-t)\mathbf{x}^* + t\mathbf{y}$ approaches $\mathbf{x}^*$ arbitrarily closely while having lower function value-contradicting $\mathbf{x}^*$ being a local minimum. $\square$

Corollary. For differentiable convex $f$ : $\nabla f(\mathbf{x}^*) = \mathbf{0}$ iff $\mathbf{x}^*$ is a global minimum.

This is the fundamental reason convex optimization is "easy" compared to non-convex: any local search algorithm that finds a stationary point has found the global optimum.

3.5 Hessian at ML Optima

Understanding the Hessian at key ML critical points provides geometric insight into learning:

Linear Regression. For $f(\mathbf{w}) = \frac{1}{2n}\|X\mathbf{w} - \mathbf{y}\|^2$ :

\nabla^2 f(\mathbf{w}) = \frac{1}{n} X^\top X

independent of $\mathbf{w}$ . The Hessian is constant-this is a quadratic. Eigenvalues are $\sigma_i^2(X)/n$ (squared singular values scaled). If $X$ has full column rank, $H \succ 0$ everywhere, so the unique critical point is a global minimum.

Logistic Regression. For $f(\mathbf{w}) = \frac{1}{n}\sum_i [-y_i \mathbf{w}^\top \mathbf{x}_i + \log(1 + e^{\mathbf{w}^\top \mathbf{x}_i})]$ :

\nabla^2 f(\mathbf{w}) = \frac{1}{n} X^\top \text{diag}(\sigma_i(1-\sigma_i)) X

where $\sigma_i = \sigma(\mathbf{w}^\top \mathbf{x}_i)$ . Since $\sigma(z)(1-\sigma(z)) \in (0, 1/4]$ , the Hessian is PSD everywhere (and PD if $X$ has full column rank), making logistic regression a convex problem.

Neural Networks. At a minimum $\mathbf{w}^*$ , the Hessian $H(\mathbf{w}^*)$ typically has:

A bulk of near-zero eigenvalues (flat landscape in many directions)
A few large positive eigenvalues (steep curvature in a small subspace)
Occasional small negative eigenvalues (slight non-convexity)

The ratio of large-to-small eigenvalues is the condition number-high condition numbers cause slow gradient descent convergence and motivate adaptive methods like Adam.

4. Convex Analysis

Convexity is the mathematical property that makes optimization tractable. Understanding convex sets and functions-their definitions, characterisations, and preservation rules-is prerequisite to deploying convex duality and KKT theory.

4.1 Convex Sets

Definition. A set $C \subseteq \mathbb{R}^n$ is convex if for all $\mathbf{x}, \mathbf{y} \in C$ and $\theta \in [0,1]$ :

\theta \mathbf{x} + (1-\theta)\mathbf{y} \in C

Geometrically: the line segment between any two points stays inside the set.

Standard examples:

Hyperplanes $\{\mathbf{x} : \mathbf{a}^\top \mathbf{x} = b\}$ and halfspaces $\{\mathbf{x} : \mathbf{a}^\top \mathbf{x} \leq b\}$
Balls $\{\mathbf{x} : \|\mathbf{x} - \mathbf{c}\| \leq r\}$ under any norm
Polyhedra $\{\mathbf{x} : A\mathbf{x} \leq \mathbf{b}\}$ (intersection of halfspaces)
Positive semidefinite cone $\mathbb{S}_+^n = \{S \in \mathbb{R}^{n\times n} : S = S^\top,\, S \succeq 0\}$
The probability simplex $\Delta^n = \{\mathbf{p} \geq 0 : \mathbf{1}^\top \mathbf{p} = 1\}$

Non-examples:

The unit sphere $\mathbb{S}^{n-1}$ (boundary only, not the ball): the segment between two points on the sphere passes through the interior
The set $\{(x,y) : xy \geq 1,\, x,y > 0\}$ : take $(1,1)$ and $(2, 1/2)$ ; the midpoint $(3/2, 3/4)$ has $3/2 \cdot 3/4 = 9/8 > 1$ -wait, this is actually convex. A cleaner non-example: $\{(x,y) : x^2 + y^2 = 1\}$ (circle)

Convexity-preserving operations:

Intersection: $C_1, C_2$ convex $\Rightarrow$ $C_1 \cap C_2$ convex
Affine image: $f(C) = \{A\mathbf{x} + \mathbf{b} : \mathbf{x} \in C\}$ is convex if $C$ is convex
Cartesian product, Minkowski sum

4.2 Convex Functions: Definition and First-Order Characterisation

Definition. A function $f: C \to \mathbb{R}$ on a convex set $C$ is convex if:

f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y}) \quad \forall \mathbf{x},\mathbf{y} \in C,\, \theta \in [0,1]

Strict convexity: replace $\leq$ with $<$ for $\mathbf{x} \neq \mathbf{y}$ , $\theta \in (0,1)$ .

First-Order Characterisation. For $f \in C^1$ , $f$ is convex iff its graph lies above all tangent hyperplanes:

f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x}) \quad \forall \mathbf{x}, \mathbf{y} \in C

This is the inequality that makes $\nabla f(\mathbf{x}^*) = \mathbf{0}$ sufficient for global minimality in the convex case: if $\nabla f(\mathbf{x}^*) = \mathbf{0}$ , then $f(\mathbf{y}) \geq f(\mathbf{x}^*) + 0 = f(\mathbf{x}^*)$ for all $\mathbf{y}$ .

Proof sketch. ( $\Rightarrow$ ) Fix $\mathbf{x}, \mathbf{y}$ and let $\phi(t) = f(\mathbf{x} + t(\mathbf{y}-\mathbf{x}))$ . Convexity implies $\phi(1) \geq \phi(0) + \phi'(0)$ , which gives the tangent inequality. ( $\Leftarrow$ ) The tangent inequality with the two points $\mathbf{x}$ and $\mathbf{y}$ evaluated at $\mathbf{z} = \theta\mathbf{x} + (1-\theta)\mathbf{y}$ yields the convexity definition after combining.

4.3 Second-Order Characterisation

Theorem. For $f \in C^2$ : $f$ is convex iff $\nabla^2 f(\mathbf{x}) \succeq 0$ for all $\mathbf{x}$ in the domain.

Proof. ( $\Rightarrow$ ) If $f$ is convex, the first-order characterisation gives $f(\mathbf{x} + t\mathbf{d}) \geq f(\mathbf{x}) + t \nabla f(\mathbf{x})^\top \mathbf{d}$ . Expanding the Taylor series and applying this inequality yields $\mathbf{d}^\top H(\mathbf{x})\mathbf{d} \geq 0$ . ( $\Leftarrow$ ) With $H \succeq 0$ everywhere, integrate along the line from $\mathbf{x}$ to $\mathbf{y}$ using the second-order Taylor remainder to establish the first-order characterisation.

Examples of convex functions:

Affine: $f(\mathbf{x}) = \mathbf{a}^\top \mathbf{x} + b$ (both convex and concave)
Quadratic: $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^\top Q \mathbf{x}$ when $Q \succeq 0$
Norms: $f(\mathbf{x}) = \|\mathbf{x}\|_p$ for $p \geq 1$ (triangle inequality = convexity)
Log-sum-exp: $f(\mathbf{x}) = \log \sum_i e^{x_i}$ (smooth convex approximation to max)
Negative entropy: $f(\mathbf{p}) = \sum_i p_i \log p_i$ on the simplex
Cross-entropy loss: $-y \log \sigma(z) - (1-y)\log(1-\sigma(z))$ in $z$

Non-convex in ML:

$f(w) = \sin(w)$ , any neural network loss, $f(A,B) = \|AB - M\|_F^2$ (matrix factorisation)

4.4 Strongly Convex Functions

Definition. $f$ is $\mu$ -strongly convex ( $\mu > 0$ ) if:

f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top(\mathbf{y}-\mathbf{x}) + \frac{\mu}{2}\|\mathbf{y}-\mathbf{x}\|^2

Equivalently: $f(\mathbf{x}) - \frac{\mu}{2}\|\mathbf{x}\|^2$ is convex; or $\nabla^2 f(\mathbf{x}) \succeq \mu I$ everywhere.

Strong convexity has powerful consequences:

Unique minimiser: the quadratic lower bound forces a unique optimal point
Linear convergence: gradient descent converges geometrically at rate $(1 - \mu/L)$ per step where $L$ is the Lipschitz constant of $\nabla f$
Self-concordance: the function cannot become arbitrarily flat

For AI: Ridge regression $f(\mathbf{w}) = \frac{1}{2n}\|X\mathbf{w} - \mathbf{y}\|^2 + \frac{\lambda}{2}\|\mathbf{w}\|^2$ is $\lambda$ -strongly convex (the $\ell_2$ regulariser adds $\lambda I$ to the Hessian). This is why $L_2$ regularisation speeds up convergence and ensures a unique solution-crucial for ill-conditioned problems.

Condition number: For $\mu$ -strongly convex, $L$ -smooth functions, $\kappa = L/\mu$ governs convergence. In transformer training, very high condition numbers of the loss landscape motivate adaptive optimisers.

4.5 Preservation Rules and Calculus of Convex Functions

Convexity is preserved under many operations, making it composable:

Operation	Condition	Result
Non-negative combination	$\alpha_i \geq 0$ , $f_i$ convex	$\sum_i \alpha_i f_i$ convex
Composition with affine	$f$ convex, $A, \mathbf{b}$ any	$g(\mathbf{x}) = f(A\mathbf{x}+\mathbf{b})$ convex
Pointwise max	$f_i$ convex	$g(\mathbf{x}) = \max_i f_i(\mathbf{x})$ convex
Composition	$f$ convex nondecreasing, $g$ convex	$f \circ g$ convex
Partial min	$f(\mathbf{x}, \mathbf{y})$ convex in $(\mathbf{x},\mathbf{y})$	$g(\mathbf{x}) = \inf_\mathbf{y} f(\mathbf{x},\mathbf{y})$ convex
Perspective	$f$ convex	$g(\mathbf{x},t) = tf(\mathbf{x}/t)$ convex on $\{t>0\}$

For AI: The cross-entropy loss $\ell(\mathbf{w}) = -\log \sigma(\mathbf{w}^\top \mathbf{x})$ is convex in $\mathbf{w}$ because it is $-\log$ composed with a concave function (affine composed with sigmoid). Convexity preservation rules let us verify this without computing the Hessian directly.

5. Lagrange Multipliers

When optimization problems come with constraints, unconstrained optimality conditions no longer apply directly. Lagrange multipliers transform constrained problems into unconstrained ones by incorporating the constraint into the objective-at the cost of introducing auxiliary variables.

5.1 Setup: Equality-Constrained Problems

Problem form:

\min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) = 0, \quad i = 1, \ldots, m

where $f, g_i \in C^1$ and $m < n$ (fewer constraints than variables).

The Lagrangian: Define $\mathcal{L}: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}$ by:

\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) + \sum_{i=1}^m \lambda_i g_i(\mathbf{x}) = f(\mathbf{x}) + \boldsymbol{\lambda}^\top \mathbf{g}(\mathbf{x})

The scalars $\lambda_i$ are Lagrange multipliers (or dual variables).

5.2 Geometric Derivation

The deepest way to understand why Lagrange's method works is geometric. At a constrained minimum $\mathbf{x}^*$ :

Claim: $\nabla f(\mathbf{x}^*)$ must lie in the span of $\{\nabla g_1(\mathbf{x}^*), \ldots, \nabla g_m(\mathbf{x}^*)\}$ .

Why: The feasible set near $\mathbf{x}^*$ is approximately the linear manifold $\{\mathbf{x} : \nabla g_i(\mathbf{x}^*)^\top (\mathbf{x} - \mathbf{x}^*) = 0, \, \forall i\}$ -the tangent plane to each constraint surface. Any feasible direction $\mathbf{d}$ from $\mathbf{x}^*$ must satisfy $\nabla g_i(\mathbf{x}^*)^\top \mathbf{d} = 0$ for all $i$ .

If $\nabla f(\mathbf{x}^*)$ had a component orthogonal to all $\nabla g_i(\mathbf{x}^*)$ , that component would be a feasible direction along which $f$ decreases-contradicting $\mathbf{x}^*$ being a local constrained minimum.

Therefore $\nabla f(\mathbf{x}^*)$ has no component in the tangent space; it must lie entirely in the normal space spanned by $\{\nabla g_i\}$ :

\nabla f(\mathbf{x}^*) = -\sum_{i=1}^m \lambda_i^* \nabla g_i(\mathbf{x}^*) \quad \Longleftrightarrow \quad \nabla_\mathbf{x} \mathcal{L}(\mathbf{x}^*, \boldsymbol{\lambda}^*) = \mathbf{0}

LAGRANGE MULTIPLIER GEOMETRY (R^2)


  Constraint: g(x,y) = 0  (a curve in the plane)
  Objective: minimize f(x,y)

          y
                   f = c_3  (level curves of f)
    
                   f = c_2
       
                 f = c_1  <- tangent to constraint at x*
           x*
          
         g = 0
         

  At x*: level curve of f is tangent to constraint g=0
          nablaf  nablag    nablaf = -lambdanablag

  nablag points normal to g=0; nablaf points normal to f=c_1
  Tangency means these normals are parallel.

5.3 The Lagrange Multiplier Theorem

Theorem (Lagrange, 1788). Let $\mathbf{x}^*$ be a local minimum of $f$ subject to $\mathbf{g}(\mathbf{x}) = \mathbf{0}$ . If the Linear Independence Constraint Qualification (LICQ) holds at $\mathbf{x}^*$ -i.e., $\{\nabla g_1(\mathbf{x}^*), \ldots, \nabla g_m(\mathbf{x}^*)\}$ are linearly independent-then there exists $\boldsymbol{\lambda}^* \in \mathbb{R}^m$ such that:

\nabla_\mathbf{x} \mathcal{L}(\mathbf{x}^*, \boldsymbol{\lambda}^*) = \nabla f(\mathbf{x}^*) + \boldsymbol{\lambda}^* {}^\top \nabla \mathbf{g}(\mathbf{x}^*) = \mathbf{0}

\mathbf{g}(\mathbf{x}^*) = \mathbf{0}

Together these are $n + m$ equations in $n + m$ unknowns $(\mathbf{x}^*, \boldsymbol{\lambda}^*)$ .

LICQ matters: Without LICQ the theorem can fail. Example: $\min x$ subject to $x^2 = 0$ and $x^3 = 0$ . The constraints both vanish at $x^* = 0$ with gradients $0$ and $0$ -linearly dependent. No Lagrange multiplier exists.

5.4 Sensitivity Interpretation: Shadow Prices

The Lagrange multiplier $\lambda_i^*$ has a precise economic interpretation: it measures how much the optimal value changes as constraint $i$ is relaxed.

Theorem (Envelope). Let $p^*(b)$ be the optimal value of $\min_\mathbf{x} f(\mathbf{x})$ subject to $g_i(\mathbf{x}) = b_i$ . Then:

\frac{\partial p^*}{\partial b_i} = \lambda_i^*

Proof sketch. Differentiating the Lagrangian optimality conditions with respect to $b_i$ and applying the chain rule yields $dp^*/db_i = \lambda_i^*$ (the Implicit Function Theorem controls how $\mathbf{x}^*$ moves with $b_i$ ).

For AI: In constrained training (e.g., "minimize loss subject to $\|\mathbf{w}\|^2 = c$ "), $\lambda^*$ tells you how much more you could improve the loss by relaxing the norm constraint by one unit. This motivates choosing the right regularisation strength: $\lambda^*$ is the value of the weight decay penalty that enforces the constraint.

5.5 Multiple Constraints and Second-Order Conditions

Multiple equality constraints: With $m$ equality constraints, the KKT point satisfies $n + m$ stationarity equations. Second-order analysis requires the bordered Hessian or, equivalently, the Hessian of the Lagrangian restricted to the tangent space of the constraints:

\mathbf{d}^\top \nabla^2_{\mathbf{x}\mathbf{x}} \mathcal{L}(\mathbf{x}^*, \boldsymbol{\lambda}^*) \mathbf{d} > 0 \quad \forall \mathbf{d} \neq \mathbf{0} \text{ with } \nabla \mathbf{g}(\mathbf{x}^*)^\top \mathbf{d} = \mathbf{0}

is the second-order sufficient condition for a constrained local minimum.

5.6 ML Applications of Lagrange Multipliers

PCA as constrained optimisation. Find $\mathbf{v}_1 \in \mathbb{R}^d$ maximising variance $\mathbf{v}_1^\top \Sigma \mathbf{v}_1$ subject to $\|\mathbf{v}_1\|^2 = 1$ :

\mathcal{L} = \mathbf{v}^\top \Sigma \mathbf{v} - \lambda(\mathbf{v}^\top \mathbf{v} - 1)

Stationarity: $2\Sigma \mathbf{v} = 2\lambda \mathbf{v}$ , i.e., $\Sigma \mathbf{v} = \lambda \mathbf{v}$ . The optimal direction is the top eigenvector; $\lambda^* =$ top eigenvalue $=$ maximum variance. PCA is literally solving Lagrange conditions.

Unit-norm attention. In some attention formulations, query/key vectors are L2-normalized before the dot product. The normalization constraint $\|{\bf q}\|=1$ is enforced via Lagrange multiplier; the shadow price reveals how much the attention energy would increase if the norm bound were relaxed.

LoRA rank constraints. Low-Rank Adaptation constrains the weight update $\Delta W = AB$ where $A \in \mathbb{R}^{d \times r}$ , $B \in \mathbb{R}^{r \times k}$ , with $r \ll \min(d,k)$ . The rank- $r$ constraint is implicit in the factorised parameterisation, and the Lagrange multiplier interpretation illuminates why the singular values of $\Delta W$ concentrate: the effective constraint is on the nuclear norm (sum of singular values).

6. KKT Conditions

Lagrange multipliers handle equality constraints. When inequality constraints are present-which is the norm in machine learning (budget constraints, margin constraints, non-negativity)-the Karush-Kuhn-Tucker (KKT) conditions provide the generalisation.

6.1 The Full Problem and Lagrangian

General form:

\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad h_j(\mathbf{x}) \leq 0,\; j = 1,\ldots,p \quad \text{and} \quad g_i(\mathbf{x}) = 0,\; i = 1,\ldots,m

Lagrangian:

\mathcal{L}(\mathbf{x}, \boldsymbol{\mu}, \boldsymbol{\lambda}) = f(\mathbf{x}) + \sum_{j=1}^p \mu_j h_j(\mathbf{x}) + \sum_{i=1}^m \lambda_i g_i(\mathbf{x})

where $\mu_j \geq 0$ are the multipliers for inequality constraints and $\lambda_i \in \mathbb{R}$ for equality constraints.

6.2 The Four KKT Conditions

At an optimal $\mathbf{x}^*$ (under a suitable constraint qualification):

1. Stationarity:

\nabla_\mathbf{x} \mathcal{L}(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*) = \nabla f(\mathbf{x}^*) + \sum_j \mu_j^* \nabla h_j(\mathbf{x}^*) + \sum_i \lambda_i^* \nabla g_i(\mathbf{x}^*) = \mathbf{0}

2. Primal Feasibility:

h_j(\mathbf{x}^*) \leq 0 \quad \forall j \qquad \text{and} \qquad g_i(\mathbf{x}^*) = 0 \quad \forall i

3. Dual Feasibility:

\mu_j^* \geq 0 \quad \forall j

4. Complementary Slackness:

\mu_j^* h_j(\mathbf{x}^*) = 0 \quad \forall j

Interpreting complementary slackness: For each inequality constraint $j$ , either:

$h_j(\mathbf{x}^*) = 0$ : the constraint is active (the optimum is on the boundary)- $\mu_j^*$ can be nonzero
$\mu_j^* = 0$ : the constraint is inactive (the optimum is strictly interior)-the constraint doesn't affect the optimum

This is the geometric signature of which constraints "matter" at the solution.

6.3 Geometric Interpretation

The KKT conditions say: at optimum, you cannot improve $f$ while satisfying all constraints. The stationarity condition generalises Lagrange's condition: $-\nabla f(\mathbf{x}^*)$ must lie in the cone generated by the active constraint gradients.

KKT COMPLEMENTARY SLACKNESS GEOMETRY


  Case 1: Constraint inactive (h(x*) < 0)
  
  The optimum is in the interior of the feasible region.
  The inequality constraint plays no role.
  mu* = 0 (it would be wrong to "push" against a slack constraint).

  Case 2: Constraint active (h(x*) = 0)
  
  The optimum lies ON the constraint boundary.
  The gradient nablaf(x*) points into the infeasible region.
  mu* > 0 "pushes back" to prevent crossing the boundary.
  The objective would improve if the constraint were relaxed.

  In both cases: mu* * h(x*) = 0 * (neg) = 0  
                          or:  (pos) * 0 = 0

6.4 KKT as Necessary Conditions: LICQ Proof

Theorem. If $\mathbf{x}^*$ is a local minimum and the LICQ holds (active constraint gradients are linearly independent), then the KKT conditions hold.

Proof sketch. Let $A = \{j : h_j(\mathbf{x}^*) = 0\}$ be the active set. By LICQ, $\{\nabla h_j(\mathbf{x}^*)\}_{j \in A} \cup \{\nabla g_i(\mathbf{x}^*)\}_i$ are linearly independent.

Any feasible descent direction $\mathbf{d}$ (satisfying $\nabla h_j(\mathbf{x}^*)^\top \mathbf{d} \leq 0$ for $j \in A$ and $\nabla g_i(\mathbf{x}^*)^\top \mathbf{d} = 0$ ) cannot have $\nabla f(\mathbf{x}^*)^\top \mathbf{d} < 0$ (otherwise $\mathbf{x}^*$ not local min).

By Farkas' lemma, $-\nabla f(\mathbf{x}^*)$ is a conic combination of active constraint gradients: $-\nabla f = \sum_{j \in A} \mu_j \nabla h_j + \sum_i \lambda_i \nabla g_i$ with $\mu_j \geq 0$ . Setting $\mu_j = 0$ for $j \notin A$ gives all four conditions. $\square$

Other constraint qualifications: LICQ is sufficient but not necessary for KKT. Alternatives include Mangasarian-Fromovitz (MFCQ), Slater's condition (for convex problems), and linear independence of the Jacobian at the active set.

6.5 KKT as Sufficient Conditions (Convex Case)

For convex problems, KKT conditions are not just necessary-they are also sufficient:

Theorem. If $f$ and $h_j$ are convex, $g_i$ are affine, and $(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)$ satisfy all four KKT conditions, then $\mathbf{x}^*$ is a global minimum.

Proof. For any feasible $\mathbf{y}$ :

f(\mathbf{y}) \geq f(\mathbf{x}^*) + \nabla f(\mathbf{x}^*)^\top(\mathbf{y}-\mathbf{x}^*) \quad \text{(convexity of } f)

= f(\mathbf{x}^*) - \left(\sum_j \mu_j^* \nabla h_j(\mathbf{x}^*) + \sum_i \lambda_i^* \nabla g_i(\mathbf{x}^*)\right)^\top (\mathbf{y}-\mathbf{x}^*)

\geq f(\mathbf{x}^*) - \sum_j \mu_j^* h_j(\mathbf{y}) + \sum_j \mu_j^* h_j(\mathbf{x}^*) - \sum_i \lambda_i^* (g_i(\mathbf{y}) - g_i(\mathbf{x}^*))

Using primal feasibility ( $h_j(\mathbf{y}) \leq 0$ ), dual feasibility ( $\mu_j^* \geq 0$ ), complementary slackness ( $\mu_j^* h_j(\mathbf{x}^*) = 0$ ), and $g_i(\mathbf{y}) = g_i(\mathbf{x}^*) = 0$ , each term is $\geq 0$ , so $f(\mathbf{y}) \geq f(\mathbf{x}^*)$ . $\square$

6.6 LP Worked Example

Linear Program: $\min -x_1 - 2x_2$ subject to $x_1 + x_2 \leq 4$ , $x_1, x_2 \geq 0$ .

Rewriting as $h_1 = x_1 + x_2 - 4 \leq 0$ , $h_2 = -x_1 \leq 0$ , $h_3 = -x_2 \leq 0$ .

Lagrangian: $\mathcal{L} = -x_1 - 2x_2 + \mu_1(x_1+x_2-4) - \mu_2 x_1 - \mu_3 x_2$ .

Stationarity: $-1 + \mu_1 - \mu_2 = 0$ , $-2 + \mu_1 - \mu_3 = 0$ .

The constraint $x_2 \geq 0$ is never active at the optimum (we want $x_2$ large), so $\mu_3 = 0$ . The budget constraint is active: $x_1 + x_2 = 4$ , $x_1 = 0$ . From stationarity: $\mu_1 = 2$ , $\mu_2 = 1$ . Optimal: $(x_1^*, x_2^*) = (0, 4)$ , objective $= -8$ .

7. Duality Theory

The Lagrangian dual offers a second approach to constrained optimisation: instead of minimising over $\mathbf{x}$ , we maximise over the multipliers $(\boldsymbol{\mu}, \boldsymbol{\lambda})$ . The resulting dual problem often has better structure (always convex, regardless of the primal), reveals hidden geometric properties, and-for convex problems-gives exactly the same optimal value.

7.1 The Dual Function and Dual Problem

Definition (Lagrangian dual function):

q(\boldsymbol{\mu}, \boldsymbol{\lambda}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}, \boldsymbol{\lambda}) = \inf_{\mathbf{x}} \left[ f(\mathbf{x}) + \boldsymbol{\mu}^\top \mathbf{h}(\mathbf{x}) + \boldsymbol{\lambda}^\top \mathbf{g}(\mathbf{x}) \right]

Key property: $q$ is always concave in $(\boldsymbol{\mu}, \boldsymbol{\lambda})$ -it is a pointwise infimum of affine functions of the multipliers.

Dual problem:

d^* = \max_{\boldsymbol{\mu} \geq 0, \boldsymbol{\lambda}} q(\boldsymbol{\mu}, \boldsymbol{\lambda})

This is always a convex optimisation problem (maximising concave = minimising convex).

7.2 Weak Duality

Theorem (Weak Duality). $d^* \leq p^*$ always, where $p^*$ is the primal optimal value.

Proof. For any feasible primal $\mathbf{x}$ (satisfying $h_j(\mathbf{x}) \leq 0$ , $g_i(\mathbf{x}) = 0$ ) and any dual-feasible $(\boldsymbol{\mu}, \boldsymbol{\lambda})$ (with $\boldsymbol{\mu} \geq 0$ ):

q(\boldsymbol{\mu}, \boldsymbol{\lambda}) = \inf_{\mathbf{y}} \mathcal{L}(\mathbf{y}, \boldsymbol{\mu}, \boldsymbol{\lambda}) \leq \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}, \boldsymbol{\lambda}) = f(\mathbf{x}) + \underbrace{\boldsymbol{\mu}^\top \mathbf{h}(\mathbf{x})}_{\leq 0} + \underbrace{\boldsymbol{\lambda}^\top \mathbf{g}(\mathbf{x})}_{=0} \leq f(\mathbf{x})

Taking supremum over dual and infimum over primal: $d^* \leq p^*$ . $\square$

The gap $p^* - d^* \geq 0$ is the duality gap.

7.3 Strong Duality and Slater's Condition

Theorem (Slater's Condition -> Strong Duality). For convex $f$ and $h_j$ , affine $g_i$ : if there exists a strictly feasible point $\hat{\mathbf{x}}$ (with $h_j(\hat{\mathbf{x}}) < 0$ strictly for all $j$ ), then $d^* = p^*$ (zero duality gap) and the dual optimum is attained.

Slater's condition is a constraint qualification: it says the feasible region is non-degenerate. For LP and QP (quadratic programs), strong duality holds under much weaker conditions.

Implications for ML:

The SVM dual problem is equivalent to the primal (strong duality holds by Slater)
The dual variables $\boldsymbol{\mu}^*$ from strong duality are exactly the KKT multipliers
Duality gap as a stopping criterion: if primal value $-$ dual value $< \epsilon$ , we have an $\epsilon$ -optimal solution

7.4 Saddle Point Characterisation

Theorem. $(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*)$ is primal-dual optimal with zero duality gap iff it is a saddle point of the Lagrangian:

\mathcal{L}(\mathbf{x}^*, \boldsymbol{\mu}, \boldsymbol{\lambda}) \leq \mathcal{L}(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*) \leq \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*) \quad \forall \mathbf{x}, \boldsymbol{\mu} \geq 0, \boldsymbol{\lambda}

The minimax equals the maximin: $\min_\mathbf{x} \max_{\boldsymbol{\mu} \geq 0, \boldsymbol{\lambda}} \mathcal{L} = \max_{\boldsymbol{\mu} \geq 0, \boldsymbol{\lambda}} \min_\mathbf{x} \mathcal{L}$ .

This saddle-point view is foundational for adversarial training in ML: GAN training is exactly seeking a saddle point of the value function $V(\theta_G, \theta_D)$ .

7.5 SVM Dual: A Complete Example

The SVM is the canonical example of duality in ML. Start with the hard-margin SVM primal:

\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1, \quad i = 1, \ldots, n

Rewrite constraints as $h_i = 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b) \leq 0$ . Lagrangian:

\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 + \sum_i \alpha_i (1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b))

KKT stationarity conditions:

\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = \mathbf{0} \quad \Rightarrow \quad \mathbf{w}^* = \sum_i \alpha_i^* y_i \mathbf{x}_i

\frac{\partial \mathcal{L}}{\partial b} = -\sum_i \alpha_i y_i = 0

Substituting back into $\mathcal{L}$ to form the dual:

q(\boldsymbol{\alpha}) = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j

Dual problem: $\max_{\boldsymbol{\alpha} \geq 0} q(\boldsymbol{\alpha})$ subject to $\sum_i \alpha_i y_i = 0$ .

This depends only on inner products $\mathbf{x}_i^\top \mathbf{x}_j$ -the kernel trick replaces these with $k(\mathbf{x}_i, \mathbf{x}_j)$ for nonlinear boundaries without ever computing features explicitly.

Support vectors: By complementary slackness, $\alpha_i^* > 0$ only when $h_i(\mathbf{x}^*, b^*) = 0$ , i.e., when the constraint is active: $y_i(\mathbf{w}^{*\top}\mathbf{x}_i + b^*) = 1$ . These are exactly the support vectors-the training points on the margin boundary that determine $\mathbf{w}^*$ .

8. Machine Learning Applications

The optimality conditions developed above appear directly in the mathematics underlying every major ML system. This section demonstrates these connections concretely.

8.1 Linear and Ridge Regression

Ordinary Least Squares. Minimise $f(\mathbf{w}) = \frac{1}{2n}\|X\mathbf{w} - \mathbf{y}\|^2$ . Setting the gradient to zero:

\nabla f(\mathbf{w}) = \frac{1}{n} X^\top(X\mathbf{w} - \mathbf{y}) = \mathbf{0} \quad \Rightarrow \quad X^\top X \mathbf{w} = X^\top \mathbf{y}

These are the normal equations. When $X$ has full column rank, $(X^\top X) \succ 0$ (SOSC), and the unique solution is $\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}$ .

Ridge Regression. Adding $\ell_2$ regularisation: $\min_\mathbf{w} \frac{1}{2n}\|X\mathbf{w}-\mathbf{y}\|^2 + \frac{\lambda}{2}\|\mathbf{w}\|^2$ .

Normal equations: $(X^\top X + n\lambda I)\mathbf{w} = X^\top \mathbf{y}$ . The regulariser shifts all eigenvalues by $n\lambda$ , making the system always well-conditioned (strongly convex with $\mu = \lambda$ ).

8.2 Lasso and the Subdifferential

Lasso: $\min_\mathbf{w} \frac{1}{2n}\|X\mathbf{w}-\mathbf{y}\|^2 + \lambda\|\mathbf{w}\|_1$ .

The $\ell_1$ norm is not differentiable at $w_j = 0$ . The first-order optimality condition uses the subdifferential $\partial \|\mathbf{w}\|_1$ :

0 \in \frac{1}{n} X^\top(X\mathbf{w}^* - \mathbf{y}) + \lambda \, \partial \|\mathbf{w}^*\|_1

Coordinate-wise, for the $j$ -th weight:

If $w_j^* \neq 0$ : $\frac{1}{n}(X^\top(X\mathbf{w}^* - \mathbf{y}))_j + \lambda \, \text{sgn}(w_j^*) = 0$
If $w_j^* = 0$ : $\left|\frac{1}{n}(X^\top(X\mathbf{w}^* - \mathbf{y}))_j\right| \leq \lambda$

The second condition (correlation of feature $j$ with residual is small enough) determines sparsity: feature $j$ is excluded when its correlation with the residual is below the threshold $\lambda$ . This is the mathematical source of Lasso's sparsity-inducing property.

8.3 SVM: Full KKT Analysis

The SVM soft-margin formulation extends the hard-margin case with slack variables $\xi_i \geq 0$ :

\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

The KKT conditions include multipliers $\alpha_i$ for the margin constraints and $\beta_i$ for $\xi_i \geq 0$ . Complementary slackness gives:

$\alpha_i = 0$ : correctly classified, not a support vector
$0 < \alpha_i < C$ : on the margin, $\xi_i = 0$ (standard support vector)
$\alpha_i = C$ : inside the margin or misclassified, $\xi_i > 0$

The parameter $C$ controls the trade-off between margin width and training error through the dual feasibility constraint $0 \leq \alpha_i \leq C$ .

8.4 PCA via Constrained Optimisation

Principal Component Analysis seeks directions of maximum variance subject to orthonormality. The full PCA problem is:

\max_{V \in \mathbb{R}^{d \times k}} \text{tr}(V^\top \Sigma V) \quad \text{s.t.} \quad V^\top V = I_k

Lagrangian: $\mathcal{L} = \text{tr}(V^\top \Sigma V) - \text{tr}(\Lambda(V^\top V - I))$ where $\Lambda$ is a $k \times k$ symmetric multiplier matrix.

Stationarity: $2\Sigma V = 2V\Lambda$ , i.e., $\Sigma V = V\Lambda$ . Each column $\mathbf{v}_j$ satisfies $\Sigma \mathbf{v}_j = \lambda_j \mathbf{v}_j$ -an eigenvalue equation. The optimal $V$ is the matrix of top- $k$ eigenvectors; the multipliers $\lambda_j$ are the eigenvalues (= captured variance).

8.5 Maximum Entropy and Softmax

Maximum Entropy Principle. Given constraints $\mathbb{E}_p[f_k] = c_k$ for $k=1,\ldots,K$ , the distribution maximising entropy $H(p) = -\sum_i p_i \log p_i$ subject to $\sum_i p_i = 1$ has the Boltzmann/Gibbs form:

p_i^* = \frac{\exp(\sum_k \lambda_k^* f_k(i))}{Z(\boldsymbol{\lambda}^*)}

Derivation. Lagrangian with multipliers $\lambda_0$ (normalisation) and $\lambda_k$ (feature constraints):

\mathcal{L} = -\sum_i p_i \log p_i - \lambda_0 \left(\sum_i p_i - 1\right) - \sum_k \lambda_k \left(\sum_i p_i f_k(i) - c_k\right)

Stationarity in $p_i$ : $-\log p_i - 1 - \lambda_0 - \sum_k \lambda_k f_k(i) = 0$ , giving $p_i^* \propto \exp(\sum_k \lambda_k^* f_k(i))$ .

Softmax as max entropy. The softmax function $p_i = e^{z_i}/\sum_j e^{z_j}$ is exactly the max-entropy distribution over $\{1,\ldots,n\}$ subject to $\mathbb{E}_p[e_i] = \text{const}$ constraints (Lagrange multipliers are the logits $z_i$ ). This gives the softmax a principled statistical interpretation beyond "normalised exponentials."

8.6 Attention as Constrained Optimisation

The attention mechanism in transformers can be viewed as a constrained optimisation problem. Given queries $Q$ , keys $K$ , values $V$ , standard attention computes:

\text{Attn}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Constrained interpretation. Attention computes the memory retrieval that solves:

\mathbf{v}^* = \arg\max_{\mathbf{p} \in \Delta^n} \mathbf{p}^\top K V - \frac{1}{\beta} \sum_i p_i \log p_i

where $\beta = 1/\sqrt{d_k}$ is a temperature and $\Delta^n$ is the probability simplex. The KKT condition for this maximum-entropy retrieval gives exactly softmax attention weights.

The Lagrange multiplier for the simplex constraint $\sum_i p_i = 1$ becomes the log-partition function $\log Z$ (the log-sum-exp normaliser), confirming that attention is a principled probabilistic retrieval operation.

9. Non-Convex Landscapes in Deep Learning

Deep learning operates almost entirely in the non-convex regime. Yet neural networks train successfully-why? The answer lies in the special geometric structure of high-dimensional non-convex landscapes.

9.1 Loss Landscape Geometry

An $n$ -parameter neural network defines a loss landscape $\mathcal{L}: \mathbb{R}^n \to \mathbb{R}_+$ with structure that differs fundamentally from classical non-convex functions:

Key empirical observations:

No spurious local minima (approximately): for sufficiently overparameterised networks on tractable data, all local minima have near-identical loss values. Gradient descent doesn't get "trapped" because there are no deep local minima to get trapped in.
Saddle point dominance: most critical points are saddle points, not local minima. The index (number of negative eigenvalue directions) of these saddles is typically large.
Loss barriers between solutions: two different global minima can be separated by high barriers in weight space, but connected by loss valleys in higher-dimensional paths.

For AI: The practical consequence is that SGD with good initialisation reliably finds good solutions, and model merging (linear combination of two trained models) often produces competitive performance-evidence of basin connectivity.

9.2 The Role of Overparameterisation

Theorem (informal, Kawaguchi 2016; Du et al. 2018). For a wide class of networks trained on generic data, when the number of parameters $n$ exceeds the number of training points $N$ , every local minimum is a global minimum.

Intuition. With $n \gg N$ parameters, the loss function has $n - N$ approximate degrees of freedom. The "level set" $\{\mathbf{w} : \mathcal{L}(\mathbf{w}) \approx 0\}$ is a high-dimensional manifold. Any point trying to be a local minimum with non-zero loss would need the gradient to vanish while the Hessian is positive definite in all $n$ directions-this becomes geometrically impossible when the loss can be driven to zero by moving in any of the $n - N$ flat directions.

Neural Tangent Kernel (NTK) perspective. In the infinite-width limit, training dynamics become linear: $\dot{\mathbf{w}} = -\eta K^\infty (\mathbf{w}) \mathbf{e}$ where $K^\infty$ is the NTK matrix. If $K^\infty \succ 0$ , gradient descent converges globally at a linear rate-a convex analysis result applied to a formally non-convex problem.

9.3 Sharpness-Aware Minimisation (SAM)

Motivated by the observation that flat minima generalise better than sharp ones, Sharpness-Aware Minimisation (Foret et al. 2021) solves:

\min_\mathbf{w} \max_{\|\boldsymbol{\epsilon}\|_2 \leq \rho} \mathcal{L}(\mathbf{w} + \boldsymbol{\epsilon})

KKT analysis of the inner problem. Fix $\mathbf{w}$ . The inner maximisation $\max_{\|\boldsymbol{\epsilon}\| \leq \rho} \mathcal{L}(\mathbf{w} + \boldsymbol{\epsilon})$ is a constrained problem with one inequality constraint $h(\boldsymbol{\epsilon}) = \|\boldsymbol{\epsilon}\|^2 - \rho^2 \leq 0$ .

KKT conditions: $\nabla_{\boldsymbol{\epsilon}} \mathcal{L}(\mathbf{w} + \boldsymbol{\epsilon}^*) = 2\mu^* \boldsymbol{\epsilon}^*$ , giving $\boldsymbol{\epsilon}^* \parallel \nabla_\mathbf{w} \mathcal{L}(\mathbf{w})$ .

The adversarial perturbation is: $\hat{\boldsymbol{\epsilon}} = \rho \cdot \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}) / \|\nabla_\mathbf{w} \mathcal{L}(\mathbf{w})\|$

SAM gradient step: $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_\mathbf{w} \mathcal{L}(\mathbf{w} + \hat{\boldsymbol{\epsilon}})$ .

This is a direct application of KKT: solving the constrained inner problem analytically gives the SAM update formula.

9.4 Neural Collapse

At the terminal phase of training, when networks reach near-zero training loss, a remarkable geometric structure called Neural Collapse (Papyan et al. 2020) emerges:

Within-class variability collapse: all training examples of class $c$ have the same last-layer representation $\mathbf{h} = \boldsymbol{\mu}_c$
Equinorm ETF: the class means $\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_C$ form an Equiangular Tight Frame-they have equal norms and equal pairwise cosine similarities $= -1/(C-1)$
Self-duality: the weight vectors $\mathbf{w}_c$ align with the class means up to scaling

KKT characterisation. Neural collapse is the KKT point of the Unconstrained Features Model (UFM):

\min_{\mathbf{H}, \mathbf{W}, \mathbf{b}} \mathcal{L}_{\text{CE}}(\mathbf{W}\mathbf{H} + \mathbf{b}) + \frac{\lambda}{2}(\|\mathbf{H}\|_F^2 + \|\mathbf{W}\|_F^2)

The KKT stationarity conditions, combined with symmetry of the cross-entropy loss at balanced class distributions, force the ETF structure. Neural collapse is not an empirical coincidence-it is the unique KKT point of this simplified training problem.

9.5 Mode Connectivity

Loss valley hypothesis: Two local minima $\mathbf{w}_1^*$ and $\mathbf{w}_2^*$ of a neural network can be connected by a low-loss path (Garipov et al. 2018). Specifically, there exists a piecewise linear or quadratic curve $\phi: [0,1] \to \mathbb{R}^n$ with $\phi(0) = \mathbf{w}_1^*$ , $\phi(1) = \mathbf{w}_2^*$ and $\mathcal{L}(\phi(t)) \approx \mathcal{L}(\mathbf{w}_1^*)$ for all $t \in [0,1]$ .

Implication for model merging. The success of weight averaging (WA) methods like Model Soups and SLERP model merging is explained by mode connectivity: if the merged model lies near the connecting path in weight space, it inherits the performance of both endpoints.

Optimality connection. Mode connectivity is a non-convex analogue of the fundamental theorem of convex analysis: convex functions have connected sublevel sets. For "sufficiently trained" neural networks, the sublevel set $\{\mathbf{w} : \mathcal{L}(\mathbf{w}) \leq \mathcal{L}(\mathbf{w}_1^*) + \epsilon\}$ is approximately path-connected-not because the landscape is convex, but because overparameterisation creates high-dimensional flat regions.

10. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	*Treating $\nabla f(\mathbf{x}^) = \mathbf{0}$ as sufficient for a minimum**	It's necessary, not sufficient. Saddle points and maxima also satisfy this.	Always check second-order conditions (Hessian PSD/PD) or verify global minimum via convexity arguments
2	Forgetting to check constraint qualification	KKT conditions require LICQ (or another CQ). Without CQ, a minimum may exist with no Lagrange multiplier	Verify that active constraint gradients are linearly independent at the candidate point
3	Setting $\mu_j < 0$ for inequality constraints	Negative multipliers on $h_j \leq 0$ constraints violate dual feasibility; the Lagrangian is then unbounded below	Dual feasibility requires $\mu_j \geq 0$ for all inequality constraints (the convention matters: $h \leq 0$ needs $\mu \geq 0$ )
4	Ignoring complementary slackness	Missing the condition $\mu_j h_j = 0$ leads to wrong determination of which constraints are active	For each inequality constraint, exactly one of $\mu_j = 0$ or $h_j = 0$ must hold (or both)
5	Concluding strong duality without Slater	Weak duality always holds, but strong duality ( $d^* = p^*$ ) requires a constraint qualification like Slater's condition	Verify strict feasibility (Slater's point exists) before asserting zero duality gap
6	Using the $\text{det}(H)$ test in $\mathbb{R}^n$ , $n > 2$	The determinant test (det $> 0$ and $H_{11} > 0$ ) is specific to $\mathbb{R}^2$ . In higher dimensions, need to check all $n$ leading principal minors or eigenvalues	In $\mathbb{R}^n$ : compute all eigenvalues of $H$ ; or use Sylvester's criterion (all leading principal minors $> 0$ iff PD)
7	Confusing local and global optimality for non-convex problems	For non-convex functions, local minima may not be global. KKT conditions identify local critical points, not global optima	Use global analysis: prove convexity, or use branch-and-bound, or accept local optimality
8	Wrong sign convention for the Lagrangian	Different texts define $\mathcal{L} = f + \lambda g$ vs. $f - \lambda g$ . Mixing conventions gives wrong multiplier signs	Pick one convention and be consistent: for $\min f$ s.t. $g = 0$ , use $\mathcal{L} = f + \lambda g$ (add constraints to objective)
9	Forgetting that the Lagrange multiplier theorem is for $C^1$ functions	At non-smooth points (e.g., $\ell_1$ constraints), the standard gradient condition fails	Use subdifferentials and subgradient conditions; or smooth the problem with a differentiable approximation
10	Concluding "no constrained minimum exists" when KKT has no solution	KKT conditions failing to have a solution means the constraint qualification fails OR no minimum exists. These are different	First check whether the feasible set is closed and bounded (Weierstrass guarantees existence); then debug the CQ
11	Misidentifying support vectors	Only points with $\alpha_i > 0$ (active margin constraint) are support vectors; incorrectly classified interior points are not	Check complementary slackness: $\alpha_i > 0 \Leftrightarrow y_i(\mathbf{w}^\top\mathbf{x}_i + b) = 1$ (on the margin boundary)
12	Applying first-order conditions to discrete or combinatorial constraints	Gradient = 0 requires differentiability; discrete feasible sets (e.g., integer programs) don't satisfy this	For discrete/combinatorial problems, use integer programming methods, branch-and-bound, or relaxations

11. Exercises

Exercise 1 - Critical Point Classification

Let $f(x,y) = x^3 - 3xy^2 + y^4$ .

(a) Find all critical points by solving $\nabla f = \mathbf{0}$ .

(b) Compute the Hessian at each critical point and classify using the second-order test.

(c) Verify numerically: check that gradient descent from each critical point stays put (up to numerical noise).

(d) Sketch the level curves of $f$ near each critical point.

Exercise 2 - Lagrange Multipliers: Constrained Extremum

Maximise $f(\mathbf{x}) = x_1 x_2 x_3$ subject to $x_1 + x_2 + x_3 = 12$ , $\mathbf{x} \geq \mathbf{0}$ .

(a) Write the Lagrangian and derive KKT conditions.

(b) Solve analytically and verify that the maximum is $f^* = 64$ .

(c) Interpret the Lagrange multiplier: by how much does $f^*$ change if the constraint becomes $x_1 + x_2 + x_3 = 13$ ?

(d) Confirm numerically using scipy.optimize.minimize.

Exercise 3 - KKT Conditions for Quadratic Program

Solve: $\min \frac{1}{2}(x_1^2 + x_2^2)$ subject to $x_1 + x_2 \geq 3$ , $x_1, x_2 \geq 0$ .

(a) Write the KKT conditions in full.

(b) Determine the active constraint(s) and solve the KKT system.

(c) Verify the solution is a global minimum by checking convexity.

(d) Compute the duality gap (should be zero).

Exercise 4 - Convexity Analysis

For each function, determine if it is convex, strictly convex, or neither, on the specified domain:

(a) $f(x) = e^x - x - 1$ on $\mathbb{R}$

(b) $f(\mathbf{x}) = \|\mathbf{x}\|_2^2 + \|\mathbf{x}\|_1$ on $\mathbb{R}^n$

(c) $f(A) = -\log \det A$ on $\mathbb{S}_{++}^n$ (positive definite matrices)

(d) $f(x,y) = x^2/y$ for $y > 0$

(e) $f(\mathbf{w}) = \mathcal{L}_{\text{CE}}(\mathbf{w})$ for logistic regression with linearly separable data

Exercise 5 - SVM Dual Derivation

Derive the dual of the hard-margin SVM from scratch.

(a) Write the primal as a standard form QP with inequality constraints.

(b) Form the Lagrangian and derive the dual function $q(\boldsymbol{\alpha})$ .

(c) Write the dual problem and verify its constraints.

(d) Show that the dual is concave in $\boldsymbol{\alpha}$ .

(e) Implement and solve both primal and dual for a small dataset; verify they give the same optimal value.

Exercise 6 - Maximum Entropy Distribution

Find the maximum entropy distribution on $\{1, 2, 3, 4\}$ subject to: $\sum_i p_i = 1$ and $\mathbb{E}[X] = 2.5$ .

(a) Write the Lagrangian with multipliers $\lambda_0, \lambda_1$ .

(b) Derive that $p_i^* = e^{-\lambda_0 - \lambda_1 i}/Z$ .

(c) Find $\lambda_0, \lambda_1$ numerically by solving the constraint equations.

(d) Compare to the uniform distribution: which has higher entropy?

(e) Verify: compute $H(p^*)$ and confirm it exceeds $H(\text{uniform restricted to } \mathbb{E}[X]=2.5)$ .

Exercise 7 - SAM: KKT Analysis

Analyse Sharpness-Aware Minimisation rigorously.

(a) Write the inner maximisation $\max_{\|\boldsymbol{\epsilon}\| \leq \rho} \mathcal{L}(\mathbf{w} + \boldsymbol{\epsilon})$ as a constrained problem.

(b) Write the KKT conditions and solve for $\boldsymbol{\epsilon}^*$ in closed form.

(c) Show that $\hat{\boldsymbol{\epsilon}} = \rho \nabla\mathcal{L}(\mathbf{w})/\|\nabla\mathcal{L}(\mathbf{w})\|$ satisfies the KKT conditions.

(d) Implement one SAM step and compare to vanilla gradient descent on a sharp quadratic.

(e) Empirically verify that SAM finds flatter minima on a small neural network.

Exercise 8 - Duality Gap and Convergence

Implement a primal-dual interior-point method for a small LP.

(a) Formulate: $\min \mathbf{c}^\top \mathbf{x}$ s.t. $A\mathbf{x} = \mathbf{b}$ , $\mathbf{x} \geq 0$ as a standard LP.

(b) Implement the primal-dual path-following method tracking both primal and dual variables.

(c) Plot the duality gap vs. iteration and verify it converges to zero.

(d) Verify the optimal solution satisfies all KKT conditions numerically.

(e) Compare convergence to simplex method on the same problem.

12. Why This Matters for AI (2026 Perspective)

Concept	Impact on AI/ML
First-order necessary conditions	Every modern optimiser (SGD, Adam, AdamW) terminates at approximate stationarity $\\|\nabla\mathcal{L}\\| \approx 0$ . Training stopping criteria are gradient-norm thresholds.
Hessian spectrum	Adaptive learning rate methods (Adam, Adagrad) implicitly approximate $H^{-1}\nabla\mathcal{L}$ to normalise curvature. Sharpness-aware methods (SAM, ASAM) explicitly minimise spectral norm of $H$ .
Saddle points	The prevalence of saddle points (not local minima) in deep network landscapes explains why gradient descent with noise (SGD, dropout) escapes them efficiently. Perturbed gradient descent has provable saddle-escape guarantees.
Convexity	Convex relaxations of combinatorial ML problems (e.g., $\ell_1$ relaxation of sparse recovery) are solvable to global optimality. Convex loss functions (logistic, square) give unique solutions independent of initialisation.
Lagrange multipliers	PCA, LDA, CCA are all Lagrange multiplier problems. QLoRA's 4-bit quantisation uses a constrained optimisation where the quantisation constraint is relaxed via Lagrange multipliers in the mixed-precision framework.
KKT conditions	SVM training, LP relaxations of beam search, constrained decoding (e.g., budget-forcing), and RLHF reward-constrained optimisation all use KKT theory for optimality analysis.
Duality	The SVM dual gives the kernel trick. Max-margin training of LLMs (via RLHF) can be analysed as a dual problem where the Lagrange multiplier on the KL constraint is the reward temperature $\beta$ .
Strong convexity	$L_2$ regularisation makes the loss strongly convex, giving linear convergence guarantees. The condition number $\kappa = L/\mu$ governs convergence; warmup schedules and weight decay are practitioner responses to ill-conditioning.
Max entropy / softmax	Language model next-token prediction is max-entropy inference subject to observed statistics. Temperature scaling adjusts the implicit Lagrange multiplier on the cross-entropy constraint, controlling distribution sharpness.
SAM/sharpness	Flat minima empirically generalise better. SAM (Foret et al. 2021) is now standard in SOTA image classification and LLM fine-tuning. The optimality condition for the inner problem is the KKT stationarity condition.
Neural collapse	The ETF structure at convergence (proven via KKT of the UFM) informs classifier weight initialisation strategies and explains why the last layer of a fine-tuned model can be reset and retrained cheaply.
Mode connectivity	Model merging (Model Soups, TIES, DARE) and model interpolation are enabled by the path-connectivity of loss basins-a direct consequence of overparameterised non-convex landscape geometry.

Conceptual Bridge

This section occupies the pivot between two phases of the curriculum. Everything before 04 established the tools for computing derivatives-limits, continuity, partial derivatives, gradients, Jacobians, chain rule. This section asks the deeper question: what do these derivatives tell us about where the best solutions are?

Looking backward. The first-order conditions derive directly from the gradient machinery of 01-03. The Hessian-the matrix of second partial derivatives-extends the single-variable second derivative test to functions of many variables, connecting to the matrix theory of 02. The chain rule and product rule for derivatives underpin every step of the Lagrangian analysis.

Looking forward. The optimality conditions here are used constantly in the chapters ahead:

05/05 (Gradient Descent) and 08 (Optimisation Algorithms): gradient descent is precisely the process of iterating toward $\nabla f = \mathbf{0}$ ; convergence analysis requires the strong convexity and Lipschitz smoothness properties defined here
06 (Probability & Statistics): maximum likelihood estimation is an unconstrained minimisation whose optimality conditions give the maximum likelihood equations; Bayesian MAP estimation introduces prior constraints handled by Lagrange multipliers
07 (Information Theory): the maximum entropy principle (8.5) is a direct application of Lagrange multipliers; the KL divergence minimisation underlying variational inference is a constrained optimisation whose KKT conditions are the ELBO

POSITION IN THE CALCULUS CURRICULUM


  04: Calculus Fundamentals
   01: Limits and Continuity           Foundation for 02
   02: Derivatives and Differentiation  Foundation for 03, 04
   03: Integration                     Foundation for 06 (probability)
   04: Optimality Conditions  YOU ARE HERE
          
            Uses: gradients, Hessians, chain rule, Jacobians
            Introduces: critical points, KKT, duality, convexity
          
          
  05: Multivariate Calculus
   01: Partial Derivatives (prerequisite)
   02: Gradient and Hessian (prerequisite)
   03: Chain Rule & Backpropagation (prerequisite)
   04: Optimality Conditions  YOU ARE HERE
   05: Gradient Descent and Convergence  Uses 04's convexity theory

  08: Optimisation Algorithms  Deep dive into algorithms enabled by 04
  09: Probabilistic Graphical Models  MAP/MLE via 04's Lagrange theory

The central insight. Optimality conditions are not merely a tool for finding optima-they are a language for describing what it means for a solution to be right. The KKT conditions don't just tell you where to stop searching; they tell you why the answer is the answer: which constraints are binding, how sensitive the solution is to perturbations, what trade-offs are implicit in the optimal choice. Every time a language model outputs a softmax distribution, every time a recommender system solves a constrained relevance problem, every time a reinforcement learning agent balances reward and KL penalty-the KKT conditions are operating in the background, whether or not the engineer knows it.

Optimality Conditions: Part 1 - Intuition To Conceptual Bridge