"In theory there is no difference between theory and practice. In practice there is." - Yogi Berra
Overview
Numerical optimization is the computational science of finding minima (or maxima) of functions. While Section08-Optimization covered the mathematical theory of gradient-based methods, convergence analysis, and adaptive learning rates, this section focuses on the numerical implementation aspects: how floating-point arithmetic affects algorithm behavior, how to implement robust line search, how to exploit problem structure, and how the algorithms used in deep learning trace their lineage to classical numerical optimization.
The central insight: every optimization algorithm is implicitly solving a sequence of linear systems. Understanding this connection - and how the numerical properties of those systems determine convergence - is the key to understanding why Adam works, why Newton's method is quadratically convergent, and why gradient descent can stall.
Scope note: Gradient descent theory, convergence rates, and adaptive learning rates are fully covered in Section08-Optimization. This section covers the numerical implementation aspects that Section08 does not address in depth: line search implementations, trust region methods, quasi-Newton methods (L-BFGS), second-order methods, and the floating-point behavior of optimization algorithms.
Prerequisites
- Section01 Floating-Point Arithmetic - machine epsilon, condition numbers, stability
- Section02 Numerical Linear Algebra - CG, condition numbers, matrix decompositions
- Section08-01 Gradient Descent - gradient descent theory
- Section08-05 Stochastic Optimization - SGD, Adam, momentum
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Line search, Newton's method, L-BFGS, trust region, convergence analysis |
| exercises.ipynb | 10 graded exercises on optimization algorithms and their numerical properties |
Learning Objectives
After completing this section, you will:
- Implement Wolfe condition line search and explain why it's needed for convergence
- Explain why Newton's method is quadratically convergent and when it fails numerically
- Implement L-BFGS and explain the two-loop recursion
- Understand trust region methods as an alternative to line search
- Analyze the effect of condition number on gradient descent convergence
- Connect Adam's update rule to diagonal quasi-Newton methods
- Implement numerical differentiation and analyze its accuracy
- Explain why finite differences have an optimal step size
Table of Contents
- 1. Intuition
- 2. Line Search Methods
- 3. Newton's Method
- 4. Quasi-Newton Methods
- 5. Trust Region Methods
- 6. Numerical Differentiation
- 7. Constrained Optimization Numerics
- 8. Applications in Machine Learning
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI (2026 Perspective)
- 12. Conceptual Bridge
1. Intuition
Numerical optimization differs from mathematical optimization in one fundamental way: we compute with floating-point numbers, not exact reals. This has profound consequences:
The gradient is never exact: Even with automatic differentiation, the computed gradient satisfies . For well-conditioned objectives, this is fine. For ill-conditioned objectives near a minimum, gradients can be numerically zero before we actually reach the minimum.
The Hessian is expensive and often unavailable: Computing requires memory and Hessian-vector products. For (typical LLM size), storing requires bytes - impossible. This is why first-order and quasi-Newton methods dominate.
Ill-conditioning slows convergence: Gradient descent on a quadratic converges at rate - exponentially slow for large . Preconditioning (approximating ) is the key to faster convergence.
The minimum is almost never reached: Gradient descent converges to a point where for some tolerance . In deep learning, we stop early (for generalization, not just optimization), so the numerical optimization problem is always underdetermined by design.
2. Line Search Methods
A line search determines how far to move in a given direction :
The step size must be chosen carefully: too small -> slow convergence; too large -> divergence or oscillation.
2.1 Exact Line Search
Exact line search minimizes exactly:
For quadratic objectives: , the exact step is:
This is exactly the CG step size - CG with steepest descent direction is "exact line search gradient descent."
Limitation: For general nonlinear , finding the exact minimum of requires many function evaluations. Inexact line search conditions (Wolfe, Armijo) are far more practical.
2.2 Wolfe Conditions
The Wolfe conditions define a set of acceptable step sizes that ensure:
- Sufficient decrease (Armijo condition):
- Curvature condition:
with . Typical values: , (line search for CG/BFGS) or (Wolfe for steepest descent).
Strong Wolfe conditions replace the curvature condition with:
This prevents steps at points where the derivative is positive but not too large (i.e., on the "other side" of a local minimum in the line direction).
Theorem (Zoutendijk): If the search direction satisfies (descent direction) and the Wolfe conditions are satisfied, then:
This guarantees - the gradients eventually become small.
2.3 Backtracking Armijo
The simplest practical line search:
def backtracking_armijo(f, g, x, p, alpha0=1.0, rho=0.5, c1=1e-4):
"""Backtracking line search satisfying Armijo condition."""
alpha = alpha0
f0 = f(x)
slope = g(x) @ p # Directional derivative (should be negative)
while f(x + alpha * p) > f0 + c1 * alpha * slope:
alpha *= rho
return alpha
Analysis: The Armijo condition ensures "sufficient decrease" - each step reduces by at least a fraction of what a linear model predicts. Backtracking automatically adjusts the step when the initial trial step is too large.
For AI: PyTorch's torch.optim.LBFGS uses a strong Wolfe line search. Standard SGD/Adam do not use line search - the learning rate is a fixed hyperparameter.
2.4 Strong Wolfe and Zoom
The zoom algorithm finds a step satisfying the strong Wolfe conditions:
-
Bracket phase: Find an interval that must contain a Wolfe-satisfying step. This is done by starting with , then , and doubling until either the Armijo condition fails or the slope at the right endpoint becomes positive.
-
Zoom phase: Bisect (or use cubic/quadratic interpolation) within the bracket until a strong Wolfe point is found.
Implementation: scipy.optimize.line_search_wolfe2 implements this algorithm. It is the standard choice for L-BFGS implementations.
3. Newton's Method
Newton's method for minimizing applies Newton's method to the gradient equation :
where is the Hessian.
3.1 Quadratic Convergence
Theorem: If is twice continuously differentiable, is positive definite at the minimum , and is sufficiently close to , then Newton's method converges quadratically:
Intuition: Each step uses a quadratic model , which is exact to second order. Quadratic convergence means errors shrink quadratically: if , then , , etc.
Cost per iteration: Solving requires flops (direct Cholesky/LU factorization) or (if is known to be SPD and well-conditioned). Plus to form the Hessian.
For AI: Newton's method is impractical for modern LLMs (). However, it motivates all quasi-Newton methods, which approximate cheaply.
3.2 Modified Newton for Non-Convex
At non-convex points (saddle points or near maxima), may be indefinite. Applying Newton's step can move toward a maximum, not a minimum.
Strategy 1: Hessian modification. Replace with where is chosen to make the modified Hessian positive definite. This is the Levenberg-Marquardt damping idea.
The modified Newton step solves:
Choosing :
- If is already positive definite with : set .
- Otherwise: set for some .
Strategy 2: Modified Cholesky. Factor where is chosen minimally to make positive definite. The Gill-Murray-Wright modified Cholesky algorithm does this robustly.
3.3 Inexact Newton Methods
For large-scale problems, solving exactly is too expensive. Inexact Newton methods find such that:
where is the forcing sequence. Using CG to solve the linear system approximately (Newton-CG or truncated Newton):
for outer iteration k:
solve H_k d = -g_k using CG, stopping when
||H_k d + g_k|| <= eta_k ||g_k||
x_{k+1} = x_k + alpha_k d_k (line search for alpha_k)
Convergence: If fast enough (e.g., ), inexact Newton converges super-linearly. If is kept constant, linear convergence with rate .
For AI: Inexact Newton methods (Hessian-free optimization) were used for training neural networks in the pre-deep-learning era (Martens, 2010). The Hessian-vector product can be computed in time via automatic differentiation (Pearlmutter's trick), making each CG iteration cost without forming .
4. Quasi-Newton Methods
Quasi-Newton methods build up an approximation (or ) using gradient information at each step, without forming the Hessian explicitly.
4.1 BFGS Update
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) update is the most successful quasi-Newton method.
Setup: After step , define:
Secant equation: The updated Hessian approximation should satisfy:
This is analogous to finite difference approximation of the Hessian.
BFGS update (inverse Hessian form):
where .
Properties:
- is symmetric positive definite if is SPD and
- satisfies the secant condition:
- BFGS converges super-linearly (between linear and quadratic) for smooth strongly convex functions
Curvature condition: is guaranteed when using the Wolfe line search (hence why Wolfe conditions are essential for BFGS). Without it, the update can make indefinite.
4.2 L-BFGS: Limited Memory BFGS
Full BFGS requires memory to store . For , this is TB - impractical.
L-BFGS stores only the last pairs and implicitly computes via the two-loop recursion:
def lbfgs_direction(g, s_list, y_list):
"""Compute H^{-1} g using the two-loop L-BFGS recursion."""
m = len(s_list)
rho = [1.0 / (y @ s) for s, y in zip(s_list, y_list)]
q = g.copy()
alpha = np.zeros(m)
# First loop (backward)
for i in range(m-1, -1, -1):
alpha[i] = rho[i] * s_list[i] @ q
q -= alpha[i] * y_list[i]
# Scale by initial Hessian approximation
# H_0^{-1} = (s_m^T y_m) / (y_m^T y_m) * I (Barzilai-Borwein scaling)
if m > 0:
gamma = (s_list[-1] @ y_list[-1]) / (y_list[-1] @ y_list[-1])
else:
gamma = 1.0
r = gamma * q
# Second loop (forward)
for i in range(m):
beta = rho[i] * y_list[i] @ r
r += s_list[i] * (alpha[i] - beta)
return -r # Search direction
Memory requirement: - just vectors of dimension . Typically to .
Convergence: L-BFGS with vectors has memory of gradient changes, giving super-linear convergence locally (linear globally for general non-convex ).
For AI: torch.optim.LBFGS implements L-BFGS. It is used for full-batch training and fine-tuning tasks where the full gradient is available and noise-free. For stochastic settings (mini-batch), L-BFGS is not directly applicable because the secant condition fails when gradients are noisy.
4.3 SR1 Update
The Symmetric Rank-1 (SR1) update is simpler:
Advantage: SR1 can capture indefinite Hessians (saddle points), unlike BFGS (which always produces SPD updates).
Disadvantage: Not guaranteed to remain positive definite; can fail numerically if .
5. Trust Region Methods
Trust region methods take a different approach: instead of choosing a direction and then a step size (line search), they choose a step size first and then find the best direction within that radius.
5.1 Trust Region Framework
At each iteration, define a trust region radius and solve:
where is an approximation to the Hessian.
Ratio test: After computing the step , compare the actual reduction to the predicted reduction:
Update rules:
- : shrink trust region (); reject step
- : keep trust region; accept step
- : expand trust region (); accept step
5.2 Cauchy Point and Dogleg
The exact trust region subproblem is an eigenvalue problem that costs . Cheaper approximations:
Cauchy point: The point that minimizes along the steepest descent direction subject to the trust region constraint:
The Cauchy point gives linear convergence (same as gradient descent with exact line search).
Dogleg method: Interpolate between the Cauchy point and the unconstrained Newton step :
- If : take the Newton step
- Else: move along the path until hitting the trust region boundary
This gives super-linear convergence near the optimum.
5.3 Trust Region vs Line Search
| Property | Trust Region | Line Search |
|---|---|---|
| Convergence guarantee | Global (any Hessian approx) | Requires descent direction |
| Indefinite Hessian | Handled naturally | Needs modification |
| Cost per iteration | Higher (subproblem solve) | Lower (backtracking) |
| Parameter tuning | Trust region radius | Learning rate |
| Non-smooth objectives | Can be extended | Harder |
For AI: Trust region methods are used in reinforcement learning (TRPO, PPO) where the "trust region" constrains policy updates. The KL divergence constraint in TRPO plays the role of the trust region: .
6. Numerical Differentiation
When analytic gradients are unavailable (or to verify automatic differentiation), we use finite difference approximations.
6.1 Finite Differences
Forward difference:
Error: .
Centered difference:
Error: - more accurate but requires two function evaluations.
Richardson extrapolation: Eliminate the leading-order error term:
where denotes centered difference with step . Error: .
6.2 Optimal Step Size
The total error in centered difference is:
Minimizing over : set , giving:
For fp64: .
Minimum achievable error: .
For forward difference: , minimum error .
6.3 Complex-Step Derivative
The complex-step method avoids cancellation entirely:
where . This requires evaluating at a complex argument.
Error: - same as centered difference, but no rounding error from cancellation (imaginary part is computed by addition, not subtraction). With , the error is - essentially machine precision.
Limitation: Requires the function to be analytically extendable to complex arguments (which almost all smooth functions are, but branch cuts can cause issues).
7. Constrained Optimization Numerics
7.1 Projected Gradient
For optimization over a convex set (e.g., the simplex, a box, or the unit sphere):
where is the Euclidean projection.
Common projections:
- Box : - O(n)
- Simplex : sort and find threshold (O(n log n))
- Ball : if , else - O(n)
For AI: Gradient clipping is a projected gradient step onto the ball . Softmax output can be seen as optimization over the simplex. Spectral normalization maintains weights on the unit spectral norm ball via projected gradient.
7.2 Augmented Lagrangian
For equality constrained problems , the augmented Lagrangian is:
The alternating direction method of multipliers (ADMM) minimizes this iteratively:
- (minimize over primal variable)
- (dual update)
ADMM in AI: Used in distributed optimization (splitting model across devices), constrained optimization problems in RL, and structured sparsity (e.g., group lasso).
8. Applications in Machine Learning
8.1 Adam as Diagonal Quasi-Newton
Adam's update rule:
can be rewritten as:
where is a diagonal approximate Hessian.
This is diagonal quasi-Newton. The second moment estimates , which is proportional to the diagonal of the Fisher information matrix. Dividing by pre-conditions the gradient by the square root of the diagonal Fisher.
Why RMS, not raw variance? The square root instead of is the key. For a quadratic with diagonal Hessian , the stationary distribution of stochastic gradients has variance in coordinate . So , and . This is a half-step towards the Newton direction .
8.2 L-BFGS for Full-Batch Training
L-BFGS is the standard algorithm for full-batch optimization of small-to-medium neural networks:
optimizer = torch.optim.LBFGS(
model.parameters(),
lr=1.0, # Inner learning rate (line search scales this)
max_iter=20, # CG iterations per step
max_eval=25, # Max function evaluations per step
history_size=100, # L-BFGS memory
line_search_fn='strong_wolfe'
)
def closure():
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
return loss
for epoch in range(100):
optimizer.step(closure)
When to use L-BFGS:
- Batch size = full dataset (no noise)
- Medium-scale problems ()
- Need high accuracy (e.g., physics simulation, function approximation)
- Fine-tuning pre-trained models with small LoRA adapters
When NOT to use L-BFGS:
- Large-scale stochastic training (gradient noise breaks secant condition)
- Online/streaming data
- Very large models where memory is constrained
8.3 Hessian-Free Optimization
Hessian-free (HF) optimization (Martens, 2010) computes Newton steps using CG applied to without forming , using the Gauss-Newton approximation:
Hessian-vector products are computed via automatic differentiation:
def hvp(loss, params, v):
"""Hessian-vector product Hv via autograd."""
grad = torch.autograd.grad(loss, params, create_graph=True)
grad_v = sum(g.view(-1) @ v_part.view(-1)
for g, v_part in zip(grad, v))
return torch.autograd.grad(grad_v, params)
This enables CG iterations that cost per step (same as a forward+backward pass).
Historical note: HF optimization achieved state-of-the-art results on deep network training in 2010-2012, before SGD with momentum/Adam became dominant. For recurrent networks with long sequences (high curvature), HF still has advantages.
8.4 Natural Gradient and Fisher Information
The natural gradient is:
where is the Fisher information matrix. This is Newton's method with the Hessian replaced by the Fisher matrix.
Why the Fisher? The KL divergence is approximately . The natural gradient moves in the direction of steepest descent in the space of probability distributions (Riemannian manifold with metric ), not in parameter space.
K-FAC (Kronecker-Factored Approximate Curvature): Approximates where is the covariance of activations and is the covariance of gradient signals. This gives , which can be computed efficiently.
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using too large a step in finite differences | Rounding error dominates: | Use for centered difference |
| 2 | Using too small a step in finite differences | Catastrophic cancellation: in fp | Use , never |
| 3 | Applying L-BFGS to stochastic objectives | Gradient noise violates secant condition | Use Adam/SGD for mini-batch; L-BFGS only for full batch |
| 4 | Ignoring the Wolfe condition in BFGS | BFGS update may produce indefinite | Always use Wolfe line search with BFGS |
| 5 | Using forward finite differences for gradient checking | Only accuracy | Use centered differences or complex-step |
| 6 | Forgetting the Hessian modification in Newton's method | Newton step points toward maximum at saddle points | Add or use modified Cholesky |
| 7 | Setting learning rate too large and expecting convergence | Divergence is guaranteed for | Use line search or tune carefully |
| 8 | Using Adam for convex problems when L-BFGS is available | Adam has higher per-iteration cost and worse final accuracy | L-BFGS for full-batch convex problems |
| 9 | Checking gradient only at initial point | Gradients can be correct initially but wrong near optimal due to numerical issues | Gradient check at multiple points including near-optimal |
| 10 | Treating trust region radius as a learning rate | Trust region is in parameter space; effective step size depends on Hessian | Tune trust region ratio test parameters separately |
| 11 | Forgetting that Adam's epsilon matters | (default) can dominate for very small gradients | Increase for small models; use for transformers |
| 12 | Not resetting L-BFGS history when task changes | Old curvature information misleads new optimization | Reset with optimizer.state_dict() or create new optimizer |
10. Exercises
Exercise 1 * - Backtracking Armijo Line Search
Implement the backtracking Armijo line search.
(a) Write armijo_search(f, grad_f, x, p, alpha0=1.0, rho=0.5, c1=1e-4).
(b) Test on (Rosenbrock variant). Start at with descent direction .
(c) Plot and the Armijo line, marking the accepted step.
Exercise 2 * - Newton's Method
Implement Newton's method with line search.
(a) Write newton_method(f, grad_f, hess_f, x0, tol) using backtracking Armijo line search.
(b) Test on the Rosenbrock function .
(c) Compare convergence (iterations to accuracy) between gradient descent, Newton's method, and gradient descent with exact line search.
Exercise 3 ** - L-BFGS Two-Loop Recursion
Implement the L-BFGS two-loop recursion.
(a) Write lbfgs_direction(g, s_list, y_list, m) that returns the L-BFGS search direction.
(b) Implement a full L-BFGS optimizer using the strong Wolfe line search.
(c) Test on quadratic minimization with varying condition numbers. Compare convergence to gradient descent.
Exercise 4 ** - Finite Difference Gradient Check
Implement gradient checking via finite differences.
(a) Write grad_check(f, grad_f, x, h_vals) that computes the centered difference gradient for each in h_vals and compares to grad_f(x).
(b) Find the optimal empirically for several test functions.
(c) Plot error vs on log-log scale. Verify the form.
Exercise 5 ** - Trust Region vs Line Search
Compare trust region and line search methods.
(a) Implement a simple trust region method with Cauchy point for a 2D test function.
(b) Test on a non-convex function with saddle points. Show that trust region methods don't require modification of the Hessian.
(c) Plot the trajectory of iterates for both methods.
Exercise 6 ** - Adam as Diagonal Quasi-Newton
Demonstrate Adam's connection to diagonal preconditioning.
(a) Implement gradient descent, Adam, and diagonal-preconditioned gradient descent.
(b) Test on a quadratic .
(c) Show that Adam converges at a rate close to the optimal diagonal-preconditioned rate, and much faster than vanilla gradient descent.
Exercise 7 ** - Numerical Differentiation Accuracy
Implement finite differences and compare to automatic differentiation.
(a) Implement forward, centered, and Richardson extrapolation finite differences.
(b) Find the optimal for each method on .
(c) Implement the complex-step derivative and verify it achieves near machine precision.
Exercise 8 *** - Hessian-Vector Product via Autograd
Compute Hessian-vector products efficiently using automatic differentiation.
(a) Write hvp(model, x, v) that computes without forming , using PyTorch's autograd.grad with create_graph=True.
(b) Implement Newton-CG (truncated Newton) using CG applied to the Hessian-vector product.
(c) Test on a small neural network (2 layers, 50 hidden units). Compare convergence to Adam.
11. Why This Matters for AI (2026 Perspective)
| Aspect | Impact |
|---|---|
| Adam as quasi-Newton | Understanding Adam through the lens of diagonal preconditioning explains its robustness and motivates improvements like AdaHessian, Shampoo |
| L-BFGS for fine-tuning | Fine-tuning small LoRA adapters with full batch uses L-BFGS; achieving faster convergence than Adam in low-noise regime |
| Hessian-vector products | Used in mechanistic interpretability (computing Hessian eigenvalue spectra), continual learning (Fisher-based regularization), and meta-learning |
| Natural gradient | K-FAC and Shampoo implement approximate natural gradient; used in LLM pre-training at Google (TPU-optimized K-FAC) |
| Trust region in RL | TRPO/PPO use trust region methods to constrain policy updates; direct precursor to RLHF stability |
| Conjugate gradient in optimization | Newton-CG underlies Hessian-free optimization; CG is used in solving the trust region subproblem |
| Line search in LBFGS | Wolfe conditions ensure the secant condition holds; critical for L-BFGS stability |
| Gradient checking | Essential for debugging custom autodiff implementations; every major framework implements it |
| Fisher information | Quantifies "how much information" each training example provides; central to active learning, Bayesian deep learning, and continual learning |
12. Conceptual Bridge
What you learned: This section bridges the gap between theoretical optimization (Section08) and practical computation. The key themes:
-
Line search ensures convergence: Without proper step size selection (Armijo, Wolfe), gradient-based methods may diverge or stagnate. Adam uses a fixed learning rate with momentum - it lacks a formal convergence guarantee for general non-convex problems but works well in practice.
-
Quasi-Newton approximates the second-order structure: BFGS/L-BFGS use past gradient differences to build up a Hessian approximation. Adam does this diagonally, using running second-moment estimates. Shampoo/K-FAC do it in a structured block-diagonal form.
-
Condition number determines convergence rate: For gradient descent on a quadratic, the convergence rate is exactly . Preconditioning (approximating the Hessian) reduces the effective condition number and accelerates convergence.
-
Floating-point limits gradient accuracy: For smooth functions, automatic differentiation gives near-exact gradients. Finite differences are limited by accuracy. This matters for gradient checking and numerical Hessian computation.
NUMERICAL OPTIMIZATION IN CONTEXT
========================================================================
Section08 Optimization (theory) -> Section10-03 Numerical Optimization
+-- Gradient descent --> Line search, convergence guarantees
+-- Adam/momentum --> Diagonal quasi-Newton interpretation
+-- Convergence theory --> Condition number analysis
+-- Constrained opt. --> Projected gradient, ADMM
Section10-02 Numerical Linear Algebra -> Section10-03 Numerical Optimization
+-- CG method --> Newton-CG, Hessian-free opt
+-- Condition numbers --> GD convergence rate
+-- Iterative refinement --> Inexact Newton methods
+-- Sparse systems --> Large-scale quasi-Newton
========================================================================
<- Back to Section02 Numerical Linear Algebra | Next: Section04 Interpolation and Approximation ->
Appendix A: Wolfe Conditions - Detailed Analysis
A.1 Why the Armijo Condition Alone Is Insufficient
The Armijo condition prevents steps that are too long (the function must decrease by at least times the predicted linear decrease). But it doesn't prevent steps that are too short!
Example: Consider , (descent direction). The Armijo condition with is satisfied for any . But if we choose , the step is effectively zero and we make no progress.
The curvature condition prevents this: it requires the slope at the new point to be at least as large (in magnitude) as a fraction of the initial slope. This ensures the step is not too short.
A.2 Wolfe Conditions Ensure BFGS Remains Well-Defined
The BFGS update requires to maintain positive definiteness of the Hessian approximation. The curvature condition guarantees this:
The curvature condition says . Since (descent direction) and :
So - the curvature condition ensures the BFGS update remains positive definite.
Appendix B: L-BFGS Two-Loop Recursion - Derivation
The L-BFGS direction where approximates the inverse Hessian. is implicitly defined by:
where and .
Directly computing via this formula requires operations - expensive if we expand the products. The two-loop recursion computes in operations by expanding the matrix-vector product recursively without ever forming .
Two-loop recursion (Nocedal, 1980):
First loop (backward, ):
starting with .
Middle: Scale by initial Hessian: .
Second loop (forward, ):
The result is the L-BFGS search direction.
Appendix C: Trust Region Subproblem - Eigenvalue Approach
The exact trust region subproblem:
is solved by the More-Sorensen algorithm (1983):
KKT conditions: The optimal satisfies:
where is the Lagrange multiplier.
Algorithm:
- If (Newton step within trust region): ,
- Otherwise, solve for such that
Step 2 is a one-dimensional root-finding problem. Since is monotonically decreasing in for , this can be solved by Newton's method in 1D.
Cost: One eigendecomposition of () plus a few 1D Newton iterations.
Appendix D: Gradient Checking Best Practices
D.1 What to Check
A gradient check compares from autograd with the finite difference approximation. The relative error should be:
- : essentially correct (autograd is accurate to ~10 digits)
- to : probably correct (some numerical noise)
- : potential bug in gradient implementation
- : definite bug
D.2 Pitfalls
-
Non-differentiable operations: ReLU at zero,
abs(0),max(a, b)at the boundary - finite differences may give a subgradient, not the same subgradient as autograd. -
Numerically sensitive operations: Softmax with very large logits, log(tiny number) - both methods may have errors here, but they may differ.
-
Wrong random seed: Always check with a fixed random seed; bugs can be masked by lucky initialization.
-
Checking only the gradient, not the Jacobian: For vector-valued functions, check the full Jacobian, not just a scalar projection.
D.3 Implementation
def gradient_check(f, x, grad_f, eps=1e-5, rtol=1e-5, atol=1e-8):
"""
Compare analytical gradient to centered finite differences.
Returns max relative error across all coordinates.
"""
n = len(x.ravel())
g_analytic = grad_f(x).ravel()
g_numerical = np.zeros(n)
for i in range(n):
x_plus = x.copy().ravel()
x_minus = x.copy().ravel()
x_plus[i] += eps
x_minus[i] -= eps
g_numerical[i] = (f(x_plus.reshape(x.shape)) -
f(x_minus.reshape(x.shape))) / (2 * eps)
# Relative error (with absolute tolerance floor)
denom = np.maximum(np.abs(g_analytic), atol)
rel_err = np.abs(g_analytic - g_numerical) / denom
return rel_err.max(), rel_err
Appendix E: Adam - Complete Update Rule and Variants
E.1 Standard Adam
def adam_step(g, m, v, t, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
"""One step of Adam optimizer."""
m = beta1 * m + (1 - beta1) * g # First moment
v = beta2 * v + (1 - beta2) * g**2 # Second moment
m_hat = m / (1 - beta1**t) # Bias correction
v_hat = v / (1 - beta2**t) # Bias correction
delta = lr * m_hat / (np.sqrt(v_hat) + eps) # Update
return m, v, delta
E.2 AdamW - Weight Decay Decoupling
Standard Adam with L2 regularization: add to the loss. This makes the update:
The weight decay term is scaled by - large for coordinates with small gradient variance. AdamW decouples weight decay from the adaptive scaling:
This is a better regularizer: all weights decay at the same rate, not faster for rarely-updated weights.
E.3 AdaFactor - Memory-Efficient Adam
AdaFactor (Shazeer & Stern, 2018) reduces Adam's memory for second moments by approximating the second moment matrix as a rank-1 outer product:
For a weight matrix , this reduces memory from to .
Trade-off: Less accurate second moment estimate -> slightly worse convergence in practice, but viable for very large models where Adam's memory cost is prohibitive.
E.4 Shampoo - Block Preconditioning
Shampoo (Gupta et al., 2018) applies a full-matrix preconditioner to each layer:
For layer weight :
Update:
This approximates the natural gradient (full Fisher) with only memory per layer, vs for full Fisher. Shampoo is used in Google's LLM pre-training (via distributed Shampoo implementation).
Appendix F: Convergence Analysis Review
F.1 Gradient Descent Convergence
Smooth convex case ( has -Lipschitz gradients, convex):
with step .
Strongly convex case ( is -strongly convex, -smooth):
Convergence rate: iterations to reduce error by .
Non-convex case:
F.2 Newton's Method Convergence
Near a non-degenerate minimum :
where is the Lipschitz constant of and is the smallest eigenvalue of the Hessian at .
Basin of convergence: Newton's method converges from any satisfying .
F.3 BFGS Convergence
For strongly convex with Lipschitz continuous Hessian, BFGS converges super-linearly:
and -linearly convergent globally:
for some and .
Appendix G: Software Ecosystem
G.1 SciPy Optimize
from scipy.optimize import minimize, check_grad
# L-BFGS-B
result = minimize(f, x0, jac=grad_f, method='L-BFGS-B',
options={'maxiter': 1000, 'ftol': 1e-15, 'gtol': 1e-10})
# Newton-CG
result = minimize(f, x0, jac=grad_f, hess=hess_f, method='Newton-CG')
# Trust region methods
result = minimize(f, x0, jac=grad_f, hess=hess_f, method='trust-ncg')
result = minimize(f, x0, jac=grad_f, hess=hess_f, method='trust-exact')
# Gradient checking
err = check_grad(f, grad_f, x0) # Should be < 1e-5 for correct gradient
G.2 PyTorch Optimizers
import torch.optim as optim
# Standard first-order
sgd = optim.SGD(params, lr=0.01, momentum=0.9, weight_decay=1e-4)
adam = optim.Adam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8)
adamw = optim.AdamW(params, lr=1e-3, weight_decay=0.01)
# Second-order
lbfgs = optim.LBFGS(params, lr=1, max_iter=20, history_size=100,
line_search_fn='strong_wolfe')
# Usage with closure (required for LBFGS and some line search methods)
def closure():
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
return loss
optimizer.step(closure)
G.3 JAX Optimizers (Optax)
import optax
# Standard optimizers
optimizer = optax.adam(learning_rate=1e-3)
opt_state = optimizer.init(params)
# One step
grads = jax.grad(loss_fn)(params)
updates, opt_state = optimizer.update(grads, opt_state, params)
params = optax.apply_updates(params, updates)
# Composing transformations
optimizer = optax.chain(
optax.clip_by_global_norm(1.0), # Gradient clipping
optax.scale_by_adam(), # Adam scaling
optax.scale(-lr), # Learning rate
)
Appendix H: Summary and Quick Reference
NUMERICAL OPTIMIZATION - ALGORITHM SELECTION GUIDE
========================================================================
STOCHASTIC (mini-batch, noisy gradients)
-----------------------------------------
Adam (default for LLMs/transformers)
AdamW (with weight decay - preferred for fine-tuning)
SGD + momentum (slower to converge, better generalization sometimes)
Adafactor (when memory is very constrained)
Shampoo/K-FAC (when second-order info is affordable)
FULL-BATCH (all data, low noise)
---------------------------------
L-BFGS (the gold standard for full-batch)
Nonlinear CG (memory efficient, similar to L-BFGS)
Newton-CG (when Hessian-vector products are cheap)
Trust-region (when Hessian is indefinite)
CONVERGENCE RATE COMPARISON (quadratic objective, = 100)
---------------------------------------------------------
GD (optimal step): ~100 iters to 1e-6 (rate = (-1)/(+1) = 0.98)
CG: ~10 iters to 1e-6 (rate \\approx (\\sqrt-1)/(\\sqrt+1) = 0.82)
L-BFGS (m=10): ~10 iters to 1e-6 (super-linear)
Newton: ~3 iters to 1e-6 (quadratic convergence!)
GRADIENT CHECKING STEP SIZE
----------------------------
Forward differences: h = sqrt(eps_mach) \\approx 1.5e-8
Centered differences: h = cbrt(eps_mach) \\approx 6e-6
Complex step: h = 1e-20 (essentially exact)
========================================================================
<- Back to Section02 Numerical Linear Algebra | Next: Section04 Interpolation and Approximation ->
Appendix I: Extended Case Studies
I.1 Why Adam Doesn't Use Line Search
Deterministic optimization (with exact gradients) benefits greatly from line search: at each step, we can evaluate as many times as needed to find a good step size. But stochastic optimization with mini-batches cannot use line search:
Problem 1: Inconsistent function values. The function value computed on batch is different from computed on the next batch. We cannot meaningfully compare to if the batch changes between evaluation.
Problem 2: Gradient noise breaks secant condition. The L-BFGS secant condition requires the gradient change to be in a specific relationship to the step . With stochastic gradients, is corrupted by noise, and the condition may fail.
Adam's solution: Instead of adapting the step size using function evaluations (line search), Adam adapts using the history of gradient squares (). This is "implicit line search" - steps are small where gradients have been large historically, and large where gradients have been small.
I.2 When L-BFGS Beats Adam
Task: Minimize the loss of a 2-layer network on a small dataset (N=1000, d=100).
| Method | Iterations | Final loss | Time |
|---|---|---|---|
| Adam () | 10,000 | 45s | |
| Adam () | 10,000 | 45s | |
| L-BFGS () | 150 | 12s |
L-BFGS reaches much lower loss faster when:
- Gradient is exact (full batch)
- The loss function is smooth (no noise)
- High accuracy is needed (physics simulation, calibration)
But for LLM pretraining (N=, d=): L-BFGS is impractical. Adam is the right choice.
I.3 The Role of Condition Number in Transformer Training
During LLM pretraining, the condition number of the loss Hessian at different points in training:
- Early training (): Random initialization, for linear layers. Gradients flow well.
- Mid training: Parameters have adapted, can reach to for some layers. This is where gradient clipping matters.
- Late training: Near convergence, stabilizes. The learning rate schedule (cosine decay, linear decay) effectively adapts to the local curvature.
Weight decay effect on condition number: Adding L2 regularization shifts all Hessian eigenvalues by , changing condition number from to , which is smaller (better conditioned). This is one reason AdamW (with explicit weight decay) often trains more stably than Adam with L2 in the loss.
Appendix J: Numerical Analysis of Backpropagation
J.1 Floating-Point Error in Backprop
The backpropagation algorithm computes by the chain rule:
In floating-point, each multiplication and activation application introduces relative error. For an -layer network, the accumulated gradient error is:
where is the condition number of the Jacobian of the network (which can be large for deep networks with poor initialization).
J.2 Mixed Precision and Gradient Accuracy
In bf16 training, gradients are computed in bf16 with . This means:
- Parameter updates have significant digits of accuracy
- Accumulated over many steps, this can introduce drift
Why it works despite low precision: The optimizer (Adam) effectively computes a smoothed gradient over many steps via the exponential moving average . Noise in individual gradient estimates is averaged out. The effective gradient accuracy is after steps, which is much better than single-step precision.
J.3 Gradient Accumulation and Precision
When accumulating gradients over mini-batches before an update:
In fp32 with Kahan summation: error independent of . In bf16 with naive summation: error - grows with accumulation steps.
Fix: Accumulate gradients in fp32, even if the forward/backward pass uses bf16. This is the default behavior in PyTorch AMP when scaler.scale(loss).backward() is used.
Appendix K: Connection to Statistical Learning Theory
Numerical optimization connects to statistical learning theory through the bias-variance tradeoff:
Optimization error: - how far we are from the training loss minimum.
Approximation error: - how far the best model in our function class is from the Bayes optimal.
Estimation error: (error on new data) - generalization gap.
The over-optimization paradox: Running optimization to convergence (small optimization error) often leads to overfitting (large estimation error). This is why:
- We use early stopping (stop before convergence)
- We use weight decay (adds regularization that changes )
- We use dropout and other regularizers (change the optimization landscape)
Implicit regularization in SGD: Stochastic gradient descent, by virtue of its noise, converges to "flat minima" - regions where the Hessian has small eigenvalues. Such minima generalize better (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016). This is NOT a numerical issue - it's a feature of the stochastic optimization process.
Appendix L: Historical Timeline of Optimization Methods
OPTIMIZATION METHODS: TIMELINE
========================================================================
1847 Cauchy - steepest descent method (gradient descent)
1944 Levenberg - damped least squares (LM algorithm)
1952 Kaczmarz - projection-based iterative method
1952 Hestenes & Stiefel - conjugate gradient method
1959 Broyden, Fletcher, Goldfarb, Shanno (BFGS) - quasi-Newton
1963 Marquardt - generalized LM algorithm
1970 Sargent & Sebastian - first L-BFGS-like method
1980 Nocedal - limited-memory BFGS
1988 Armijo - sufficient decrease condition
1969 Wolfe - curvature condition for line search
1994 Byrd, Lu, Nocedal, Zhu - L-BFGS-B (bounds constrained)
1997 Hochreiter & Schmidhuber - flat minima and generalization
2010 Martens - Hessian-free optimization for deep nets
2012 Duchi et al. - Adagrad
2014 Kingma & Ba - Adam
2015 Bottou - stochastic optimization in ML survey
2016 Keskar et al. - sharp vs flat minima for generalization
2017 Loshchilov & Hutter - AdamW (decoupled weight decay)
2018 Gupta et al. - Shampoo (block-diagonal preconditioning)
2018 Anil et al. - Scalable second-order optimization
2021+ Vyas et al., Dettmers - training efficiency, mixed precision
========================================================================
Appendix M: Chapter Position and Summary
Section10 NUMERICAL METHODS - SECTION 03 SUMMARY
========================================================================
PREREQUISITE CHAIN:
Section01 Floating Point -> Section02 Numerical LA -> Section03 Numerical Optimization
CONNECTIONS FROM THIS SECTION:
+--------------------------------------------------------------+
| Line search (Wolfe) -> Used in L-BFGS, BFGS |
| Newton's method -> Quasi-Newton (BFGS, L-BFGS) |
| L-BFGS -> torch.optim.LBFGS |
| Trust region -> TRPO, PPO (RL policy updates) |
| Diagonal quasi-Newton -> Adam, AdaGrad, RMSProp |
| Natural gradient -> K-FAC, Shampoo |
| Condition numbers -> Section02 Numerical LA |
| Gradient checking -> Debugging autodiff implementations|
+--------------------------------------------------------------+
KEY FORMULAS:
GD convergence rate: (-1)/(+1) per iteration
CG convergence rate: (\\sqrt-1)/(\\sqrt+1) per iteration
Newton convergence: quadratic near minimum
L-BFGS convergence: super-linear (between GD and Newton)
Optimal finite diff h: cbrt(eps_mach) \\approx 6e-6 (centered)
Finite diff error: O(h^2) + O(eps/h)
Complex step error: O(h^2) - no rounding error
========================================================================
Section statistics: 12 main sections, 22 subsections, 12 common mistakes, 10 exercises, 13 appendices.
<- Back to Section02 Numerical Linear Algebra | Next: Section04 Interpolation and Approximation ->
Appendix N: Additional Optimization Algorithms
N.1 Nonlinear Conjugate Gradient
The nonlinear conjugate gradient (NCG) method extends CG to nonlinear objectives by using the same direction-update structure but with gradient changes instead of exact Hessian-vector products.
The search directions are:
where the scalar determines which variant of NCG:
| Formula | Name | Best for |
|---|---|---|
| Fletcher-Reeves | Smooth convex | |
| Polak-Ribiere | General nonlinear | |
| Hestenes-Stiefel | Non-convex |
NCG vs L-BFGS:
- NCG: memory (just the previous direction)
- L-BFGS ( vectors): memory, faster convergence
- NCG: simpler to implement, same asymptotic convergence order
N.2 Stochastic Gradient Descent with Momentum
SGD + Momentum:
Nesterov accelerated gradient (NAG):
Connection to CG: Heavy ball (momentum SGD) is the Riemann form of CG for quadratic objectives. For non-quadratic objectives, momentum helps navigate ravines (valleys where the gradient is large perpendicular to the valley but small along it).
Optimal momentum for quadratics: . With this momentum, gradient descent with momentum has the same convergence rate as CG: vs without momentum.
N.3 Stochastic Variance Reduction
For finite-sum objectives , variance-reduced methods achieve better convergence than SGD by occasionally computing the full gradient:
SVRG (Johnson & Zhang, 2013):
for outer loop s = 1, 2, ..., S:
compute full gradient \\mu = f(x_s)
for inner loop t = 1, ..., m:
sample i uniformly
g_t = f_i(x_t) - f_i(x_s) + \\mu # Variance-reduced gradient
x_{t+1} = x_t - \\alpha g_t
x_{s+1} = x_{t_m} (some iterate from inner loop)
The key: (unbiased) and as (variance vanishes at optimum). SVRG achieves linear convergence (like GD) with per-iteration cost similar to SGD.
For AI: SVRG is rarely used for deep learning because the gradient computation involves complex non-linearities (making the variance reduction calculation non-trivial) and because large datasets make computing the full gradient expensive. Adam with decaying learning rate achieves comparable practical performance without the exact full gradient.
Appendix O: Extended Exercises
O.1 Implementing the Wolfe Zoom Algorithm (***)
The Wolfe zoom algorithm is the industry-standard line search for L-BFGS. Implement it from scratch:
def wolfe_zoom(f, g, x, p, alpha_lo, alpha_hi, phi_lo, phi0, dphi0,
c1=1e-4, c2=0.9, max_iter=20):
"""
Find alpha satisfying strong Wolfe conditions within [alpha_lo, alpha_hi].
phi_lo = f(x + alpha_lo * p), phi0 = f(x), dphi0 = g(x) @ p
"""
for i in range(max_iter):
# Cubic interpolation to find trial step
alpha_j = 0.5 * (alpha_lo + alpha_hi) # Simplified: bisection
phi_j = f(x + alpha_j * p)
if phi_j > phi0 + c1 * alpha_j * dphi0 or phi_j >= phi_lo:
alpha_hi = alpha_j
else:
dphi_j = g(x + alpha_j * p) @ p
if abs(dphi_j) <= -c2 * dphi0:
return alpha_j # Strong Wolfe satisfied
if dphi_j * (alpha_hi - alpha_lo) >= 0:
alpha_hi = alpha_lo
alpha_lo = alpha_j
phi_lo = phi_j
return alpha_j # Return best found
O.2 Numerical Gradient Check via PyTorch (*)
import torch
from torch.autograd import gradcheck
# Define a custom function
class MyFunc(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x)
return x ** 3 + torch.sin(x)
@staticmethod
def backward(ctx, grad_output):
x, = ctx.saved_tensors
return grad_output * (3 * x**2 + torch.cos(x))
# Gradient check with double precision
x = torch.randn(5, dtype=torch.float64, requires_grad=True)
result = gradcheck(MyFunc.apply, x, eps=1e-6, atol=1e-4, rtol=1e-3)
print(f"Gradient check passed: {result}")
O.3 Computing Hessian Spectrum via Lanczos (***)
The Hessian spectrum of a trained neural network reveals the curvature structure of the loss landscape:
def hessian_spectrum_lanczos(loss, params, n_iters=100):
"""
Compute the extremal eigenvalues of the Hessian via Lanczos iteration.
Each iteration requires one Hessian-vector product (one backward pass).
"""
d = sum(p.numel() for p in params)
# Initial vector
v = [torch.randn_like(p) for p in params]
v_norm = sum(torch.sum(vi**2) for vi in v).sqrt()
v = [vi / v_norm for vi in v]
alphas, betas = [], []
for i in range(n_iters):
# Hessian-vector product
Hv = hvp(loss, params, v)
alpha = sum(torch.sum(vi * Hvi) for vi, Hvi in zip(v, Hv)).item()
alphas.append(alpha)
# Orthogonalize
w = [Hvi - alpha * vi for Hvi, vi in zip(Hv, v)]
if i > 0:
w = [wi - beta * vi_old for wi, vi_old in zip(w, v_old)]
beta = sum(torch.sum(wi**2) for wi in w).sqrt().item()
betas.append(beta)
v_old = v
v = [wi / beta for wi in w]
# The Lanczos tridiagonal matrix T has diag=alphas, off-diag=betas
T = np.diag(alphas) + np.diag(betas[:-1], 1) + np.diag(betas[:-1], -1)
eigvals = np.linalg.eigvalsh(T)
return eigvals
Appendix P: Key Theorems Reference
P.1 Convergence Theorems
Theorem 1 (Gradient Descent): For -smooth with :
- Convex:
- -strongly convex:
Theorem 2 (CG): For SPD with condition number :
Theorem 3 (Newton): Near a non-degenerate minimum, Newton's method converges quadratically:
Theorem 4 (BFGS): For strongly convex with Lipschitz Hessian, BFGS converges super-linearly.
Theorem 5 (Zoutendijk): For descent direction with Wolfe line search:
P.2 Optimality Conditions
First-order necessary:
Second-order sufficient: and (positive definite Hessian)
KKT conditions (constrained): See Section08-04 Constrained Optimization
End of Section03 Numerical Optimization
<- Back to Section02 Numerical Linear Algebra | Next: Section04 Interpolation and Approximation ->
Appendix Q: Deep Dive - The Rosenbrock Function
The Rosenbrock function is the canonical test for optimization algorithms:
Properties:
- Minimum at with
- The valley is nearly flat along the direction but steep perpendicular to it
- Condition number of the Hessian at the minimum: (for the standard form)
- Not convex globally, but has a unique global minimum
Hessian at the minimum :
Eigenvalues: , .
Performance comparison on Rosenbrock:
| Method | Iterations to |
|---|---|
| Gradient descent (optimal ) | |
| CG (Fletcher-Reeves) | |
| L-BFGS () | |
| Newton (with line search) | |
| BFGS |
Why it's hard for gradient descent: The "banana-shaped" valley means gradient descent oscillates across the valley and makes tiny progress along the valley. Line search helps but doesn't fix the fundamental curvature mismatch.
Why L-BFGS works well: After a few steps, L-BFGS builds up good curvature information along the valley direction, allowing it to take large steps along the valley.
Q.1 Numerical Analysis of Rosenbrock
import numpy as np
def rosenbrock(x):
return 100 * (x[1] - x[0]**2)**2 + (1 - x[0])**2
def rosenbrock_grad(x):
g = np.zeros(2)
g[0] = -400 * x[0] * (x[1] - x[0]**2) - 2 * (1 - x[0])
g[1] = 200 * (x[1] - x[0]**2)
return g
def rosenbrock_hess(x):
H = np.zeros((2, 2))
H[0, 0] = -400 * (x[1] - x[0]**2) + 800 * x[0]**2 + 2
H[0, 1] = -400 * x[0]
H[1, 0] = -400 * x[0]
H[1, 1] = 200
return H
# Condition number at the minimum
H_star = rosenbrock_hess(np.array([1.0, 1.0]))
eigvals = np.linalg.eigvalsh(H_star)
kappa = eigvals.max() / eigvals.min()
print(f"kappa(H*) = {kappa:.2f}")
# Expected: ~2508
Appendix R: Optimization Landscape Analysis for LLMs
Recent research (2020-2024) has revealed interesting properties of the optimization landscape for large language models:
R.1 The Edge of Stability
Edge of Stability (EoS) phenomenon (Cohen et al., 2021): During neural network training with a constant learning rate , the largest Hessian eigenvalue tends to increase until it reaches and then stabilizes there.
Implication: The loss oscillates but decreases on average. The learning rate implicitly controls the curvature of the loss landscape that training "settles into."
Numerical analysis view: is the stability threshold for gradient descent ( for convergence). The EoS phenomenon shows that neural networks are often training at the boundary of instability - aggressive but not diverging.
R.2 Catastrophic Forgetting and Flat Minima
Observation (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016): Models trained with small batches (high gradient noise) tend to find "flatter" minima where is small. Models trained with large batches find "sharper" minima with large .
Flat minima generalize better because small perturbations to the weights (due to test distribution shift, quantization, etc.) cause smaller changes in the loss.
Numerical measure of sharpness:
SAM (Sharpness-Aware Minimization, Foret et al., 2021) explicitly minimizes sharpness by finding the worst-case perturbation and minimizing there:
This requires two gradient evaluations per step but consistently improves generalization.
R.3 Loss Landscape at Scale
For billion-parameter models (GPT-3, LLaMA, etc.), direct computation of the Hessian is impossible. Researchers use:
- Implicit Hessian-vector products (via Pearlmutter's trick) to estimate Hessian properties
- Lanczos iteration to find the leading eigenvalues of the Hessian
- Local quadratic approximation using finite differences around a trained model
Key findings (Ghorbani et al., 2019; Yao et al., 2020):
- The Hessian has a "bulk + outlier" structure: most eigenvalues are near zero (flat directions), with a few large eigenvalues (curved directions)
- The number of large eigenvalues grows with model capacity
- The condition number at convergence is surprisingly manageable ( to ) for well-trained models
Appendix S: Practical Implementation Notes
S.1 Warm Starting L-BFGS
When fine-tuning a pre-trained model with L-BFGS, warm-start the optimizer by providing the final state from a previous run:
# Load pre-trained weights
model.load_state_dict(torch.load('pretrained.pt'))
# Create L-BFGS optimizer
optimizer = torch.optim.LBFGS(model.parameters(), lr=1.0, history_size=20)
# Optionally load optimizer state for warm start
try:
optimizer.load_state_dict(torch.load('optimizer_state.pt'))
print("Warm-starting from previous optimizer state")
except FileNotFoundError:
print("Starting fresh L-BFGS")
S.2 Debugging Failed Optimization
Symptom: Loss not decreasing after several steps
- Check gradient norm:
g_norm = sum(p.grad.norm()**2 for p in params) ** 0.5 - If : already at a local minimum or learning rate too small
- If is large: check for NaN gradients, verify loss computation
Symptom: Loss oscillating or diverging
- Reduce learning rate by 10x
- Check condition number:
kappa = max_sv / min_svfor key weight matrices - Add gradient clipping:
torch.nn.utils.clip_grad_norm_(params, max_norm=1.0)
Symptom: L-BFGS says "function evaluation limit reached"
- Increase
max_evalparameter - Reduce model complexity or regularize more strongly
- Check if the closure is deterministic (L-BFGS requires consistent function values)
S.3 Hyperparameter Sensitivity
For Adam in practice:
- Learning rate : most sensitive; typical values to for LLMs
- , : largely insensitive; for very noisy gradients
- : increase to or for numerical stability in fp16/bf16
- Weight decay: treat as regularization, not a numerical parameter; for large models
For L-BFGS:
- history_size : 10-100; larger = better curvature approximation but more memory
- max_iter: 5-20 CG iterations per step
- line_search_fn: always use
'strong_wolfe'for robustness
End of Section03 Numerical Optimization Notes (1600+ lines)
<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation
Appendix T: Convergence Rate Summary Tables
T.1 First-Order Methods
| Method | Cost/Iter | Conv. Rate (smooth convex) | Conv. Rate (strongly convex, ) |
|---|---|---|---|
| Gradient descent | |||
| GD (optimal step) | |||
| GD + momentum | (Nesterov) | ||
| CG | |||
| SGD | |||
| Adam | Empirically fast |
T.2 Second-Order and Quasi-Newton Methods
| Method | Memory | Cost/Iter | Convergence |
|---|---|---|---|
| Newton | Quadratic near | ||
| Newton-CG | per CG step | Super-linear | |
| BFGS | Super-linear | ||
| L-BFGS | Super-linear locally | ||
| Diagonal precond. | Linear, rate |
where is the condition number after diagonal preconditioning.
T.3 Trust Region vs Line Search
| Aspect | Line Search | Trust Region |
|---|---|---|
| Convergence guarantee | Yes (Wolfe) | Yes (ratio test) |
| Non-convex Hessian | Needs modification | Natural handling |
| Implementation complexity | Moderate | Higher |
| Typical algorithms | BFGS, L-BFGS | Dog-leg, GLTR |
Appendix U: Worked Numerical Example - L-BFGS Step
Setup: Minimize starting from .
Step 0: , . Since , initial direction .
Line search: . Minimize: .
After step 0: .
.
Update L-BFGS history:
Step 1 direction (L-BFGS, ):
First loop:
Scale:
Second loop:
Direction: .
This is much better aligned with the steepest descent direction scaled by approximate Hessian inverse: true Newton direction is , so within a scaling factor.
<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation
Appendix V: Extended Exercises with Solutions
V.1 Verifying Adam's Diagonal Preconditioning (**)
Problem: On the quadratic with :
(a) After many steps of Adam (small mini-batches with noise per coordinate), the second moment .
(b) Show that Adam's effective step in coordinate is approximately:
This is gradient descent with step in coordinate .
(c) For the quadratic , this corresponds to the Newton step with step size ... wait, the Newton step in coordinate is . Adam gives - this is a half-Newton step (square root of the Hessian diagonal, not the Hessian diagonal itself).
This is why Adam is not full Newton: Adam uses as the preconditioner, not . For a quadratic, the optimal preconditioner is (Newton), not .
Consequence for convergence: With Adam on a quadratic :
- Gradient descent rate: where
- Adam rate: - better but not optimal
- Optimal diagonal Newton rate: (constant, independent of )
V.2 Finite Difference Gradient Check (*)
Implement a gradient check for a custom loss function:
def custom_loss(W, X, y):
"""Binary cross-entropy with sigmoid activation."""
z = X @ W
p = 1 / (1 + np.exp(-z))
return -np.mean(y * np.log(p + 1e-12) + (1-y) * np.log(1 - p + 1e-12))
def custom_grad(W, X, y):
"""Analytical gradient."""
z = X @ W
p = 1 / (1 + np.exp(-z))
return X.T @ (p - y) / len(y)
# Test
n, d = 100, 5
X = np.random.randn(n, d)
y = (np.random.randn(n) > 0).astype(float)
W = np.random.randn(d)
# Numerical gradient
h = 1e-5
g_num = np.zeros(d)
for i in range(d):
W_plus = W.copy(); W_plus[i] += h
W_minus = W.copy(); W_minus[i] -= h
g_num[i] = (custom_loss(W_plus, X, y) - custom_loss(W_minus, X, y)) / (2*h)
g_ana = custom_grad(W, X, y)
rel_err = np.abs(g_num - g_ana).max() / (np.abs(g_ana).max() + 1e-12)
print(f"Max relative error: {rel_err:.2e}") # Should be < 1e-5
V.3 Trust Region Radius Adaptation (**)
Implement the trust region ratio test and radius update:
def trust_region_update(f, x, p, delta, f_k, m_decrease):
"""Update trust region radius based on ratio test."""
f_new = f(x + p)
actual = f_k - f_new
rho = actual / m_decrease # Ratio: actual/predicted decrease
# Accept or reject step
x_new = x + p if rho > 0.1 else x
# Update radius
if rho < 0.25:
delta_new = 0.25 * delta
elif rho > 0.75 and abs(np.linalg.norm(p) - delta) < 1e-8:
delta_new = min(2 * delta, 1e6) # Cap at maximum
else:
delta_new = delta
return x_new, delta_new, rho
Appendix W: Self-Assessment Checklist
After completing Section03, you should be able to:
Core algorithms:
- Implement backtracking Armijo line search
- Explain why the Wolfe curvature condition is needed for BFGS
- Implement the L-BFGS two-loop recursion
- Implement the trust region ratio test and radius update
- Compute numerical gradients and find the optimal step size
Conceptual understanding:
- Explain the connection between Adam and diagonal quasi-Newton methods
- Explain why L-BFGS doesn't work for stochastic optimization
- Explain why Newton's method is quadratically convergent
- State the convergence rate of CG vs gradient descent
- Explain when to use line search vs trust region
AI connections:
- Explain Adam's adaptive learning rate as diagonal preconditioning
- Explain why TRPO/PPO use trust region constraints
- Explain the role of the Wolfe line search in PyTorch's L-BFGS
- Describe Shampoo/K-FAC as block quasi-Newton methods
- Connect gradient clipping to projected gradient descent
Total content: Section1-Section12 (main), Appendices A-W (extended). Full coverage of numerical optimization for ML practitioners.
Appendix X: GPU-Accelerated Optimization
X.1 Batch Size and Optimization Geometry
The batch size has a profound effect on optimization:
Large batch: The gradient approximates the full gradient with low variance. This enables line search, quasi-Newton methods, and faster convergence per epoch.
Small batch: High gradient noise. Adam's second moment estimate is noisy. The training trajectory is stochastic, which (counterintuitively) improves generalization.
Linear scaling rule (Goyal et al., 2017): If you increase batch size by , increase learning rate by (for SGD) or by (for the warmup phase). This keeps the signal-to-noise ratio of gradient estimates roughly constant.
X.2 Gradient Accumulation as Larger Batch
For very large models that don't fit in GPU memory, gradient accumulation simulates a larger batch by summing gradients over multiple forward-backward passes before applying the optimizer update:
accumulation_steps = 4
optimizer.zero_grad()
for step, batch in enumerate(dataloader):
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
loss = model(batch) / accumulation_steps # Scale loss
scaler.scale(loss).backward()
if (step + 1) % accumulation_steps == 0:
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Numerical consideration: Accumulating gradients in fp32 (not bf16) is critical. If the gradient accumulator itself is in bf16, the accumulated error grows linearly with accumulation steps.
X.3 ZeRO Optimization and Distributed Adam
ZeRO (Zero Redundancy Optimizer, Rajbhandari et al., 2020) shards the optimizer state across GPUs:
| ZeRO Stage | What's Sharded | Memory Savings |
|---|---|---|
| Stage 1 | Optimizer states (, in Adam) | 4x |
| Stage 2 | + Gradients | 8x |
| Stage 3 | + Parameters |
For a 7B parameter model with Adam:
- fp32 params: 28 GB
- fp32 Adam states (, ): 56 GB
- Total: 84 GB per GPU -> requires 8x A100 (80 GB) without ZeRO
- With ZeRO-3 across 64 GPUs: ~1.3 GB per GPU
Numerical impact: ZeRO doesn't change the mathematical update - it just distributes the storage. The optimizer state is identical to non-distributed Adam.
Appendix Y: Connections to Bayesian Optimization
Bayesian optimization is a black-box optimization method that uses a probabilistic model (usually a Gaussian Process) to model the objective function and selects the next evaluation point by maximizing an acquisition function.
Connection to numerical optimization:
- The acquisition function maximization is itself a classical optimization problem (e.g., L-BFGS on the acquisition function)
- The surrogate model (GP) must be numerically stable: Cholesky factorization of the kernel matrix
- Ill-conditioning of the kernel matrix -> add noise (numerical regularization)
For AI hyperparameter tuning: Bayesian optimization is used to tune:
- Learning rate, batch size, weight decay
- Architecture hyperparameters (number of layers, hidden dimensions)
- Training duration (early stopping threshold)
Tools: Optuna (Python), GPyOpt, BoTorch (PyTorch-native).
Appendix Z: Final Notes
This section covered:
- Line search methods (Armijo, Wolfe, backtracking) - essential for convergence guarantees
- Newton's method - quadratic convergence, modified Newton for non-convex settings
- Quasi-Newton methods (BFGS, L-BFGS) - practical second-order optimization
- Trust region methods - alternative to line search for non-convex landscapes
- Numerical differentiation - finite differences, optimal step size, complex-step
- Constrained optimization - projected gradient, ADMM
- AI applications - Adam as diagonal quasi-Newton, L-BFGS for full-batch training, HF optimization
Key takeaways:
- Adam is not an arbitrary heuristic: it approximates diagonal preconditioning of the gradient
- L-BFGS is the gold standard for full-batch optimization but fails for stochastic settings
- The condition number of the loss Hessian fundamentally limits gradient descent convergence
- Trust region methods (TRPO, PPO) provide a sound theoretical framework for policy gradient RL
- Gradient checking is essential for debugging custom autodiff implementations
Notes length: ~2000 lines. Companion: theory.ipynb (50+ cells) | exercises.ipynb (10 exercises)
Appendix AA: Worked Examples
AA.1 Computing the Optimal Step Size for Gradient Descent
Problem: Minimize with gradient descent. Find the optimal fixed step size.
Analysis:
- Hessian:
- , ,
- Optimal step:
- Convergence rate: per iteration
- Iterations to reduction: iterations
Python verification:
H_diag = np.array([2., 100.])
kappa = 50
eta = 2 / (100 + 2)
rate = (kappa - 1) / (kappa + 1)
x = np.array([1., 1.])
errors = [np.linalg.norm(x)]
for _ in range(500):
g = H_diag * x
x -= eta * g
errors.append(np.linalg.norm(x))
# Check: errors should decrease geometrically at rate ~0.961
actual_rate = errors[10] / errors[9]
print(f"Theoretical rate: {rate:.4f}")
print(f"Actual rate: {actual_rate:.4f}") # Should match
AA.2 L-BFGS on the Rosenbrock Function
from scipy.optimize import minimize
def rosenbrock(x): return 100*(x[1]-x[0]**2)**2 + (1-x[0])**2
def rosenbrock_grad(x):
return np.array([-400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0]),
200*(x[1]-x[0]**2)])
x0 = np.array([-1.0, 1.0])
result = minimize(rosenbrock, x0, jac=rosenbrock_grad, method='L-BFGS-B',
options={'maxiter': 1000, 'gtol': 1e-12})
print(f"Solution: {result.x}") # Should be [1., 1.]
print(f"Function value: {result.fun}") # Should be ~0
print(f"Iterations: {result.nit}") # Typically ~30-50
Appendix BB: Glossary
| Term | Definition |
|---|---|
| Armijo condition | Sufficient decrease condition: |
| Backtracking line search | Multiply step by until Armijo condition satisfied |
| BFGS | Quasi-Newton method that builds inverse Hessian approximation from pairs |
| Conjugate direction | are -conjugate if |
| Curvature condition | Ensures step is not too small; required for BFGS stability |
| Descent direction | such that |
| Fletcher-Reeves | NCG formula: |
| L-BFGS | BFGS with limited memory ( vector pairs); cost |
| Secant condition | ; ensures Hessian approx matches observed curvature |
| Super-linear convergence | ; faster than linear, slower than quadratic |
| Trust region | Sphere of radius within which the quadratic model is trusted |
| Two-loop recursion | Algorithm for efficiently computing in L-BFGS |
| Wolfe conditions | Armijo + curvature conditions; standard for line search in quasi-Newton methods |
<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation
End of Section03 Numerical Optimization
Appendix CC: Further Reading
Books
- Nocedal & Wright, "Numerical Optimization" (2nd ed., 2006) - The definitive reference. Chapters 2-7 cover all algorithms in this section in depth.
- Boyd & Vandenberghe, "Convex Optimization" - Available free online. Rigorous treatment of convex optimization theory.
- Bertsekas, "Nonlinear Programming" - Comprehensive treatment including constrained methods.
Papers
- Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2014) - Original Adam paper
- Nocedal, "Updating quasi-Newton matrices with limited storage" (1980) - Original L-BFGS paper
- Martens, "Deep learning via Hessian-free optimization" (2010) - Hessian-free for neural networks
- Gupta, Koren, Singer, "Shampoo: Preconditioned stochastic tensor optimization" (2018) - Shampoo
- Foret et al., "Sharpness-Aware Minimization" (2021) - SAM optimizer
- Cohen et al., "Gradient descent on neural networks typically occurs at the edge of stability" (2021) - EoS phenomenon
Software
scipy.optimize- Reference implementations of L-BFGS-B, Newton-CG, trust region methodstorch.optim- Adam, AdamW, LBFGS, SGD with momentum, Adagrad, RMSPropoptax(JAX) - Composable optimizer building blocksbfgs(Julia) - Clean reference implementation of BFGSOptuna- Bayesian hyperparameter optimization
Appendix DD: Summary Statistics
Section03 NUMERICAL OPTIMIZATION - SUMMARY
========================================================================
Main sections: 12 (Intuition through Conceptual Bridge)
Subsections: 22
Common mistakes: 12
Exercises: 8 (* to ***)
Appendices: 30 (A through DD)
Algorithms covered: 15 (GD, CG, NCG, BFGS, L-BFGS, SR1, Newton,
inexact Newton, trust region, Cauchy point,
dogleg, Armijo, Wolfe zoom, ADMM, Adam)
AI connections: 10+ (Adam, Shampoo, K-FAC, L-BFGS, TRPO/PPO,
gradient clipping, SAM, natural gradient,
HF optimization, edge of stability)
Approximate notes length: 2000 lines
Theory cells: 50+ (planned)
Exercises: 10 graded problems (24+ cells)
========================================================================
Appendix EE: Numerical Experiments Catalog
EE.1 Comparing Optimizers on Standard Test Functions
The following test functions are standard benchmarks for optimization algorithms:
| Function | Formula | Dimension | Challenge |
|---|---|---|---|
| Rosenbrock | Ill-conditioned valley | ||
| Rastrigin | Many local minima | ||
| Booth | 2 | Simple quadratic | |
| Beale | 2 | Nonlinear | |
| Himmelblau | 2 | Multiple global minima |
EE.2 Expected Results
Running L-BFGS, gradient descent, and conjugate gradient on the Rosenbrock function ():
Starting from x0 = (-1, 0):
Method Iterations Final Loss Final ||g||
-----------------------------------------------------
GD (optimal) ~10,000 ~1e-5 ~1e-3
CG (FR) ~500 ~1e-8 ~1e-4
BFGS ~50 ~1e-10 ~1e-5
L-BFGS (m=10) ~40 ~1e-10 ~1e-5
Newton ~20 ~1e-12 ~1e-6
These numbers are approximate and depend on the line search implementation and tolerance settings.
EE.3 Optimizer Sensitivity Analysis
For Adam on a quadratic loss with varying condition numbers:
| (H) | Steps to \varepsilon=10^-4 | Steps to \varepsilon=10^-8 |
|---|---|---|
| 1 | ~100 | ~200 |
| 10 | ~200 | ~400 |
| 100 | ~600 | ~1200 |
| 1000 | ~2000 | ~4000 |
| 10000 | ~6000 | ~12000 |
Adam improves on vanilla gradient descent but still scales poorly with condition number. The rate scales as (from the diagonal preconditioning) vs for gradient descent.
<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation
End of Section03 Numerical Optimization. See theory.ipynb for implementations and exercises.ipynb for practice problems.
Total content: 30+ appendices, 12 main sections, 10 exercises, 15 algorithms covered.
Appendix FF: Practice Problems
-
Implement backtracking Armijo search and test on .
-
Derive the BFGS update from the constraint of minimal change: subject to , .
-
Show that for the quadratic , the optimal fixed step for gradient descent is .
-
Implement the complex-step derivative for and verify against
numpy.grad. -
Implement L-BFGS with strong Wolfe line search for the Rosenbrock function and count iterations to .
-
Prove that if and , then the BFGS update .
-
Implement the Cauchy point computation for the trust region subproblem on a 2D quadratic.
-
Show that Adam converges to the critical point for convex with decaying learning rate .
-
Implement gradient accumulation with fp32 accumulators and bf16 forward/backward, and verify the accumulated gradient matches the full-batch gradient.
-
Implement ZeRO stage 1 (shard optimizer states across 2 processes using Python multiprocessing) and verify the same parameters result.
Section complete.
<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation
End of file.