Part 1

24 min read12 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Second-Order Methods: Part 1: Intuition to 2. Formal Definitions

1. Intuition

This block develops intuition for Second-Order Methods. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

1.1 Why Second-Order Methods matters for training systems

In this section, damped Newton is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Why Second-Order Methods matters for training systems" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, damped Newton is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where damped Newton can be computed directly and compared with theory.
A logistic-regression or softmax objective where damped Newton affects optimization but the model remains interpretable.
A transformer training diagnostic where damped Newton appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating damped Newton as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving damped Newton, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes damped Newton visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about damped Newton is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

1.2 The optimization object: parameters, objective, algorithm, and diagnostic

In this section, modified Cholesky is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "The optimization object: parameters, objective, algorithm, and diagnostic" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, modified Cholesky is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where modified Cholesky can be computed directly and compared with theory.
A logistic-regression or softmax objective where modified Cholesky affects optimization but the model remains interpretable.
A transformer training diagnostic where modified Cholesky appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating modified Cholesky as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving modified Cholesky, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes modified Cholesky visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about modified Cholesky is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

1.3 Historical arc from classical optimization to modern AI

In this section, trust-region preview is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Historical arc from classical optimization to modern AI" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, trust-region preview is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where trust-region preview can be computed directly and compared with theory.
A logistic-regression or softmax objective where trust-region preview affects optimization but the model remains interpretable.
A transformer training diagnostic where trust-region preview appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating trust-region preview as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving trust-region preview, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes trust-region preview visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about trust-region preview is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

1.4 What this section treats as canonical scope

In this section, Gauss-Newton is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "What this section treats as canonical scope" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Gauss-Newton is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Gauss-Newton can be computed directly and compared with theory.
A logistic-regression or softmax objective where Gauss-Newton affects optimization but the model remains interpretable.
A transformer training diagnostic where Gauss-Newton appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Gauss-Newton as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving Gauss-Newton, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes Gauss-Newton visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Gauss-Newton is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

1.5 A first mental model for LLM training

In this section, Levenberg-Marquardt is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "A first mental model for LLM training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Levenberg-Marquardt is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Levenberg-Marquardt can be computed directly and compared with theory.
A logistic-regression or softmax objective where Levenberg-Marquardt affects optimization but the model remains interpretable.
A transformer training diagnostic where Levenberg-Marquardt appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Levenberg-Marquardt as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving Levenberg-Marquardt, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes Levenberg-Marquardt visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Levenberg-Marquardt is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

2. Formal Definitions

This block develops formal definitions for Second-Order Methods. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

2.1 Primary definition: Hessian matrix

In this section, Gauss-Newton is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Primary definition: Hessian matrix" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Gauss-Newton is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Gauss-Newton can be computed directly and compared with theory.
A logistic-regression or softmax objective where Gauss-Newton affects optimization but the model remains interpretable.
A transformer training diagnostic where Gauss-Newton appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Gauss-Newton as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes Gauss-Newton visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Gauss-Newton is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

2.2 Secondary definition: quadratic model

In this section, Levenberg-Marquardt is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Secondary definition: quadratic model" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Levenberg-Marquardt is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Levenberg-Marquardt can be computed directly and compared with theory.
A logistic-regression or softmax objective where Levenberg-Marquardt affects optimization but the model remains interpretable.
A transformer training diagnostic where Levenberg-Marquardt appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Levenberg-Marquardt as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes Levenberg-Marquardt visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Levenberg-Marquardt is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

2.3 Algorithmic object: Newton step

In this section, secant equation is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Algorithmic object: Newton step" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, secant equation is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where secant equation can be computed directly and compared with theory.
A logistic-regression or softmax objective where secant equation affects optimization but the model remains interpretable.
A transformer training diagnostic where secant equation appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating secant equation as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving secant equation, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes secant equation visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about secant equation is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

2.4 Examples, non-examples, and boundary cases

In this section, BFGS is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Examples, non-examples, and boundary cases" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, BFGS is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where BFGS can be computed directly and compared with theory.
A logistic-regression or softmax objective where BFGS affects optimization but the model remains interpretable.
A transformer training diagnostic where BFGS appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating BFGS as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving BFGS, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes BFGS visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about BFGS is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

2.5 Notation, dimensions, and assumptions

In this section, L-BFGS is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Notation, dimensions, and assumptions" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, L-BFGS is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where L-BFGS can be computed directly and compared with theory.
A logistic-regression or softmax objective where L-BFGS affects optimization but the model remains interpretable.
A transformer training diagnostic where L-BFGS appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating L-BFGS as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving L-BFGS, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes L-BFGS visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about L-BFGS is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Second Order Methods: Part 1 - Intuition To 2 Formal Definitions

Second-Order Methods: Part 1: Intuition to 2. Formal Definitions

1. Intuition

1.1 Why Second-Order Methods matters for training systems

1.2 The optimization object: parameters, objective, algorithm, and diagnostic

1.3 Historical arc from classical optimization to modern AI

1.4 What this section treats as canonical scope

1.5 A first mental model for LLM training

2. Formal Definitions

2.1 Primary definition: Hessian matrix

2.2 Secondary definition: quadratic model

2.3 Algorithmic object: Newton step

2.4 Examples, non-examples, and boundary cases

2.5 Notation, dimensions, and assumptions

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?