Part 3

24 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Second-Order Methods: Part 5: Core Theory III: Practical Variants to 6. Advanced Topics

5. Core Theory III: Practical Variants

This block develops core theory iii: practical variants for Second-Order Methods. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

5.1 Variant built around secant equation

In this section, K-FAC is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Variant built around secant equation" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, K-FAC is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where K-FAC can be computed directly and compared with theory.
A logistic-regression or softmax objective where K-FAC affects optimization but the model remains interpretable.
A transformer training diagnostic where K-FAC appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating K-FAC as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving K-FAC, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes K-FAC visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about K-FAC is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

5.2 Variant built around BFGS

In this section, Shampoo is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Variant built around BFGS" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Shampoo is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Shampoo can be computed directly and compared with theory.
A logistic-regression or softmax objective where Shampoo affects optimization but the model remains interpretable.
A transformer training diagnostic where Shampoo appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Shampoo as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving Shampoo, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes Shampoo visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Shampoo is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

5.3 Variant built around L-BFGS

In this section, SOAP is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Variant built around L-BFGS" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, SOAP is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where SOAP can be computed directly and compared with theory.
A logistic-regression or softmax objective where SOAP affects optimization but the model remains interpretable.
A transformer training diagnostic where SOAP appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating SOAP as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving SOAP, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes SOAP visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about SOAP is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

5.4 Implementation constraints and numerical stability

In this section, Muon preview is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Implementation constraints and numerical stability" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Muon preview is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Muon preview can be computed directly and compared with theory.
A logistic-regression or softmax objective where Muon preview affects optimization but the model remains interpretable.
A transformer training diagnostic where Muon preview appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Muon preview as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving Muon preview, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes Muon preview visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Muon preview is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

5.5 What belongs here versus neighboring sections

In this section, curvature diagnostics is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "What belongs here versus neighboring sections" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, curvature diagnostics is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where curvature diagnostics can be computed directly and compared with theory.
A logistic-regression or softmax objective where curvature diagnostics affects optimization but the model remains interpretable.
A transformer training diagnostic where curvature diagnostics appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating curvature diagnostics as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving curvature diagnostics, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes curvature diagnostics visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about curvature diagnostics is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

6. Advanced Topics

This block develops advanced topics for Second-Order Methods. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

6.1 Advanced view of two-loop recursion

In this section, Muon preview is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Advanced view of two-loop recursion" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Muon preview is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Muon preview can be computed directly and compared with theory.
A logistic-regression or softmax objective where Muon preview affects optimization but the model remains interpretable.
A transformer training diagnostic where Muon preview appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Muon preview as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes Muon preview visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Muon preview is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

6.2 Advanced view of Hessian-vector products

In this section, curvature diagnostics is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Advanced view of Hessian-vector products" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, curvature diagnostics is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where curvature diagnostics can be computed directly and compared with theory.
A logistic-regression or softmax objective where curvature diagnostics affects optimization but the model remains interpretable.
A transformer training diagnostic where curvature diagnostics appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating curvature diagnostics as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes curvature diagnostics visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about curvature diagnostics is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

6.3 Advanced view of Fisher information

In this section, large-model feasibility is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Advanced view of Fisher information" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, large-model feasibility is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where large-model feasibility can be computed directly and compared with theory.
A logistic-regression or softmax objective where large-model feasibility affects optimization but the model remains interpretable.
A transformer training diagnostic where large-model feasibility appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating large-model feasibility as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving large-model feasibility, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes large-model feasibility visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about large-model feasibility is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

6.4 Infinite-dimensional or large-scale interpretation

In this section, Hessian matrix is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Infinite-dimensional or large-scale interpretation" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Hessian matrix is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Hessian matrix can be computed directly and compared with theory.
A logistic-regression or softmax objective where Hessian matrix affects optimization but the model remains interpretable.
A transformer training diagnostic where Hessian matrix appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Hessian matrix as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving Hessian matrix, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes Hessian matrix visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Hessian matrix is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

6.5 Open questions for frontier model training

In this section, quadratic model is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Open questions for frontier model training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, quadratic model is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where quadratic model can be computed directly and compared with theory.
A logistic-regression or softmax objective where quadratic model affects optimization but the model remains interpretable.
A transformer training diagnostic where quadratic model appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating quadratic model as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving quadratic model, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes quadratic model visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about quadratic model is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

K-FAC and natural-gradient style preconditioners for neural networks.
L-BFGS for small-batch fine-tuning and classical ML objectives.
Hessian-vector products for sharpness and interpretability diagnostics.
structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Second Order Methods: Part 3 - Core Theory Iii Practical Variants To 6 Advanced Topics

Second-Order Methods: Part 5: Core Theory III: Practical Variants to 6. Advanced Topics

5. Core Theory III: Practical Variants

5.1 Variant built around secant equation

5.2 Variant built around BFGS

5.3 Variant built around L-BFGS

5.4 Implementation constraints and numerical stability

5.5 What belongs here versus neighboring sections

6. Advanced Topics

6.1 Advanced view of two-loop recursion

6.2 Advanced view of Hessian-vector products

6.3 Advanced view of Fisher information

6.4 Infinite-dimensional or large-scale interpretation

6.5 Open questions for frontier model training

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?