Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
24 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Second-Order Methods: Part 3: Core Theory I: Geometry and Guarantees to 4. Core Theory II: Algorithms and Dynamics

3. Core Theory I: Geometry and Guarantees

This block develops core theory i: geometry and guarantees for Second-Order Methods. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

3.1 Geometry of Newton decrement

In this section, BFGS is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Geometry of Newton decrement" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, BFGS is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where BFGS can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where BFGS affects optimization but the model remains interpretable.
  • A transformer training diagnostic where BFGS appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating BFGS as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving BFGS, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes BFGS visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about BFGS is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

3.2 Key inequality for damped Newton

In this section, L-BFGS is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Key inequality for damped Newton" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, L-BFGS is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where L-BFGS can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where L-BFGS affects optimization but the model remains interpretable.
  • A transformer training diagnostic where L-BFGS appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating L-BFGS as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving L-BFGS, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes L-BFGS visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about L-BFGS is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

3.3 Role of modified Cholesky

In this section, two-loop recursion is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Role of modified Cholesky" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, two-loop recursion is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where two-loop recursion can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where two-loop recursion affects optimization but the model remains interpretable.
  • A transformer training diagnostic where two-loop recursion appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating two-loop recursion as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving two-loop recursion, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes two-loop recursion visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about two-loop recursion is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

3.4 Proof template and what the proof actually buys

In this section, Hessian-vector products is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Proof template and what the proof actually buys" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Hessian-vector products is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Hessian-vector products can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Hessian-vector products affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Hessian-vector products appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Hessian-vector products as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Hessian-vector products, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Hessian-vector products visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Hessian-vector products is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

3.5 Failure modes when assumptions are removed

In this section, Fisher information is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Failure modes when assumptions are removed" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Fisher information is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Fisher information can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Fisher information affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Fisher information appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Fisher information as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Fisher information, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Fisher information visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Fisher information is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

4. Core Theory II: Algorithms and Dynamics

This block develops core theory ii: algorithms and dynamics for Second-Order Methods. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

4.1 Algorithmic update for trust-region preview

In this section, Hessian-vector products is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Algorithmic update for trust-region preview" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Hessian-vector products is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Hessian-vector products can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Hessian-vector products affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Hessian-vector products appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Hessian-vector products as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Hessian-vector products, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Hessian-vector products visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Hessian-vector products is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

4.2 Stability role of Gauss-Newton

In this section, Fisher information is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Stability role of Gauss-Newton" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Fisher information is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Fisher information can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Fisher information affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Fisher information appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Fisher information as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Fisher information, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Fisher information visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Fisher information is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

4.3 Rate or complexity controlled by Levenberg-Marquardt

In this section, natural gradient is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Rate or complexity controlled by Levenberg-Marquardt" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, natural gradient is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where natural gradient can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where natural gradient affects optimization but the model remains interpretable.
  • A transformer training diagnostic where natural gradient appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating natural gradient as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving natural gradient, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes natural gradient visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about natural gradient is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

4.4 Diagnostic interpretation of the update path

In this section, K-FAC is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Diagnostic interpretation of the update path" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, K-FAC is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where K-FAC can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where K-FAC affects optimization but the model remains interpretable.
  • A transformer training diagnostic where K-FAC appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating K-FAC as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving K-FAC, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes K-FAC visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about K-FAC is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

4.5 Connection to the next section in the chapter

In this section, Shampoo is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Second-Order Methods, the phrase "Connection to the next section in the chapter" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Shampoo is the part of Second-Order Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Shampoo can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Shampoo affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Shampoo appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Shampoo as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

θt+1=θtHf(θt)1f(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_f(\boldsymbol{\theta}_t)^{-1}\nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Shampoo, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Shampoo visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
Hf(θt)pt=f(θt)H_f(\boldsymbol{\theta}_t)\mathbf{p}_t = -\nabla f(\boldsymbol{\theta}_t)
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Shampoo is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • K-FAC and natural-gradient style preconditioners for neural networks.
  • L-BFGS for small-batch fine-tuning and classical ML objectives.
  • Hessian-vector products for sharpness and interpretability diagnostics.
  • structured preconditioning proposals such as Shampoo, SOAP, and Muon-family methods.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue