Part 2

23 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Optimization Landscape: Part 3: Core Theory I: Geometry and Guarantees to 4. Core Theory II: Algorithms and Dynamics

3. Core Theory I: Geometry and Guarantees

This block develops core theory i: geometry and guarantees for Optimization Landscape. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

3.1 Geometry of strict saddle

In this section, basin of attraction is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Geometry of strict saddle" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, basin of attraction is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where basin of attraction can be computed directly and compared with theory.
A logistic-regression or softmax objective where basin of attraction affects optimization but the model remains interpretable.
A transformer training diagnostic where basin of attraction appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating basin of attraction as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving basin of attraction, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes basin of attraction visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about basin of attraction is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

3.2 Key inequality for plateau

In this section, barrier is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Key inequality for plateau" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, barrier is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where barrier can be computed directly and compared with theory.
A logistic-regression or softmax objective where barrier affects optimization but the model remains interpretable.
A transformer training diagnostic where barrier appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating barrier as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving barrier, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes barrier visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about barrier is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

3.3 Role of Hessian spectrum

In this section, sharpness is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Role of Hessian spectrum" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, sharpness is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where sharpness can be computed directly and compared with theory.
A logistic-regression or softmax objective where sharpness affects optimization but the model remains interpretable.
A transformer training diagnostic where sharpness appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating sharpness as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving sharpness, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes sharpness visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about sharpness is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

3.4 Proof template and what the proof actually buys

In this section, flatness is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Proof template and what the proof actually buys" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, flatness is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where flatness can be computed directly and compared with theory.
A logistic-regression or softmax objective where flatness affects optimization but the model remains interpretable.
A transformer training diagnostic where flatness appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating flatness as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving flatness, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes flatness visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about flatness is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

3.5 Failure modes when assumptions are removed

In this section, reparameterization caveat is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Failure modes when assumptions are removed" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, reparameterization caveat is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where reparameterization caveat can be computed directly and compared with theory.
A logistic-regression or softmax objective where reparameterization caveat affects optimization but the model remains interpretable.
A transformer training diagnostic where reparameterization caveat appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating reparameterization caveat as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving reparameterization caveat, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes reparameterization caveat visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about reparameterization caveat is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

4. Core Theory II: Algorithms and Dynamics

This block develops core theory ii: algorithms and dynamics for Optimization Landscape. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

4.1 Algorithmic update for negative curvature

In this section, flatness is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Algorithmic update for negative curvature" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, flatness is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where flatness can be computed directly and compared with theory.
A logistic-regression or softmax objective where flatness affects optimization but the model remains interpretable.
A transformer training diagnostic where flatness appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating flatness as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes flatness visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about flatness is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

4.2 Stability role of degeneracy

In this section, reparameterization caveat is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Stability role of degeneracy" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, reparameterization caveat is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where reparameterization caveat can be computed directly and compared with theory.
A logistic-regression or softmax objective where reparameterization caveat affects optimization but the model remains interpretable.
A transformer training diagnostic where reparameterization caveat appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating reparameterization caveat as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes reparameterization caveat visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about reparameterization caveat is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

4.3 Rate or complexity controlled by symmetry

In this section, mode connectivity is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Rate or complexity controlled by symmetry" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, mode connectivity is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where mode connectivity can be computed directly and compared with theory.
A logistic-regression or softmax objective where mode connectivity affects optimization but the model remains interpretable.
A transformer training diagnostic where mode connectivity appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating mode connectivity as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving mode connectivity, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes mode connectivity visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about mode connectivity is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

4.4 Diagnostic interpretation of the update path

In this section, linear interpolation is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Diagnostic interpretation of the update path" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, linear interpolation is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where linear interpolation can be computed directly and compared with theory.
A logistic-regression or softmax objective where linear interpolation affects optimization but the model remains interpretable.
A transformer training diagnostic where linear interpolation appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating linear interpolation as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving linear interpolation, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes linear interpolation visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about linear interpolation is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

4.5 Connection to the next section in the chapter

In this section, curve finding is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Connection to the next section in the chapter" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, curve finding is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where curve finding can be computed directly and compared with theory.
A logistic-regression or softmax objective where curve finding affects optimization but the model remains interpretable.
A transformer training diagnostic where curve finding appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating curve finding as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving curve finding, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes curve finding visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about curve finding is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

sharpness-aware minimization and flat-minimum heuristics.
mode connectivity behind checkpoint averaging and model soups.
edge-of-stability behavior in large neural-network training.
Hessian-spectrum diagnostics for loss spikes and instability.

Optimization Landscape: Part 2 - Core Theory I Geometry And Guarantees To 4 Core Theory Ii Algorithms

Optimization Landscape: Part 3: Core Theory I: Geometry and Guarantees to 4. Core Theory II: Algorithms and Dynamics

3. Core Theory I: Geometry and Guarantees

3.1 Geometry of strict saddle

3.2 Key inequality for plateau

3.3 Role of Hessian spectrum

3.4 Proof template and what the proof actually buys

3.5 Failure modes when assumptions are removed

4. Core Theory II: Algorithms and Dynamics

4.1 Algorithmic update for negative curvature

4.2 Stability role of degeneracy

4.3 Rate or complexity controlled by symmetry

4.4 Diagnostic interpretation of the update path

4.5 Connection to the next section in the chapter

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?