Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 4
29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Optimization Landscape: Part 7: Applications in Machine Learning to References

7. Applications in Machine Learning

This block develops applications in machine learning for Optimization Landscape. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

7.1 sharpness-aware minimization and flat-minimum heuristics

In this section, critical point is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "sharpness-aware minimization and flat-minimum heuristics" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, critical point is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where critical point can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where critical point affects optimization but the model remains interpretable.
  • A transformer training diagnostic where critical point appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating critical point as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving critical point, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes critical point visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about critical point is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

7.2 mode connectivity behind checkpoint averaging and model soups

In this section, local minimum is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "mode connectivity behind checkpoint averaging and model soups" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, local minimum is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where local minimum can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where local minimum affects optimization but the model remains interpretable.
  • A transformer training diagnostic where local minimum appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating local minimum as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving local minimum, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes local minimum visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about local minimum is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

7.3 edge-of-stability behavior in large neural-network training

In this section, saddle point is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "edge-of-stability behavior in large neural-network training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, saddle point is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where saddle point can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where saddle point affects optimization but the model remains interpretable.
  • A transformer training diagnostic where saddle point appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating saddle point as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving saddle point, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes saddle point visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about saddle point is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

7.4 Hessian-spectrum diagnostics for loss spikes and instability

In this section, strict saddle is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Hessian-spectrum diagnostics for loss spikes and instability" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, strict saddle is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where strict saddle can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where strict saddle affects optimization but the model remains interpretable.
  • A transformer training diagnostic where strict saddle appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating strict saddle as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving strict saddle, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes strict saddle visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about strict saddle is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

7.5 Diagnostic checklist for real experiments

In this section, plateau is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Diagnostic checklist for real experiments" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, plateau is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where plateau can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where plateau affects optimization but the model remains interpretable.
  • A transformer training diagnostic where plateau appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating plateau as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving plateau, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes plateau visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about plateau is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8. Implementation and Diagnostics

This block develops implementation and diagnostics for Optimization Landscape. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

8.1 Minimal NumPy experiment for mode connectivity

In this section, strict saddle is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Minimal NumPy experiment for mode connectivity" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, strict saddle is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where strict saddle can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where strict saddle affects optimization but the model remains interpretable.
  • A transformer training diagnostic where strict saddle appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating strict saddle as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving strict saddle, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes strict saddle visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about strict saddle is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8.2 Monitoring signal for linear interpolation

In this section, plateau is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Monitoring signal for linear interpolation" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, plateau is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where plateau can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where plateau affects optimization but the model remains interpretable.
  • A transformer training diagnostic where plateau appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating plateau as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving plateau, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes plateau visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about plateau is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8.3 Failure signature for curve finding

In this section, Hessian spectrum is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Failure signature for curve finding" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Hessian spectrum is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Hessian spectrum can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Hessian spectrum affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Hessian spectrum appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Hessian spectrum as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Hessian spectrum, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Hessian spectrum visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Hessian spectrum is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8.4 Framework-level implementation pattern

In this section, negative curvature is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Framework-level implementation pattern" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, negative curvature is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where negative curvature can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where negative curvature affects optimization but the model remains interpretable.
  • A transformer training diagnostic where negative curvature appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating negative curvature as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving negative curvature, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes negative curvature visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about negative curvature is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8.5 Reproducibility and logging checklist

In this section, degeneracy is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Reproducibility and logging checklist" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, degeneracy is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where degeneracy can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where degeneracy affects optimization but the model remains interpretable.
  • A transformer training diagnostic where degeneracy appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating degeneracy as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(θ)=0,Hf(θ) determines local curvature\nabla f(\boldsymbol{\theta}^*)=\mathbf{0}, \qquad H_f(\boldsymbol{\theta}^*) \text{ determines local curvature}

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving degeneracy, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes degeneracy visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about degeneracy is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • sharpness-aware minimization and flat-minimum heuristics.
  • mode connectivity behind checkpoint averaging and model soups.
  • edge-of-stability behavior in large neural-network training.
  • Hessian-spectrum diagnostics for loss spikes and instability.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

9. Common Mistakes

#MistakeWhy It Is WrongFix
1Using a recipe without checking assumptionsOptimization guarantees depend on smoothness, convexity, stochasticity, or feasibility assumptions.Write the assumptions next to the update rule before choosing hyperparameters.
2Confusing objective decrease with validation improvementThe optimizer sees the training objective; validation behavior also depends on generalization and data split quality.Track objective, train metric, validation metric, and update norm separately.
3Treating all norms as interchangeableThe geometry changes when the norm changes, especially for constraints and regularizers.State whether you use 1\ell_1, 2\ell_2, Frobenius, spectral, or another norm.
4Ignoring scaleLearning rates, penalties, curvature, and gradient norms are all scale-sensitive.Normalize units and inspect effective update size Δθ2/θ2\lVert \Delta\boldsymbol{\theta}\rVert_2 / \lVert\boldsymbol{\theta}\rVert_2.
5Overfitting to a single seedOptimization can look stable for one seed and fail under another.Run small seed sweeps for important claims.
6Hiding instability behind smoothed plotsA moving average can hide spikes, divergence, and bad curvature events.Plot raw metrics alongside smoothed metrics.
7Using test data during tuningThis contaminates the final evaluation.Reserve test data until after model and hyperparameter selection.
8Assuming large models make theory irrelevantLarge models often make diagnostics more important because failures are expensive.Use theory to decide what to log, not to pretend every theorem applies exactly.
9Mixing optimizer state with model state carelesslyState corruption changes the effective algorithm.Checkpoint parameters, gradients if needed, optimizer moments, scheduler state, and random seeds.
10Not checking numerical precisionBF16, FP16, FP8, and accumulation choices can change the observed optimizer.Cross-check suspicious runs against higher precision on a small batch.

10. Exercises

  1. Exercise 1 [*] - Saddle Point (a) Define saddle point using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 2 [*] - Plateau (a) Define plateau using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 3 [*] - Negative Curvature (a) Define negative curvature using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 4 [] - Symmetry** (a) Define symmetry using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 5 [] - Basin Of Attraction** (a) Define basin of attraction using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 6 [] - Sharpness** (a) Define sharpness using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 7 [] - Reparameterization Caveat** (a) Define reparameterization caveat using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 8 [*] - Linear Interpolation** (a) Define linear interpolation using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 9 [*] - Swa** (a) Define SWA using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 10 [*] - Edge Of Stability** (a) Define edge of stability using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λmax(Hf(θt))η2\lambda_{\max}(H_f(\boldsymbol{\theta}_t)) \eta \approx 2

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

11. Why This Matters for AI (2026 Perspective)

ConceptAI Impact
critical pointsharpness-aware minimization and flat-minimum heuristics
local minimummode connectivity behind checkpoint averaging and model soups
saddle pointedge-of-stability behavior in large neural-network training
strict saddleHessian-spectrum diagnostics for loss spikes and instability
plateausharpness-aware minimization and flat-minimum heuristics
Hessian spectrummode connectivity behind checkpoint averaging and model soups
negative curvatureedge-of-stability behavior in large neural-network training
degeneracyHessian-spectrum diagnostics for loss spikes and instability
symmetrysharpness-aware minimization and flat-minimum heuristics
overparameterizationmode connectivity behind checkpoint averaging and model soups

12. Conceptual Bridge

Optimization Landscape sits inside a chain. Earlier sections give the calculus, probability, and linear algebra needed to write the objective and interpret the update. Later sections use this material to reason about noisy gradients, adaptive state, regularization, tuning, schedules, and finally information-theoretic losses.

Backward link: Stochastic Optimization supplies the immediate prerequisite vocabulary.

Forward link: Adaptive Learning Rate uses this section as a building block.

+------------------------------------------------------------+
| Chapter 8: Optimization                                    |
|    01-Convex-Optimization          Convex Optimization    |
|    02-Gradient-Descent             Gradient Descent       |
|    03-Second-Order-Methods         Second-Order Methods   |
|    04-Constrained-Optimization     Constrained Optimization |
|    05-Stochastic-Optimization      Stochastic Optimization |
| >> 06-Optimization-Landscape       Optimization Landscape |
|    07-Adaptive-Learning-Rate       Adaptive Learning Rate |
|    08-Regularization-Methods       Regularization Methods |
|    09-Hyperparameter-Optimization  Hyperparameter Optimization |
|    10-Learning-Rate-Schedules      Learning Rate Schedules |
+------------------------------------------------------------+

Appendix A. Extended Derivation and Diagnostic Cards

References

  • Keskar et al., On Large-Batch Training for Deep Learning.
  • Foret et al., Sharpness-Aware Minimization.
  • Garipov et al., Loss Surfaces, Mode Connectivity, and Fast Ensembling.
  • Cohen et al., Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability.
  • Goodfellow, Bengio, and Courville, Deep Learning.
  • Bottou, Curtis, and Nocedal, Optimization Methods for Large-Scale Machine Learning.
  • PyTorch optimizer and scheduler documentation.
  • Optax documentation for composable optimizer transformations.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue