Lesson overview | Previous part | Next part
Optimization Landscape: Part 5: Core Theory III: Practical Variants to 6. Advanced Topics
5. Core Theory III: Practical Variants
This block develops core theory iii: practical variants for Optimization Landscape. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
5.1 Variant built around overparameterization
In this section, linear interpolation is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Variant built around overparameterization" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, linear interpolation is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where linear interpolation can be computed directly and compared with theory.
- A logistic-regression or softmax objective where linear interpolation affects optimization but the model remains interpretable.
- A transformer training diagnostic where linear interpolation appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating linear interpolation as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving linear interpolation, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes linear interpolation visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about linear interpolation is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
5.2 Variant built around basin of attraction
In this section, curve finding is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Variant built around basin of attraction" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, curve finding is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where curve finding can be computed directly and compared with theory.
- A logistic-regression or softmax objective where curve finding affects optimization but the model remains interpretable.
- A transformer training diagnostic where curve finding appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating curve finding as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving curve finding, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes curve finding visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about curve finding is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
5.3 Variant built around barrier
In this section, SWA is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Variant built around barrier" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, SWA is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where SWA can be computed directly and compared with theory.
- A logistic-regression or softmax objective where SWA affects optimization but the model remains interpretable.
- A transformer training diagnostic where SWA appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating SWA as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving SWA, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes SWA visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about SWA is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
5.4 Implementation constraints and numerical stability
In this section, model soups is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Implementation constraints and numerical stability" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, model soups is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where model soups can be computed directly and compared with theory.
- A logistic-regression or softmax objective where model soups affects optimization but the model remains interpretable.
- A transformer training diagnostic where model soups appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating model soups as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving model soups, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes model soups visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about model soups is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
5.5 What belongs here versus neighboring sections
In this section, edge of stability is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "What belongs here versus neighboring sections" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, edge of stability is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where edge of stability can be computed directly and compared with theory.
- A logistic-regression or softmax objective where edge of stability affects optimization but the model remains interpretable.
- A transformer training diagnostic where edge of stability appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating edge of stability as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving edge of stability, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes edge of stability visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about edge of stability is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
6. Advanced Topics
This block develops advanced topics for Optimization Landscape. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
6.1 Advanced view of sharpness
In this section, model soups is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Advanced view of sharpness" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, model soups is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where model soups can be computed directly and compared with theory.
- A logistic-regression or softmax objective where model soups affects optimization but the model remains interpretable.
- A transformer training diagnostic where model soups appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating model soups as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving model soups, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes model soups visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about model soups is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
6.2 Advanced view of flatness
In this section, edge of stability is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Advanced view of flatness" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, edge of stability is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where edge of stability can be computed directly and compared with theory.
- A logistic-regression or softmax objective where edge of stability affects optimization but the model remains interpretable.
- A transformer training diagnostic where edge of stability appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating edge of stability as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving edge of stability, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes edge of stability visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about edge of stability is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
6.3 Advanced view of reparameterization caveat
In this section, catapult dynamics is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Advanced view of reparameterization caveat" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, catapult dynamics is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where catapult dynamics can be computed directly and compared with theory.
- A logistic-regression or softmax objective where catapult dynamics affects optimization but the model remains interpretable.
- A transformer training diagnostic where catapult dynamics appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating catapult dynamics as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving catapult dynamics, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes catapult dynamics visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about catapult dynamics is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
6.4 Infinite-dimensional or large-scale interpretation
In this section, critical point is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Infinite-dimensional or large-scale interpretation" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, critical point is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where critical point can be computed directly and compared with theory.
- A logistic-regression or softmax objective where critical point affects optimization but the model remains interpretable.
- A transformer training diagnostic where critical point appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating critical point as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving critical point, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes critical point visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about critical point is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
6.5 Open questions for frontier model training
In this section, local minimum is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Open questions for frontier model training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, local minimum is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where local minimum can be computed directly and compared with theory.
- A logistic-regression or softmax objective where local minimum affects optimization but the model remains interpretable.
- A transformer training diagnostic where local minimum appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating local minimum as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving local minimum, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes local minimum visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about local minimum is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.