Lesson overview | Lesson overview | Next part
Optimization Landscape: Part 1: Intuition to 2. Formal Definitions
1. Intuition
This block develops intuition for Optimization Landscape. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
1.1 Why Optimization Landscape matters for training systems
In this section, plateau is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Why Optimization Landscape matters for training systems" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, plateau is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where plateau can be computed directly and compared with theory.
- A logistic-regression or softmax objective where plateau affects optimization but the model remains interpretable.
- A transformer training diagnostic where plateau appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating plateau as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving plateau, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes plateau visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about plateau is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.2 The optimization object: parameters, objective, algorithm, and diagnostic
In this section, Hessian spectrum is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "The optimization object: parameters, objective, algorithm, and diagnostic" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, Hessian spectrum is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where Hessian spectrum can be computed directly and compared with theory.
- A logistic-regression or softmax objective where Hessian spectrum affects optimization but the model remains interpretable.
- A transformer training diagnostic where Hessian spectrum appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating Hessian spectrum as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving Hessian spectrum, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes Hessian spectrum visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about Hessian spectrum is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.3 Historical arc from classical optimization to modern AI
In this section, negative curvature is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Historical arc from classical optimization to modern AI" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, negative curvature is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where negative curvature can be computed directly and compared with theory.
- A logistic-regression or softmax objective where negative curvature affects optimization but the model remains interpretable.
- A transformer training diagnostic where negative curvature appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating negative curvature as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving negative curvature, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes negative curvature visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about negative curvature is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.4 What this section treats as canonical scope
In this section, degeneracy is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "What this section treats as canonical scope" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, degeneracy is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where degeneracy can be computed directly and compared with theory.
- A logistic-regression or softmax objective where degeneracy affects optimization but the model remains interpretable.
- A transformer training diagnostic where degeneracy appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating degeneracy as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving degeneracy, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes degeneracy visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about degeneracy is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.5 A first mental model for LLM training
In this section, symmetry is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "A first mental model for LLM training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, symmetry is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where symmetry can be computed directly and compared with theory.
- A logistic-regression or softmax objective where symmetry affects optimization but the model remains interpretable.
- A transformer training diagnostic where symmetry appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating symmetry as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving symmetry, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes symmetry visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about symmetry is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2. Formal Definitions
This block develops formal definitions for Optimization Landscape. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
2.1 Primary definition: critical point
In this section, degeneracy is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Primary definition: critical point" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, degeneracy is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where degeneracy can be computed directly and compared with theory.
- A logistic-regression or softmax objective where degeneracy affects optimization but the model remains interpretable.
- A transformer training diagnostic where degeneracy appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating degeneracy as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving degeneracy, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes degeneracy visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about degeneracy is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.2 Secondary definition: local minimum
In this section, symmetry is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Secondary definition: local minimum" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, symmetry is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where symmetry can be computed directly and compared with theory.
- A logistic-regression or softmax objective where symmetry affects optimization but the model remains interpretable.
- A transformer training diagnostic where symmetry appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating symmetry as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving symmetry, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes symmetry visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about symmetry is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.3 Algorithmic object: saddle point
In this section, overparameterization is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Algorithmic object: saddle point" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, overparameterization is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where overparameterization can be computed directly and compared with theory.
- A logistic-regression or softmax objective where overparameterization affects optimization but the model remains interpretable.
- A transformer training diagnostic where overparameterization appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating overparameterization as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving overparameterization, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes overparameterization visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about overparameterization is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.4 Examples, non-examples, and boundary cases
In this section, basin of attraction is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Examples, non-examples, and boundary cases" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, basin of attraction is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where basin of attraction can be computed directly and compared with theory.
- A logistic-regression or softmax objective where basin of attraction affects optimization but the model remains interpretable.
- A transformer training diagnostic where basin of attraction appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating basin of attraction as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving basin of attraction, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes basin of attraction visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about basin of attraction is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.5 Notation, dimensions, and assumptions
In this section, barrier is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Optimization Landscape, the phrase "Notation, dimensions, and assumptions" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, barrier is the part of Optimization Landscape that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where barrier can be computed directly and compared with theory.
- A logistic-regression or softmax objective where barrier affects optimization but the model remains interpretable.
- A transformer training diagnostic where barrier appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating barrier as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving barrier, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes barrier visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about barrier is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- sharpness-aware minimization and flat-minimum heuristics.
- mode connectivity behind checkpoint averaging and model soups.
- edge-of-stability behavior in large neural-network training.
- Hessian-spectrum diagnostics for loss spikes and instability.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.