Lesson overview | Lesson overview | Next part
Gradient Descent: Part 1: Intuition to 2. Formal Definitions
1. Intuition
This block develops intuition for Gradient Descent. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
1.1 Why Gradient Descent matters for training systems
In this section, backtracking line search is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Why Gradient Descent matters for training systems" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, backtracking line search is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where backtracking line search can be computed directly and compared with theory.
- A logistic-regression or softmax objective where backtracking line search affects optimization but the model remains interpretable.
- A transformer training diagnostic where backtracking line search appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating backtracking line search as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving backtracking line search, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes backtracking line search visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about backtracking line search is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.2 The optimization object: parameters, objective, algorithm, and diagnostic
In this section, Armijo condition is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "The optimization object: parameters, objective, algorithm, and diagnostic" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, Armijo condition is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where Armijo condition can be computed directly and compared with theory.
- A logistic-regression or softmax objective where Armijo condition affects optimization but the model remains interpretable.
- A transformer training diagnostic where Armijo condition appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating Armijo condition as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving Armijo condition, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes Armijo condition visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about Armijo condition is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.3 Historical arc from classical optimization to modern AI
In this section, Wolfe conditions is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Historical arc from classical optimization to modern AI" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, Wolfe conditions is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where Wolfe conditions can be computed directly and compared with theory.
- A logistic-regression or softmax objective where Wolfe conditions affects optimization but the model remains interpretable.
- A transformer training diagnostic where Wolfe conditions appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating Wolfe conditions as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving Wolfe conditions, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes Wolfe conditions visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about Wolfe conditions is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.4 What this section treats as canonical scope
In this section, convex convergence is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "What this section treats as canonical scope" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, convex convergence is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where convex convergence can be computed directly and compared with theory.
- A logistic-regression or softmax objective where convex convergence affects optimization but the model remains interpretable.
- A transformer training diagnostic where convex convergence appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating convex convergence as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving convex convergence, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes convex convergence visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about convex convergence is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.5 A first mental model for LLM training
In this section, strongly convex convergence is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "A first mental model for LLM training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, strongly convex convergence is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where strongly convex convergence can be computed directly and compared with theory.
- A logistic-regression or softmax objective where strongly convex convergence affects optimization but the model remains interpretable.
- A transformer training diagnostic where strongly convex convergence appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating strongly convex convergence as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving strongly convex convergence, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes strongly convex convergence visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about strongly convex convergence is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2. Formal Definitions
This block develops formal definitions for Gradient Descent. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
2.1 Primary definition: gradient direction
In this section, convex convergence is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Primary definition: gradient direction" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, convex convergence is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where convex convergence can be computed directly and compared with theory.
- A logistic-regression or softmax objective where convex convergence affects optimization but the model remains interpretable.
- A transformer training diagnostic where convex convergence appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating convex convergence as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving convex convergence, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes convex convergence visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about convex convergence is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.2 Secondary definition: descent lemma
In this section, strongly convex convergence is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Secondary definition: descent lemma" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, strongly convex convergence is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where strongly convex convergence can be computed directly and compared with theory.
- A logistic-regression or softmax objective where strongly convex convergence affects optimization but the model remains interpretable.
- A transformer training diagnostic where strongly convex convergence appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating strongly convex convergence as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving strongly convex convergence, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes strongly convex convergence visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about strongly convex convergence is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.3 Algorithmic object: constant step size
In this section, nonconvex stationarity is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Algorithmic object: constant step size" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, nonconvex stationarity is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where nonconvex stationarity can be computed directly and compared with theory.
- A logistic-regression or softmax objective where nonconvex stationarity affects optimization but the model remains interpretable.
- A transformer training diagnostic where nonconvex stationarity appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating nonconvex stationarity as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving nonconvex stationarity, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes nonconvex stationarity visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about nonconvex stationarity is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.4 Examples, non-examples, and boundary cases
In this section, PL condition is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Examples, non-examples, and boundary cases" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, PL condition is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where PL condition can be computed directly and compared with theory.
- A logistic-regression or softmax objective where PL condition affects optimization but the model remains interpretable.
- A transformer training diagnostic where PL condition appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating PL condition as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving PL condition, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes PL condition visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about PL condition is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.5 Notation, dimensions, and assumptions
In this section, condition number is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Notation, dimensions, and assumptions" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, condition number is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where condition number can be computed directly and compared with theory.
- A logistic-regression or softmax objective where condition number affects optimization but the model remains interpretable.
- A transformer training diagnostic where condition number appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating condition number as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving condition number, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes condition number visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about condition number is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- the basic training loop used by every neural-network optimizer.
- step-size stability for cross-entropy and mean-squared-error objectives.
- momentum as the ancestor of Adam's first-moment accumulator.
- line-search logic as a debugging model for divergence and oscillation.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.