Lesson overview | Previous part | Next part
Stochastic Optimization: Part 3: Core Theory I: Geometry and Guarantees to 4. Core Theory II: Algorithms and Dynamics
3. Core Theory I: Geometry and Guarantees
This block develops core theory i: geometry and guarantees for Stochastic Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
3.1 Geometry of unbiased gradient oracle
In this section, strongly convex SGD is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Geometry of unbiased gradient oracle" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, strongly convex SGD is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where strongly convex SGD can be computed directly and compared with theory.
- A logistic-regression or softmax objective where strongly convex SGD affects optimization but the model remains interpretable.
- A transformer training diagnostic where strongly convex SGD appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating strongly convex SGD as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving strongly convex SGD, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes strongly convex SGD visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about strongly convex SGD is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
3.2 Key inequality for gradient variance
In this section, nonconvex SGD is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Key inequality for gradient variance" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, nonconvex SGD is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where nonconvex SGD can be computed directly and compared with theory.
- A logistic-regression or softmax objective where nonconvex SGD affects optimization but the model remains interpretable.
- A transformer training diagnostic where nonconvex SGD appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating nonconvex SGD as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving nonconvex SGD, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes nonconvex SGD visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about nonconvex SGD is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
3.3 Role of minibatch estimator
In this section, gradient noise scale is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Role of minibatch estimator" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, gradient noise scale is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where gradient noise scale can be computed directly and compared with theory.
- A logistic-regression or softmax objective where gradient noise scale affects optimization but the model remains interpretable.
- A transformer training diagnostic where gradient noise scale appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating gradient noise scale as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving gradient noise scale, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes gradient noise scale visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about gradient noise scale is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
3.4 Proof template and what the proof actually buys
In this section, SVRG is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Proof template and what the proof actually buys" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, SVRG is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where SVRG can be computed directly and compared with theory.
- A logistic-regression or softmax objective where SVRG affects optimization but the model remains interpretable.
- A transformer training diagnostic where SVRG appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating SVRG as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving SVRG, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes SVRG visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about SVRG is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
3.5 Failure modes when assumptions are removed
In this section, SAGA is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Failure modes when assumptions are removed" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, SAGA is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where SAGA can be computed directly and compared with theory.
- A logistic-regression or softmax objective where SAGA affects optimization but the model remains interpretable.
- A transformer training diagnostic where SAGA appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating SAGA as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving SAGA, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes SAGA visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about SAGA is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4. Core Theory II: Algorithms and Dynamics
This block develops core theory ii: algorithms and dynamics for Stochastic Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
4.1 Algorithmic update for batch-size scaling
In this section, SVRG is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Algorithmic update for batch-size scaling" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, SVRG is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where SVRG can be computed directly and compared with theory.
- A logistic-regression or softmax objective where SVRG affects optimization but the model remains interpretable.
- A transformer training diagnostic where SVRG appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating SVRG as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving SVRG, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes SVRG visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about SVRG is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4.2 Stability role of critical batch size
In this section, SAGA is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Stability role of critical batch size" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, SAGA is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where SAGA can be computed directly and compared with theory.
- A logistic-regression or softmax objective where SAGA affects optimization but the model remains interpretable.
- A transformer training diagnostic where SAGA appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating SAGA as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving SAGA, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes SAGA visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about SAGA is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4.3 Rate or complexity controlled by Robbins-Monro schedule
In this section, control variates is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Rate or complexity controlled by Robbins-Monro schedule" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, control variates is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where control variates can be computed directly and compared with theory.
- A logistic-regression or softmax objective where control variates affects optimization but the model remains interpretable.
- A transformer training diagnostic where control variates appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating control variates as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving control variates, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes control variates visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about control variates is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4.4 Diagnostic interpretation of the update path
In this section, Polyak averaging is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Diagnostic interpretation of the update path" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, Polyak averaging is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where Polyak averaging can be computed directly and compared with theory.
- A logistic-regression or softmax objective where Polyak averaging affects optimization but the model remains interpretable.
- A transformer training diagnostic where Polyak averaging appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating Polyak averaging as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving Polyak averaging, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes Polyak averaging visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about Polyak averaging is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4.5 Connection to the next section in the chapter
In this section, distributed SGD is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Connection to the next section in the chapter" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, distributed SGD is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where distributed SGD can be computed directly and compared with theory.
- A logistic-regression or softmax objective where distributed SGD affects optimization but the model remains interpretable.
- A transformer training diagnostic where distributed SGD appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating distributed SGD as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving distributed SGD, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes distributed SGD visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about distributed SGD is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- minibatch training for deep networks and transformers.
- batch-size and learning-rate coupling in large-scale pretraining.
- distributed gradient averaging under data parallelism.
- variance reduction ideas behind efficient fine-tuning and classical ML solvers.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.