Lesson overview | Previous part | Next part
Regularization Methods: Part 3: Core Theory I: Geometry and Guarantees to 4. Core Theory II: Algorithms and Dynamics
3. Core Theory I: Geometry and Guarantees
This block develops core theory i: geometry and guarantees for Regularization Methods. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
3.1 Geometry of weight decay
In this section, early stopping is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Geometry of weight decay" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, early stopping is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where early stopping can be computed directly and compared with theory.
- A logistic-regression or softmax objective where early stopping affects optimization but the model remains interpretable.
- A transformer training diagnostic where early stopping appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating early stopping as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving early stopping, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes early stopping visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about early stopping is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
3.2 Key inequality for AdamW decay
In this section, data augmentation is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Key inequality for AdamW decay" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, data augmentation is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where data augmentation can be computed directly and compared with theory.
- A logistic-regression or softmax objective where data augmentation affects optimization but the model remains interpretable.
- A transformer training diagnostic where data augmentation appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating data augmentation as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving data augmentation, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes data augmentation visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about data augmentation is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
3.3 Role of L1 penalty
In this section, label smoothing preview is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Role of L1 penalty" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, label smoothing preview is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where label smoothing preview can be computed directly and compared with theory.
- A logistic-regression or softmax objective where label smoothing preview affects optimization but the model remains interpretable.
- A transformer training diagnostic where label smoothing preview appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating label smoothing preview as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving label smoothing preview, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes label smoothing preview visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about label smoothing preview is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
3.4 Proof template and what the proof actually buys
In this section, spectral normalization is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Proof template and what the proof actually buys" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, spectral normalization is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where spectral normalization can be computed directly and compared with theory.
- A logistic-regression or softmax objective where spectral normalization affects optimization but the model remains interpretable.
- A transformer training diagnostic where spectral normalization appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating spectral normalization as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving spectral normalization, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes spectral normalization visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about spectral normalization is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
3.5 Failure modes when assumptions are removed
In this section, gradient clipping preview is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Failure modes when assumptions are removed" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, gradient clipping preview is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where gradient clipping preview can be computed directly and compared with theory.
- A logistic-regression or softmax objective where gradient clipping preview affects optimization but the model remains interpretable.
- A transformer training diagnostic where gradient clipping preview appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating gradient clipping preview as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving gradient clipping preview, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes gradient clipping preview visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about gradient clipping preview is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4. Core Theory II: Algorithms and Dynamics
This block develops core theory ii: algorithms and dynamics for Regularization Methods. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
4.1 Algorithmic update for soft thresholding
In this section, spectral normalization is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Algorithmic update for soft thresholding" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, spectral normalization is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where spectral normalization can be computed directly and compared with theory.
- A logistic-regression or softmax objective where spectral normalization affects optimization but the model remains interpretable.
- A transformer training diagnostic where spectral normalization appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating spectral normalization as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving spectral normalization, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes spectral normalization visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about spectral normalization is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4.2 Stability role of elastic net
In this section, gradient clipping preview is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Stability role of elastic net" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, gradient clipping preview is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where gradient clipping preview can be computed directly and compared with theory.
- A logistic-regression or softmax objective where gradient clipping preview affects optimization but the model remains interpretable.
- A transformer training diagnostic where gradient clipping preview appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating gradient clipping preview as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving gradient clipping preview, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes gradient clipping preview visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about gradient clipping preview is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4.3 Rate or complexity controlled by nuclear norm
In this section, SAM is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Rate or complexity controlled by nuclear norm" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, SAM is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where SAM can be computed directly and compared with theory.
- A logistic-regression or softmax objective where SAM affects optimization but the model remains interpretable.
- A transformer training diagnostic where SAM appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating SAM as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving SAM, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes SAM visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about SAM is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4.4 Diagnostic interpretation of the update path
In this section, implicit regularization is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Diagnostic interpretation of the update path" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, implicit regularization is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where implicit regularization can be computed directly and compared with theory.
- A logistic-regression or softmax objective where implicit regularization affects optimization but the model remains interpretable.
- A transformer training diagnostic where implicit regularization appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating implicit regularization as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving implicit regularization, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes implicit regularization visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about implicit regularization is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
4.5 Connection to the next section in the chapter
In this section, Bayesian MAP view is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Regularization Methods, the phrase "Connection to the next section in the chapter" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, Bayesian MAP view is the part of Regularization Methods that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where Bayesian MAP view can be computed directly and compared with theory.
- A logistic-regression or softmax objective where Bayesian MAP view affects optimization but the model remains interpretable.
- A transformer training diagnostic where Bayesian MAP view appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating Bayesian MAP view as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving Bayesian MAP view, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes Bayesian MAP view visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about Bayesian MAP view is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- weight decay in AdamW-based transformer training.
- dropout and stochastic regularization for neural networks.
- spectral normalization in GANs and Lipschitz-controlled models.
- SAM as a regularizer that penalizes sharp local neighborhoods.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.