Lesson overview | Previous part | Lesson overview
Learning Rate Schedules: Part 7: Applications in Machine Learning to References
7. Applications in Machine Learning
This block develops applications in machine learning for Learning Rate Schedules. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
7.1 linear warmup plus cosine decay for transformer pretraining
In this section, schedule function is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "linear warmup plus cosine decay for transformer pretraining" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, schedule function is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where schedule function can be computed directly and compared with theory.
- A logistic-regression or softmax objective where schedule function affects optimization but the model remains interpretable.
- A transformer training diagnostic where schedule function appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating schedule function as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving schedule function, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes schedule function visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about schedule function is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
7.2 warmup-stable-decay schedules for long LLM runs
In this section, constant learning rate is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "warmup-stable-decay schedules for long LLM runs" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, constant learning rate is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where constant learning rate can be computed directly and compared with theory.
- A logistic-regression or softmax objective where constant learning rate affects optimization but the model remains interpretable.
- A transformer training diagnostic where constant learning rate appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating constant learning rate as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving constant learning rate, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes constant learning rate visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about constant learning rate is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
7.3 one-cycle schedules for fast supervised training
In this section, step decay is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "one-cycle schedules for fast supervised training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, step decay is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where step decay can be computed directly and compared with theory.
- A logistic-regression or softmax objective where step decay affects optimization but the model remains interpretable.
- A transformer training diagnostic where step decay appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating step decay as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving step decay, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes step decay visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about step decay is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
7.4 batch-size and gradient-accumulation coupling in distributed training
In this section, exponential decay is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "batch-size and gradient-accumulation coupling in distributed training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, exponential decay is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where exponential decay can be computed directly and compared with theory.
- A logistic-regression or softmax objective where exponential decay affects optimization but the model remains interpretable.
- A transformer training diagnostic where exponential decay appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating exponential decay as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving exponential decay, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes exponential decay visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about exponential decay is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
7.5 Diagnostic checklist for real experiments
In this section, polynomial decay is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Diagnostic checklist for real experiments" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, polynomial decay is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where polynomial decay can be computed directly and compared with theory.
- A logistic-regression or softmax objective where polynomial decay affects optimization but the model remains interpretable.
- A transformer training diagnostic where polynomial decay appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating polynomial decay as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving polynomial decay, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes polynomial decay visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about polynomial decay is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8. Implementation and Diagnostics
This block develops implementation and diagnostics for Learning Rate Schedules. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
8.1 Minimal NumPy experiment for learning-rate rewinding
In this section, exponential decay is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Minimal NumPy experiment for learning-rate rewinding" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, exponential decay is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where exponential decay can be computed directly and compared with theory.
- A logistic-regression or softmax objective where exponential decay affects optimization but the model remains interpretable.
- A transformer training diagnostic where exponential decay appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating exponential decay as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving exponential decay, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes exponential decay visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about exponential decay is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8.2 Monitoring signal for batch-size scaling
In this section, polynomial decay is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Monitoring signal for batch-size scaling" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, polynomial decay is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where polynomial decay can be computed directly and compared with theory.
- A logistic-regression or softmax objective where polynomial decay affects optimization but the model remains interpretable.
- A transformer training diagnostic where polynomial decay appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating polynomial decay as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving polynomial decay, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes polynomial decay visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about polynomial decay is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8.3 Failure signature for gradient accumulation coupling
In this section, linear warmup is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Failure signature for gradient accumulation coupling" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, linear warmup is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where linear warmup can be computed directly and compared with theory.
- A logistic-regression or softmax objective where linear warmup affects optimization but the model remains interpretable.
- A transformer training diagnostic where linear warmup appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating linear warmup as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving linear warmup, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes linear warmup visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about linear warmup is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8.4 Framework-level implementation pattern
In this section, warmup ratio is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Framework-level implementation pattern" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, warmup ratio is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where warmup ratio can be computed directly and compared with theory.
- A logistic-regression or softmax objective where warmup ratio affects optimization but the model remains interpretable.
- A transformer training diagnostic where warmup ratio appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating warmup ratio as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving warmup ratio, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes warmup ratio visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about warmup ratio is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8.5 Reproducibility and logging checklist
In this section, cosine annealing is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Reproducibility and logging checklist" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, cosine annealing is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where cosine annealing can be computed directly and compared with theory.
- A logistic-regression or softmax objective where cosine annealing affects optimization but the model remains interpretable.
- A transformer training diagnostic where cosine annealing appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating cosine annealing as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving cosine annealing, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes cosine annealing visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about cosine annealing is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- linear warmup plus cosine decay for transformer pretraining.
- warmup-stable-decay schedules for long LLM runs.
- one-cycle schedules for fast supervised training.
- batch-size and gradient-accumulation coupling in distributed training.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
9. Common Mistakes
| # | Mistake | Why It Is Wrong | Fix |
|---|---|---|---|
| 1 | Using a recipe without checking assumptions | Optimization guarantees depend on smoothness, convexity, stochasticity, or feasibility assumptions. | Write the assumptions next to the update rule before choosing hyperparameters. |
| 2 | Confusing objective decrease with validation improvement | The optimizer sees the training objective; validation behavior also depends on generalization and data split quality. | Track objective, train metric, validation metric, and update norm separately. |
| 3 | Treating all norms as interchangeable | The geometry changes when the norm changes, especially for constraints and regularizers. | State whether you use , , Frobenius, spectral, or another norm. |
| 4 | Ignoring scale | Learning rates, penalties, curvature, and gradient norms are all scale-sensitive. | Normalize units and inspect effective update size . |
| 5 | Overfitting to a single seed | Optimization can look stable for one seed and fail under another. | Run small seed sweeps for important claims. |
| 6 | Hiding instability behind smoothed plots | A moving average can hide spikes, divergence, and bad curvature events. | Plot raw metrics alongside smoothed metrics. |
| 7 | Using test data during tuning | This contaminates the final evaluation. | Reserve test data until after model and hyperparameter selection. |
| 8 | Assuming large models make theory irrelevant | Large models often make diagnostics more important because failures are expensive. | Use theory to decide what to log, not to pretend every theorem applies exactly. |
| 9 | Mixing optimizer state with model state carelessly | State corruption changes the effective algorithm. | Checkpoint parameters, gradients if needed, optimizer moments, scheduler state, and random seeds. |
| 10 | Not checking numerical precision | BF16, FP16, FP8, and accumulation choices can change the observed optimizer. | Cross-check suspicious runs against higher precision on a small batch. |
10. Exercises
- Exercise 1 [*] - Step Decay (a) Define step decay using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 2 [*] - Polynomial Decay (a) Define polynomial decay using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 3 [*] - Warmup Ratio (a) Define warmup ratio using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 4 [] - Cosine With Restarts** (a) Define cosine with restarts using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 5 [] - One-Cycle Policy** (a) Define one-cycle policy using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 6 [] - Inverse-Square-Root Decay** (a) Define inverse-square-root decay using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 7 [] - Cooldown** (a) Define cooldown using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 8 [*] - Batch-Size Scaling** (a) Define batch-size scaling using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 9 [*] - Token-Budget Scheduling** (a) Define token-budget scheduling using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 10 [*] - Llm Pretraining Schedule Design** (a) Define LLM pretraining schedule design using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
11. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact |
|---|---|
| schedule function | linear warmup plus cosine decay for transformer pretraining |
| constant learning rate | warmup-stable-decay schedules for long LLM runs |
| step decay | one-cycle schedules for fast supervised training |
| exponential decay | batch-size and gradient-accumulation coupling in distributed training |
| polynomial decay | linear warmup plus cosine decay for transformer pretraining |
| linear warmup | warmup-stable-decay schedules for long LLM runs |
| warmup ratio | one-cycle schedules for fast supervised training |
| cosine annealing | batch-size and gradient-accumulation coupling in distributed training |
| cosine with restarts | linear warmup plus cosine decay for transformer pretraining |
| cyclic learning rate | warmup-stable-decay schedules for long LLM runs |
12. Conceptual Bridge
Learning Rate Schedules sits inside a chain. Earlier sections give the calculus, probability, and linear algebra needed to write the objective and interpret the update. Later sections use this material to reason about noisy gradients, adaptive state, regularization, tuning, schedules, and finally information-theoretic losses.
Backward link: Hyperparameter Optimization supplies the immediate prerequisite vocabulary.
Forward link: Chapter 9 turns optimization objectives into information-theoretic quantities such as entropy, KL divergence, cross-entropy, and Fisher information.
+------------------------------------------------------------+
| Chapter 8: Optimization |
| 01-Convex-Optimization Convex Optimization |
| 02-Gradient-Descent Gradient Descent |
| 03-Second-Order-Methods Second-Order Methods |
| 04-Constrained-Optimization Constrained Optimization |
| 05-Stochastic-Optimization Stochastic Optimization |
| 06-Optimization-Landscape Optimization Landscape |
| 07-Adaptive-Learning-Rate Adaptive Learning Rate |
| 08-Regularization-Methods Regularization Methods |
| 09-Hyperparameter-Optimization Hyperparameter Optimization |
| >> 10-Learning-Rate-Schedules Learning Rate Schedules |
+------------------------------------------------------------+
Appendix A. Extended Derivation and Diagnostic Cards
References
- Smith, Cyclical Learning Rates for Training Neural Networks.
- Loshchilov and Hutter, SGDR: Stochastic Gradient Descent with Warm Restarts.
- Vaswani et al., Attention Is All You Need.
- Recent work on warmup-stable-decay schedules for large language models.
- Goodfellow, Bengio, and Courville, Deep Learning.
- Bottou, Curtis, and Nocedal, Optimization Methods for Large-Scale Machine Learning.
- PyTorch optimizer and scheduler documentation.
- Optax documentation for composable optimizer transformations.