Part 1

24 min read12 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Learning Rate Schedules: Part 1: Intuition to 2. Formal Definitions

1. Intuition

This block develops intuition for Learning Rate Schedules. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

1.1 Why Learning Rate Schedules matters for training systems

In this section, polynomial decay is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Why Learning Rate Schedules matters for training systems" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, polynomial decay is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where polynomial decay can be computed directly and compared with theory.
A logistic-regression or softmax objective where polynomial decay affects optimization but the model remains interpretable.
A transformer training diagnostic where polynomial decay appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating polynomial decay as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving polynomial decay, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes polynomial decay visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about polynomial decay is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

1.2 The optimization object: parameters, objective, algorithm, and diagnostic

In this section, linear warmup is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "The optimization object: parameters, objective, algorithm, and diagnostic" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, linear warmup is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where linear warmup can be computed directly and compared with theory.
A logistic-regression or softmax objective where linear warmup affects optimization but the model remains interpretable.
A transformer training diagnostic where linear warmup appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating linear warmup as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving linear warmup, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes linear warmup visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about linear warmup is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

1.3 Historical arc from classical optimization to modern AI

In this section, warmup ratio is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Historical arc from classical optimization to modern AI" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, warmup ratio is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where warmup ratio can be computed directly and compared with theory.
A logistic-regression or softmax objective where warmup ratio affects optimization but the model remains interpretable.
A transformer training diagnostic where warmup ratio appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating warmup ratio as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving warmup ratio, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes warmup ratio visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about warmup ratio is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

1.4 What this section treats as canonical scope

In this section, cosine annealing is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "What this section treats as canonical scope" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, cosine annealing is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where cosine annealing can be computed directly and compared with theory.
A logistic-regression or softmax objective where cosine annealing affects optimization but the model remains interpretable.
A transformer training diagnostic where cosine annealing appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating cosine annealing as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving cosine annealing, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes cosine annealing visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about cosine annealing is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

1.5 A first mental model for LLM training

In this section, cosine with restarts is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "A first mental model for LLM training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, cosine with restarts is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where cosine with restarts can be computed directly and compared with theory.
A logistic-regression or softmax objective where cosine with restarts affects optimization but the model remains interpretable.
A transformer training diagnostic where cosine with restarts appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating cosine with restarts as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving cosine with restarts, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes cosine with restarts visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about cosine with restarts is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

2. Formal Definitions

This block develops formal definitions for Learning Rate Schedules. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

2.1 Primary definition: schedule function

In this section, cosine annealing is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Primary definition: schedule function" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, cosine annealing is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where cosine annealing can be computed directly and compared with theory.
A logistic-regression or softmax objective where cosine annealing affects optimization but the model remains interpretable.
A transformer training diagnostic where cosine annealing appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating cosine annealing as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes cosine annealing visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about cosine annealing is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

2.2 Secondary definition: constant learning rate

In this section, cosine with restarts is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Secondary definition: constant learning rate" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, cosine with restarts is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where cosine with restarts can be computed directly and compared with theory.
A logistic-regression or softmax objective where cosine with restarts affects optimization but the model remains interpretable.
A transformer training diagnostic where cosine with restarts appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating cosine with restarts as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes cosine with restarts visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about cosine with restarts is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

2.3 Algorithmic object: step decay

In this section, cyclic learning rate is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Algorithmic object: step decay" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, cyclic learning rate is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where cyclic learning rate can be computed directly and compared with theory.
A logistic-regression or softmax objective where cyclic learning rate affects optimization but the model remains interpretable.
A transformer training diagnostic where cyclic learning rate appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating cyclic learning rate as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving cyclic learning rate, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes cyclic learning rate visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about cyclic learning rate is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

2.4 Examples, non-examples, and boundary cases

In this section, one-cycle policy is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Examples, non-examples, and boundary cases" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, one-cycle policy is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where one-cycle policy can be computed directly and compared with theory.
A logistic-regression or softmax objective where one-cycle policy affects optimization but the model remains interpretable.
A transformer training diagnostic where one-cycle policy appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating one-cycle policy as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving one-cycle policy, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes one-cycle policy visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about one-cycle policy is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

2.5 Notation, dimensions, and assumptions

In this section, linear decay is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Notation, dimensions, and assumptions" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, linear decay is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where linear decay can be computed directly and compared with theory.
A logistic-regression or softmax objective where linear decay affects optimization but the model remains interpretable.
A transformer training diagnostic where linear decay appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating linear decay as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving linear decay, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes linear decay visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about linear decay is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

Learning Rate Schedules: Part 1 - Intuition To 2 Formal Definitions

Learning Rate Schedules: Part 1: Intuition to 2. Formal Definitions

1. Intuition

1.1 Why Learning Rate Schedules matters for training systems

1.2 The optimization object: parameters, objective, algorithm, and diagnostic

1.3 Historical arc from classical optimization to modern AI

1.4 What this section treats as canonical scope

1.5 A first mental model for LLM training

2. Formal Definitions

2.1 Primary definition: schedule function

2.2 Secondary definition: constant learning rate

2.3 Algorithmic object: step decay

2.4 Examples, non-examples, and boundary cases

2.5 Notation, dimensions, and assumptions

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?