Part 2

24 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Learning Rate Schedules: Part 3: Core Theory I: Geometry and Guarantees to 4. Core Theory II: Algorithms and Dynamics

3. Core Theory I: Geometry and Guarantees

This block develops core theory i: geometry and guarantees for Learning Rate Schedules. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

3.1 Geometry of exponential decay

In this section, one-cycle policy is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Geometry of exponential decay" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, one-cycle policy is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where one-cycle policy can be computed directly and compared with theory.
A logistic-regression or softmax objective where one-cycle policy affects optimization but the model remains interpretable.
A transformer training diagnostic where one-cycle policy appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating one-cycle policy as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving one-cycle policy, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes one-cycle policy visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about one-cycle policy is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

3.2 Key inequality for polynomial decay

In this section, linear decay is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Key inequality for polynomial decay" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, linear decay is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where linear decay can be computed directly and compared with theory.
A logistic-regression or softmax objective where linear decay affects optimization but the model remains interpretable.
A transformer training diagnostic where linear decay appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating linear decay as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving linear decay, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes linear decay visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about linear decay is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

3.3 Role of linear warmup

In this section, inverse-square-root decay is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Role of linear warmup" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, inverse-square-root decay is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where inverse-square-root decay can be computed directly and compared with theory.
A logistic-regression or softmax objective where inverse-square-root decay affects optimization but the model remains interpretable.
A transformer training diagnostic where inverse-square-root decay appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating inverse-square-root decay as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving inverse-square-root decay, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes inverse-square-root decay visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about inverse-square-root decay is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

3.4 Proof template and what the proof actually buys

In this section, WSD schedule is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Proof template and what the proof actually buys" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, WSD schedule is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where WSD schedule can be computed directly and compared with theory.
A logistic-regression or softmax objective where WSD schedule affects optimization but the model remains interpretable.
A transformer training diagnostic where WSD schedule appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating WSD schedule as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving WSD schedule, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes WSD schedule visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about WSD schedule is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

3.5 Failure modes when assumptions are removed

In this section, cooldown is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Failure modes when assumptions are removed" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, cooldown is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where cooldown can be computed directly and compared with theory.
A logistic-regression or softmax objective where cooldown affects optimization but the model remains interpretable.
A transformer training diagnostic where cooldown appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating cooldown as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving cooldown, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes cooldown visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about cooldown is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

4. Core Theory II: Algorithms and Dynamics

This block develops core theory ii: algorithms and dynamics for Learning Rate Schedules. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

4.1 Algorithmic update for warmup ratio

In this section, WSD schedule is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Algorithmic update for warmup ratio" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, WSD schedule is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where WSD schedule can be computed directly and compared with theory.
A logistic-regression or softmax objective where WSD schedule affects optimization but the model remains interpretable.
A transformer training diagnostic where WSD schedule appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating WSD schedule as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes WSD schedule visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about WSD schedule is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

4.2 Stability role of cosine annealing

In this section, cooldown is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Stability role of cosine annealing" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, cooldown is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where cooldown can be computed directly and compared with theory.
A logistic-regression or softmax objective where cooldown affects optimization but the model remains interpretable.
A transformer training diagnostic where cooldown appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating cooldown as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes cooldown visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about cooldown is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

4.3 Rate or complexity controlled by cosine with restarts

In this section, learning-rate rewinding is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Rate or complexity controlled by cosine with restarts" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, learning-rate rewinding is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where learning-rate rewinding can be computed directly and compared with theory.
A logistic-regression or softmax objective where learning-rate rewinding affects optimization but the model remains interpretable.
A transformer training diagnostic where learning-rate rewinding appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating learning-rate rewinding as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving learning-rate rewinding, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes learning-rate rewinding visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about learning-rate rewinding is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

4.4 Diagnostic interpretation of the update path

In this section, batch-size scaling is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Diagnostic interpretation of the update path" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, batch-size scaling is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where batch-size scaling can be computed directly and compared with theory.
A logistic-regression or softmax objective where batch-size scaling affects optimization but the model remains interpretable.
A transformer training diagnostic where batch-size scaling appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating batch-size scaling as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving batch-size scaling, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes batch-size scaling visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about batch-size scaling is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

4.5 Connection to the next section in the chapter

In this section, gradient accumulation coupling is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Learning Rate Schedules, the phrase "Connection to the next section in the chapter" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, gradient accumulation coupling is the part of Learning Rate Schedules that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where gradient accumulation coupling can be computed directly and compared with theory.
A logistic-regression or softmax objective where gradient accumulation coupling affects optimization but the model remains interpretable.
A transformer training diagnostic where gradient accumulation coupling appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating gradient accumulation coupling as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving gradient accumulation coupling, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes gradient accumulation coupling visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{u}_t

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about gradient accumulation coupling is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

linear warmup plus cosine decay for transformer pretraining.
warmup-stable-decay schedules for long LLM runs.
one-cycle schedules for fast supervised training.
batch-size and gradient-accumulation coupling in distributed training.

Learning Rate Schedules: Part 2 - Core Theory I Geometry And Guarantees To 4 Core Theory Ii Algorithms

Learning Rate Schedules: Part 3: Core Theory I: Geometry and Guarantees to 4. Core Theory II: Algorithms and Dynamics

3. Core Theory I: Geometry and Guarantees

3.1 Geometry of exponential decay

3.2 Key inequality for polynomial decay

3.3 Role of linear warmup

3.4 Proof template and what the proof actually buys

3.5 Failure modes when assumptions are removed

4. Core Theory II: Algorithms and Dynamics

4.1 Algorithmic update for warmup ratio

4.2 Stability role of cosine annealing

4.3 Rate or complexity controlled by cosine with restarts

4.4 Diagnostic interpretation of the update path

4.5 Connection to the next section in the chapter

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?