Part 1

24 min read12 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Gradient Descent: Part 1: Intuition to 2. Formal Definitions

1. Intuition

This block develops intuition for Gradient Descent. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

1.1 Why Gradient Descent matters for training systems

In this section, backtracking line search is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Why Gradient Descent matters for training systems" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, backtracking line search is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where backtracking line search can be computed directly and compared with theory.
A logistic-regression or softmax objective where backtracking line search affects optimization but the model remains interpretable.
A transformer training diagnostic where backtracking line search appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating backtracking line search as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving backtracking line search, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes backtracking line search visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about backtracking line search is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

1.2 The optimization object: parameters, objective, algorithm, and diagnostic

In this section, Armijo condition is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "The optimization object: parameters, objective, algorithm, and diagnostic" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Armijo condition is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Armijo condition can be computed directly and compared with theory.
A logistic-regression or softmax objective where Armijo condition affects optimization but the model remains interpretable.
A transformer training diagnostic where Armijo condition appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Armijo condition as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving Armijo condition, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes Armijo condition visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Armijo condition is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

1.3 Historical arc from classical optimization to modern AI

In this section, Wolfe conditions is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Historical arc from classical optimization to modern AI" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Wolfe conditions is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where Wolfe conditions can be computed directly and compared with theory.
A logistic-regression or softmax objective where Wolfe conditions affects optimization but the model remains interpretable.
A transformer training diagnostic where Wolfe conditions appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating Wolfe conditions as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving Wolfe conditions, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes Wolfe conditions visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about Wolfe conditions is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

1.4 What this section treats as canonical scope

In this section, convex convergence is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "What this section treats as canonical scope" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, convex convergence is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where convex convergence can be computed directly and compared with theory.
A logistic-regression or softmax objective where convex convergence affects optimization but the model remains interpretable.
A transformer training diagnostic where convex convergence appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating convex convergence as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving convex convergence, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes convex convergence visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about convex convergence is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

1.5 A first mental model for LLM training

In this section, strongly convex convergence is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "A first mental model for LLM training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, strongly convex convergence is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where strongly convex convergence can be computed directly and compared with theory.
A logistic-regression or softmax objective where strongly convex convergence affects optimization but the model remains interpretable.
A transformer training diagnostic where strongly convex convergence appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating strongly convex convergence as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving strongly convex convergence, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes strongly convex convergence visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about strongly convex convergence is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

2. Formal Definitions

This block develops formal definitions for Gradient Descent. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

2.1 Primary definition: gradient direction

In this section, convex convergence is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Primary definition: gradient direction" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, convex convergence is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where convex convergence can be computed directly and compared with theory.
A logistic-regression or softmax objective where convex convergence affects optimization but the model remains interpretable.
A transformer training diagnostic where convex convergence appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating convex convergence as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes convex convergence visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about convex convergence is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

2.2 Secondary definition: descent lemma

In this section, strongly convex convergence is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Secondary definition: descent lemma" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, strongly convex convergence is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where strongly convex convergence can be computed directly and compared with theory.
A logistic-regression or softmax objective where strongly convex convergence affects optimization but the model remains interpretable.
A transformer training diagnostic where strongly convex convergence appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating strongly convex convergence as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes strongly convex convergence visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about strongly convex convergence is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

2.3 Algorithmic object: constant step size

In this section, nonconvex stationarity is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Algorithmic object: constant step size" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, nonconvex stationarity is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where nonconvex stationarity can be computed directly and compared with theory.
A logistic-regression or softmax objective where nonconvex stationarity affects optimization but the model remains interpretable.
A transformer training diagnostic where nonconvex stationarity appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating nonconvex stationarity as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving nonconvex stationarity, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes nonconvex stationarity visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about nonconvex stationarity is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

2.4 Examples, non-examples, and boundary cases

In this section, PL condition is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Examples, non-examples, and boundary cases" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, PL condition is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where PL condition can be computed directly and compared with theory.
A logistic-regression or softmax objective where PL condition affects optimization but the model remains interpretable.
A transformer training diagnostic where PL condition appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating PL condition as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving PL condition, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes PL condition visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about PL condition is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

2.5 Notation, dimensions, and assumptions

In this section, condition number is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Gradient Descent, the phrase "Notation, dimensions, and assumptions" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, condition number is the part of Gradient Descent that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where condition number can be computed directly and compared with theory.
A logistic-regression or softmax objective where condition number affects optimization but the model remains interpretable.
A transformer training diagnostic where condition number appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating condition number as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla f(\boldsymbol{\theta}_t)

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving condition number, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes condition number visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

f(\boldsymbol{\theta}_{t+1}) \leq f(\boldsymbol{\theta}_t) - \frac{\eta}{2}\lVert \nabla f(\boldsymbol{\theta}_t)\rVert_2^2

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about condition number is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

the basic training loop used by every neural-network optimizer.
step-size stability for cross-entropy and mean-squared-error objectives.
momentum as the ancestor of Adam's first-moment accumulator.
line-search logic as a debugging model for divergence and oscillation.

Gradient Descent: Part 1 - Intuition To 2 Formal Definitions

Gradient Descent: Part 1: Intuition to 2. Formal Definitions

1. Intuition

1.1 Why Gradient Descent matters for training systems

1.2 The optimization object: parameters, objective, algorithm, and diagnostic

1.3 Historical arc from classical optimization to modern AI

1.4 What this section treats as canonical scope

1.5 A first mental model for LLM training

2. Formal Definitions

2.1 Primary definition: gradient direction

2.2 Secondary definition: descent lemma

2.3 Algorithmic object: constant step size

2.4 Examples, non-examples, and boundary cases

2.5 Notation, dimensions, and assumptions

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?