Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
24 min read12 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Stochastic Optimization: Part 1: Intuition to 2. Formal Definitions

1. Intuition

This block develops intuition for Stochastic Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

1.1 Why Stochastic Optimization matters for training systems

In this section, gradient variance is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Why Stochastic Optimization matters for training systems" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, gradient variance is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where gradient variance can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where gradient variance affects optimization but the model remains interpretable.
  • A transformer training diagnostic where gradient variance appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating gradient variance as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving gradient variance, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes gradient variance visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about gradient variance is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

1.2 The optimization object: parameters, objective, algorithm, and diagnostic

In this section, minibatch estimator is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "The optimization object: parameters, objective, algorithm, and diagnostic" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, minibatch estimator is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where minibatch estimator can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where minibatch estimator affects optimization but the model remains interpretable.
  • A transformer training diagnostic where minibatch estimator appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating minibatch estimator as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving minibatch estimator, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes minibatch estimator visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about minibatch estimator is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

1.3 Historical arc from classical optimization to modern AI

In this section, batch-size scaling is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Historical arc from classical optimization to modern AI" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, batch-size scaling is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where batch-size scaling can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where batch-size scaling affects optimization but the model remains interpretable.
  • A transformer training diagnostic where batch-size scaling appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating batch-size scaling as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving batch-size scaling, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes batch-size scaling visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about batch-size scaling is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

1.4 What this section treats as canonical scope

In this section, critical batch size is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "What this section treats as canonical scope" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, critical batch size is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where critical batch size can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where critical batch size affects optimization but the model remains interpretable.
  • A transformer training diagnostic where critical batch size appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating critical batch size as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving critical batch size, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes critical batch size visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about critical batch size is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

1.5 A first mental model for LLM training

In this section, Robbins-Monro schedule is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "A first mental model for LLM training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Robbins-Monro schedule is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Robbins-Monro schedule can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Robbins-Monro schedule affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Robbins-Monro schedule appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Robbins-Monro schedule as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Robbins-Monro schedule, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Robbins-Monro schedule visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Robbins-Monro schedule is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

2. Formal Definitions

This block develops formal definitions for Stochastic Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

2.1 Primary definition: stochastic objective

In this section, critical batch size is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Primary definition: stochastic objective" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, critical batch size is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where critical batch size can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where critical batch size affects optimization but the model remains interpretable.
  • A transformer training diagnostic where critical batch size appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating critical batch size as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving critical batch size, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes critical batch size visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about critical batch size is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

2.2 Secondary definition: empirical risk

In this section, Robbins-Monro schedule is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Secondary definition: empirical risk" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Robbins-Monro schedule is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Robbins-Monro schedule can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Robbins-Monro schedule affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Robbins-Monro schedule appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Robbins-Monro schedule as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Robbins-Monro schedule, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Robbins-Monro schedule visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Robbins-Monro schedule is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

2.3 Algorithmic object: population risk

In this section, SGD convergence is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Algorithmic object: population risk" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, SGD convergence is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where SGD convergence can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where SGD convergence affects optimization but the model remains interpretable.
  • A transformer training diagnostic where SGD convergence appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating SGD convergence as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving SGD convergence, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes SGD convergence visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about SGD convergence is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

2.4 Examples, non-examples, and boundary cases

In this section, strongly convex SGD is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Examples, non-examples, and boundary cases" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, strongly convex SGD is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where strongly convex SGD can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where strongly convex SGD affects optimization but the model remains interpretable.
  • A transformer training diagnostic where strongly convex SGD appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating strongly convex SGD as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving strongly convex SGD, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes strongly convex SGD visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about strongly convex SGD is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

2.5 Notation, dimensions, and assumptions

In this section, nonconvex SGD is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Stochastic Optimization, the phrase "Notation, dimensions, and assumptions" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, nonconvex SGD is the part of Stochastic Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where nonconvex SGD can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where nonconvex SGD affects optimization but the model remains interpretable.
  • A transformer training diagnostic where nonconvex SGD appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating nonconvex SGD as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

gt=1BiBtθ(θt;x(i),y(i))\mathbf{g}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}_t; \mathbf{x}^{(i)}, y^{(i)})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving nonconvex SGD, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes nonconvex SGD visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
θt+1=θtηtgt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{g}_t
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about nonconvex SGD is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • minibatch training for deep networks and transformers.
  • batch-size and learning-rate coupling in large-scale pretraining.
  • distributed gradient averaging under data parallelism.
  • variance reduction ideas behind efficient fine-tuning and classical ML solvers.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue