Part 1

24 min read12 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Convex Optimization: Part 1: Intuition to 2. Formal Definitions

1. Intuition

This block develops intuition for Convex Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

1.1 Why Convex Optimization matters for training systems

In this section, first-order characterization is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "Why Convex Optimization matters for training systems" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, first-order characterization is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where first-order characterization can be computed directly and compared with theory.
A logistic-regression or softmax objective where first-order characterization affects optimization but the model remains interpretable.
A transformer training diagnostic where first-order characterization appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating first-order characterization as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving first-order characterization, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes first-order characterization visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about first-order characterization is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

1.2 The optimization object: parameters, objective, algorithm, and diagnostic

In this section, second-order characterization is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "The optimization object: parameters, objective, algorithm, and diagnostic" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, second-order characterization is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where second-order characterization can be computed directly and compared with theory.
A logistic-regression or softmax objective where second-order characterization affects optimization but the model remains interpretable.
A transformer training diagnostic where second-order characterization appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating second-order characterization as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving second-order characterization, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes second-order characterization visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about second-order characterization is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

1.3 Historical arc from classical optimization to modern AI

In this section, smoothness is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "Historical arc from classical optimization to modern AI" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, smoothness is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where smoothness can be computed directly and compared with theory.
A logistic-regression or softmax objective where smoothness affects optimization but the model remains interpretable.
A transformer training diagnostic where smoothness appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating smoothness as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving smoothness, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes smoothness visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about smoothness is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

1.4 What this section treats as canonical scope

In this section, strong convexity is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "What this section treats as canonical scope" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, strong convexity is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where strong convexity can be computed directly and compared with theory.
A logistic-regression or softmax objective where strong convexity affects optimization but the model remains interpretable.
A transformer training diagnostic where strong convexity appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating strong convexity as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving strong convexity, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes strong convexity visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about strong convexity is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

1.5 A first mental model for LLM training

In this section, condition number is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "A first mental model for LLM training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, condition number is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where condition number can be computed directly and compared with theory.
A logistic-regression or softmax objective where condition number affects optimization but the model remains interpretable.
A transformer training diagnostic where condition number appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating condition number as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving condition number, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes condition number visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about condition number is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

2. Formal Definitions

This block develops formal definitions for Convex Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

2.1 Primary definition: convex sets

In this section, strong convexity is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "Primary definition: convex sets" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, strong convexity is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where strong convexity can be computed directly and compared with theory.
A logistic-regression or softmax objective where strong convexity affects optimization but the model remains interpretable.
A transformer training diagnostic where strong convexity appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating strong convexity as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes strong convexity visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about strong convexity is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

2.2 Secondary definition: convex combinations

In this section, condition number is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "Secondary definition: convex combinations" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, condition number is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where condition number can be computed directly and compared with theory.
A logistic-regression or softmax objective where condition number affects optimization but the model remains interpretable.
A transformer training diagnostic where condition number appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating condition number as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Implementation consequence:

Log a metric that makes condition number visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about condition number is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

2.3 Algorithmic object: convex functions

In this section, convex problem classes is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "Algorithmic object: convex functions" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, convex problem classes is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where convex problem classes can be computed directly and compared with theory.
A logistic-regression or softmax objective where convex problem classes affects optimization but the model remains interpretable.
A transformer training diagnostic where convex problem classes appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating convex problem classes as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving convex problem classes, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes convex problem classes visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about convex problem classes is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

2.4 Examples, non-examples, and boundary cases

In this section, linear programs is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "Examples, non-examples, and boundary cases" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, linear programs is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where linear programs can be computed directly and compared with theory.
A logistic-regression or softmax objective where linear programs affects optimization but the model remains interpretable.
A transformer training diagnostic where linear programs appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating linear programs as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving linear programs, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes linear programs visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about linear programs is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

2.5 Notation, dimensions, and assumptions

In this section, quadratic programs is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Convex Optimization, the phrase "Notation, dimensions, and assumptions" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, quadratic programs is the part of Convex Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through $f$ , $\boldsymbol{\theta}$ , $\eta$ , $\nabla f(\boldsymbol{\theta})$ , and any auxiliary state used by the algorithm.

Examples:

A small synthetic quadratic where quadratic programs can be computed directly and compared with theory.
A logistic-regression or softmax objective where quadratic programs affects optimization but the model remains interpretable.
A transformer training diagnostic where quadratic programs appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

Treating quadratic programs as a hyperparameter recipe without checking the objective assumptions.
Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})

Proof sketch or reasoning pattern:

Start with the local model around $\boldsymbol{\theta}_t$ , isolate the term involving quadratic programs, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

Log a metric that makes quadratic programs visible; otherwise a training run can fail while the scalar loss hides the cause.
Compare the measured update with the mathematical update below before blaming data or architecture.

\mathbf{x}^* \in \arg\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})

Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

Which assumption about quadratic programs is most fragile in the current training setup?
What number would you log to catch the failure one thousand steps before divergence?

AI connection:

logistic regression and softmax regression as convex baselines.
support-vector machines through primal and dual convex programs.
nuclear-norm relaxations behind low-rank matrix recovery and LoRA intuition.
regularized empirical risk minimization with explicit certificates.

Convex Optimization: Part 1 - Intuition To 2 Formal Definitions

Convex Optimization: Part 1: Intuition to 2. Formal Definitions

1. Intuition

1.1 Why Convex Optimization matters for training systems

1.2 The optimization object: parameters, objective, algorithm, and diagnostic

1.3 Historical arc from classical optimization to modern AI

1.4 What this section treats as canonical scope

1.5 A first mental model for LLM training

2. Formal Definitions

2.1 Primary definition: convex sets

2.2 Secondary definition: convex combinations

2.3 Algorithmic object: convex functions

2.4 Examples, non-examples, and boundary cases

2.5 Notation, dimensions, and assumptions

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?