Lesson overview | Lesson overview | Next part
Constrained Optimization: Part 1: Intuition to 2. Formal Definitions
1. Intuition
This block develops intuition for Constrained Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
1.1 Why Constrained Optimization matters for training systems
In this section, Lagrangian is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Why Constrained Optimization matters for training systems" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, Lagrangian is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where Lagrangian can be computed directly and compared with theory.
- A logistic-regression or softmax objective where Lagrangian affects optimization but the model remains interpretable.
- A transformer training diagnostic where Lagrangian appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating Lagrangian as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving Lagrangian, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes Lagrangian visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about Lagrangian is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.2 The optimization object: parameters, objective, algorithm, and diagnostic
In this section, stationarity is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "The optimization object: parameters, objective, algorithm, and diagnostic" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, stationarity is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where stationarity can be computed directly and compared with theory.
- A logistic-regression or softmax objective where stationarity affects optimization but the model remains interpretable.
- A transformer training diagnostic where stationarity appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating stationarity as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving stationarity, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes stationarity visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about stationarity is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.3 Historical arc from classical optimization to modern AI
In this section, primal feasibility is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Historical arc from classical optimization to modern AI" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, primal feasibility is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where primal feasibility can be computed directly and compared with theory.
- A logistic-regression or softmax objective where primal feasibility affects optimization but the model remains interpretable.
- A transformer training diagnostic where primal feasibility appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating primal feasibility as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving primal feasibility, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes primal feasibility visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about primal feasibility is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.4 What this section treats as canonical scope
In this section, dual feasibility is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "What this section treats as canonical scope" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, dual feasibility is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where dual feasibility can be computed directly and compared with theory.
- A logistic-regression or softmax objective where dual feasibility affects optimization but the model remains interpretable.
- A transformer training diagnostic where dual feasibility appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating dual feasibility as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving dual feasibility, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes dual feasibility visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about dual feasibility is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
1.5 A first mental model for LLM training
In this section, complementary slackness is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "A first mental model for LLM training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, complementary slackness is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where complementary slackness can be computed directly and compared with theory.
- A logistic-regression or softmax objective where complementary slackness affects optimization but the model remains interpretable.
- A transformer training diagnostic where complementary slackness appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating complementary slackness as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving complementary slackness, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes complementary slackness visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about complementary slackness is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2. Formal Definitions
This block develops formal definitions for Constrained Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
2.1 Primary definition: feasible set
In this section, dual feasibility is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Primary definition: feasible set" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, dual feasibility is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where dual feasibility can be computed directly and compared with theory.
- A logistic-regression or softmax objective where dual feasibility affects optimization but the model remains interpretable.
- A transformer training diagnostic where dual feasibility appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating dual feasibility as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving dual feasibility, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes dual feasibility visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about dual feasibility is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.2 Secondary definition: active constraint
In this section, complementary slackness is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Secondary definition: active constraint" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, complementary slackness is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where complementary slackness can be computed directly and compared with theory.
- A logistic-regression or softmax objective where complementary slackness affects optimization but the model remains interpretable.
- A transformer training diagnostic where complementary slackness appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating complementary slackness as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving complementary slackness, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes complementary slackness visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about complementary slackness is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.3 Algorithmic object: equality constraints
In this section, KKT conditions is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Algorithmic object: equality constraints" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, KKT conditions is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where KKT conditions can be computed directly and compared with theory.
- A logistic-regression or softmax objective where KKT conditions affects optimization but the model remains interpretable.
- A transformer training diagnostic where KKT conditions appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating KKT conditions as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving KKT conditions, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes KKT conditions visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about KKT conditions is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.4 Examples, non-examples, and boundary cases
In this section, constraint qualifications is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Examples, non-examples, and boundary cases" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, constraint qualifications is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where constraint qualifications can be computed directly and compared with theory.
- A logistic-regression or softmax objective where constraint qualifications affects optimization but the model remains interpretable.
- A transformer training diagnostic where constraint qualifications appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating constraint qualifications as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving constraint qualifications, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes constraint qualifications visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about constraint qualifications is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
2.5 Notation, dimensions, and assumptions
In this section, Slater condition is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Notation, dimensions, and assumptions" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, Slater condition is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where Slater condition can be computed directly and compared with theory.
- A logistic-regression or softmax objective where Slater condition affects optimization but the model remains interpretable.
- A transformer training diagnostic where Slater condition appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating Slater condition as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving Slater condition, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes Slater condition visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about Slater condition is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.