Lesson overview | Previous part | Lesson overview
Constrained Optimization: Part 7: Applications in Machine Learning to References
7. Applications in Machine Learning
This block develops applications in machine learning for Constrained Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
7.1 support-vector machines through KKT and dual variables
In this section, feasible set is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "support-vector machines through KKT and dual variables" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, feasible set is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where feasible set can be computed directly and compared with theory.
- A logistic-regression or softmax objective where feasible set affects optimization but the model remains interpretable.
- A transformer training diagnostic where feasible set appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating feasible set as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving feasible set, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes feasible set visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about feasible set is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
7.2 fairness, safety, and resource-constrained model training
In this section, active constraint is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "fairness, safety, and resource-constrained model training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, active constraint is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where active constraint can be computed directly and compared with theory.
- A logistic-regression or softmax objective where active constraint affects optimization but the model remains interpretable.
- A transformer training diagnostic where active constraint appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating active constraint as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving active constraint, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes active constraint visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about active constraint is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
7.3 projection layers for nonnegative or norm-constrained parameters
In this section, equality constraints is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "projection layers for nonnegative or norm-constrained parameters" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, equality constraints is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where equality constraints can be computed directly and compared with theory.
- A logistic-regression or softmax objective where equality constraints affects optimization but the model remains interpretable.
- A transformer training diagnostic where equality constraints appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating equality constraints as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving equality constraints, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes equality constraints visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about equality constraints is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
7.4 ADMM-style splitting for distributed and federated objectives
In this section, inequality constraints is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "ADMM-style splitting for distributed and federated objectives" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, inequality constraints is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where inequality constraints can be computed directly and compared with theory.
- A logistic-regression or softmax objective where inequality constraints affects optimization but the model remains interpretable.
- A transformer training diagnostic where inequality constraints appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating inequality constraints as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving inequality constraints, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes inequality constraints visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about inequality constraints is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
7.5 Diagnostic checklist for real experiments
In this section, Lagrangian is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Diagnostic checklist for real experiments" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, Lagrangian is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where Lagrangian can be computed directly and compared with theory.
- A logistic-regression or softmax objective where Lagrangian affects optimization but the model remains interpretable.
- A transformer training diagnostic where Lagrangian appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating Lagrangian as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving Lagrangian, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes Lagrangian visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about Lagrangian is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8. Implementation and Diagnostics
This block develops implementation and diagnostics for Constrained Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.
8.1 Minimal NumPy experiment for penalty methods
In this section, inequality constraints is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Minimal NumPy experiment for penalty methods" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, inequality constraints is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where inequality constraints can be computed directly and compared with theory.
- A logistic-regression or softmax objective where inequality constraints affects optimization but the model remains interpretable.
- A transformer training diagnostic where inequality constraints appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating inequality constraints as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving inequality constraints, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes inequality constraints visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about inequality constraints is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8.2 Monitoring signal for barrier methods
In this section, Lagrangian is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Monitoring signal for barrier methods" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, Lagrangian is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where Lagrangian can be computed directly and compared with theory.
- A logistic-regression or softmax objective where Lagrangian affects optimization but the model remains interpretable.
- A transformer training diagnostic where Lagrangian appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating Lagrangian as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving Lagrangian, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes Lagrangian visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about Lagrangian is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8.3 Failure signature for augmented Lagrangian
In this section, stationarity is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Failure signature for augmented Lagrangian" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, stationarity is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where stationarity can be computed directly and compared with theory.
- A logistic-regression or softmax objective where stationarity affects optimization but the model remains interpretable.
- A transformer training diagnostic where stationarity appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating stationarity as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving stationarity, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes stationarity visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about stationarity is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8.4 Framework-level implementation pattern
In this section, primal feasibility is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Framework-level implementation pattern" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, primal feasibility is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where primal feasibility can be computed directly and compared with theory.
- A logistic-regression or softmax objective where primal feasibility affects optimization but the model remains interpretable.
- A transformer training diagnostic where primal feasibility appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating primal feasibility as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving primal feasibility, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes primal feasibility visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about primal feasibility is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
8.5 Reproducibility and logging checklist
In this section, dual feasibility is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Constrained Optimization, the phrase "Reproducibility and logging checklist" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.
Definition.
For this section, dual feasibility is the part of Constrained Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.
Symbolically, we track it through , , , , and any auxiliary state used by the algorithm.
Examples:
- A small synthetic quadratic where dual feasibility can be computed directly and compared with theory.
- A logistic-regression or softmax objective where dual feasibility affects optimization but the model remains interpretable.
- A transformer training diagnostic where dual feasibility appears through gradient norms, update norms, curvature, or validation loss.
Non-examples:
- Treating dual feasibility as a hyperparameter recipe without checking the objective assumptions.
- Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.
Useful formula:
Proof sketch or reasoning pattern:
Start with the local model around , isolate the term involving dual feasibility, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.
Implementation consequence:
- Log a metric that makes dual feasibility visible; otherwise a training run can fail while the scalar loss hides the cause.
- Compare the measured update with the mathematical update below before blaming data or architecture.
- Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.
Diagnostic questions:
- Which assumption about dual feasibility is most fragile in the current training setup?
- What number would you log to catch the failure one thousand steps before divergence?
AI connection:
- support-vector machines through KKT and dual variables.
- fairness, safety, and resource-constrained model training.
- projection layers for nonnegative or norm-constrained parameters.
- ADMM-style splitting for distributed and federated objectives.
Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.
9. Common Mistakes
| # | Mistake | Why It Is Wrong | Fix |
|---|---|---|---|
| 1 | Using a recipe without checking assumptions | Optimization guarantees depend on smoothness, convexity, stochasticity, or feasibility assumptions. | Write the assumptions next to the update rule before choosing hyperparameters. |
| 2 | Confusing objective decrease with validation improvement | The optimizer sees the training objective; validation behavior also depends on generalization and data split quality. | Track objective, train metric, validation metric, and update norm separately. |
| 3 | Treating all norms as interchangeable | The geometry changes when the norm changes, especially for constraints and regularizers. | State whether you use , , Frobenius, spectral, or another norm. |
| 4 | Ignoring scale | Learning rates, penalties, curvature, and gradient norms are all scale-sensitive. | Normalize units and inspect effective update size . |
| 5 | Overfitting to a single seed | Optimization can look stable for one seed and fail under another. | Run small seed sweeps for important claims. |
| 6 | Hiding instability behind smoothed plots | A moving average can hide spikes, divergence, and bad curvature events. | Plot raw metrics alongside smoothed metrics. |
| 7 | Using test data during tuning | This contaminates the final evaluation. | Reserve test data until after model and hyperparameter selection. |
| 8 | Assuming large models make theory irrelevant | Large models often make diagnostics more important because failures are expensive. | Use theory to decide what to log, not to pretend every theorem applies exactly. |
| 9 | Mixing optimizer state with model state carelessly | State corruption changes the effective algorithm. | Checkpoint parameters, gradients if needed, optimizer moments, scheduler state, and random seeds. |
| 10 | Not checking numerical precision | BF16, FP16, FP8, and accumulation choices can change the observed optimizer. | Cross-check suspicious runs against higher precision on a small batch. |
10. Exercises
- Exercise 1 [*] - Equality Constraints (a) Define equality constraints using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 2 [*] - Lagrangian (a) Define Lagrangian using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 3 [*] - Primal Feasibility (a) Define primal feasibility using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 4 [] - Complementary Slackness** (a) Define complementary slackness using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 5 [] - Constraint Qualifications** (a) Define constraint qualifications using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 6 [] - Dual Problem** (a) Define dual problem using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 7 [] - Euclidean Projection** (a) Define Euclidean projection using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 8 [*] - Barrier Methods** (a) Define barrier methods using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 9 [*] - Admm** (a) Define ADMM using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
- Exercise 10 [*] - Fairness Constraints** (a) Define fairness constraints using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.
11. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact |
|---|---|
| feasible set | support-vector machines through KKT and dual variables |
| active constraint | fairness, safety, and resource-constrained model training |
| equality constraints | projection layers for nonnegative or norm-constrained parameters |
| inequality constraints | ADMM-style splitting for distributed and federated objectives |
| Lagrangian | support-vector machines through KKT and dual variables |
| stationarity | fairness, safety, and resource-constrained model training |
| primal feasibility | projection layers for nonnegative or norm-constrained parameters |
| dual feasibility | ADMM-style splitting for distributed and federated objectives |
| complementary slackness | support-vector machines through KKT and dual variables |
| KKT conditions | fairness, safety, and resource-constrained model training |
12. Conceptual Bridge
Constrained Optimization sits inside a chain. Earlier sections give the calculus, probability, and linear algebra needed to write the objective and interpret the update. Later sections use this material to reason about noisy gradients, adaptive state, regularization, tuning, schedules, and finally information-theoretic losses.
Backward link: Second-Order Methods supplies the immediate prerequisite vocabulary.
Forward link: Stochastic Optimization uses this section as a building block.
+------------------------------------------------------------+
| Chapter 8: Optimization |
| 01-Convex-Optimization Convex Optimization |
| 02-Gradient-Descent Gradient Descent |
| 03-Second-Order-Methods Second-Order Methods |
| >> 04-Constrained-Optimization Constrained Optimization |
| 05-Stochastic-Optimization Stochastic Optimization |
| 06-Optimization-Landscape Optimization Landscape |
| 07-Adaptive-Learning-Rate Adaptive Learning Rate |
| 08-Regularization-Methods Regularization Methods |
| 09-Hyperparameter-Optimization Hyperparameter Optimization |
| 10-Learning-Rate-Schedules Learning Rate Schedules |
+------------------------------------------------------------+
Appendix A. Extended Derivation and Diagnostic Cards
References
- Boyd and Vandenberghe, Convex Optimization.
- Bertsekas, Constrained Optimization and Lagrange Multiplier Methods.
- Boyd et al., Distributed Optimization and Statistical Learning via ADMM.
- Cortes and Vapnik, Support-vector Networks.
- Goodfellow, Bengio, and Courville, Deep Learning.
- Bottou, Curtis, and Nocedal, Optimization Methods for Large-Scale Machine Learning.
- PyTorch optimizer and scheduler documentation.
- Optax documentation for composable optimizer transformations.