Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 4
30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Hyperparameter Optimization: Part 7: Applications in Machine Learning to References

7. Applications in Machine Learning

This block develops applications in machine learning for Hyperparameter Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

7.1 learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning

In this section, configuration space is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, configuration space is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where configuration space can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where configuration space affects optimization but the model remains interpretable.
  • A transformer training diagnostic where configuration space appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating configuration space as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving configuration space, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes configuration space visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about configuration space is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

In this section, conditional parameter is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "Hyperband and ASHA for neural architecture and training-budget search" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, conditional parameter is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where conditional parameter can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where conditional parameter affects optimization but the model remains interpretable.
  • A transformer training diagnostic where conditional parameter appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating conditional parameter as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving conditional parameter, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes conditional parameter visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about conditional parameter is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

7.3 Bayesian optimization for expensive, low-dimensional continuous tuning

In this section, log-uniform sampling is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "Bayesian optimization for expensive, low-dimensional continuous tuning" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, log-uniform sampling is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where log-uniform sampling can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where log-uniform sampling affects optimization but the model remains interpretable.
  • A transformer training diagnostic where log-uniform sampling appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating log-uniform sampling as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving log-uniform sampling, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes log-uniform sampling visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about log-uniform sampling is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

7.4 validation leakage prevention in model-selection pipelines

In this section, grid search is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "validation leakage prevention in model-selection pipelines" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, grid search is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where grid search can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where grid search affects optimization but the model remains interpretable.
  • A transformer training diagnostic where grid search appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating grid search as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving grid search, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes grid search visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about grid search is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

7.5 Diagnostic checklist for real experiments

In this section, random search is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "Diagnostic checklist for real experiments" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, random search is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where random search can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where random search affects optimization but the model remains interpretable.
  • A transformer training diagnostic where random search appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating random search as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving random search, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes random search visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about random search is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8. Implementation and Diagnostics

This block develops implementation and diagnostics for Hyperparameter Optimization. It keeps the scope local to this section while pointing forward when a neighboring topic owns the full treatment.

8.1 Minimal NumPy experiment for BOHB

In this section, grid search is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "Minimal NumPy experiment for BOHB" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, grid search is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where grid search can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where grid search affects optimization but the model remains interpretable.
  • A transformer training diagnostic where grid search appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating grid search as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving grid search, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes grid search visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about grid search is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8.2 Monitoring signal for population-based training

In this section, random search is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "Monitoring signal for population-based training" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, random search is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where random search can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where random search affects optimization but the model remains interpretable.
  • A transformer training diagnostic where random search appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating random search as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving random search, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes random search visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about random search is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8.3 Failure signature for multi-objective tuning

In this section, Sobol initialization is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "Failure signature for multi-objective tuning" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Sobol initialization is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Sobol initialization can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Sobol initialization affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Sobol initialization appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Sobol initialization as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Sobol initialization, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Sobol initialization visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Sobol initialization is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8.4 Framework-level implementation pattern

In this section, surrogate model is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "Framework-level implementation pattern" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, surrogate model is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where surrogate model can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where surrogate model affects optimization but the model remains interpretable.
  • A transformer training diagnostic where surrogate model appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating surrogate model as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving surrogate model, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes surrogate model visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about surrogate model is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

8.5 Reproducibility and logging checklist

In this section, Gaussian process is treated as a concrete optimization object rather than a slogan. The goal is to understand how it changes the objective, the update rule, the convergence story, and the diagnostics a practitioner should inspect when training a modern model. For Hyperparameter Optimization, the phrase "Reproducibility and logging checklist" means a precise mathematical habit: state the assumptions, write the update, identify what can be measured, and connect the result to a real AI training decision.

Definition.

For this section, Gaussian process is the part of Hyperparameter Optimization that controls how the objective, feasible region, or update rule behaves under the assumptions currently in force.

Symbolically, we track it through ff, θ\boldsymbol{\theta}, η\eta, f(θ)\nabla f(\boldsymbol{\theta}), and any auxiliary state used by the algorithm.

Examples:

  • A small synthetic quadratic where Gaussian process can be computed directly and compared with theory.
  • A logistic-regression or softmax objective where Gaussian process affects optimization but the model remains interpretable.
  • A transformer training diagnostic where Gaussian process appears through gradient norms, update norms, curvature, or validation loss.

Non-examples:

  • Treating Gaussian process as a hyperparameter recipe without checking the objective assumptions.
  • Inferring global behavior from one noisy minibatch when the section requires a population or full-batch statement.

Useful formula:

λargminλΛV(A(λ),Dval)\boldsymbol{\lambda}^* \in \arg\min_{\boldsymbol{\lambda}\in\Lambda} \mathcal{V}(\mathcal{A}(\boldsymbol{\lambda}), \mathcal{D}_{\mathrm{val}})

Proof sketch or reasoning pattern:

Start with the local model around θt\boldsymbol{\theta}_t, isolate the term involving Gaussian process, and use the section assumptions to bound the change in objective value. If the assumption is geometric, the proof turns a picture into an inequality. If the assumption is stochastic, the proof takes conditional expectation before applying the bound. If the assumption is algorithmic, the proof checks that the proposed update is a descent, projection, or preconditioning step. This pattern is reusable across optimization theory.

Implementation consequence:

  • Log a metric that makes Gaussian process visible; otherwise a training run can fail while the scalar loss hides the cause.
  • Compare the measured update with the mathematical update below before blaming data or architecture.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})
  • Keep units straight: parameter norm, gradient norm, update norm, objective value, and validation metric are different objects.

Diagnostic questions:

  • Which assumption about Gaussian process is most fragile in the current training setup?
  • What number would you log to catch the failure one thousand steps before divergence?

AI connection:

  • learning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning.
  • Hyperband and ASHA for neural architecture and training-budget search.
  • Bayesian optimization for expensive, low-dimensional continuous tuning.
  • validation leakage prevention in model-selection pipelines.

Local scope boundary: This subsection may reference neighboring material, but the full canonical treatment stays in its own folder. For example, stochastic gradient noise belongs to Stochastic Optimization, external schedule shapes belong to Learning Rate Schedules, and cross-entropy as an information measure belongs to Cross-Entropy.

9. Common Mistakes

#MistakeWhy It Is WrongFix
1Using a recipe without checking assumptionsOptimization guarantees depend on smoothness, convexity, stochasticity, or feasibility assumptions.Write the assumptions next to the update rule before choosing hyperparameters.
2Confusing objective decrease with validation improvementThe optimizer sees the training objective; validation behavior also depends on generalization and data split quality.Track objective, train metric, validation metric, and update norm separately.
3Treating all norms as interchangeableThe geometry changes when the norm changes, especially for constraints and regularizers.State whether you use 1\ell_1, 2\ell_2, Frobenius, spectral, or another norm.
4Ignoring scaleLearning rates, penalties, curvature, and gradient norms are all scale-sensitive.Normalize units and inspect effective update size Δθ2/θ2\lVert \Delta\boldsymbol{\theta}\rVert_2 / \lVert\boldsymbol{\theta}\rVert_2.
5Overfitting to a single seedOptimization can look stable for one seed and fail under another.Run small seed sweeps for important claims.
6Hiding instability behind smoothed plotsA moving average can hide spikes, divergence, and bad curvature events.Plot raw metrics alongside smoothed metrics.
7Using test data during tuningThis contaminates the final evaluation.Reserve test data until after model and hyperparameter selection.
8Assuming large models make theory irrelevantLarge models often make diagnostics more important because failures are expensive.Use theory to decide what to log, not to pretend every theorem applies exactly.
9Mixing optimizer state with model state carelesslyState corruption changes the effective algorithm.Checkpoint parameters, gradients if needed, optimizer moments, scheduler state, and random seeds.
10Not checking numerical precisionBF16, FP16, FP8, and accumulation choices can change the observed optimizer.Cross-check suspicious runs against higher precision on a small batch.

10. Exercises

  1. Exercise 1 [*] - Log-Uniform Sampling (a) Define log-uniform sampling using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 2 [*] - Random Search (a) Define random search using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 3 [*] - Surrogate Model (a) Define surrogate model using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 4 [] - Expected Improvement** (a) Define expected improvement using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 5 [] - Thompson Sampling** (a) Define Thompson sampling using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 6 [] - Successive Halving** (a) Define successive halving using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 7 [] - Asha** (a) Define ASHA using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 8 [*] - Population-Based Training** (a) Define population-based training using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 9 [*] - Pareto Frontier** (a) Define Pareto frontier using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

  1. Exercise 10 [*] - Nested Validation** (a) Define nested validation using the notation of this repository. (b) Give three valid examples and two non-examples. (c) Derive the relevant update or inequality shown below.
λt+1=argmaxλΛat(λ)\boldsymbol{\lambda}_{t+1} = \arg\max_{\boldsymbol{\lambda}\in\Lambda} a_t(\boldsymbol{\lambda})

(d) Implement a NumPy check on a synthetic two-dimensional objective. (e) Explain what metric you would log in a real LLM or fine-tuning run.

11. Why This Matters for AI (2026 Perspective)

ConceptAI Impact
configuration spacelearning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning
conditional parameterHyperband and ASHA for neural architecture and training-budget search
log-uniform samplingBayesian optimization for expensive, low-dimensional continuous tuning
grid searchvalidation leakage prevention in model-selection pipelines
random searchlearning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning
Sobol initializationHyperband and ASHA for neural architecture and training-budget search
surrogate modelBayesian optimization for expensive, low-dimensional continuous tuning
Gaussian processvalidation leakage prevention in model-selection pipelines
expected improvementlearning-rate, weight-decay, batch-size, and schedule tuning for LLM fine-tuning
upper confidence boundHyperband and ASHA for neural architecture and training-budget search

12. Conceptual Bridge

Hyperparameter Optimization sits inside a chain. Earlier sections give the calculus, probability, and linear algebra needed to write the objective and interpret the update. Later sections use this material to reason about noisy gradients, adaptive state, regularization, tuning, schedules, and finally information-theoretic losses.

Backward link: Regularization Methods supplies the immediate prerequisite vocabulary.

Forward link: Learning Rate Schedules uses this section as a building block.

+------------------------------------------------------------+
| Chapter 8: Optimization                                    |
|    01-Convex-Optimization          Convex Optimization    |
|    02-Gradient-Descent             Gradient Descent       |
|    03-Second-Order-Methods         Second-Order Methods   |
|    04-Constrained-Optimization     Constrained Optimization |
|    05-Stochastic-Optimization      Stochastic Optimization |
|    06-Optimization-Landscape       Optimization Landscape |
|    07-Adaptive-Learning-Rate       Adaptive Learning Rate |
|    08-Regularization-Methods       Regularization Methods |
| >> 09-Hyperparameter-Optimization  Hyperparameter Optimization |
|    10-Learning-Rate-Schedules      Learning Rate Schedules |
+------------------------------------------------------------+

Appendix A. Extended Derivation and Diagnostic Cards

References

  • Bergstra and Bengio, Random Search for Hyper-Parameter Optimization.
  • Snoek, Larochelle, and Adams, Practical Bayesian Optimization of Machine Learning Algorithms.
  • Li et al., Hyperband.
  • Jaderberg et al., Population Based Training of Neural Networks.
  • Goodfellow, Bengio, and Courville, Deep Learning.
  • Bottou, Curtis, and Nocedal, Optimization Methods for Large-Scale Machine Learning.
  • PyTorch optimizer and scheduler documentation.
  • Optax documentation for composable optimizer transformations.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue