Part 2

30 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Human in the Loop and Monitoring: Part 3: Feedback Collection to 4. Active Learning

3. Feedback Collection

Feedback Collection develops the part of human in the loop and monitoring that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.

3.1 Rankings

Rankings belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol (x_i,y_i,h_i). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For rankings, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat rankings as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for rankings:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Rankings is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

3.2 Ratings

Ratings belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For ratings, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat ratings as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for ratings:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Ratings is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

3.3 Edits

Edits belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For edits, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat edits as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for edits:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Edits is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

3.4 Demonstrations

Demonstrations belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For demonstrations, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat demonstrations as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for demonstrations:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Demonstrations is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

3.5 Natural-language critiques

Natural-language critiques belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For natural-language critiques, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat natural-language critiques as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for natural-language critiques:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Natural-language critiques is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

4. Active Learning

Active Learning develops the part of human in the loop and monitoring that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.

4.1 Uncertainty sampling

Uncertainty sampling belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For uncertainty sampling, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat uncertainty sampling as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for uncertainty sampling:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Uncertainty sampling is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

4.2 Diversity sampling

Diversity sampling belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For diversity sampling, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat diversity sampling as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for diversity sampling:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Diversity sampling is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

4.3 Risk-weighted sampling

Risk-weighted sampling belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For risk-weighted sampling, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat risk-weighted sampling as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for risk-weighted sampling:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Risk-weighted sampling is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

4.4 Marginal value of labels

Marginal value of labels belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For marginal value of labels, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat marginal value of labels as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for marginal value of labels:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Marginal value of labels is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

4.5 Exploration budget

Exploration budget belongs in the canonical scope of human in the loop and monitoring. The object is the human feedback loop, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

u_i = \lambda_{\mathrm{risk}} r_i + \lambda_{\mathrm{unc}} h_i + \lambda_{\mathrm{div}} d_i.

For exploration budget, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment object	Mathematical question	Engineering question
Data	Which examples define the target behavior?	Who wrote, filtered, and approved them?
Objective	Which terms receive weight?	Are masks, margins, and thresholds logged?
Policy	Which actions are allowed or disallowed?	Can reviewers reproduce the decision?
Evaluation	Which metric detects regression?	Is the test private, stable, and sliced?
Feedback	Which new evidence changes training?	How does it enter the next dataset version?

Examples:

Treat exploration budget as part of the model contract and store the exact data version.
Record the prompt template, role format, policy version, and decoder settings.
Compare aligned and reference policies on both helpfulness and safety slices.
Use held-out examples that were not used to tune refusals or rewards.
Inspect failure cases before declaring the objective successful.

Non-examples:

Calling a model aligned because it sounds polite on a few prompts.
Training on refusals without measuring over-refusal on benign requests.
Using a reward model as ground truth without calibration or adversarial checks.
Shipping a guardrail threshold without measuring false positive and false negative rates.
Letting feedback logs change training without provenance or consent controls.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for exploration budget:

Name the target behavior in plain language.
Write the mathematical variable that represents it.
Specify which examples or comparisons estimate it.
Choose the optimization loss or runtime decision rule.
Define the regression metric that would prove the change became worse.

Failure pressure	Typical symptom	Mitigation
Proxy reward	High reward but worse human judgment	Holdout preferences and adversarial review
Refusal shortcut	Safe but unhelpful responses	Measure benign refusal rate separately
Template overfit	Good on training chat format only	Evaluate alternate templates and languages
Policy ambiguity	Inconsistent labels	Adjudication and rubric revision
Feedback drift	New labels change old policy silently	Version policy, rubric, and dataset together

AI connection: Exploration budget is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

Human in the Loop and Monitoring: Part 2 - Feedback Collection To 4 Active Learning

Human in the Loop and Monitoring: Part 3: Feedback Collection to 4. Active Learning

3. Feedback Collection

3.1 Rankings

3.2 Ratings

3.3 Edits

3.4 Demonstrations

3.5 Natural-language critiques

4. Active Learning

4.1 Uncertainty sampling

4.2 Diversity sampling

4.3 Risk-weighted sampling

4.4 Marginal value of labels

4.5 Exploration budget

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?