"Instruction tuning turns a language model into a cooperative interface."
Overview
Supervised fine-tuning aligns a pretrained next-token model with demonstrated instruction-following behavior by optimizing response tokens under a curated chat protocol.
Alignment and safety sit between model training and model deployment. The chapter studies how desired behavior is specified, learned, stress-tested, constrained, and improved through feedback.
This section is written in LaTeX Markdown. Inline mathematics uses $...$, and display
equations use `
`. The notes use math to expose the operational contract: which behavior is optimized, which behavior is forbidden, and which evidence proves the system changed.
Prerequisites
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable demonstrations for instruction tuning and sft |
| exercises.ipynb | Graded practice for instruction tuning and sft |
Learning Objectives
After completing this section, you will be able to:
- Define the core mathematical objects in instruction tuning and sft
- Write alignment losses and decision rules in repo notation
- Separate data, objective, policy, and evaluation responsibilities
- Explain how safety failures become training or guardrail updates
- Identify reward hacking, over-refusal, proxy misspecification, and drift
- Design synthetic experiments that demonstrate alignment tradeoffs
- Connect human feedback to SFT, reward modeling, DPO, or runtime policies
- State the boundary with Chapter 17 evaluation and Chapter 19 production MLOps
- Read modern alignment papers with attention to assumptions and proxies
- Build exercises that test both formulas and system-level judgment
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Instruction Data Design
- 4. SFT Objective
- 5. Alignment Behavior
- 6. SFT Limits
- 7. Synthetic and Self-Generated Instructions
- 8. Common Mistakes
- 9. Exercises
- 10. Why This Matters for AI
- 11. Conceptual Bridge
- References
1. Intuition
Intuition develops the part of instruction tuning and sft that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.
1.1 Why pretrained next-token models need instruction-following alignment
Why pretrained next-token models need instruction-following alignment belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For why pretrained next-token models need instruction-following alignment, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat why pretrained next-token models need instruction-following alignment as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for why pretrained next-token models need instruction-following alignment:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Why pretrained next-token models need instruction-following alignment is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
1.2 Instruction following as behavior conditioning
Instruction following as behavior conditioning belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For instruction following as behavior conditioning, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat instruction following as behavior conditioning as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for instruction following as behavior conditioning:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Instruction following as behavior conditioning is part of the post- training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
1.3 Demonstrations versus preferences
Demonstrations versus preferences belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For demonstrations versus preferences, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat demonstrations versus preferences as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for demonstrations versus preferences:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Demonstrations versus preferences is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
1.4 Chat templates as part of the objective
Chat templates as part of the objective belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For chat templates as part of the objective, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat chat templates as part of the objective as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for chat templates as part of the objective:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Chat templates as part of the objective is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
1.5 Historical arc from FLAN to InstructGPT
Historical arc from FLAN to InstructGPT belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For historical arc from flan to instructgpt, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat historical arc from flan to instructgpt as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for historical arc from flan to instructgpt:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Historical arc from FLAN to InstructGPT is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
2. Formal Definitions
Formal Definitions develops the part of instruction tuning and sft that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.
2.1 Prompt and response
Prompt and response belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For prompt and response , this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat prompt and response as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for prompt and response :
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Prompt and response is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
2.2 Demonstration dataset
Demonstration dataset belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For demonstration dataset , this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat demonstration dataset as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for demonstration dataset :
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Demonstration dataset is part of the post- training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
2.3 Policy
Policy belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For policy , this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat policy as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for policy :
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Policy is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
2.4 Response-token mask
Response-token mask belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For response-token mask, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat response-token mask as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for response-token mask:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Response-token mask is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
2.5 Instruction distribution and validation split
Instruction distribution and validation split belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For instruction distribution and validation split, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat instruction distribution and validation split as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for instruction distribution and validation split:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Instruction distribution and validation split is part of the post- training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
3. Instruction Data Design
Instruction Data Design develops the part of instruction tuning and sft that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.
3.1 Task diversity
Task diversity belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For task diversity, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat task diversity as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for task diversity:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Task diversity is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
3.2 Chat templates
Chat templates belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For chat templates, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat chat templates as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for chat templates:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Chat templates is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
3.3 Role tokens
Role tokens belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For role tokens, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat role tokens as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for role tokens:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Role tokens is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
3.4 Refusal examples
Refusal examples belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For refusal examples, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat refusal examples as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for refusal examples:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Refusal examples is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
3.5 Multi-turn demonstrations
Multi-turn demonstrations belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For multi-turn demonstrations, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat multi-turn demonstrations as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for multi-turn demonstrations:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Multi-turn demonstrations is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
4. SFT Objective
SFT Objective develops the part of instruction tuning and sft that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.
4.1 Response-only cross-entropy
Response-only cross-entropy belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For response-only cross-entropy, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat response-only cross-entropy as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for response-only cross-entropy:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Response-only cross-entropy is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
4.2 Packed examples
Packed examples belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For packed examples, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat packed examples as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for packed examples:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Packed examples is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
4.3 Loss masks
Loss masks belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For loss masks, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat loss masks as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for loss masks:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Loss masks is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
4.4 Class imbalance
Class imbalance belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For class imbalance, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat class imbalance as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for class imbalance:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Class imbalance is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
4.5 Validation curves
Validation curves belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For validation curves, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat validation curves as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for validation curves:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Validation curves is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
5. Alignment Behavior
Alignment Behavior develops the part of instruction tuning and sft that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.
5.1 Helpfulness
Helpfulness belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For helpfulness, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat helpfulness as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for helpfulness:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Helpfulness is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
5.2 Honesty
Honesty belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For honesty, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat honesty as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for honesty:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Honesty is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
5.3 Harmlessness
Harmlessness belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For harmlessness, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat harmlessness as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for harmlessness:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Harmlessness is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
5.4 Style control
Style control belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For style control, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat style control as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for style control:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Style control is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
5.5 Sycophancy risk
Sycophancy risk belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For sycophancy risk, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat sycophancy risk as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for sycophancy risk:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Sycophancy risk is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6. SFT Limits
SFT Limits develops the part of instruction tuning and sft that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.
6.1 Imitation ceiling
Imitation ceiling belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For imitation ceiling, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat imitation ceiling as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for imitation ceiling:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Imitation ceiling is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6.2 Hallucination
Hallucination belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For hallucination, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat hallucination as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for hallucination:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Hallucination is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6.3 Over-refusal
Over-refusal belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For over-refusal, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat over-refusal as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for over-refusal:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Over-refusal is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6.4 Catastrophic forgetting
Catastrophic forgetting belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For catastrophic forgetting, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat catastrophic forgetting as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for catastrophic forgetting:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Catastrophic forgetting is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6.5 Distribution mismatch
Distribution mismatch belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For distribution mismatch, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat distribution mismatch as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for distribution mismatch:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Distribution mismatch is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
7. Synthetic and Self-Generated Instructions
Synthetic and Self-Generated Instructions develops the part of instruction tuning and sft that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.
7.1 FLAN-style multitask instruction tuning
FLAN-style multitask instruction tuning belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For flan-style multitask instruction tuning, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat flan-style multitask instruction tuning as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for flan-style multitask instruction tuning:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: FLAN-style multitask instruction tuning is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
7.2 Self-Instruct bootstrapping
Self-Instruct bootstrapping belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For self-instruct bootstrapping, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat self-instruct bootstrapping as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for self-instruct bootstrapping:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Self-Instruct bootstrapping is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
7.3 Data filtering
Data filtering belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For data filtering, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat data filtering as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for data filtering:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Data filtering is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
7.4 Quality scoring
Quality scoring belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For quality scoring, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat quality scoring as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for quality scoring:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Quality scoring is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
7.5 Bootstrapping loops
Bootstrapping loops belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For bootstrapping loops, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat bootstrapping loops as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for bootstrapping loops:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Bootstrapping loops is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
8. Common Mistakes
| # | Mistake | Why It Is Wrong | Fix |
|---|---|---|---|
| 1 | Treating SFT as full alignment | SFT imitates demonstrations but does not optimize preferences or robust safety. | Use preference optimization and safety evals after SFT. |
| 2 | Masking prompt tokens incorrectly | The model is trained to copy user prompts instead of answer them. | Use response-only loss masks for chat SFT. |
| 3 | Trusting reward scores as truth | Reward models are learned proxies with bias and calibration error. | Evaluate reward models on held-out preference and safety sets. |
| 4 | Ignoring KL drift | A policy can become high reward but lose language quality or capability. | Track KL to the reference policy and capability regressions. |
| 5 | Optimizing only refusal rate | High refusal can hide low helpfulness and overblocking. | Measure safe compliance and benign refusal separately. |
| 6 | Using public jailbreaks as the only red team | Static attacks overfit quickly. | Mix human, automated, private, and adaptive attacks. |
| 7 | Changing policy text without versioning | Labels become incomparable across time. | Version policy, rubric, data, and model together. |
| 8 | Skipping reviewer calibration | Human feedback becomes noisy and inconsistent. | Use gold tasks, overlap, adjudication, and disagreement analysis. |
| 9 | Letting guardrails replace model training | Runtime filters cannot fix every model behavior. | Use layered defenses: data, training, policies, and gates. |
| 10 | Confusing safety monitoring with production observability | Chapter 18 feedback loops are not full MLOps dashboards. | Hand production telemetry to Chapter 19 while preserving safety feedback evidence. |
9. Exercises
-
(*) Why pretrained next-token models need instruction-following alignment. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
-
(*) Instruction following as behavior conditioning. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
-
(*) Demonstrations versus preferences. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
-
(**) Chat templates as part of the objective. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
-
(**) Historical arc from FLAN to InstructGPT. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
-
(**) Prompt and response . Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
-
(***) Demonstration dataset . Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
-
(***) Policy . Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
-
(***) Response-token mask. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
-
(***) Instruction distribution and validation split. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.
10. Why This Matters for AI
| Concept | AI Impact |
|---|---|
| Instruction tuning | Converts raw next-token prediction into usable assistant behavior |
| Preference learning | Optimizes choices that are hard to express as reference answers |
| KL control | Limits destructive policy drift during reward optimization |
| Red teaming | Finds harmful behavior before deployment and creates regression cases |
| Guardrails | Adds runtime control when training alone is insufficient |
| Policy versioning | Keeps safety labels auditable across changing rules |
| Human feedback | Supplies sparse but high-value evidence about user intent and risk |
| Release gates | Connects alignment work to measurable safety and capability thresholds |
11. Conceptual Bridge
Chapter 17 taught how to measure model behavior with benchmarks, uncertainty, robustness tests, ablations, and online experiments. Chapter 18 uses those measurements to change behavior through data, objectives, policies, guardrails, and human feedback.
Chapter 15 remains the home for general fine-tuning mechanics: parameter-efficient updates, memory cost, and broad training details. This chapter narrows the focus to post-training methods whose purpose is alignment with instructions, preferences, and safety policies.
Chapter 19 will pick up production lineage, monitoring, observability, drift, and serving systems. Chapter 18 stops at the safety feedback loop: how evidence becomes alignment data or runtime policy, not how every deployed metric is stored forever.
15 LLM training and fine-tuning math
-> objectives and update mechanics
17 Evaluation and Reliability
-> evidence about model behavior
18 Alignment and Safety
-> SFT, preferences, red teams, policies, feedback
19 Production ML and MLOps
-> deployment, observability, drift, retraining