Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 4
30 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Instruction Tuning and SFT: Part 5: Alignment Behavior to 6. SFT Limits

5. Alignment Behavior

Alignment Behavior develops the part of instruction tuning and sft that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.

5.1 Helpfulness

Helpfulness belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For helpfulness, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat helpfulness as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for helpfulness:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Helpfulness is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

5.2 Honesty

Honesty belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For honesty, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat honesty as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for honesty:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Honesty is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

5.3 Harmlessness

Harmlessness belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For harmlessness, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat harmlessness as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for harmlessness:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Harmlessness is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

5.4 Style control

Style control belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For style control, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat style control as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for style control:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Style control is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

5.5 Sycophancy risk

Sycophancy risk belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For sycophancy risk, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat sycophancy risk as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for sycophancy risk:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Sycophancy risk is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

6. SFT Limits

SFT Limits develops the part of instruction tuning and sft that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.

6.1 Imitation ceiling

Imitation ceiling belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For imitation ceiling, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat imitation ceiling as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for imitation ceiling:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Imitation ceiling is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

6.2 Hallucination

Hallucination belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For hallucination, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat hallucination as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for hallucination:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Hallucination is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

6.3 Over-refusal

Over-refusal belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For over-refusal, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat over-refusal as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for over-refusal:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Over-refusal is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

6.4 Catastrophic forgetting

Catastrophic forgetting belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For catastrophic forgetting, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat catastrophic forgetting as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for catastrophic forgetting:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Catastrophic forgetting is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

6.5 Distribution mismatch

Distribution mismatch belongs in the canonical scope of instruction tuning and sft. The object is the instruction-following policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol \pi_\theta(y \mid x). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LSFT(θ)=1Ni=1NtRilogπθ(yi,txi,yi,<t).\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t \in R_i}\log \pi_\theta(y_{i,t} \mid x_i,y_{i,<t}).

For distribution mismatch, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat distribution mismatch as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for distribution mismatch:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Distribution mismatch is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue