Part 7Math for LLMs

Preference Optimization RLHF and DPO: Part 7 - Failure Modes To References

Alignment and Safety / Preference Optimization RLHF and DPO

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 7
20 min read11 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Preference Optimization RLHF and DPO: Part 7: Failure Modes to References

7. Failure Modes

Failure Modes develops the part of preference optimization rlhf and dpo that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.

7.1 Reward hacking

Reward hacking belongs in the canonical scope of preference optimization rlhf and dpo. The object is the preference-aligned policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol (x,y_w,y_l). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LDPO(θ)=Elogσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx)).\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}\log\sigma\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)}-\beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\right).

For reward hacking, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat reward hacking as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for reward hacking:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Reward hacking is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

7.2 Preference overfitting

Preference overfitting belongs in the canonical scope of preference optimization rlhf and dpo. The object is the preference-aligned policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol (x,y_w,y_l). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LDPO(θ)=Elogσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx)).\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}\log\sigma\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)}-\beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\right).

For preference overfitting, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat preference overfitting as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for preference overfitting:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Preference overfitting is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

7.3 Length bias

Length bias belongs in the canonical scope of preference optimization rlhf and dpo. The object is the preference-aligned policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol (x,y_w,y_l). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LDPO(θ)=Elogσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx)).\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}\log\sigma\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)}-\beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\right).

For length bias, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat length bias as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for length bias:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Length bias is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

7.4 Judge bias

Judge bias belongs in the canonical scope of preference optimization rlhf and dpo. The object is the preference-aligned policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol (x,y_w,y_l). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LDPO(θ)=Elogσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx)).\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}\log\sigma\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)}-\beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\right).

For judge bias, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat judge bias as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for judge bias:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Judge bias is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

7.5 Alignment tax

Alignment tax belongs in the canonical scope of preference optimization rlhf and dpo. The object is the preference-aligned policy, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.

A compact way to read this subsection is through the local symbol (x,y_w,y_l). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.

LDPO(θ)=Elogσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx)).\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}\log\sigma\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)}-\beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\right).

For alignment tax, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.

Alignment objectMathematical questionEngineering question
DataWhich examples define the target behavior?Who wrote, filtered, and approved them?
ObjectiveWhich terms receive weight?Are masks, margins, and thresholds logged?
PolicyWhich actions are allowed or disallowed?Can reviewers reproduce the decision?
EvaluationWhich metric detects regression?Is the test private, stable, and sliced?
FeedbackWhich new evidence changes training?How does it enter the next dataset version?

Examples:

  • Treat alignment tax as part of the model contract and store the exact data version.
  • Record the prompt template, role format, policy version, and decoder settings.
  • Compare aligned and reference policies on both helpfulness and safety slices.
  • Use held-out examples that were not used to tune refusals or rewards.
  • Inspect failure cases before declaring the objective successful.

Non-examples:

  • Calling a model aligned because it sounds polite on a few prompts.
  • Training on refusals without measuring over-refusal on benign requests.
  • Using a reward model as ground truth without calibration or adversarial checks.
  • Shipping a guardrail threshold without measuring false positive and false negative rates.
  • Letting feedback logs change training without provenance or consent controls.

A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.

policy text/rubric
      |
      v
training or guardrail data  ->  objective/threshold  ->  aligned system
      |                                                   |
      v                                                   v
audit metadata                                      held-out safety eval

Worked reasoning pattern for alignment tax:

  1. Name the target behavior in plain language.
  2. Write the mathematical variable that represents it.
  3. Specify which examples or comparisons estimate it.
  4. Choose the optimization loss or runtime decision rule.
  5. Define the regression metric that would prove the change became worse.

Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.

Failure pressureTypical symptomMitigation
Proxy rewardHigh reward but worse human judgmentHoldout preferences and adversarial review
Refusal shortcutSafe but unhelpful responsesMeasure benign refusal rate separately
Template overfitGood on training chat format onlyEvaluate alternate templates and languages
Policy ambiguityInconsistent labelsAdjudication and rubric revision
Feedback driftNew labels change old policy silentlyVersion policy, rubric, and dataset together

AI connection: Alignment tax is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.

8. Common Mistakes

#MistakeWhy It Is WrongFix
1Treating SFT as full alignmentSFT imitates demonstrations but does not optimize preferences or robust safety.Use preference optimization and safety evals after SFT.
2Masking prompt tokens incorrectlyThe model is trained to copy user prompts instead of answer them.Use response-only loss masks for chat SFT.
3Trusting reward scores as truthReward models are learned proxies with bias and calibration error.Evaluate reward models on held-out preference and safety sets.
4Ignoring KL driftA policy can become high reward but lose language quality or capability.Track KL to the reference policy and capability regressions.
5Optimizing only refusal rateHigh refusal can hide low helpfulness and overblocking.Measure safe compliance and benign refusal separately.
6Using public jailbreaks as the only red teamStatic attacks overfit quickly.Mix human, automated, private, and adaptive attacks.
7Changing policy text without versioningLabels become incomparable across time.Version policy, rubric, data, and model together.
8Skipping reviewer calibrationHuman feedback becomes noisy and inconsistent.Use gold tasks, overlap, adjudication, and disagreement analysis.
9Letting guardrails replace model trainingRuntime filters cannot fix every model behavior.Use layered defenses: data, training, policies, and gates.
10Confusing safety monitoring with production observabilityChapter 18 feedback loops are not full MLOps dashboards.Hand production telemetry to Chapter 19 while preserving safety feedback evidence.

9. Exercises

  1. (*) Preferences optimize choices rather than demonstrations. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

  2. (*) Reward models as learned proxies. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

  3. (*) Policy shift under a KL budget. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

  4. (**) DPO as direct reward-model-free training. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

  5. (**) Why preference data is noisy but valuable. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

  6. (**) Preference pair (x,yw,yl)(x,y_w,y_l). Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

  7. (***) Reward model rϕ(x,y)r_\phi(x,y). Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

  8. (***) Bradley-Terry preference model. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

  9. (***) Reference policy πref\pi_{\mathrm{ref}}. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

  10. (***) KL coefficient and inverse temperature β\beta. Define the alignment object, write the relevant loss or decision rule, give one safe example and one unsafe edge case, then explain which held-out metric would catch regression.

10. Why This Matters for AI

ConceptAI Impact
Instruction tuningConverts raw next-token prediction into usable assistant behavior
Preference learningOptimizes choices that are hard to express as reference answers
KL controlLimits destructive policy drift during reward optimization
Red teamingFinds harmful behavior before deployment and creates regression cases
GuardrailsAdds runtime control when training alone is insufficient
Policy versioningKeeps safety labels auditable across changing rules
Human feedbackSupplies sparse but high-value evidence about user intent and risk
Release gatesConnects alignment work to measurable safety and capability thresholds

11. Conceptual Bridge

Chapter 17 taught how to measure model behavior with benchmarks, uncertainty, robustness tests, ablations, and online experiments. Chapter 18 uses those measurements to change behavior through data, objectives, policies, guardrails, and human feedback.

Chapter 15 remains the home for general fine-tuning mechanics: parameter-efficient updates, memory cost, and broad training details. This chapter narrows the focus to post-training methods whose purpose is alignment with instructions, preferences, and safety policies.

Chapter 19 will pick up production lineage, monitoring, observability, drift, and serving systems. Chapter 18 stops at the safety feedback loop: how evidence becomes alignment data or runtime policy, not how every deployed metric is stored forever.

15 LLM training and fine-tuning math
        -> objectives and update mechanics
17 Evaluation and Reliability
        -> evidence about model behavior
18 Alignment and Safety
        -> SFT, preferences, red teams, policies, feedback
19 Production ML and MLOps
        -> deployment, observability, drift, retraining

References

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue