Lesson overview | Previous part | Lesson overview
Error Analysis and Ablations: Part 7: Regression-Test Workflow to References
7. Regression-Test Workflow
Regression-Test Workflow is the part of error analysis and ablations that turns the approved TOC into a concrete learning path. The subsections below keep the focus on Chapter 17's canonical job: measurement, reliability, uncertainty, and decision support for AI systems.
7.1 Converting failures into eval cases
Converting failures into eval cases is part of the canonical scope of error analysis and ablations. In this chapter, the object under study is not merely a dataset or a model, but the full failure slice and causal comparison: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For converting failures into eval cases, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in converting failures into eval cases |
| Scoring rule | Exact formula for \mathbb{1}{e_i \in E} | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report converting failures into eval cases with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for converting failures into eval cases:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, converting failures into eval cases is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Converting failures into eval cases is one place where that habit becomes concrete.
7.2 Versioning error suites
Versioning error suites is part of the canonical scope of error analysis and ablations. In this chapter, the object under study is not merely a dataset or a model, but the full failure slice and causal comparison: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For versioning error suites, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in versioning error suites |
| Scoring rule | Exact formula for \mathbb{1}{e_i \in E} | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report versioning error suites with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for versioning error suites:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, versioning error suites is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Versioning error suites is one place where that habit becomes concrete.
7.3 Avoiding overfit fixes
Avoiding overfit fixes is part of the canonical scope of error analysis and ablations. In this chapter, the object under study is not merely a dataset or a model, but the full failure slice and causal comparison: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For avoiding overfit fixes, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in avoiding overfit fixes |
| Scoring rule | Exact formula for \mathbb{1}{e_i \in E} | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report avoiding overfit fixes with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for avoiding overfit fixes:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, avoiding overfit fixes is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Avoiding overfit fixes is one place where that habit becomes concrete.
7.4 Budgeting regression tests
Budgeting regression tests is part of the canonical scope of error analysis and ablations. In this chapter, the object under study is not merely a dataset or a model, but the full failure slice and causal comparison: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For budgeting regression tests, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in budgeting regression tests |
| Scoring rule | Exact formula for \mathbb{1}{e_i \in E} | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report budgeting regression tests with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for budgeting regression tests:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, budgeting regression tests is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Budgeting regression tests is one place where that habit becomes concrete.
7.5 Closing the debug loop
Closing the debug loop is part of the canonical scope of error analysis and ablations. In this chapter, the object under study is not merely a dataset or a model, but the full failure slice and causal comparison: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.
The basic mathematical pattern is an empirical estimator. For a model or system evaluated on items , the local estimate is written
The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For closing the debug loop, those choices determine whether the reported number is evidence or decoration.
A useful invariant is that every evaluation claim should be reproducible as a tuple , where is the system, is the task sample, is the prompt or intervention policy, is the grader, and is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.
| Component | What to record | Why it matters |
|---|---|---|
| Item definition | IDs, source, split, and allowed transformations | Prevents accidental drift in closing the debug loop |
| Scoring rule | Exact formula for \mathbb{1}{e_i \in E} | Makes comparisons repeatable |
| Aggregation | Mean, weighted mean, worst group, or pairwise model | Determines the scientific claim |
| Uncertainty | Standard error, interval, or posterior summary | Separates signal from sampling noise |
| Audit trail | Code version and random seeds | Makes failures debuggable |
Examples of correct use:
- Report closing the debug loop with item count, prompt protocol, grader version, and a confidence interval.
- Use paired comparisons when two models answer the same evaluation items.
- Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
- Store raw outputs so future graders can be replayed without querying the model again.
- Document whether the metric is measuring capability, reliability, user value, or risk.
Non-examples:
- A leaderboard point estimate without sample size.
- A benchmark score produced with an undocumented prompt template.
- A model-graded result without judge identity, rubric, or agreement check.
- A robustness claim measured only on the easiest in-distribution examples.
- An online win declared before the randomization and logging checks pass.
Worked evaluation pattern for closing the debug loop:
- Define the evaluation population in words before writing code.
- Choose the smallest metric set that answers the decision question.
- Compute the point estimate and an uncertainty statement together.
- Run a slice or paired analysis to check whether the aggregate hides structure.
- Archive raw outputs, scores, and seeds before changing the prompt or grader.
For AI systems, closing the debug loop is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.
| AI connection | Evaluation consequence |
|---|---|
| Prompting | Treat prompt templates as part of the protocol, not as invisible setup |
| Decoding | Temperature and sampling change both mean score and variance |
| Retrieval | Retrieved context creates an extra source of failure and leakage |
| Tool use | Tool errors need separate attribution from model reasoning errors |
| Safety layer | Guardrail behavior can improve risk metrics while changing capability metrics |
Implementation checklist:
- Use deterministic seeds for synthetic or sampled evaluation subsets.
- Print metric denominators, not only percentages.
- Keep missing, invalid, timeout, and refusal outcomes explicit.
- Prefer typed result records over loose CSV columns.
- Separate raw model outputs from normalized grader inputs.
- Track the smallest reproducible command that generated the result.
- Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
- Write the decision rule before seeing the final score whenever the result will guide a release.
The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Closing the debug loop is one place where that habit becomes concrete.
8. Common Mistakes
| # | Mistake | Why It Is Wrong | Fix |
|---|---|---|---|
| 1 | Treating a point estimate as exact | Every finite evaluation has sampling error in error analysis and ablations. | Report uncertainty with the point estimate. |
| 2 | Changing prompts between models | The protocol changed with the treatment in error analysis and ablations. | Lock prompt, decoding, and grader before comparison. |
| 3 | Ignoring invalid outputs | Missingness can be correlated with model quality in error analysis and ablations. | Track invalid, timeout, refusal, and parse-failure rates. |
| 4 | Overfitting to a public leaderboard | Repeated testing leaks information from the benchmark in error analysis and ablations. | Use private holdouts and regression suites. |
| 5 | Averaging incomparable metrics | Different scales do not have shared units in error analysis and ablations. | Normalize by a stated decision rule or report separately. |
| 6 | Forgetting paired structure | Two models often answer the same items in error analysis and ablations. | Use paired bootstrap or paired tests where possible. |
| 7 | Reporting only aggregate performance | Subgroup failures can be hidden in error analysis and ablations. | Add slice and tail-risk views. |
| 8 | Trusting model judges blindly | LLM judges have position, verbosity, and self-preference biases in error analysis and ablations. | Calibrate judges against human labels. |
| 9 | Peeking during online experiments | Optional stopping inflates false positives in error analysis and ablations. | Use fixed horizons or sequential-valid methods. |
| 10 | Conflating evaluation with monitoring | Chapter 17 measures controlled evidence; production monitoring is ongoing operations in error analysis and ablations. | Hand off drift dashboards to Chapter 19 concepts. |
9. Exercises
-
(*) Aggregate metrics hide failure modes. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(*) Failures as structured data. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(*) Ablations as causal probes. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(**) Debugging without benchmark overfitting. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(**) From one bug to a regression suite. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(**) Error set. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(***) Confusion matrix. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(***) Slice and subgroup. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(***) Counterfactual example. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
-
(***) Ablation effect and interaction. (a) Define the relevant evaluation object. (b) Write the estimator in LaTeX notation. (c) Give one example where the estimator is reliable. (d) Give one example where the same number would be misleading. (e) Describe what the theory notebook should verify computationally.
10. Why This Matters for AI
| Concept | AI Impact |
|---|---|
| Protocol as measurement | Prevents hidden prompt or grader changes from masquerading as model progress |
| Uncertainty intervals | Keeps model rankings honest when differences are smaller than sampling noise |
| Slice metrics | Reveals failures on languages, domains, formats, or user groups hidden by averages |
| Calibration | Lets systems decide when to answer, abstain, ask for help, or escalate |
| Robustness | Tests whether behavior survives realistic perturbations and distribution shift |
| Ablations | Separates real improvements from accidental metric movement |
| Online tests | Measures causal user impact rather than offline proxy success |
| Audit trails | Turns evaluation from a screenshot into reproducible scientific evidence |
11. Conceptual Bridge
This section sits after the training-data pipeline because evaluation depends on clean holdouts, contamination audits, and well-documented data provenance. It does not repeat those pipeline mechanics; it consumes their outputs as the basis for credible measurement.
It also sits before alignment and production chapters. Alignment asks how to shape model behavior with supervised data, preferences, policies, and feedback. Production MLOps asks how deployed systems are observed and maintained over time. Error Analysis and Ablations supplies the measurement discipline both chapters need.
The recurring mathematical pattern is empirical risk with uncertainty. Whether the object is a benchmark item, a calibrated probability, a shifted subgroup, an ablation comparison, or an online treatment effect, the learner should ask: what distribution generated this evidence, what estimator did we compute, and what decision is justified by the uncertainty?
16 Data Pipeline
-> clean eval data, manifests, decontamination
17 Evaluation and Reliability
-> benchmarks, calibration, robustness, ablations, online tests
18 Alignment and Safety
-> SFT, preferences, policies, human feedback
19 Production ML and MLOps
-> monitoring, serving, retraining, observability