Part 3

16 min read6 headingsSplit lesson page

Lesson overview | Previous part | Next part

Capability Benchmarks: Part 3: Benchmark Design

3. Benchmark Design

Benchmark Design is the part of capability benchmarks that turns the approved TOC into a concrete learning path. The subsections below keep the focus on Chapter 17's canonical job: measurement, reliability, uncertainty, and decision support for AI systems.

3.1 Task taxonomy and coverage

Task taxonomy and coverage is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.

The basic mathematical pattern is an empirical estimator. For a model or system $m$ evaluated on items $z_1,\ldots,z_n$ , the local estimate is written

\hat{\mu}_{m,t} = \frac{1}{n}\sum_{i=1}^n s_m(z_i).

The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For task taxonomy and coverage, those choices determine whether the reported number is evidence or decoration.

A useful invariant is that every evaluation claim should be reproducible as a tuple $(m,\mathcal{T},\pi,g,\rho)$ , where $m$ is the system, $\mathcal{T}$ is the task sample, $\pi$ is the prompt or intervention policy, $g$ is the grader, and $\rho$ is the aggregation rule. If any part of this tuple is missing, the number cannot be audited.

Component	What to record	Why it matters
Item definition	IDs, source, split, and allowed transformations	Prevents accidental drift in task taxonomy and coverage
Scoring rule	Exact formula for s_m(z_i)	Makes comparisons repeatable
Aggregation	Mean, weighted mean, worst group, or pairwise model	Determines the scientific claim
Uncertainty	Standard error, interval, or posterior summary	Separates signal from sampling noise
Audit trail	Code version and random seeds	Makes failures debuggable

Examples of correct use:

Report task taxonomy and coverage with item count, prompt protocol, grader version, and a confidence interval.
Use paired comparisons when two models answer the same evaluation items.
Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
Store raw outputs so future graders can be replayed without querying the model again.
Document whether the metric is measuring capability, reliability, user value, or risk.

Non-examples:

A leaderboard point estimate without sample size.
A benchmark score produced with an undocumented prompt template.
A model-graded result without judge identity, rubric, or agreement check.
A robustness claim measured only on the easiest in-distribution examples.
An online win declared before the randomization and logging checks pass.

Worked evaluation pattern for task taxonomy and coverage:

Define the evaluation population in words before writing code.
Choose the smallest metric set that answers the decision question.
Compute the point estimate and an uncertainty statement together.
Run a slice or paired analysis to check whether the aggregate hides structure.
Archive raw outputs, scores, and seeds before changing the prompt or grader.

For AI systems, task taxonomy and coverage is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.

AI connection	Evaluation consequence
Prompting	Treat prompt templates as part of the protocol, not as invisible setup
Decoding	Temperature and sampling change both mean score and variance
Retrieval	Retrieved context creates an extra source of failure and leakage
Tool use	Tool errors need separate attribution from model reasoning errors
Safety layer	Guardrail behavior can improve risk metrics while changing capability metrics

Implementation checklist:

Use deterministic seeds for synthetic or sampled evaluation subsets.
Print metric denominators, not only percentages.
Keep missing, invalid, timeout, and refusal outcomes explicit.
Prefer typed result records over loose CSV columns.
Separate raw model outputs from normalized grader inputs.
Track the smallest reproducible command that generated the result.
Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
Write the decision rule before seeing the final score whenever the result will guide a release.

The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Task taxonomy and coverage is one place where that habit becomes concrete.

3.2 Dataset sampling and item independence

Dataset sampling and item independence is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.

The basic mathematical pattern is an empirical estimator. For a model or system $m$ evaluated on items $z_1,\ldots,z_n$ , the local estimate is written

\hat{\mu}_{m,t} = \frac{1}{n}\sum_{i=1}^n s_m(z_i).

The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For dataset sampling and item independence, those choices determine whether the reported number is evidence or decoration.

Component	What to record	Why it matters
Item definition	IDs, source, split, and allowed transformations	Prevents accidental drift in dataset sampling and item independence
Scoring rule	Exact formula for s_m(z_i)	Makes comparisons repeatable
Aggregation	Mean, weighted mean, worst group, or pairwise model	Determines the scientific claim
Uncertainty	Standard error, interval, or posterior summary	Separates signal from sampling noise
Audit trail	Code version and random seeds	Makes failures debuggable

Examples of correct use:

Report dataset sampling and item independence with item count, prompt protocol, grader version, and a confidence interval.
Use paired comparisons when two models answer the same evaluation items.
Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
Store raw outputs so future graders can be replayed without querying the model again.
Document whether the metric is measuring capability, reliability, user value, or risk.

Non-examples:

A leaderboard point estimate without sample size.
A benchmark score produced with an undocumented prompt template.
A model-graded result without judge identity, rubric, or agreement check.
A robustness claim measured only on the easiest in-distribution examples.
An online win declared before the randomization and logging checks pass.

Worked evaluation pattern for dataset sampling and item independence:

Define the evaluation population in words before writing code.
Choose the smallest metric set that answers the decision question.
Compute the point estimate and an uncertainty statement together.
Run a slice or paired analysis to check whether the aggregate hides structure.
Archive raw outputs, scores, and seeds before changing the prompt or grader.

For AI systems, dataset sampling and item independence is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.

AI connection	Evaluation consequence
Prompting	Treat prompt templates as part of the protocol, not as invisible setup
Decoding	Temperature and sampling change both mean score and variance
Retrieval	Retrieved context creates an extra source of failure and leakage
Tool use	Tool errors need separate attribution from model reasoning errors
Safety layer	Guardrail behavior can improve risk metrics while changing capability metrics

Implementation checklist:

Use deterministic seeds for synthetic or sampled evaluation subsets.
Print metric denominators, not only percentages.
Keep missing, invalid, timeout, and refusal outcomes explicit.
Prefer typed result records over loose CSV columns.
Separate raw model outputs from normalized grader inputs.
Track the smallest reproducible command that generated the result.
Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
Write the decision rule before seeing the final score whenever the result will guide a release.

The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Dataset sampling and item independence is one place where that habit becomes concrete.

3.3 Prompt templates and few-shot policy

Prompt templates and few-shot policy is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.

The basic mathematical pattern is an empirical estimator. For a model or system $m$ evaluated on items $z_1,\ldots,z_n$ , the local estimate is written

\hat{\mu}_{m,t} = \frac{1}{n}\sum_{i=1}^n s_m(z_i).

The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For prompt templates and few-shot policy, those choices determine whether the reported number is evidence or decoration.

Component	What to record	Why it matters
Item definition	IDs, source, split, and allowed transformations	Prevents accidental drift in prompt templates and few-shot policy
Scoring rule	Exact formula for s_m(z_i)	Makes comparisons repeatable
Aggregation	Mean, weighted mean, worst group, or pairwise model	Determines the scientific claim
Uncertainty	Standard error, interval, or posterior summary	Separates signal from sampling noise
Audit trail	Code version and random seeds	Makes failures debuggable

Examples of correct use:

Report prompt templates and few-shot policy with item count, prompt protocol, grader version, and a confidence interval.
Use paired comparisons when two models answer the same evaluation items.
Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
Store raw outputs so future graders can be replayed without querying the model again.
Document whether the metric is measuring capability, reliability, user value, or risk.

Non-examples:

A leaderboard point estimate without sample size.
A benchmark score produced with an undocumented prompt template.
A model-graded result without judge identity, rubric, or agreement check.
A robustness claim measured only on the easiest in-distribution examples.
An online win declared before the randomization and logging checks pass.

Worked evaluation pattern for prompt templates and few-shot policy:

Define the evaluation population in words before writing code.
Choose the smallest metric set that answers the decision question.
Compute the point estimate and an uncertainty statement together.
Run a slice or paired analysis to check whether the aggregate hides structure.
Archive raw outputs, scores, and seeds before changing the prompt or grader.

For AI systems, prompt templates and few-shot policy is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.

AI connection	Evaluation consequence
Prompting	Treat prompt templates as part of the protocol, not as invisible setup
Decoding	Temperature and sampling change both mean score and variance
Retrieval	Retrieved context creates an extra source of failure and leakage
Tool use	Tool errors need separate attribution from model reasoning errors
Safety layer	Guardrail behavior can improve risk metrics while changing capability metrics

Implementation checklist:

Use deterministic seeds for synthetic or sampled evaluation subsets.
Print metric denominators, not only percentages.
Keep missing, invalid, timeout, and refusal outcomes explicit.
Prefer typed result records over loose CSV columns.
Separate raw model outputs from normalized grader inputs.
Track the smallest reproducible command that generated the result.
Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
Write the decision rule before seeing the final score whenever the result will guide a release.

The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Prompt templates and few-shot policy is one place where that habit becomes concrete.

3.4 Grading functions and rubrics

Grading functions and rubrics is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.

The basic mathematical pattern is an empirical estimator. For a model or system $m$ evaluated on items $z_1,\ldots,z_n$ , the local estimate is written

\hat{\mu}_{m,t} = \frac{1}{n}\sum_{i=1}^n s_m(z_i).

The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For grading functions and rubrics, those choices determine whether the reported number is evidence or decoration.

Component	What to record	Why it matters
Item definition	IDs, source, split, and allowed transformations	Prevents accidental drift in grading functions and rubrics
Scoring rule	Exact formula for s_m(z_i)	Makes comparisons repeatable
Aggregation	Mean, weighted mean, worst group, or pairwise model	Determines the scientific claim
Uncertainty	Standard error, interval, or posterior summary	Separates signal from sampling noise
Audit trail	Code version and random seeds	Makes failures debuggable

Examples of correct use:

Report grading functions and rubrics with item count, prompt protocol, grader version, and a confidence interval.
Use paired comparisons when two models answer the same evaluation items.
Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
Store raw outputs so future graders can be replayed without querying the model again.
Document whether the metric is measuring capability, reliability, user value, or risk.

Non-examples:

A leaderboard point estimate without sample size.
A benchmark score produced with an undocumented prompt template.
A model-graded result without judge identity, rubric, or agreement check.
A robustness claim measured only on the easiest in-distribution examples.
An online win declared before the randomization and logging checks pass.

Worked evaluation pattern for grading functions and rubrics:

Define the evaluation population in words before writing code.
Choose the smallest metric set that answers the decision question.
Compute the point estimate and an uncertainty statement together.
Run a slice or paired analysis to check whether the aggregate hides structure.
Archive raw outputs, scores, and seeds before changing the prompt or grader.

For AI systems, grading functions and rubrics is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.

AI connection	Evaluation consequence
Prompting	Treat prompt templates as part of the protocol, not as invisible setup
Decoding	Temperature and sampling change both mean score and variance
Retrieval	Retrieved context creates an extra source of failure and leakage
Tool use	Tool errors need separate attribution from model reasoning errors
Safety layer	Guardrail behavior can improve risk metrics while changing capability metrics

Implementation checklist:

Use deterministic seeds for synthetic or sampled evaluation subsets.
Print metric denominators, not only percentages.
Keep missing, invalid, timeout, and refusal outcomes explicit.
Prefer typed result records over loose CSV columns.
Separate raw model outputs from normalized grader inputs.
Track the smallest reproducible command that generated the result.
Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
Write the decision rule before seeing the final score whenever the result will guide a release.

The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Grading functions and rubrics is one place where that habit becomes concrete.

3.5 Contamination flags and eval provenance

Contamination flags and eval provenance is part of the canonical scope of capability benchmarks. In this chapter, the object under study is not merely a dataset or a model, but the full benchmark protocol: the items, prompts, outputs, graders, uncertainty statements, and decision rules that turn model behavior into evidence.

The basic mathematical pattern is an empirical estimator. For a model or system $m$ evaluated on items $z_1,\ldots,z_n$ , the local estimate is written

\hat{\mu}_{m,t} = \frac{1}{n}\sum_{i=1}^n s_m(z_i).

The formula is intentionally simple. The difficulty lies in deciding what counts as an item, which loss or score is meaningful, whether the items are independent, and whether the estimate answers the real product or research question. For contamination flags and eval provenance, those choices determine whether the reported number is evidence or decoration.

Component	What to record	Why it matters
Item definition	IDs, source, split, and allowed transformations	Prevents accidental drift in contamination flags and eval provenance
Scoring rule	Exact formula for s_m(z_i)	Makes comparisons repeatable
Aggregation	Mean, weighted mean, worst group, or pairwise model	Determines the scientific claim
Uncertainty	Standard error, interval, or posterior summary	Separates signal from sampling noise
Audit trail	Code version and random seeds	Makes failures debuggable

Examples of correct use:

Report contamination flags and eval provenance with item count, prompt protocol, grader version, and a confidence interval.
Use paired comparisons when two models answer the same evaluation items.
Inspect at least one meaningful slice before concluding that the aggregate result is reliable.
Store raw outputs so future graders can be replayed without querying the model again.
Document whether the metric is measuring capability, reliability, user value, or risk.

Non-examples:

A leaderboard point estimate without sample size.
A benchmark score produced with an undocumented prompt template.
A model-graded result without judge identity, rubric, or agreement check.
A robustness claim measured only on the easiest in-distribution examples.
An online win declared before the randomization and logging checks pass.

Worked evaluation pattern for contamination flags and eval provenance:

Define the evaluation population in words before writing code.
Choose the smallest metric set that answers the decision question.
Compute the point estimate and an uncertainty statement together.
Run a slice or paired analysis to check whether the aggregate hides structure.
Archive raw outputs, scores, and seeds before changing the prompt or grader.

For AI systems, contamination flags and eval provenance is especially delicate because the same model can be used with many prompts, decoding policies, tools, retrieval contexts, and safety filters. The measured quantity is therefore a property of the system configuration, not just the base weights.

AI connection	Evaluation consequence
Prompting	Treat prompt templates as part of the protocol, not as invisible setup
Decoding	Temperature and sampling change both mean score and variance
Retrieval	Retrieved context creates an extra source of failure and leakage
Tool use	Tool errors need separate attribution from model reasoning errors
Safety layer	Guardrail behavior can improve risk metrics while changing capability metrics

Implementation checklist:

Use deterministic seeds for synthetic or sampled evaluation subsets.
Print metric denominators, not only percentages.
Keep missing, invalid, timeout, and refusal outcomes explicit.
Prefer typed result records over loose CSV columns.
Separate raw model outputs from normalized grader inputs.
Track the smallest reproducible command that generated the result.
Record whether the estimate is item-weighted, token-weighted, user-weighted, or domain-weighted.
Write the decision rule before seeing the final score whenever the result will guide a release.

The mathematical habit to build is skepticism with structure. A score is not ignored because it is noisy; it is interpreted through the design that produced it. Contamination flags and eval provenance is one place where that habit becomes concrete.

Capability Benchmarks: Part 3 - Benchmark Design

Capability Benchmarks: Part 3: Benchmark Design

3. Benchmark Design

3.1 Task taxonomy and coverage

3.2 Dataset sampling and item independence

3.3 Prompt templates and few-shot policy

3.4 Grading functions and rubrics

3.5 Contamination flags and eval provenance

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?