NotesMath for LLMs

Quality Checks

LLM Training Data Pipeline / Quality Checks

Notes

"Filtering is not cleaning; it is choosing which errors the model is allowed to learn from."

Overview

Quality checks estimate which records increase effective training signal and which records inject noise, risk, or distributional distortion. In an LLM training run, data is not an inert pile of text; it is the empirical distribution that defines the examples, losses, risks, and capabilities the model will see.

This section is written as LaTeX Markdown. Inline mathematics uses $...$, and display equations use `

......

`. The goal is to connect data engineering decisions to mathematical objects such as records rir_i, token sequences x1:Tx_{1:T}, filters f(x)f(x), hashes h(x)h(x), mixture weights α\boldsymbol{\alpha}, and empirical expectations.

The scope is deliberately narrow: this chapter owns the training-data pipeline. Tokenizer design, GPU training systems, benchmark methodology, alignment objectives, and production MLOps each have their own canonical chapters. Here we study the data objects that those later systems consume.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbExecutable demonstrations for quality checks
exercises.ipynbGraded practice for quality checks

Learning Objectives

After completing this section, you will be able to:

  • Define quality scores, filter functions, acceptance rates, and filter cascades
  • Implement length, repetition, language, and character-ratio filters
  • Explain model-based quality filtering and threshold calibration
  • Analyze the tradeoff between quality, toxicity, diversity, and coverage
  • Detect PII-like and secret-like patterns with conservative regex audits
  • Summarize filter behavior by source, language, length, and time slice
  • Design human audit rubrics and filter ablation reports
  • Connect quality filtering to effective token count DeffD_{\mathrm{eff}}

Table of Contents


1. Intuition

Intuition gives the conceptual and mathematical layer for quality checks. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

1.1 Quality as effective token multiplier Deff=qDD_{\mathrm{eff}}=qD

Quality as effective token multiplier Deff=qDD_{\mathrm{eff}}=qD is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For quality score, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.2 Filtering as precision/recall tradeoff

Filtering as precision/recall tradeoff is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For filter, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.3 Quality vs diversity

Quality vs diversity is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For acceptance rate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.4 Safety vs capability

Safety vs capability is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For PII, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.5 Lessons from C4, Dolma, FineWeb, and DCLM

Lessons from C4, Dolma, FineWeb, and DCLM is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For toxicity, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2. Formal Definitions

Formal Definitions gives the conceptual and mathematical layer for quality checks. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

2.1 Quality score q(x)q(x)

Quality score q(x)q(x) is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For quality score, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.2 Filter function f(x){0,1}f(x)\in\{0,1\}

Filter function f(x){0,1}f(x)\in\{0,1\} is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For filter, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.3 Acceptance rate

Acceptance rate is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For acceptance rate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.4 False positives and false negatives

False positives and false negatives is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For PII, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.5 Filter cascade

Filter cascade is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For toxicity, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3. Rule-Based Filters

Rule-Based Filters gives the conceptual and mathematical layer for quality checks. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

3.1 Length filters

Length filters is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For quality score, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.2 Language ID

Language ID is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For filter, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.3 Repetition ratios

Repetition ratios is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For acceptance rate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.4 Character/script ratios

Character/script ratios is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For PII, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.5 Boilerplate and markup filters

Boilerplate and markup filters is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For toxicity, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4. Model-Based Filters

Model-Based Filters gives the conceptual and mathematical layer for quality checks. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

4.1 Perplexity filters

Perplexity filters is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For quality score, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.2 Quality classifiers

Quality classifiers is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For filter, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.3 Educational-value classifiers

Educational-value classifiers is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For acceptance rate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.4 Embedding outliers

Embedding outliers is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For PII, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.5 Calibration of filter thresholds

Calibration of filter thresholds is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For toxicity, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5. Safety and Privacy Filters

Safety and Privacy Filters gives the conceptual and mathematical layer for quality checks. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

5.1 PII detection

PII detection is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For quality score, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.2 Toxicity/hate filters

Toxicity/hate filters is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For filter, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.3 Secrets/API keys in code data

Secrets/API keys in code data is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For acceptance rate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.4 Malware/code safety preview

Malware/code safety preview is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For PII, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.5 Quarantine policies

Quarantine policies is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For toxicity, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6. Monitoring and Human Audit

Monitoring and Human Audit gives the conceptual and mathematical layer for quality checks. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

6.1 Distribution summaries

Distribution summaries is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For quality score, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.2 Sample review rubric

Sample review rubric is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For filter, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.3 Slice-based audit

Slice-based audit is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For acceptance rate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.4 Drift by source/time

Drift by source/time is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For PII, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.5 Filter ablation reports

Filter ablation reports is part of the canonical scope of quality checks. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For toxicity, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

7. Common Mistakes

#MistakeWhy It Is WrongFix
1Trusting a file because it existsA zero-byte or unparsable artifact can still pass a loose path checkValidate content and parseability
2Counting documents but not tokensLong documents dominate computeReport both document and token rates
3Changing schemas without versioningOld and new records become indistinguishablePin schema versions in every record
4Dropping metadata during transformsAudits and removals become impossiblePreserve source and transform lineage
5Using nondeterministic orderingRebuilds cannot be comparedSeed and record ordering rules
6Ignoring failed recordsSilent loss can bias the corpusQuarantine and summarize failures
7Treating filters as neutralFilters encode preferences and tradeoffsAblate and audit every major filter
8Mixing train and eval sourcesEvaluation becomes contaminatedRun overlap audits before release
9Optimizing one aggregate scoreSmall domains can regressTrack slice metrics
10Skipping data cardsUsers cannot judge intended use or riskPublish structured documentation
11Assuming licenses are uniformSource terms can conflictTrack license at source and record level
12Forgetting reproducible manifestsThe same name can refer to different dataUse hashes and version pins

8. Exercises

  1. (*) Build a synthetic quality score example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  2. (*) Build a synthetic filter example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  3. (*) Build a synthetic acceptance rate example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  4. (**) Build a synthetic PII example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  5. (**) Build a synthetic toxicity example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  6. (**) Build a synthetic threshold example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  7. (**) Build a synthetic audit example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  8. (***) Build a synthetic quality score example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  9. (***) Build a synthetic filter example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  10. (***) Build a synthetic acceptance rate example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.

9. Why This Matters for AI

ConceptAI impact
quality scoreControls what examples, gradients, risks, or audits the model pipeline can represent
filterControls what examples, gradients, risks, or audits the model pipeline can represent
acceptance rateControls what examples, gradients, risks, or audits the model pipeline can represent
PIIControls what examples, gradients, risks, or audits the model pipeline can represent
toxicityControls what examples, gradients, risks, or audits the model pipeline can represent
thresholdControls what examples, gradients, risks, or audits the model pipeline can represent
auditControls what examples, gradients, risks, or audits the model pipeline can represent

Data pipeline quality is model quality in delayed form. The model eventually converts these records into gradients; any unresolved ambiguity becomes either wasted compute, misleading evaluation, memorization risk, or irreproducible science.

10. Conceptual Bridge

This section connects the previous and next pieces of the curriculum as follows:

raw sources -> records -> validation -> assembly -> audits -> documentation -> mixture

The next section is Full Dataset Assembly. It uses the contracts established here and moves one step further through the LLM data pipeline.

References