"If a dataset cannot explain where it came from, it cannot explain what it taught."
Overview
Documentation and governance convert a data pipeline into a reproducible, reviewable, and accountable artifact. In an LLM training run, data is not an inert pile of text; it is the empirical distribution that defines the examples, losses, risks, and capabilities the model will see.
This section is written as LaTeX Markdown. Inline mathematics uses $...$, and display
equations use `
`. The goal is to connect data engineering decisions to mathematical objects such as records , token sequences , filters , hashes , mixture weights , and empirical expectations.
The scope is deliberately narrow: this chapter owns the training-data pipeline. Tokenizer design, GPU training systems, benchmark methodology, alignment objectives, and production MLOps each have their own canonical chapters. Here we study the data objects that those later systems consume.
Prerequisites
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable demonstrations for documentation and governance |
| exercises.ipynb | Graded practice for documentation and governance |
Learning Objectives
After completing this section, you will be able to:
- Define data cards, provenance graphs, license fields, risk registers, and release checklists
- Write dataset documentation that explains intent, sources, processing, and limitations
- Represent lineage through source URIs, hashes, transforms, and version pins
- Design governance controls for access, PII review, takedown, and release approval
- Track dataset versions and diff reports
- Link trained models back to exact data manifests
- Explain why documentation is a user-facing product, not a README afterthought
- Prepare evidence needed for reproducibility and responsible release
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Dataset Documentation
- 4. Provenance and Lineage
- 5. Governance Controls
- 6. Dataset Versioning
- 7. Common Mistakes
- 8. Exercises
- 9. Why This Matters for AI
- 10. Conceptual Bridge
- References
1. Intuition
Intuition gives the conceptual and mathematical layer for documentation and governance. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.
1.1 A dataset without documentation is not reproducible
A dataset without documentation is not reproducible is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For data card, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
1.2 Governance as risk control
Governance as risk control is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For provenance, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
1.3 Dataset users as stakeholders
Dataset users as stakeholders is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For lineage, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
1.4 Responsible release
Responsible release is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For license, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
1.5 Data cards and data statements
Data cards and data statements is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For risk register, the invariant should be explicit enough that a checker can fail
fast. If the invariant is only written in a notebook comment or an engineer's memory, it
will not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
2. Formal Definitions
Formal Definitions gives the conceptual and mathematical layer for documentation and governance. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.
2.1 Dataset card
Dataset card is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record- level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For data card, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
2.2 Provenance graph
Provenance graph is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For provenance, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
2.3 License vector
License vector is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record- level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For lineage, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
2.4 Consent/permission field
Consent/permission field is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For license, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
2.5 Risk register
Risk register is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record- level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For risk register, the invariant should be explicit enough that a checker can fail
fast. If the invariant is only written in a notebook comment or an engineer's memory, it
will not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
3. Dataset Documentation
Dataset Documentation gives the conceptual and mathematical layer for documentation and governance. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.
3.1 Intended use
Intended use is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record- level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For data card, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
3.2 Collection process
Collection process is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For provenance, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
3.3 Processing pipeline
Processing pipeline is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For lineage, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
3.4 Known limitations
Known limitations is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For license, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
3.5 Evaluation and audit results
Evaluation and audit results is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For risk register, the invariant should be explicit enough that a checker can fail
fast. If the invariant is only written in a notebook comment or an engineer's memory, it
will not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
4. Provenance and Lineage
Provenance and Lineage gives the conceptual and mathematical layer for documentation and governance. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.
4.1 Source URI
Source URI is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For data card, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
4.2 Snapshot time
Snapshot time is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record- level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For provenance, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
4.3 Transform history
Transform history is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For lineage, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
4.4 Hash chain
Hash chain is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For license, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
4.5 Rebuild command
Rebuild command is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record- level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For risk register, the invariant should be explicit enough that a checker can fail
fast. If the invariant is only written in a notebook comment or an engineer's memory, it
will not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5. Governance Controls
Governance Controls gives the conceptual and mathematical layer for documentation and governance. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.
5.1 Access control
Access control is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record- level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For data card, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5.2 License compatibility
License compatibility is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For provenance, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5.3 PII review
PII review is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For lineage, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5.4 Takedown/removal workflow
Takedown/removal workflow is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For license, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5.5 Release approval checklist
Release approval checklist is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For risk register, the invariant should be explicit enough that a checker can fail
fast. If the invariant is only written in a notebook comment or an engineer's memory, it
will not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6. Dataset Versioning
Dataset Versioning gives the conceptual and mathematical layer for documentation and governance. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.
6.1 Semantic dataset versions
Semantic dataset versions is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For data card, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6.2 Diff reports
Diff reports is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record- level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For provenance, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6.3 Deprecation
Deprecation is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For lineage, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6.4 Reproducible manifests
Reproducible manifests is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For license, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6.5 Model-to-data linkage
Model-to-data linkage is part of the canonical scope of documentation and governance. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For risk register, the invariant should be explicit enough that a checker can fail
fast. If the invariant is only written in a notebook comment or an engineer's memory, it
will not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
7. Common Mistakes
| # | Mistake | Why It Is Wrong | Fix |
|---|---|---|---|
| 1 | Trusting a file because it exists | A zero-byte or unparsable artifact can still pass a loose path check | Validate content and parseability |
| 2 | Counting documents but not tokens | Long documents dominate compute | Report both document and token rates |
| 3 | Changing schemas without versioning | Old and new records become indistinguishable | Pin schema versions in every record |
| 4 | Dropping metadata during transforms | Audits and removals become impossible | Preserve source and transform lineage |
| 5 | Using nondeterministic ordering | Rebuilds cannot be compared | Seed and record ordering rules |
| 6 | Ignoring failed records | Silent loss can bias the corpus | Quarantine and summarize failures |
| 7 | Treating filters as neutral | Filters encode preferences and tradeoffs | Ablate and audit every major filter |
| 8 | Mixing train and eval sources | Evaluation becomes contaminated | Run overlap audits before release |
| 9 | Optimizing one aggregate score | Small domains can regress | Track slice metrics |
| 10 | Skipping data cards | Users cannot judge intended use or risk | Publish structured documentation |
| 11 | Assuming licenses are uniform | Source terms can conflict | Track license at source and record level |
| 12 | Forgetting reproducible manifests | The same name can refer to different data | Use hashes and version pins |
8. Exercises
- (*) Build a synthetic
data cardexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (*) Build a synthetic
provenanceexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (*) Build a synthetic
lineageexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (**) Build a synthetic
licenseexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (**) Build a synthetic
risk registerexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (**) Build a synthetic
versionexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (**) Build a synthetic
governanceexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (***) Build a synthetic
data cardexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (***) Build a synthetic
provenanceexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (***) Build a synthetic
lineageexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
9. Why This Matters for AI
| Concept | AI impact |
|---|---|
| data card | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| provenance | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| lineage | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| license | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| risk register | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| version | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| governance | Controls what examples, gradients, risks, or audits the model pipeline can represent |
Data pipeline quality is model quality in delayed form. The model eventually converts these records into gradients; any unresolved ambiguity becomes either wasted compute, misleading evaluation, memorization risk, or irreproducible science.
10. Conceptual Bridge
This section connects the previous and next pieces of the curriculum as follows:
raw sources -> records -> validation -> assembly -> audits -> documentation -> mixture
The next section is [Data Mixture Optimization](../07-Data-Mixture- Optimization/notes.md). It uses the contracts established here and moves one step further through the LLM data pipeline.