Part 2

28 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Full Dataset Assembly: Part 4: Assembly Algorithms to References

4. Assembly Algorithms

Assembly Algorithms gives the conceptual and mathematical layer for full dataset assembly. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

4.1 Concatenation

Concatenation is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For source set, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.2 Weighted sampling

Weighted sampling is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For mixture weight, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.3 Stratified sampling

Stratified sampling is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record- level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For token budget, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.4 Train/validation/test split

Train/validation/test split is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For manifest, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.5 Deterministic shuffling

Deterministic shuffling is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5. Tokenization and Packing Interface

Tokenization and Packing Interface gives the conceptual and mathematical layer for full dataset assembly. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

5.1 Token-count budgets

Token-count budgets is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record- level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.2 Sequence packing

Sequence packing is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.3 Document-boundary masks

Document-boundary masks is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.4 Padding/truncation

Padding/truncation is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.5 Packed-shard statistics

Packed-shard statistics is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6. Final Verification

Final Verification gives the conceptual and mathematical layer for full dataset assembly. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

6.1 Shard count

Shard count is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.2 Token count

Token count is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.3 Source proportions

Source proportions is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.4 Reproducible rebuild

Reproducible rebuild is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record- level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.5 Smoke-test data loader

Smoke-test data loader is part of the canonical scope of full dataset assembly. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record- level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

7. Common Mistakes

#	Mistake	Why It Is Wrong	Fix
1	Trusting a file because it exists	A zero-byte or unparsable artifact can still pass a loose path check	Validate content and parseability
2	Counting documents but not tokens	Long documents dominate compute	Report both document and token rates
3	Changing schemas without versioning	Old and new records become indistinguishable	Pin schema versions in every record
4	Dropping metadata during transforms	Audits and removals become impossible	Preserve source and transform lineage
5	Using nondeterministic ordering	Rebuilds cannot be compared	Seed and record ordering rules
6	Ignoring failed records	Silent loss can bias the corpus	Quarantine and summarize failures
7	Treating filters as neutral	Filters encode preferences and tradeoffs	Ablate and audit every major filter
8	Mixing train and eval sources	Evaluation becomes contaminated	Run overlap audits before release
9	Optimizing one aggregate score	Small domains can regress	Track slice metrics
10	Skipping data cards	Users cannot judge intended use or risk	Publish structured documentation
11	Assuming licenses are uniform	Source terms can conflict	Track license at source and record level
12	Forgetting reproducible manifests	The same name can refer to different data	Use hashes and version pins

8. Exercises

(*) Build a synthetic source set example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
(*) Build a synthetic mixture weight example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
(*) Build a synthetic token budget example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
(**) Build a synthetic manifest example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
(**) Build a synthetic shard example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
(**) Build a synthetic split example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
(**) Build a synthetic packing example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
(***) Build a synthetic source set example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
(***) Build a synthetic mixture weight example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
(***) Build a synthetic token budget example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.

9. Why This Matters for AI

Concept	AI impact
source set	Controls what examples, gradients, risks, or audits the model pipeline can represent
mixture weight	Controls what examples, gradients, risks, or audits the model pipeline can represent
token budget	Controls what examples, gradients, risks, or audits the model pipeline can represent
manifest	Controls what examples, gradients, risks, or audits the model pipeline can represent
shard	Controls what examples, gradients, risks, or audits the model pipeline can represent
split	Controls what examples, gradients, risks, or audits the model pipeline can represent
packing	Controls what examples, gradients, risks, or audits the model pipeline can represent

Data pipeline quality is model quality in delayed form. The model eventually converts these records into gradients; any unresolved ambiguity becomes either wasted compute, misleading evaluation, memorization risk, or irreproducible science.

10. Conceptual Bridge

This section connects the previous and next pieces of the curriculum as follows:

raw sources -> records -> validation -> assembly -> audits -> documentation -> mixture

The next section is [Contamination and Dedup Audits](../05-Contamination-and-Dedup- Audits/notes.md). It uses the contracts established here and moves one step further through the LLM data pipeline.

Full Dataset Assembly: Part 2 - Assembly Algorithms To References

Full Dataset Assembly: Part 4: Assembly Algorithms to References

4. Assembly Algorithms

4.1 Concatenation

4.2 Weighted sampling

4.3 Stratified sampling

4.4 Train/validation/test split

4.5 Deterministic shuffling

5. Tokenization and Packing Interface

5.1 Token-count budgets

5.2 Sequence packing

5.3 Document-boundary masks

5.4 Padding/truncation

5.5 Packed-shard statistics

6. Final Verification

6.1 Shard count

6.2 Token count

6.3 Source proportions

6.4 Reproducible rebuild

6.5 Smoke-test data loader

7. Common Mistakes

8. Exercises

9. Why This Matters for AI

10. Conceptual Bridge

References

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?