NotesMath for LLMs

Data Format Standards

LLM Training Data Pipeline / Data Format Standards

Notes

"A training record is a small object with a large blast radius."

Overview

Data format standards define the mathematical and engineering contract between raw examples and the training loop. In an LLM training run, data is not an inert pile of text; it is the empirical distribution that defines the examples, losses, risks, and capabilities the model will see.

This section is written as LaTeX Markdown. Inline mathematics uses $...$, and display equations use `

......

`. The goal is to connect data engineering decisions to mathematical objects such as records rir_i, token sequences x1:Tx_{1:T}, filters f(x)f(x), hashes h(x)h(x), mixture weights α\boldsymbol{\alpha}, and empirical expectations.

The scope is deliberately narrow: this chapter owns the training-data pipeline. Tokenizer design, GPU training systems, benchmark methodology, alignment objectives, and production MLOps each have their own canonical chapters. Here we study the data objects that those later systems consume.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbExecutable demonstrations for data format standards
exercises.ipynbGraded practice for data format standards

Learning Objectives

After completing this section, you will be able to:

  • Define records, schemas, token streams, shards, and provenance identifiers
  • Distinguish raw documents, pretraining records, SFT messages, and preference pairs
  • Validate JSONL-style examples with deterministic type and key checks
  • Explain when JSONL, Parquet, Arrow, or tokenized binary formats are appropriate
  • Use stable hashes to identify records and preserve reproducibility
  • Design metadata fields for source, license, language, quality, and split information
  • Connect schema design to downstream loss computation and evaluation isolation
  • Recognize format errors that silently change the training objective

Table of Contents


1. Intuition

Intuition gives the conceptual and mathematical layer for data format standards. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

1.1 Data as a training contract

Data as a training contract is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For record, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.2 Records vs documents vs token streams

Records vs documents vs token streams is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For schema, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.3 Why format bugs become model bugs

Why format bugs become model bugs is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For JSONL, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.4 Pretraining, SFT, and preference formats

Pretraining, SFT, and preference formats is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For metadata, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.5 Pipeline history from raw web text to curated corpora

Pipeline history from raw web text to curated corpora is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For provenance, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2. Formal Definitions

Formal Definitions gives the conceptual and mathematical layer for data format standards. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

2.1 Record rir_i

Record rir_i is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For record, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.2 Schema S\mathcal{S}

Schema S\mathcal{S} is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For schema, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.3 Text field and metadata field

Text field and metadata field is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For JSONL, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.4 Token sequence x1:Tx_{1:T}

Token sequence x1:Tx_{1:T} is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For metadata, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.5 Source, split, shard, and provenance identifiers

Source, split, shard, and provenance identifiers is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For provenance, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3. Canonical Schemas

Canonical Schemas gives the conceptual and mathematical layer for data format standards. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

3.1 Raw text document schema

Raw text document schema is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For record, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.2 Pretraining document schema

Pretraining document schema is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For schema, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.3 Chat/SFT messages schema

Chat/SFT messages schema is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For JSONL, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.4 Pairwise preference schema

Pairwise preference schema is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For metadata, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.5 Evaluation-holdout schema

Evaluation-holdout schema is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For provenance, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4. Storage Formats

Storage Formats gives the conceptual and mathematical layer for data format standards. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

4.1 JSONL

JSONL is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For record, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.2 Parquet and Arrow

Parquet and Arrow is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For schema, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.3 Tokenized binary arrays

Tokenized binary arrays is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For JSONL, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.4 Sharded compressed files

Sharded compressed files is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For metadata, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.5 Manifest files and checksums

Manifest files and checksums is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For provenance, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5. Validation Rules

Validation Rules gives the conceptual and mathematical layer for data format standards. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

5.1 Required keys

Required keys is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For record, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.2 Type validation

Type validation is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For schema, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.3 Unicode and whitespace normalization

Unicode and whitespace normalization is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For JSONL, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.4 Text-length constraints

Text-length constraints is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For metadata, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.5 Deterministic IDs and hashes

Deterministic IDs and hashes is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For provenance, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6. Applications

Applications gives the conceptual and mathematical layer for data format standards. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

6.1 Pretraining corpora

Pretraining corpora is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For record, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.2 Continual pretraining

Continual pretraining is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For schema, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.3 SFT

SFT is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For JSONL, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.4 Preference data

Preference data is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For metadata, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.5 Data release packages

Data release packages is part of the canonical scope of data format standards. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For provenance, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

7. Common Mistakes

#MistakeWhy It Is WrongFix
1Trusting a file because it existsA zero-byte or unparsable artifact can still pass a loose path checkValidate content and parseability
2Counting documents but not tokensLong documents dominate computeReport both document and token rates
3Changing schemas without versioningOld and new records become indistinguishablePin schema versions in every record
4Dropping metadata during transformsAudits and removals become impossiblePreserve source and transform lineage
5Using nondeterministic orderingRebuilds cannot be comparedSeed and record ordering rules
6Ignoring failed recordsSilent loss can bias the corpusQuarantine and summarize failures
7Treating filters as neutralFilters encode preferences and tradeoffsAblate and audit every major filter
8Mixing train and eval sourcesEvaluation becomes contaminatedRun overlap audits before release
9Optimizing one aggregate scoreSmall domains can regressTrack slice metrics
10Skipping data cardsUsers cannot judge intended use or riskPublish structured documentation
11Assuming licenses are uniformSource terms can conflictTrack license at source and record level
12Forgetting reproducible manifestsThe same name can refer to different dataUse hashes and version pins

8. Exercises

  1. (*) Build a synthetic record example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  2. (*) Build a synthetic schema example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  3. (*) Build a synthetic JSONL example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  4. (**) Build a synthetic metadata example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  5. (**) Build a synthetic provenance example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  6. (**) Build a synthetic token sequence example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  7. (**) Build a synthetic manifest example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  8. (***) Build a synthetic record example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  9. (***) Build a synthetic schema example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  10. (***) Build a synthetic JSONL example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.

9. Why This Matters for AI

ConceptAI impact
recordControls what examples, gradients, risks, or audits the model pipeline can represent
schemaControls what examples, gradients, risks, or audits the model pipeline can represent
JSONLControls what examples, gradients, risks, or audits the model pipeline can represent
metadataControls what examples, gradients, risks, or audits the model pipeline can represent
provenanceControls what examples, gradients, risks, or audits the model pipeline can represent
token sequenceControls what examples, gradients, risks, or audits the model pipeline can represent
manifestControls what examples, gradients, risks, or audits the model pipeline can represent

Data pipeline quality is model quality in delayed form. The model eventually converts these records into gradients; any unresolved ambiguity becomes either wasted compute, misleading evaluation, memorization risk, or irreproducible science.

10. Conceptual Bridge

This section connects the previous and next pieces of the curriculum as follows:

raw sources -> records -> validation -> assembly -> audits -> documentation -> mixture

The next section is JSONL Generation. It uses the contracts established here and moves one step further through the LLM data pipeline.

References