Part 1

25 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

JSONL Generation: Part 1: Intuition to 3. Source Extraction

1. Intuition

Intuition gives the conceptual and mathematical layer for jsonl generation. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

1.1 JSONL as streamable training data

JSONL as streamable training data is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For generator, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.2 Serialization as deterministic map $g(r_i)$

Serialization as deterministic map $g(r_i)$ is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For serialization, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.3 Why one-record-per-line matters

Why one-record-per-line matters is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.4 Reproducibility

Reproducibility is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For quarantine, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.5 Failure modes in large generation jobs

Failure modes in large generation jobs is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For idempotence, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2. Formal Definitions

Formal Definitions gives the conceptual and mathematical layer for jsonl generation. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

2.1 Generator function

Generator function is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.2 Canonical JSON encoding

Canonical JSON encoding is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.3 Idempotence

Idempotence is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.4 Deterministic ordering

Deterministic ordering is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.5 Atomic write contract

Atomic write contract is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3. Source Extraction

Source Extraction gives the conceptual and mathematical layer for jsonl generation. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

3.1 Plain text

Plain text is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.2 HTML extraction preview

HTML extraction preview is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.3 Code/source files

Code/source files is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.4 PDF/OCR preview

PDF/OCR preview is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.5 Document boundaries

Document boundaries is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

JSONL Generation: Part 1 - Intuition To 3 Source Extraction

JSONL Generation: Part 1: Intuition to 3. Source Extraction

1. Intuition

1.1 JSONL as streamable training data

1.2 Serialization as deterministic map $g(r_i)$

1.3 Why one-record-per-line matters

1.4 Reproducibility

1.5 Failure modes in large generation jobs

2. Formal Definitions

2.1 Generator function

2.2 Canonical JSON encoding

2.3 Idempotence

2.4 Deterministic ordering

2.5 Atomic write contract

3. Source Extraction

3.1 Plain text

3.2 HTML extraction preview

3.3 Code/source files

3.4 PDF/OCR preview

3.5 Document boundaries

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

JSONL Generation: Part 1 - Intuition To 3 Source Extraction

JSONL Generation: Part 1: Intuition to 3. Source Extraction

1. Intuition

1.1 JSONL as streamable training data

1.2 Serialization as deterministic map g(ri)g(r_i)g(ri​)

1.3 Why one-record-per-line matters

1.4 Reproducibility

1.5 Failure modes in large generation jobs

2. Formal Definitions

2.1 Generator function

2.2 Canonical JSON encoding

2.3 Idempotence

2.4 Deterministic ordering

2.5 Atomic write contract

3. Source Extraction

3.1 Plain text

3.2 HTML extraction preview

3.3 Code/source files

3.4 PDF/OCR preview

3.5 Document boundaries

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

1.2 Serialization as deterministic map $g(r_i)$