Lesson overview | Previous part | Lesson overview
JSONL Generation: Part 4: Record Construction to References
4. Record Construction
Record Construction gives the conceptual and mathematical layer for jsonl generation. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.
4.1 Field mapping
Field mapping is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For generator, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
4.2 Metadata preservation
Metadata preservation is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For serialization, the invariant should be explicit enough that a checker can fail
fast. If the invariant is only written in a notebook comment or an engineer's memory, it
will not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
4.3 Token count estimates
Token count estimates is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For shard, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
4.4 Source trace
Source trace is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For quarantine, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
4.5 Error quarantine
Error quarantine is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For idempotence, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5. Streaming and Sharding
Streaming and Sharding gives the conceptual and mathematical layer for jsonl generation. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.
5.1 Python generators
Python generators is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For generator, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5.2 Memory-safe iteration
Memory-safe iteration is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For serialization, the invariant should be explicit enough that a checker can fail
fast. If the invariant is only written in a notebook comment or an engineer's memory, it
will not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5.3 Shard rotation
Shard rotation is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For shard, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5.4 Compression
Compression is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For quarantine, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
5.5 Resume/restart logic
Resume/restart logic is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For idempotence, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6. Validation and Performance
Validation and Performance gives the conceptual and mathematical layer for jsonl generation. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.
6.1 Line-level JSON parse
Line-level JSON parse is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For generator, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6.2 Duplicate ID detection
Duplicate ID detection is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For serialization, the invariant should be explicit enough that a checker can fail
fast. If the invariant is only written in a notebook comment or an engineer's memory, it
will not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6.3 Throughput metrics
Throughput metrics is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For shard, the invariant should be explicit enough that a checker can fail fast. If
the invariant is only written in a notebook comment or an engineer's memory, it will not
protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6.4 Multiprocessing
Multiprocessing is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For quarantine, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
6.5 Deterministic tests
Deterministic tests is part of the canonical scope of jsonl generation. We model the relevant object as a finite collection with record-level metadata and text or token content . The practical question is whether the transformation preserves the intended empirical distribution.
A useful local invariant is:
For idempotence, the invariant should be explicit enough that a checker can fail fast.
If the invariant is only written in a notebook comment or an engineer's memory, it will
not protect a long-running data build.
Examples:
- A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
- The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
- The notebook for this section uses synthetic data so the same ideas can be executed without external files.
Non-examples:
- A path on disk without a manifest is not a reproducible dataset.
- A metric dashboard without record-level lineage is not a provenance system.
- A filter threshold without an audit sample is not evidence of quality.
Implementation consequence: every transformation should report both a count and a rate. If records enter the stage and records leave, the acceptance rate is
A sudden change in is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.
For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is
where is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.
7. Common Mistakes
| # | Mistake | Why It Is Wrong | Fix |
|---|---|---|---|
| 1 | Trusting a file because it exists | A zero-byte or unparsable artifact can still pass a loose path check | Validate content and parseability |
| 2 | Counting documents but not tokens | Long documents dominate compute | Report both document and token rates |
| 3 | Changing schemas without versioning | Old and new records become indistinguishable | Pin schema versions in every record |
| 4 | Dropping metadata during transforms | Audits and removals become impossible | Preserve source and transform lineage |
| 5 | Using nondeterministic ordering | Rebuilds cannot be compared | Seed and record ordering rules |
| 6 | Ignoring failed records | Silent loss can bias the corpus | Quarantine and summarize failures |
| 7 | Treating filters as neutral | Filters encode preferences and tradeoffs | Ablate and audit every major filter |
| 8 | Mixing train and eval sources | Evaluation becomes contaminated | Run overlap audits before release |
| 9 | Optimizing one aggregate score | Small domains can regress | Track slice metrics |
| 10 | Skipping data cards | Users cannot judge intended use or risk | Publish structured documentation |
| 11 | Assuming licenses are uniform | Source terms can conflict | Track license at source and record level |
| 12 | Forgetting reproducible manifests | The same name can refer to different data | Use hashes and version pins |
8. Exercises
- (*) Build a synthetic
generatorexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (*) Build a synthetic
serializationexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (*) Build a synthetic
shardexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (**) Build a synthetic
quarantineexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (**) Build a synthetic
idempotenceexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (**) Build a synthetic
throughputexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (**) Build a synthetic
resumeexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (***) Build a synthetic
generatorexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (***) Build a synthetic
serializationexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong. - (***) Build a synthetic
shardexample, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
9. Why This Matters for AI
| Concept | AI impact |
|---|---|
| generator | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| serialization | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| shard | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| quarantine | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| idempotence | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| throughput | Controls what examples, gradients, risks, or audits the model pipeline can represent |
| resume | Controls what examples, gradients, risks, or audits the model pipeline can represent |
Data pipeline quality is model quality in delayed form. The model eventually converts these records into gradients; any unresolved ambiguity becomes either wasted compute, misleading evaluation, memorization risk, or irreproducible science.
10. Conceptual Bridge
This section connects the previous and next pieces of the curriculum as follows:
raw sources -> records -> validation -> assembly -> audits -> documentation -> mixture
The next section is Quality Checks. It uses the contracts established here and moves one step further through the LLM data pipeline.