All Courses
16 · LLM TRAINING DATA PIPELINE

Data Mixture Optimization

"A data mixture is a prior over what the model should become good at."

Overview

Data mixture optimization treats corpus composition as a constrained optimization problem over domains, objectives, and risk constraints. In an LLM training run, data is not an inert pile of text; it is the empirical distribution that defines the examples, losses, risks, and capabilities the model will see.

This section is written as LaTeX Markdown. Inline mathematics uses $...$, and display equations use $$...$$. The goal is to connect data engineering decisions to mathematical objects such as records $r_i$, token sequences $x_{1:T}$, filters $f(x)$, hashes $h(x)$, mixture weights $\boldsymbol{\alpha}$, and empirical expectations.

The scope is deliberately narrow: this chapter owns the training-data pipeline. Tokenizer design, GPU training systems, benchmark methodology, alignment objectives, and production MLOps each have their own canonical chapters. Here we study the data objects that those later systems consume.

Prerequisites

Learning Objectives

After completing this section, you will be able to:

  • Define domain mixtures as vectors on the simplex $\Delta^{K-1}$
  • Compare uniform, source-proportional, hand-tuned, and temperature-smoothed mixtures
  • Use validation losses to evaluate mixture candidates
  • Explain effective tokens, data age, repeated-data scaling, and domain coverage
  • Implement grid-search and proxy-model mixture selection
  • Describe DoReMi/group-DRO intuition without duplicating full optimization chapters
  • Apply constraints for safety, quality, license, and downstream target mismatch
  • Connect mixture optimization to Chapter 17 evaluation and Chapter 18 alignment data


1. Intuition

Intuition gives the conceptual and mathematical layer for data mixture optimization. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

1.1 Mixture weights determine model skill profile

Mixture weights determine model skill profile is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For mixture, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.2 Not all tokens have equal value

Not all tokens have equal value is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For simplex, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.3 Mixture as constrained optimization

Mixture as constrained optimization is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For domain, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.4 Proxy models

Proxy models is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For proxy model, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.5 DataComp, DoReMi, and data mixing laws context

DataComp, DoReMi, and data mixing laws context is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For DRO, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2. Formal Definitions

Formal Definitions gives the conceptual and mathematical layer for data mixture optimization. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

2.1 Domains $\mathcal{D}_1,\ldots,\mathcal{D}_K$

Domains $\mathcal{D}_1,\ldots,\mathcal{D}_K$ is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For mixture, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.2 Mixture vector $\boldsymbol{\alpha}\in\Delta^{K-1}$

Mixture vector $\boldsymbol{\alpha}\in\Delta^{K-1}$ is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection

$\mathcal{D} = \{r_i\}_{i=1}^n$

with record-level metadata and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For simplex, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.3 Sampling distribution

Sampling distribution is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For domain, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.4 Validation objective

Validation objective is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For proxy model, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.5 Token budget constraint

Token budget constraint is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For DRO, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3. Baseline Mixtures

Baseline Mixtures gives the conceptual and mathematical layer for data mixture optimization. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

3.1 Uniform by document

Uniform by document is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For mixture, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.2 Uniform by token

Uniform by token is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record- level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For simplex, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.3 Source-proportional

Source-proportional is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For domain, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.4 Hand-tuned domain weights

Hand-tuned domain weights is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For proxy model, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.5 Temperature-smoothed mixtures

Temperature-smoothed mixtures is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For DRO, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4. Quality and Scaling Effects

Quality and Scaling Effects gives the conceptual and mathematical layer for data mixture optimization. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

4.1 Effective tokens $D_{\mathrm{eff}}$

Effective tokens $D_{\mathrm{eff}}$ is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For mixture, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.2 Repeated-data scaling

Repeated-data scaling is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For simplex, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.3 Domain coverage

Domain coverage is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record- level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For domain, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.4 Data age

Data age is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For proxy model, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.5 Synthetic data preview

Synthetic data preview is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For DRO, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5. Optimization Methods

Optimization Methods gives the conceptual and mathematical layer for data mixture optimization. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

5.1 Grid search

Grid search is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For mixture, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.2 Bayesian optimization preview

Bayesian optimization preview is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For simplex, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.3 Proxy-model sweeps

Proxy-model sweeps is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record- level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For domain, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.4 DoReMi/group DRO

DoReMi/group DRO is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record- level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For proxy model, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.5 Data mixing law extrapolation

Data mixing law extrapolation is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For DRO, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6. Evaluation and Constraints

Evaluation and Constraints gives the conceptual and mathematical layer for data mixture optimization. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

6.1 Multi-domain validation loss

Multi-domain validation loss is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For mixture, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.2 Downstream score aggregation

Downstream score aggregation is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For simplex, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.3 Safety/quality constraints

Safety/quality constraints is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For domain, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.4 License constraints

License constraints is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For proxy model, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.5 Robustness to target mismatch

Robustness to target mismatch is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

$$ \text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.} $$

For DRO, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples: - A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records. - The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits. - The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples: - A path on disk without a manifest is not a reproducible dataset. - A metric dashboard without record-level lineage is not a provenance system. - A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

$$ a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}. $$

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

$$ a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i}, $$

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

7. Common Mistakes

# Mistake Why It Is Wrong Fix
1 Trusting a file because it exists A zero-byte or unparsable artifact can still pass a loose path check Validate content and parseability
2 Counting documents but not tokens Long documents dominate compute Report both document and token rates
3 Changing schemas without versioning Old and new records become indistinguishable Pin schema versions in every record
4 Dropping metadata during transforms Audits and removals become impossible Preserve source and transform lineage
5 Using nondeterministic ordering Rebuilds cannot be compared Seed and record ordering rules
6 Ignoring failed records Silent loss can bias the corpus Quarantine and summarize failures
7 Treating filters as neutral Filters encode preferences and tradeoffs Ablate and audit every major filter
8 Mixing train and eval sources Evaluation becomes contaminated Run overlap audits before release
9 Optimizing one aggregate score Small domains can regress Track slice metrics
10 Skipping data cards Users cannot judge intended use or risk Publish structured documentation
11 Assuming licenses are uniform Source terms can conflict Track license at source and record level
12 Forgetting reproducible manifests The same name can refer to different data Use hashes and version pins

8. Exercises

  1. (*) Build a synthetic mixture example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  2. (*) Build a synthetic simplex example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  3. (*) Build a synthetic domain example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  4. (**) Build a synthetic proxy model example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  5. (**) Build a synthetic DRO example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  6. (**) Build a synthetic effective tokens example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  7. (**) Build a synthetic validation loss example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  8. (***) Build a synthetic mixture example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  9. (***) Build a synthetic simplex example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  10. (***) Build a synthetic domain example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.

9. Why This Matters for AI

Concept AI impact
mixture Controls what examples, gradients, risks, or audits the model pipeline can represent
simplex Controls what examples, gradients, risks, or audits the model pipeline can represent
domain Controls what examples, gradients, risks, or audits the model pipeline can represent
proxy model Controls what examples, gradients, risks, or audits the model pipeline can represent
DRO Controls what examples, gradients, risks, or audits the model pipeline can represent
effective tokens Controls what examples, gradients, risks, or audits the model pipeline can represent
validation loss Controls what examples, gradients, risks, or audits the model pipeline can represent

Data pipeline quality is model quality in delayed form. The model eventually converts these records into gradients; any unresolved ambiguity becomes either wasted compute, misleading evaluation, memorization risk, or irreproducible science.

10. Conceptual Bridge

This section connects the previous and next pieces of the curriculum as follows:

raw sources -> records -> validation -> assembly -> audits -> documentation -> mixture

The next section is Capability Benchmarks. It uses the contracts established here and moves one step further through the LLM data pipeline.

References

$m_i$