Part 2

29 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

VC Dimension: Part 3: Computing VC Dimension to 4. Growth Functions

3. Computing VC Dimension

Computing VC Dimension develops the part of vc dimension specified by the approved Chapter 21 table of contents. The emphasis is statistical learning theory, not generic statistics, optimization recipes, or benchmark operations.

3.1 thresholds on $\mathbb{R}$

Thresholds on $\mathbb{r}$ is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is shattering, growth functions, Sauer-Shelah bounds, VC sample complexity, and capacity control beyond parameter count. We use a distribution $\mathcal{D}$ , a sample $S$ , a hypothesis class $\mathcal{H}$ , and a loss-derived risk. The core question is whether the behavior on $S$ can control the behavior under $\mathcal{D}$ .

\Pi_{\mathcal{H}}(m)\le \sum_{i=0}^{d}\binom{m}{i}\le \left(\frac{em}{d}\right)^d.

The formula should be read operationally. For thresholds on $\mathbb{r}$ , a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of thresholds on $\mathbb{r}$ :

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for thresholds on $\mathbb{r}$ is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: thresholds on $\mathbb{r}$ will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using thresholds on $\mathbb{r}$ responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

3.2 intervals

Intervals is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

m = O\left(\frac{d\log(1/\epsilon)+\log(1/\delta)}{\epsilon}\right).

The formula should be read operationally. For intervals, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of intervals:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for intervals is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: intervals will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using intervals responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

3.3 linear separators in $\mathbb{R}^d$

Linear separators in $\mathbb{r}^d$ is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

\Pi_{\mathcal{H}}(m)=\max_{\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(m)}}\lvert\{(h(\mathbf{x}^{(1)}),\ldots,h(\mathbf{x}^{(m)})):h\in\mathcal{H}\}\rvert.

The formula should be read operationally. For linear separators in $\mathbb{r}^d$ , a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of linear separators in $\mathbb{r}^d$ :

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for linear separators in $\mathbb{r}^d$ is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: linear separators in $\mathbb{r}^d$ will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using linear separators in $\mathbb{r}^d$ responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

3.4 axis-aligned rectangles

Axis-aligned rectangles is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

\operatorname{VCdim}(\mathcal{H})=\max\{m:\Pi_{\mathcal{H}}(m)=2^m\}.

The formula should be read operationally. For axis-aligned rectangles, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of axis-aligned rectangles:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for axis-aligned rectangles is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: axis-aligned rectangles will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using axis-aligned rectangles responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

3.5 finite classes

Finite classes is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

\Pi_{\mathcal{H}}(m)\le \sum_{i=0}^{d}\binom{m}{i}\le \left(\frac{em}{d}\right)^d.

The formula should be read operationally. For finite classes, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of finite classes:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for finite classes is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: finite classes will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using finite classes responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

4. Growth Functions

Growth Functions develops the part of vc dimension specified by the approved Chapter 21 table of contents. The emphasis is statistical learning theory, not generic statistics, optimization recipes, or benchmark operations.

4.1 dichotomy counts

Dichotomy counts is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

m = O\left(\frac{d\log(1/\epsilon)+\log(1/\delta)}{\epsilon}\right).

The formula should be read operationally. For dichotomy counts, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of dichotomy counts:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for dichotomy counts is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: dichotomy counts will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using dichotomy counts responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

4.2 polynomial versus exponential growth

Polynomial versus exponential growth is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

\Pi_{\mathcal{H}}(m)=\max_{\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(m)}}\lvert\{(h(\mathbf{x}^{(1)}),\ldots,h(\mathbf{x}^{(m)})):h\in\mathcal{H}\}\rvert.

The formula should be read operationally. For polynomial versus exponential growth, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of polynomial versus exponential growth:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for polynomial versus exponential growth is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: polynomial versus exponential growth will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using polynomial versus exponential growth responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

4.3 Sauer-Shelah bound

Sauer-shelah bound is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

\operatorname{VCdim}(\mathcal{H})=\max\{m:\Pi_{\mathcal{H}}(m)=2^m\}.

The formula should be read operationally. For sauer-shelah bound, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of sauer-shelah bound:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for sauer-shelah bound is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: sauer-shelah bound will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using sauer-shelah bound responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

4.4 break points

Break points is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

\Pi_{\mathcal{H}}(m)\le \sum_{i=0}^{d}\binom{m}{i}\le \left(\frac{em}{d}\right)^d.

The formula should be read operationally. For break points, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of break points:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for break points is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: break points will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using break points responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

4.5 sample complexity role

Sample complexity role is part of the canonical scope of VC Dimension. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

m = O\left(\frac{d\log(1/\epsilon)+\log(1/\delta)}{\epsilon}\right).

The formula should be read operationally. For sample complexity role, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of sample complexity role:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for sample complexity role is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: sample complexity role will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using sample complexity role responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

VC Dimension: Part 2 - Computing Vc Dimension To 4 Growth Functions

VC Dimension: Part 3: Computing VC Dimension to 4. Growth Functions

3. Computing VC Dimension

3.1 thresholds on $\mathbb{R}$

3.2 intervals

3.3 linear separators in $\mathbb{R}^d$

3.4 axis-aligned rectangles

3.5 finite classes

4. Growth Functions

4.1 dichotomy counts

4.2 polynomial versus exponential growth

4.3 Sauer-Shelah bound

4.4 break points

4.5 sample complexity role

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

VC Dimension: Part 2 - Computing Vc Dimension To 4 Growth Functions

VC Dimension: Part 3: Computing VC Dimension to 4. Growth Functions

3. Computing VC Dimension

3.1 thresholds on R\mathbb{R}R

3.2 intervals

3.3 linear separators in Rd\mathbb{R}^dRd

3.4 axis-aligned rectangles

3.5 finite classes

4. Growth Functions

4.1 dichotomy counts

4.2 polynomial versus exponential growth

4.3 Sauer-Shelah bound

4.4 break points

4.5 sample complexity role

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

3.1 thresholds on $\mathbb{R}$

3.3 linear separators in $\mathbb{R}^d$