Part 3

29 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

PAC Learning: Part 5: Sample Complexity to 6. Applications in ML and LLMs

5. Sample Complexity

Sample Complexity develops the part of pac learning specified by the approved Chapter 21 table of contents. The emphasis is statistical learning theory, not generic statistics, optimization recipes, or benchmark operations.

5.1 dependence on $\epsilon$

Dependence on $\epsilon$ is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution $\mathcal{D}$ , a sample $S$ , a hypothesis class $\mathcal{H}$ , and a loss-derived risk. The core question is whether the behavior on $S$ can control the behavior under $\mathcal{D}$ .

L_{\mathcal{D}}(h)=P_{(\mathbf{x},y)\sim\mathcal{D}}[h(\mathbf{x})\ne y].

The formula should be read operationally. For dependence on $\epsilon$ , a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of dependence on $\epsilon$ :

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for dependence on $\epsilon$ is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: dependence on $\epsilon$ will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using dependence on $\epsilon$ responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

5.2 dependence on $\delta$

Dependence on $\delta$ is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

L_S(h)=\frac{1}{m}\sum_{i=1}^{m}\mathbb{1}[h(\mathbf{x}^{(i)})\ne y^{(i)}].

The formula should be read operationally. For dependence on $\delta$ , a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of dependence on $\delta$ :

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for dependence on $\delta$ is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: dependence on $\delta$ will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using dependence on $\delta$ responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

5.3 logarithmic class-size dependence

Logarithmic class-size dependence is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

m \ge \frac{1}{\epsilon}\left(\log\lvert\mathcal{H}\rvert + \log\frac{1}{\delta}\right).

The formula should be read operationally. For logarithmic class-size dependence, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of logarithmic class-size dependence:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for logarithmic class-size dependence is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: logarithmic class-size dependence will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using logarithmic class-size dependence responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

5.4 infinite-class motivation

Infinite-class motivation is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

P[L_{\mathcal{D}}(h_S)\le \epsilon] \ge 1-\delta.

The formula should be read operationally. For infinite-class motivation, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of infinite-class motivation:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for infinite-class motivation is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: infinite-class motivation will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using infinite-class motivation responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

5.5 AI-scale interpretation

Ai-scale interpretation is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

L_{\mathcal{D}}(h)=P_{(\mathbf{x},y)\sim\mathcal{D}}[h(\mathbf{x})\ne y].

The formula should be read operationally. For ai-scale interpretation, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of ai-scale interpretation:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for ai-scale interpretation is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: ai-scale interpretation will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using ai-scale interpretation responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

6. Applications in ML and LLMs

Applications in ML and LLMs develops the part of pac learning specified by the approved Chapter 21 table of contents. The emphasis is statistical learning theory, not generic statistics, optimization recipes, or benchmark operations.

6.1 classifier selection

Classifier selection is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

L_S(h)=\frac{1}{m}\sum_{i=1}^{m}\mathbb{1}[h(\mathbf{x}^{(i)})\ne y^{(i)}].

The formula should be read operationally. For classifier selection, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of classifier selection:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for classifier selection is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: classifier selection will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using classifier selection responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

6.2 prompt classifier reliability

Prompt classifier reliability is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

m \ge \frac{1}{\epsilon}\left(\log\lvert\mathcal{H}\rvert + \log\frac{1}{\delta}\right).

The formula should be read operationally. For prompt classifier reliability, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of prompt classifier reliability:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for prompt classifier reliability is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: prompt classifier reliability will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using prompt classifier reliability responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

6.3 safety classifier sample needs

Safety classifier sample needs is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

P[L_{\mathcal{D}}(h_S)\le \epsilon] \ge 1-\delta.

The formula should be read operationally. For safety classifier sample needs, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of safety classifier sample needs:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for safety classifier sample needs is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: safety classifier sample needs will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using safety classifier sample needs responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

6.4 evaluation-set sizing preview

Evaluation-set sizing preview is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

L_{\mathcal{D}}(h)=P_{(\mathbf{x},y)\sim\mathcal{D}}[h(\mathbf{x})\ne y].

The formula should be read operationally. For evaluation-set sizing preview, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of evaluation-set sizing preview:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for evaluation-set sizing preview is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: evaluation-set sizing preview will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using evaluation-set sizing preview responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

6.5 limits for deep nets

Limits for deep nets is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

L_S(h)=\frac{1}{m}\sum_{i=1}^{m}\mathbb{1}[h(\mathbf{x}^{(i)})\ne y^{(i)}].

The formula should be read operationally. For limits for deep nets, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory object	Meaning	AI interpretation
$\mathcal{D}$	Unknown data distribution	User prompts, images, tokens, labels, or tasks the system will face
$S$	Finite training or evaluation sample	The observed examples available to the learner or auditor
$\mathcal{H}$	Hypothesis class	Classifiers, probes, reward models, safety filters, or predictors
$L_S(h)$	Empirical risk	Error measured on the observed sample
$L_{\mathcal{D}}(h)$	True risk	Error on the distribution that matters after deployment

Three examples of limits for deep nets:

A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

A leaderboard rank without a distributional statement is not a learnability guarantee.
A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for limits for deep nets is to identify the random object first. Sometimes the randomness is the sample $S$ . Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

Implementation note for the companion notebook: limits for deep nets will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

Checklist for using limits for deep nets responsibly:

State the sample space and label space.
State the hypothesis or function class.
State the loss and risk definition.
State whether the setting is realizable or agnostic.
Track both accuracy tolerance and confidence.
Identify whether the bound is distribution-free or data-dependent.
Separate the theorem from the empirical measurement.

PAC Learning: Part 3 - Sample Complexity To 6 Applications In Ml And Llms

PAC Learning: Part 5: Sample Complexity to 6. Applications in ML and LLMs

5. Sample Complexity

5.1 dependence on $\epsilon$

5.2 dependence on $\delta$

5.3 logarithmic class-size dependence

5.4 infinite-class motivation

5.5 AI-scale interpretation

6. Applications in ML and LLMs

6.1 classifier selection

6.2 prompt classifier reliability

6.3 safety classifier sample needs

6.4 evaluation-set sizing preview

6.5 limits for deep nets

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

PAC Learning: Part 3 - Sample Complexity To 6 Applications In Ml And Llms

PAC Learning: Part 5: Sample Complexity to 6. Applications in ML and LLMs

5. Sample Complexity

5.1 dependence on ϵ\epsilonϵ

5.2 dependence on δ\deltaδ

5.3 logarithmic class-size dependence

5.4 infinite-class motivation

5.5 AI-scale interpretation

6. Applications in ML and LLMs

6.1 classifier selection

6.2 prompt classifier reliability

6.3 safety classifier sample needs

6.4 evaluation-set sizing preview

6.5 limits for deep nets

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

5.1 dependence on $\epsilon$

5.2 dependence on $\delta$