Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
30 min read12 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

PAC Learning: Part 1: Intuition to 2. Formal Definitions

1. Intuition

Intuition develops the part of pac learning specified by the approved Chapter 21 table of contents. The emphasis is statistical learning theory, not generic statistics, optimization recipes, or benchmark operations.

1.1 learning as selecting a hypothesis from data

Learning as selecting a hypothesis from data is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

LD(h)=P(x,y)D[h(x)y].L_{\mathcal{D}}(h)=P_{(\mathbf{x},y)\sim\mathcal{D}}[h(\mathbf{x})\ne y].

The formula should be read operationally. For learning as selecting a hypothesis from data, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of learning as selecting a hypothesis from data:

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for learning as selecting a hypothesis from data is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: learning as selecting a hypothesis from data will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using learning as selecting a hypothesis from data responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

1.2 probably approximately correct guarantee

Probably approximately correct guarantee is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

LS(h)=1mi=1m1[h(x(i))y(i)].L_S(h)=\frac{1}{m}\sum_{i=1}^{m}\mathbb{1}[h(\mathbf{x}^{(i)})\ne y^{(i)}].

The formula should be read operationally. For probably approximately correct guarantee, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of probably approximately correct guarantee:

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for probably approximately correct guarantee is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: probably approximately correct guarantee will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using probably approximately correct guarantee responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

1.3 error confidence and sample size

Error confidence and sample size is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

m1ϵ(logH+log1δ).m \ge \frac{1}{\epsilon}\left(\log\lvert\mathcal{H}\rvert + \log\frac{1}{\delta}\right).

The formula should be read operationally. For error confidence and sample size, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of error confidence and sample size:

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for error confidence and sample size is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: error confidence and sample size will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using error confidence and sample size responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

1.4 why PAC is distribution-free

Why pac is distribution-free is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

P[LD(hS)ϵ]1δ.P[L_{\mathcal{D}}(h_S)\le \epsilon] \ge 1-\delta.

The formula should be read operationally. For why pac is distribution-free, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of why pac is distribution-free:

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for why pac is distribution-free is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: why pac is distribution-free will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using why pac is distribution-free responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

1.5 what PAC does not promise

What pac does not promise is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

LD(h)=P(x,y)D[h(x)y].L_{\mathcal{D}}(h)=P_{(\mathbf{x},y)\sim\mathcal{D}}[h(\mathbf{x})\ne y].

The formula should be read operationally. For what pac does not promise, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of what pac does not promise:

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for what pac does not promise is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: what pac does not promise will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using what pac does not promise responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

2. Formal Definitions

Formal Definitions develops the part of pac learning specified by the approved Chapter 21 table of contents. The emphasis is statistical learning theory, not generic statistics, optimization recipes, or benchmark operations.

2.1 instance space X\mathcal{X}

Instance space x\mathcal{x} is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

LS(h)=1mi=1m1[h(x(i))y(i)].L_S(h)=\frac{1}{m}\sum_{i=1}^{m}\mathbb{1}[h(\mathbf{x}^{(i)})\ne y^{(i)}].

The formula should be read operationally. For instance space x\mathcal{x}, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of instance space x\mathcal{x}:

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for instance space x\mathcal{x} is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: instance space x\mathcal{x} will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using instance space x\mathcal{x} responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

2.2 label space Y\mathcal{Y}

Label space y\mathcal{y} is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

m1ϵ(logH+log1δ).m \ge \frac{1}{\epsilon}\left(\log\lvert\mathcal{H}\rvert + \log\frac{1}{\delta}\right).

The formula should be read operationally. For label space y\mathcal{y}, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of label space y\mathcal{y}:

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for label space y\mathcal{y} is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: label space y\mathcal{y} will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using label space y\mathcal{y} responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

2.3 hypothesis class H\mathcal{H}

Hypothesis class h\mathcal{h} is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

P[LD(hS)ϵ]1δ.P[L_{\mathcal{D}}(h_S)\le \epsilon] \ge 1-\delta.

The formula should be read operationally. For hypothesis class h\mathcal{h}, a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of hypothesis class h\mathcal{h}:

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for hypothesis class h\mathcal{h} is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: hypothesis class h\mathcal{h} will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using hypothesis class h\mathcal{h} responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

2.4 true risk LD(h)L_{\mathcal{D}}(h) and empirical risk LS(h)L_S(h)

True risk ld(h)l_{\mathcal{d}}(h) and empirical risk ls(h)l_s(h) is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

LD(h)=P(x,y)D[h(x)y].L_{\mathcal{D}}(h)=P_{(\mathbf{x},y)\sim\mathcal{D}}[h(\mathbf{x})\ne y].

The formula should be read operationally. For true risk ld(h)l_{\mathcal{d}}(h) and empirical risk ls(h)l_s(h), a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of true risk ld(h)l_{\mathcal{d}}(h) and empirical risk ls(h)l_s(h):

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for true risk ld(h)l_{\mathcal{d}}(h) and empirical risk ls(h)l_s(h) is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: true risk ld(h)l_{\mathcal{d}}(h) and empirical risk ls(h)l_s(h) will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using true risk ld(h)l_{\mathcal{d}}(h) and empirical risk ls(h)l_s(h) responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

2.5 PAC learner with (ϵ,δ)(\epsilon,\delta)

Pac learner with (ϵ,δ)(\epsilon,\delta) is part of the canonical scope of PAC Learning. The purpose is to understand when finite data can justify a claim about unseen examples, not to replace empirical evaluation or production monitoring.

In this subsection the working scope is probably approximately correct guarantees, finite-class sample complexity, realizable and agnostic learning, and distribution-free learnability. We use a distribution D\mathcal{D}, a sample SS, a hypothesis class H\mathcal{H}, and a loss-derived risk. The core question is whether the behavior on SS can control the behavior under D\mathcal{D}.

LS(h)=1mi=1m1[h(x(i))y(i)].L_S(h)=\frac{1}{m}\sum_{i=1}^{m}\mathbb{1}[h(\mathbf{x}^{(i)})\ne y^{(i)}].

The formula should be read operationally. For pac learner with (ϵ,δ)(\epsilon,\delta), a learner is not certified by a story about model architecture. It is certified by assumptions, a class of hypotheses, a loss, a sample size, and a probability statement.

Theory objectMeaningAI interpretation
D\mathcal{D}Unknown data distributionUser prompts, images, tokens, labels, or tasks the system will face
SSFinite training or evaluation sampleThe observed examples available to the learner or auditor
H\mathcal{H}Hypothesis classClassifiers, probes, reward models, safety filters, or predictors
LS(h)L_S(h)Empirical riskError measured on the observed sample
LD(h)L_{\mathcal{D}}(h)True riskError on the distribution that matters after deployment

Three examples of pac learner with (ϵ,δ)(\epsilon,\delta):

  1. A binary safety classifier is evaluated on a sample of labeled prompts, but the team needs a bound on future violation-detection error.
  2. A linear probe is trained on hidden states, and learning theory asks how much the probe's validation behavior depends on sample size and class capacity.
  3. A small model is fine-tuned on limited domain data, and the practitioner wants to separate approximation error from estimation error.

Two non-examples are just as important:

  1. A leaderboard rank without a distributional statement is not a learnability guarantee.
  2. A production incident report without a hypothesis class, loss, or sampling assumption is not a statistical learning theorem.

The proof habit for pac learner with (ϵ,δ)(\epsilon,\delta) is to identify the random object first. Sometimes the randomness is the sample SS. Sometimes it is Rademacher signs. Sometimes it is label noise. Once the random object is explicit, concentration and symmetrization tools can be used without hand-waving.

A useful ASCII picture for this subsection is:

unknown distribution D
        | sample S
        v
 empirical learner h_S ----> empirical risk L_S(h_S)
        |
        v
 true deployment risk L_D(h_S)

The gap between the last two quantities is the reason this chapter exists. Chapter 17 measures it empirically with benchmark protocols. Chapter 21 studies when mathematics can control it before all future examples are observed.

Implementation note for the companion notebook: pac learner with (ϵ,δ)(\epsilon,\delta) will be demonstrated with synthetic finite samples. The code will not depend on external datasets; it will compute bounds, simulate class behavior, or plot risk decompositions so the theorem-level object is visible.

The modern AI caution is that very large models often violate the cleanest textbook assumptions. That does not make the mathematics useless. It means the reader should distinguish theorem-level guarantees from diagnostic metaphors and engineering heuristics.

Checklist for using pac learner with (ϵ,δ)(\epsilon,\delta) responsibly:

  • State the sample space and label space.
  • State the hypothesis or function class.
  • State the loss and risk definition.
  • State whether the setting is realizable or agnostic.
  • Track both accuracy tolerance and confidence.
  • Identify whether the bound is distribution-free or data-dependent.
  • Separate the theorem from the empirical measurement.

For AI systems, this discipline prevents a common confusion: empirical success is evidence, but learnability theory explains which kinds of evidence should scale with sample size, class capacity, margins, norms, and noise.

The subsection also prepares the later material. PAC learning motivates VC dimension. VC dimension motivates generalization bounds. Bias-variance decomposition gives a different error accounting. Rademacher complexity gives a data-dependent complexity view.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue