NotesMath for LLMs

Hypothesis Testing

Statistics / Hypothesis Testing

Notes

<- Back to Chapter 7: Statistics | Next: Bayesian Inference ->


"To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of."

  • Sir Ronald A. Fisher, Presidential Address to the First Indian Statistical Congress (1938)

Overview

Hypothesis testing is the art of making principled decisions from data under uncertainty. Where estimation theory (Section02) asks "what is the value of this parameter?", hypothesis testing asks a sharper question: "is this parameter consistent with a specific claim?" This inversion - from continuous estimation to binary decision - is the formal machinery behind every A/B experiment that ships a product feature, every clinical trial that approves a drug, and every benchmark comparison that claims one model outperforms another.

The discipline has two intertwined origins. Ronald Fisher developed the p-value framework in the 1920s: compute how surprising the data would be if the null hypothesis were true, and report that probability as evidence. Jerzy Neyman and Egon Pearson developed the decision-theoretic framework in 1933: pre-commit to a decision rule with controlled error rates before seeing the data. Modern practice blends both - using p-values as a continuous measure of evidence while respecting the Neyman-Pearson discipline of pre-specified significance levels, power analysis, and sample size planning.

For AI and ML, hypothesis testing has never been more important. Every benchmark leaderboard is an implicit multiple-comparison experiment susceptible to false-discovery inflation. Every online A/B test deployed at scale faces the sequential testing problem. Every data pipeline needs distributional drift detection. This section builds the complete framework: from the formal definition of a test statistic through the Neyman-Pearson lemma, classical t/\chi^2/F tests, likelihood ratio tests, multiple testing correction, nonparametric methods, and the sequential A/B testing infrastructure that powers modern ML deployment.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: t-tests, chi-squared tests, power curves, NP lemma, GLRT, Bonferroni/BH correction, permutation tests, KS drift detection, sequential SPRT
exercises.ipynb10 graded exercises from one-sample t-tests through sequential A/B testing and KS-based LLM drift detection

Learning Objectives

After completing this section, you will:

  • State the formal definition of a statistical hypothesis and distinguish simple from composite hypotheses
  • Define a test statistic, rejection region, and p-value and explain what each does and does not mean
  • Quantify Type I error (\alpha), Type II error (\beta), and power (1-\beta), and explain why they cannot all be minimised simultaneously
  • Derive sample size requirements from desired \alpha, \beta, and effect size (Cohen's d)
  • Derive and apply one-sample t-test, two-sample Welch t-test, chi-squared goodness-of-fit, and one-way ANOVA
  • State and prove the Neyman-Pearson lemma and identify when a UMP test exists
  • Apply Wilks' theorem to construct generalized likelihood ratio tests for composite hypotheses
  • Explain the multiple testing problem and apply Bonferroni, Holm, and Benjamini-Hochberg corrections
  • Implement permutation tests, Wilcoxon rank tests, and KS tests for nonparametric inference
  • Design a sequential A/B test using SPRT and explain why it avoids the peeking problem
  • Identify statistical pitfalls in NLP benchmark comparisons and LLM evaluation leaderboards

Table of Contents


1. Intuition

1.1 The Core Question: Evidence or Noise?

Imagine you flip a coin 100 times and observe 63 heads. Is the coin biased, or is 63 just a chance fluctuation from a fair coin? You cannot answer this by staring at the number 63. You need a framework that asks: how often would a fair coin produce 63 or more heads in 100 flips? If the answer is "1 in 1,000 times", you have strong evidence for bias. If the answer is "1 in 5 times", the result is easily explained by chance.

This is the essence of hypothesis testing: quantify how surprising the data would be if the "nothing interesting happened" explanation were true. The "nothing interesting happened" explanation is the null hypothesis H0H_0. The alternative - something systematic is going on - is the alternative hypothesis H1H_1.

The court-room analogy is exact and instructive. In criminal law, the null hypothesis is innocence (H0H_0: defendant is innocent). The prosecution must present evidence so overwhelming that innocence becomes implausible. The defendant is never "proven innocent" - the court simply fails to accumulate enough evidence to reject H0H_0. Similarly, in statistics, we never "prove" the null hypothesis true; we can only fail to reject it. The asymmetry is deliberate: falsely convicting an innocent person (Type I error) is considered worse than failing to convict a guilty one (Type II error), so we set the bar for conviction (rejection) very high.

For AI: Every time you report "model A achieves 87.3% accuracy vs. model B's 86.1% - a statistically significant improvement at p < 0.05", you are running a hypothesis test. The null hypothesis is H0:μA=μBH_0: \mu_A = \mu_B (no real difference). The question is whether the 1.2% gap is real signal or sampling noise from a finite test set.

1.2 Two Schools of Thought

Modern statistical testing is a marriage of two incompatible philosophies that practitioners blend without always realising it.

Fisher's approach (1925): Compute the p-value - the probability of observing data at least as extreme as what was obtained, under H0H_0. Report it as a continuous measure of evidence against H0H_0. Never pre-specify H1H_1. Never pre-specify a decision threshold. The p-value is just one piece of evidence to weigh alongside domain knowledge and replication. Fisher rejected the idea of a fixed significance threshold as "absurdly academic".

Neyman-Pearson approach (1933): Pre-specify both H0H_0 and H1H_1, a significance level α\alpha (Type I error rate), and a desired power 1β1 - \beta (sensitivity to H1H_1). Compute the most powerful test for those hypotheses. Make a binary decision: reject or do not reject. The p-value is irrelevant - what matters is whether T>cαT > c_\alpha. This framework optimises long-run decision quality across many repeated experiments.

What practitioners actually do: Use the Neyman-Pearson machinery (pre-specify α\alpha, compute a test statistic, check whether p<αp < \alpha) while interpreting the p-value in Fisher's spirit (as a continuous measure of evidence). This hybrid is coherent enough for most purposes but creates confusions - particularly the widespread misinterpretation of p-values as "the probability that H0H_0 is true" (which is Bayesian thinking, belonging to neither school).

FISHER vs. NEYMAN-PEARSON COMPARISON
========================================================================

  Property           | Fisher                  | Neyman-Pearson
  -------------------+-------------------------+----------------------
  Goal               | Measure evidence        | Make optimal decision
  Pre-specify H_1?    | No                      | Yes
  Pre-specify \\alpha?     | No                      | Yes (before seeing data)
  Output             | p-value (continuous)    | Reject / Do not reject
  Power              | Not part of framework   | Central to design
  Philosophical base | Inductive reasoning     | Long-run frequency
  Use case           | Exploratory science     | Industrial quality control

========================================================================

1.3 Historical Timeline

YearContributorContribution
1710John ArbuthnotFirst known significance test (sex ratio at birth)
1900Karl PearsonChi-squared goodness-of-fit test
1908William Gosset ("Student")t-distribution for small samples
1922Ronald FisherFormalises likelihood, degrees of freedom
1925Ronald FisherStatistical Methods for Research Workers - p-values, F-test, ANOVA
1933Neyman & PearsonPower, UMP tests, Neyman-Pearson lemma
1943Abraham WaldSequential probability ratio test (SPRT)
1951WaldStatistical decision theory
1979Bonferroni correctionWidely adopted for multiple testing
1995Benjamini & HochbergFalse discovery rate (FDR) - transformative for genomics and ML
2005Ioannidis"Why Most Published Research Findings Are False" - catalyses replication crisis
2016ASA statementFormal warning against p-value misuse
2019Nature editorial800+ scientists call for retiring "statistical significance"
2022+Always-valid p-valuesRamdaset al. - sequential testing for online A/B experiments

1.4 Why Hypothesis Testing Matters for AI

Model evaluation: Reporting a test accuracy without a confidence interval or significance test is meaningless for model comparison. Is 87.3% vs. 86.1% real? On 1000 test examples at 0.5 base rate, a 1.2% difference corresponds to a two-proportion z-test with p0.15p \approx 0.15 - not significant. On 10,000 examples, the same gap gives p0.002p \approx 0.002. Sample size is everything.

A/B testing at scale: Tech companies run thousands of simultaneous A/B experiments. Each one is a two-sample hypothesis test. The infrastructure problem is: how do you test without pre-committing to a fixed sample size (you want to stop early if the effect is clear), while controlling false discovery rate across simultaneous tests?

Data drift detection: Production ML systems degrade when the input distribution changes. Detecting this is a two-sample test: is the distribution of today's features statistically different from training data? The Kolmogorov-Smirnov test, Maximum Mean Discrepancy, and Population Stability Index are all hypothesis tests under the hood.

LLM benchmark evaluation: The 2024-2026 era of LLM leaderboards (MMLU, HumanEval, BIG-Bench, LMSYS Arena) suffers from massive multiple-comparison inflation. If you test 100 models on 50 benchmarks, you expect 250 false discoveries at α=0.05\alpha = 0.05 even if no model truly differs. Proper evaluation requires FDR correction and bootstrap confidence intervals.

Causal inference for RLHF: When measuring whether RLHF improves output quality, you need a randomised controlled design and a proper two-sample test. Confounded comparisons (different prompts, different raters) can produce entirely spurious "improvements".


2. The Formal Framework

2.1 Hypotheses: Null and Alternative

Definition (Statistical Hypothesis). A statistical hypothesis is a claim about the parameter θ\theta of a probability model {Pθ:θΩ}\{P_\theta : \theta \in \Omega\}. Formally, a hypothesis specifies a subset Θ0Ω\Theta_0 \subseteq \Omega:

H0:θΘ0vs.H1:θΘ1=ΩΘ0H_0: \theta \in \Theta_0 \quad \text{vs.} \quad H_1: \theta \in \Theta_1 = \Omega \setminus \Theta_0

Simple vs. composite hypotheses:

  • A simple hypothesis pins θ\theta to a single value: H0:θ=θ0H_0: \theta = \theta_0.
  • A composite hypothesis specifies a range: H0:θθ0H_0: \theta \leq \theta_0 or H0:θθ0H_0: \theta \neq \theta_0.

One-sided vs. two-sided tests:

  • One-sided (directional): H1:θ>θ0H_1: \theta > \theta_0 or H1:θ<θ0H_1: \theta < \theta_0. Use when the direction of the effect is theoretically specified in advance.
  • Two-sided (non-directional): H1:θθ0H_1: \theta \neq \theta_0. Use when any deviation from θ0\theta_0 is of interest, or when the direction is unknown.

The asymmetry between H0H_0 and H1H_1: The null hypothesis is the "default" - the claim we assume true unless data provide sufficient evidence against it. This asymmetry has important consequences:

  • We control the probability of falsely rejecting H0H_0 (Type I error).
  • We do not automatically control the probability of falsely accepting H0H_0 (Type II error) - that requires separate power analysis.
  • "Fail to reject H0H_0" is NOT the same as "accept H0H_0". Absence of evidence is not evidence of absence.

Standard examples:

SettingH0H_0H1H_1
Coin fairnessp=0.5p = 0.5p0.5p \neq 0.5
New drug effectivenessμtreatment=μcontrol\mu_{\text{treatment}} = \mu_{\text{control}}μtreatment>μcontrol\mu_{\text{treatment}} > \mu_{\text{control}}
Model improvementaccA=accB\text{acc}_A = \text{acc}_BaccAaccB\text{acc}_A \neq \text{acc}_B
Feature distribution shiftFnew=FtrainF_{\text{new}} = F_{\text{train}}FnewFtrainF_{\text{new}} \neq F_{\text{train}}
Independence in contingency tableVariables independentVariables associated

2.2 Test Statistics and Sampling Distributions

Definition (Test Statistic). A test statistic T=T(X1,,Xn)T = T(X_1, \ldots, X_n) is a function of the data that summarises the evidence against H0H_0. A good test statistic:

  1. Has a known distribution under H0H_0 (enables exact p-value computation).
  2. Takes extreme values when H1H_1 is true (enables detection).

The sampling distribution of TT under H0H_0 is central. For example:

  • If X1,,XniidN(μ,σ2)X_1, \ldots, X_n \overset{iid}{\sim} \mathcal{N}(\mu, \sigma^2) with σ\sigma known, then under H0:μ=μ0H_0: \mu = \mu_0:
Z=Xˉμ0σ/nN(0,1)Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1)
  • If σ\sigma is unknown, replacing it with the sample standard deviation SS introduces extra variability:
T=Xˉμ0S/ntn1T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} \sim t_{n-1}

The shift from N(0,1)\mathcal{N}(0,1) to tn1t_{n-1} is not a minor detail - for small nn, the t-distribution has much heavier tails, making it much harder to reject H0H_0 unless the evidence is very strong.

Standardisation principle: Most test statistics are of the form:

T=EstimatorNull valueStandard error of estimatorT = \frac{\text{Estimator} - \text{Null value}}{\text{Standard error of estimator}}

This form ensures TT is dimensionless and has a tractable distribution under H0H_0.

2.3 Rejection Regions and Critical Values

Definition (Rejection Region). For a test of size α\alpha, the rejection region Rα\mathcal{R}_\alpha is a subset of the sample space such that:

PH0(TRα)=αP_{H_0}(T \in \mathcal{R}_\alpha) = \alpha

For a two-sided test of a Gaussian mean with known σ\sigma:

Rα={z:z>zα/2}\mathcal{R}_\alpha = \{z : \lvert z \rvert > z_{\alpha/2}\}

where zα/2z_{\alpha/2} is the (1α/2)(1 - \alpha/2) quantile of N(0,1)\mathcal{N}(0, 1). At α=0.05\alpha = 0.05, z0.025=1.96z_{0.025} = 1.96.

Critical value: The boundary cαc_\alpha such that PH0(T>cα)=αP_{H_0}(T > c_\alpha) = \alpha (one-sided) or PH0(T>cα)=αP_{H_0}(\lvert T \rvert > c_\alpha) = \alpha (two-sided). The test rejects H0H_0 iff T>cαT > c_\alpha (or T>cα\lvert T \rvert > c_\alpha).

Exact vs. approximate rejection regions:

  • For normal populations with known σ\sigma: exact (z-test).
  • For normal populations with unknown σ\sigma: exact (t-test, using t distribution).
  • For non-normal populations, large nn: approximate (CLT makes ZZ approximately standard normal).
  • For small nn, non-normal: nonparametric tests (Section 7).

2.4 The p-Value

Definition (p-Value). The p-value of a test with statistic T=tobsT = t_{\text{obs}} is:

p=PH0(Ttobs)p = P_{H_0}(T \geq t_{\text{obs}})

for a one-sided test, or

p=PH0(Ttobs)p = P_{H_0}(\lvert T \rvert \geq \lvert t_{\text{obs}} \rvert)

for a two-sided test. Equivalently, pp is the smallest significance level α\alpha at which the observed data would lead to rejection of H0H_0.

Key properties of p-values:

  1. Under H0H_0, pU(0,1)p \sim \mathcal{U}(0,1). This is a fundamental result: if H0H_0 is true and the test is exact, the p-value is uniformly distributed. This enables calibration checks.
  2. Under H1H_1, pp is stochastically smaller - it tends toward 0 as sample size grows or effect size increases.
  3. pp is a random variable. Running the same experiment twice will give different p-values. The p-value quantifies how surprising the specific data are, not how true or false H0H_0 is.

The six most important p-value misinterpretations:

MisinterpretationWhy it's wrongCorrect statement
"pp = probability H0H_0 is true"Frequentist pp makes no probability claim about H0H_0pp = prob of data this extreme under H0H_0
"1p1-p = probability H1H_1 is true"Same errorNot a probability about hypotheses
"p<0.05p < 0.05 means effect is large"pp conflates effect size with sample sizeReport effect size separately
"p>0.05p > 0.05 means no effect"Absence of evidence \neq evidence of absenceReport power and CI
"p<0.05p < 0.05 means the finding replicates"Single-study p is unreliableNeed replication studies
"We found p=0.049p = 0.049, thus significant"Arbitrary threshold; p=0.051p = 0.051 is equally evidentialReport exact pp; don't dichotomize

2.5 Duality: Tests and Confidence Intervals

There is an exact correspondence between hypothesis tests and confidence intervals - a fact that is both theoretically beautiful and practically useful.

The Inversion Principle: Given a size-α\alpha test for H0:θ=θ0H_0: \theta = \theta_0, the (1α)(1-\alpha) confidence interval for θ\theta is:

CI1α={θ0:H0 is not rejected by the size-α test}\text{CI}_{1-\alpha} = \{\theta_0 : H_0 \text{ is not rejected by the size-}\alpha \text{ test}\}

Conversely, the size-α\alpha test rejects H0:θ=θ0H_0: \theta = \theta_0 if and only if θ0CI1α\theta_0 \notin \text{CI}_{1-\alpha}.

Concrete example: The 95% CI for a Gaussian mean with known σ\sigma is:

CI0.95=[Xˉ1.96σn,  Xˉ+1.96σn]\text{CI}_{0.95} = \left[\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}},\; \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}\right]

The corresponding z-test rejects H0:μ=μ0H_0: \mu = \mu_0 at α=0.05\alpha = 0.05 iff μ0\mu_0 falls outside this interval - exactly the inversion principle.

Recall: Confidence intervals were derived in Section02 Estimation Theory. The CI for μ\mu was constructed by pivoting on the standard normal. Here, we see that same CI is the set of null values we would fail to reject.

Practical implication: Reporting a CI is strictly more informative than reporting a p-value. The CI tells you the effect size and uncertainty; the p-value alone only tells you whether a point null is rejected. Always prefer CIs over p-values where possible.


3. Errors, Power, and Sample Size

3.1 Type I and Type II Errors

Any binary decision procedure applied to random data will sometimes make mistakes. There are exactly two ways to err:

Definition (Type I Error). Rejecting H0H_0 when H0H_0 is true. Also called a false positive. Probability = α\alpha (the significance level).

Definition (Type II Error). Failing to reject H0H_0 when H1H_1 is true. Also called a false negative. Probability = β\beta (depends on the specific alternative θΘ1\theta \in \Theta_1).

ERROR TYPE TABLE
========================================================================

                      |  H_0 True             |  H_1 True
  --------------------+----------------------+----------------------
  Reject H_0           |  Type I Error (\\alpha)    |  CORRECT (Power 1-\\beta)
  Do not reject H_0    |  CORRECT (1-\\alpha)       |  Type II Error (\\beta)

  Analogy:            |  Convict innocent    |  Free the guilty
  Medical test:       |  False positive      |  False negative

========================================================================

The fundamental trade-off: For a fixed sample size nn, decreasing α\alpha (requiring stronger evidence to reject) increases β\beta (making it harder to detect real effects). To decrease both simultaneously, you must increase nn.

Conventional thresholds (and their limitations):

  • α=0.05\alpha = 0.05: Fisher's suggestion from 1925, now a near-universal convention despite having no theoretical justification.
  • α=0.01\alpha = 0.01: More stringent; used in physics and genomics.
  • α=0.005\alpha = 0.005: Proposed by Benjamin et al. (2018) as a new standard to reduce false discoveries.
  • For AI deployment decisions: The appropriate α\alpha depends on the cost of each error type. Deploying a harmful model is a Type II error; rejecting a good model is a Type I error. These costs are application-specific.

3.2 The Power Function

Definition (Power Function). The power function of a test is:

π(θ)=Pθ(reject H0)=Pθ(TRα)\pi(\theta) = P_\theta(\text{reject } H_0) = P_\theta(T \in \mathcal{R}_\alpha)

evaluated at every θΩ\theta \in \Omega.

Key properties of a well-designed power function:

  • π(θ)=α\pi(\theta) = \alpha for θΘ0\theta \in \Theta_0 (the test has correct size).
  • π(θ)1\pi(\theta) \to 1 as θ\theta moves far from Θ0\Theta_0 (the test is consistent).
  • π(θ)\pi(\theta) is large for θΘ1\theta \in \Theta_1 of practical interest (the test is powerful).

Power at a specific alternative: For the one-sample z-test with H0:μ=μ0H_0: \mu = \mu_0 vs. H1:μ=μ1>μ0H_1: \mu = \mu_1 > \mu_0:

π(μ1)=Pμ1(Z>zα)=P(N(0,1)>zα(μ1μ0)nσ)=1Φ(zα(μ1μ0)nσ)\pi(\mu_1) = P_{\mu_1}\left(Z > z_\alpha\right) = P\left(\mathcal{N}(0,1) > z_\alpha - \frac{(\mu_1 - \mu_0)\sqrt{n}}{\sigma}\right) = 1 - \Phi\left(z_\alpha - \frac{(\mu_1 - \mu_0)\sqrt{n}}{\sigma}\right)

This formula reveals exactly how power depends on:

  • Effect size δ=(μ1μ0)/σ\delta = (\mu_1 - \mu_0)/\sigma: larger effect -> higher power.
  • Sample size nn: larger nn -> higher power (as n\sqrt{n}).
  • Significance level α\alpha: larger α\alpha -> higher power (but more Type I errors).

Minimum detectable effect (MDE): The smallest δ\delta such that π(μ0+δσ)1βtarget\pi(\mu_0 + \delta\sigma) \geq 1 - \beta_{\text{target}}. Solving for δ\delta:

δmin=(zα+zβ)σn\delta_{\min} = \frac{(z_\alpha + z_\beta)\sigma}{\sqrt{n}}

where zβ=Φ1(1β)z_\beta = \Phi^{-1}(1-\beta). At α=0.05\alpha = 0.05, β=0.20\beta = 0.20: z0.05+z0.20=1.645+0.842=2.487z_{0.05} + z_{0.20} = 1.645 + 0.842 = 2.487.

3.3 Effect Size

The problem with raw differences: A mean difference of 2 points on an exam is huge if the standard deviation is 1, but negligible if it is 100. Effect sizes standardise the comparison.

Cohen's d (for means):

d=μ1μ2σpooledwhereσpooled=(n11)σ12+(n21)σ22n1+n22d = \frac{\mu_1 - \mu_2}{\sigma_{\text{pooled}}} \quad \text{where} \quad \sigma_{\text{pooled}} = \sqrt{\frac{(n_1-1)\sigma_1^2 + (n_2-1)\sigma_2^2}{n_1+n_2-2}}

Benchmarks (Cohen 1988): d=0.2d = 0.2 small, d=0.5d = 0.5 medium, d=0.8d = 0.8 large.

Cohen's h (for proportions):

h=2arcsin ⁣p12arcsin ⁣p2h = 2\arcsin\!\sqrt{p_1} - 2\arcsin\!\sqrt{p_2}

The arcsin transform stabilises variance. Benchmarks: h=0.2h = 0.2 small, h=0.5h = 0.5 medium, h=0.8h = 0.8 large.

Cramer's V (for contingency tables with χ2\chi^2):

V=χ2nmin(r1,c1)V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}

where r,cr, c are the numbers of rows and columns. V[0,1]V \in [0, 1] with 0 = no association.

For AI: When comparing two models' accuracies, report Cohen's h for proportions. A 1% absolute accuracy gain with h=0.03h = 0.03 is small and may not justify the deployment cost; with h=0.20h = 0.20 it is meaningful. Effect size is always reported alongside p-value in rigorous ML papers.

3.4 Sample Size Calculation

Solving the power equation for nn gives the required sample size to detect effect size δ\delta with power 1β1-\beta at level α\alpha:

n=(zα+zβδ)2n = \left(\frac{z_\alpha + z_\beta}{\delta}\right)^2

(one-sample, one-sided). For two-sided tests replace zαz_\alpha with zα/2z_{\alpha/2}.

Two-sample comparison of means (equal group sizes):

nper group=2(zα/2+zβ)2δ2n_{\text{per group}} = \frac{2(z_{\alpha/2} + z_\beta)^2}{\delta^2}

where δ=(μ1μ2)/σ\delta = (\mu_1 - \mu_2)/\sigma.

Two-proportion z-test (comparing accuracy rates p1p_1 vs. p2p_2):

n=(zα/22pˉ(1pˉ)+zβp1(1p1)+p2(1p2))2(p1p2)2n = \frac{(z_{\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_\beta\sqrt{p_1(1-p_1) + p_2(1-p_2)})^2}{(p_1 - p_2)^2}

where pˉ=(p1+p2)/2\bar{p} = (p_1+p_2)/2.

Worked example: You want to detect a 2% accuracy improvement (from 85% to 87%) with 80% power at α=0.05\alpha = 0.05.

  • pˉ=0.86\bar{p} = 0.86, z0.025=1.96z_{0.025} = 1.96, z0.20=0.842z_{0.20} = 0.842.
  • n(1.962(0.86)(0.14)+0.8420.85(0.15)+0.87(0.13))2(0.02)23,600n \approx \frac{(1.96\sqrt{2(0.86)(0.14)} + 0.842\sqrt{0.85(0.15)+0.87(0.13)})^2}{(0.02)^2} \approx 3{,}600 per group.

This reveals why benchmark comparisons on small test sets are inconclusive: on 1,000 examples per group, the same test has power 30%\approx 30\%.

3.5 ROC Analogy

The error rate trade-off in hypothesis testing is structurally identical to the ROC curve in binary classification:

Hypothesis TestingBinary Classification
Significance level α\alphaFalse positive rate (FPR)
Power 1β1 - \betaTrue positive rate (TPR) / Recall
Critical value cαc_\alphaClassification threshold
Type I errorFalse positive
Type II errorFalse negative
Reject regionPredicted positive region

In both settings, you trace out a curve by varying the threshold (critical value / classification threshold), and the curve represents the complete trade-off between sensitivity and specificity. The AUC of a classifier measures the same thing as the integrated power function of a test: how well the score separates the two classes.

For AI: The ROC analogy makes hypothesis testing intuitive for ML practitioners. Choosing α=0.05\alpha = 0.05 is exactly like choosing a classification threshold to achieve 5% FPR. Power analysis is like calculating recall at that threshold.


4. Classical Parametric Tests

4.1 The Z-Test

Setting: X1,,XniidN(μ,σ2)X_1, \ldots, X_n \overset{iid}{\sim} \mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known. Test H0:μ=μ0H_0: \mu = \mu_0.

Test statistic:

Z=Xˉμ0σ/nH0N(0,1)Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} \overset{H_0}{\sim} \mathcal{N}(0, 1)

Rejection regions and p-values:

  • Two-sided (H1:μμ0H_1: \mu \neq \mu_0): reject iff Z>zα/2\lvert Z \rvert > z_{\alpha/2}; p=2(1Φ(Z))p = 2(1 - \Phi(\lvert Z \rvert)).
  • Upper-tailed (H1:μ>μ0H_1: \mu > \mu_0): reject iff Z>zαZ > z_\alpha; p=1Φ(Z)p = 1 - \Phi(Z).
  • Lower-tailed (H1:μ<μ0H_1: \mu < \mu_0): reject iff Z<zαZ < -z_\alpha; p=Φ(Z)p = \Phi(Z).

Two-sample z-test for proportions: Compare p1p_1 (proportion in group 1) vs. p2p_2 (group 2).

Z=p^1p^2p^(1p^)(1/n1+1/n2)approxH0N(0,1)Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1 + 1/n_2)}} \overset{H_0}{\overset{\text{approx}}{\sim}} \mathcal{N}(0,1)

where p^=(n1p^1+n2p^2)/(n1+n2)\hat{p} = (n_1\hat{p}_1 + n_2\hat{p}_2)/(n_1+n_2) is the pooled proportion. This is the standard test for A/B experiments comparing click-through rates or model accuracy.

Validity: Requires n1p^(1p^)5n_1\hat{p}(1-\hat{p}) \geq 5 and n2p^(1p^)5n_2\hat{p}(1-\hat{p}) \geq 5. For rare events or small samples, use Fisher's exact test.

4.2 Student's t-Test

The t-test is the workhorse of applied statistics: it handles the realistic case where σ\sigma is unknown.

One-sample t-test: X1,,XniidN(μ,σ2)X_1, \ldots, X_n \overset{iid}{\sim} \mathcal{N}(\mu, \sigma^2), σ\sigma unknown. Test H0:μ=μ0H_0: \mu = \mu_0.

T=Xˉμ0S/nH0tn1T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} \overset{H_0}{\sim} t_{n-1}

where S2=1n1i=1n(XiXˉ)2S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2 is the sample variance. Reject at level α\alpha iff T>tn1,α/2\lvert T \rvert > t_{n-1, \alpha/2}.

Gosset's insight: Why tn1t_{n-1} and not N(0,1)\mathcal{N}(0,1)? Because SS is estimated from data, not known. Substituting SS for σ\sigma introduces additional randomness. The t-distribution has heavier tails to account for this extra uncertainty. As nn \to \infty, tn1N(0,1)t_{n-1} \to \mathcal{N}(0,1).

Paired t-test: When observations come in natural pairs (before/after measurements, matched subjects), compute differences Di=XiYiD_i = X_i - Y_i and apply the one-sample t-test to D1,,DnD_1, \ldots, D_n. This removes between-pair variability and dramatically increases power.

Two-sample Welch t-test: Compare means from two independent groups with possibly unequal variances (σ12σ22\sigma_1^2 \neq \sigma_2^2):

T=Xˉ1Xˉ2S12/n1+S22/n2H0tνT = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}} \overset{H_0}{\approx} t_\nu

where the Welch-Satterthwaite degrees of freedom are:

ν=(S12/n1+S22/n2)2(S12/n1)2/(n11)+(S22/n2)2/(n21)\nu = \frac{(S_1^2/n_1 + S_2^2/n_2)^2}{(S_1^2/n_1)^2/(n_1-1) + (S_2^2/n_2)^2/(n_2-1)}

Always use the Welch t-test (not the pooled t-test) unless you have strong prior evidence that σ1=σ2\sigma_1 = \sigma_2. The pooled t-test's assumption of equal variances is rarely justified and can inflate Type I error badly.

Robustness: The t-test is remarkably robust to non-normality for n30n \geq 30 by the CLT. For small nn with strongly skewed or heavy-tailed distributions, use nonparametric alternatives (Wilcoxon signed-rank or Mann-Whitney).

For AI: Use the paired t-test when comparing two models evaluated on the same test examples (paired observations). Use Welch's t-test when comparing models evaluated on different test sets.

4.3 The Chi-Squared Test

Goodness-of-fit test: Observed counts O1,,OkO_1, \ldots, O_k from nn observations, expected counts Ei=npi0E_i = np_{i0} under H0H_0.

χ2=i=1k(OiEi)2EiH0χk12\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \overset{H_0}{\approx} \chi^2_{k-1}

Valid when Ei5E_i \geq 5 for all ii. The χ2\chi^2 approximation improves with nn.

Test of independence: A r×cr \times c contingency table with counts OijO_{ij}. Under H0H_0 (row and column variables independent):

Eij=RiCjn,χ2=i=1rj=1c(OijEij)2EijH0χ(r1)(c1)2E_{ij} = \frac{R_i C_j}{n}, \quad \chi^2 = \sum_{i=1}^r\sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \overset{H_0}{\approx} \chi^2_{(r-1)(c-1)}

Worked example: A model is evaluated on 4 topic categories. Observed errors: [12, 8, 25, 5]. Under H0H_0 (equal error rate): Ei=50/4=12.5E_i = 50/4 = 12.5 each. χ2=(1212.5)2/12.5+=16.08\chi^2 = (12-12.5)^2/12.5 + \ldots = 16.08, df = 3, p=0.001p = 0.001. Strong evidence the error rate varies by topic.

For AI: Chi-squared tests are natural for:

  • Testing whether a language model's errors are uniformly distributed across categories.
  • Testing whether a tokenizer's vocabulary coverage is uniform across languages.
  • Detecting systematic biases in model outputs (contingency table: output category vs. demographic group).

4.4 The F-Test and ANOVA

F-test for two variances: H0:σ12=σ22H_0: \sigma_1^2 = \sigma_2^2.

F=S12S22H0Fn11,n21F = \frac{S_1^2}{S_2^2} \overset{H_0}{\sim} F_{n_1-1, n_2-1}

Rarely used directly; arises naturally inside ANOVA.

One-way ANOVA: Compare means across kk groups with njn_j observations in group jj. Total N=njN = \sum n_j.

Decompose total variance into between-group and within-group components:

j=1ki=1nj(XijXˉ)2SST=j=1knj(XˉjXˉ)2SSB+j=1ki=1nj(XijXˉj)2SSW\underbrace{\sum_{j=1}^k \sum_{i=1}^{n_j} (X_{ij} - \bar{X})^2}_{\text{SST}} = \underbrace{\sum_{j=1}^k n_j(\bar{X}_j - \bar{X})^2}_{\text{SSB}} + \underbrace{\sum_{j=1}^k \sum_{i=1}^{n_j} (X_{ij} - \bar{X}_j)^2}_{\text{SSW}}

Test statistic:

F=SSB/(k1)SSW/(Nk)=MSBMSWH0Fk1,NkF = \frac{\text{SSB}/(k-1)}{\text{SSW}/(N-k)} = \frac{\text{MSB}}{\text{MSW}} \overset{H_0}{\sim} F_{k-1, N-k}

Reject H0:μ1==μkH_0: \mu_1 = \cdots = \mu_k when F>Fk1,Nk,αF > F_{k-1, N-k, \alpha}.

Post-hoc tests: A significant ANOVA F-test says "at least one mean differs" but not which ones. Post-hoc comparisons (Tukey HSD, Bonferroni-corrected t-tests) identify the specific differences while controlling FWER.

ANOVA assumptions: Normality within groups, equal variances (homoscedasticity), independence. Use Welch's ANOVA or Kruskal-Wallis test when homoscedasticity fails.

4.5 Which Test When

TEST SELECTION FLOWCHART
========================================================================

  How many groups?
  +-- One group
  |   +-- Normal / large n -> One-sample t-test or z-test
  |   +-- Non-normal, small n -> Wilcoxon signed-rank
  +-- Two groups
  |   +-- Paired observations?
  |   |   +-- Yes: Normal -> Paired t-test
  |   |   +-- Yes: Non-normal -> Wilcoxon signed-rank
  |   +-- Independent observations?
  |       +-- Normal: Welch two-sample t-test
  |       +-- Non-normal: Mann-Whitney U test
  +-- Three or more groups
      +-- Normal, equal variances -> One-way ANOVA
      +-- Normal, unequal variances -> Welch's ANOVA
      +-- Non-normal -> Kruskal-Wallis test

  Categorical / count data?
  +-- One sample, counts vs. expected -> Chi-squared GoF
  +-- Two categorical variables -> Chi-squared independence
  +-- Small expected counts (< 5) -> Fisher's exact test

========================================================================

5. Likelihood Ratio Tests and UMP Tests

5.1 The Neyman-Pearson Lemma

The Neyman-Pearson lemma answers a fundamental question: among all tests with size at most α\alpha, which one maximises power at a specific alternative θ1\theta_1?

Theorem (Neyman-Pearson, 1933). Consider testing H0:θ=θ0H_0: \theta = \theta_0 vs. H1:θ=θ1H_1: \theta = \theta_1 (both simple). The most powerful size-α\alpha test rejects H0H_0 when:

Λ(x)=p(xθ1)p(xθ0)>kα\Lambda(\mathbf{x}) = \frac{p(\mathbf{x} \mid \theta_1)}{p(\mathbf{x} \mid \theta_0)} > k_\alpha

where the constant kαk_\alpha is chosen so that Pθ0(Λ>kα)=αP_{\theta_0}(\Lambda > k_\alpha) = \alpha.

Proof sketch: Let ϕ\phi^* be the likelihood ratio test and ϕ\phi any other test with Eθ0[ϕ]α\mathbb{E}_{\theta_0}[\phi] \leq \alpha. We want to show Eθ1[ϕ]Eθ1[ϕ]\mathbb{E}_{\theta_1}[\phi^*] \geq \mathbb{E}_{\theta_1}[\phi].

By construction of ϕ\phi^*: (ϕ(x)ϕ(x))(p(xθ1)kαp(xθ0))0(\phi^*(\mathbf{x}) - \phi(\mathbf{x}))(p(\mathbf{x}|\theta_1) - k_\alpha p(\mathbf{x}|\theta_0)) \geq 0 for all x\mathbf{x} (both factors have the same sign). Therefore:

Eθ1[ϕ]Eθ1[ϕ]kα(Eθ0[ϕ]Eθ0[ϕ])0\mathbb{E}_{\theta_1}[\phi^*] - \mathbb{E}_{\theta_1}[\phi] \geq k_\alpha(\mathbb{E}_{\theta_0}[\phi^*] - \mathbb{E}_{\theta_0}[\phi]) \geq 0

since Eθ0[ϕ]=αEθ0[ϕ]\mathbb{E}_{\theta_0}[\phi^*] = \alpha \geq \mathbb{E}_{\theta_0}[\phi]. \square

Intuition: The likelihood ratio Λ(x)\Lambda(\mathbf{x}) ranks data points by how much more likely they are under H1H_1 than H0H_0. Including the most H1H_1-likely data points in the rejection region maximises power. No other region of the same size can do better.

Example - Gaussian mean: Testing H0:μ=0H_0: \mu = 0 vs. H1:μ=1H_1: \mu = 1 with known σ=1\sigma = 1, nn observations.

Λ(x)=N(xi;1,1)N(xi;0,1)=exp ⁣(xin2)\Lambda(\mathbf{x}) = \frac{\prod \mathcal{N}(x_i; 1, 1)}{\prod \mathcal{N}(x_i; 0, 1)} = \exp\!\left(\sum x_i - \frac{n}{2}\right)

Rejecting when Λ>k\Lambda > k is equivalent to rejecting when Xˉ>c\bar{X} > c for some threshold cc. The NP lemma tells us the z-test is the most powerful test for this specific H1H_1.

5.2 Uniformly Most Powerful Tests

The NP lemma gives the most powerful test against a single specific alternative. Can we find a test that is simultaneously most powerful against all alternatives in Θ1\Theta_1?

Definition (UMP Test). A size-α\alpha test ϕ\phi^* is uniformly most powerful (UMP) if for every other size-α\alpha test ϕ\phi and every θΘ1\theta \in \Theta_1:

πϕ(θ)πϕ(θ)\pi_{\phi^*}(\theta) \geq \pi_\phi(\theta)

Monotone Likelihood Ratio (MLR): A family {p(xθ)}\{p(\mathbf{x}|\theta)\} has the MLR property in statistic T(x)T(\mathbf{x}) if for θ1>θ2\theta_1 > \theta_2, the ratio p(xθ1)/p(xθ2)p(\mathbf{x}|\theta_1)/p(\mathbf{x}|\theta_2) is a non-decreasing function of T(x)T(\mathbf{x}).

Theorem (Karlin-Rubin). If the family has MLR in TT, then for H0:θθ0H_0: \theta \leq \theta_0 vs. H1:θ>θ0H_1: \theta > \theta_0, the test that rejects when T>cαT > c_\alpha is UMP.

Exponential families have MLR: The natural exponential family p(xη)exp(ηT(x)A(η))p(\mathbf{x}|\eta) \propto \exp(\eta T(\mathbf{x}) - A(\eta)) has MLR in the sufficient statistic T(x)T(\mathbf{x}). This means UMP tests exist for one-sided hypotheses about natural parameters of: Gaussian (mean), Bernoulli (logit), Poisson (log-rate), Exponential (rate), Gamma.

When UMP tests do NOT exist: For two-sided alternatives H1:θθ0H_1: \theta \neq \theta_0, UMP tests generally do not exist. The best we can do is a UMP unbiased test (UMPU), which has power α\geq \alpha everywhere in Θ1\Theta_1.

5.3 The Generalized Likelihood Ratio Test

For composite hypotheses involving multiple parameters, the Neyman-Pearson approach does not directly apply. The GLRT provides a general-purpose solution.

Definition (GLRT). The generalised likelihood ratio is:

Λ(x)=supθΘ0L(θ)supθΩL(θ)=L(θ^0)L(θ^MLE)\Lambda(\mathbf{x}) = \frac{\sup_{\theta \in \Theta_0} \mathcal{L}(\theta)}{\sup_{\theta \in \Omega} \mathcal{L}(\theta)} = \frac{\mathcal{L}(\hat{\theta}_0)}{\mathcal{L}(\hat{\theta}_{\text{MLE}})}

where θ^0\hat{\theta}_0 is the restricted MLE (constrained to Θ0\Theta_0) and θ^MLE\hat{\theta}_{\text{MLE}} is the unrestricted MLE.

Note 0Λ10 \leq \Lambda \leq 1. Small Λ\Lambda means the constrained model fits much worse than the unconstrained model - evidence against H0H_0.

Wilks' Theorem. Under H0H_0 and regularity conditions, as nn \to \infty:

2logΛ(x)dχk2-2\log\Lambda(\mathbf{x}) \overset{d}{\to} \chi^2_k

where k=dim(Ω)dim(Θ0)k = \dim(\Omega) - \dim(\Theta_0) is the number of constraints imposed by H0H_0.

Proof idea: Taylor-expand logL(θ^0)\log \mathcal{L}(\hat{\theta}_0) around θ^MLE\hat{\theta}_{\text{MLE}}. The second-order term yields a quadratic form in (θ^0θ^MLE)(\hat{\theta}_0 - \hat{\theta}_{\text{MLE}}) scaled by the Fisher information. By asymptotic normality of MLE (Section02), this quadratic form is χk2\chi^2_k. \square

Example: Testing H0:μ=0,σ2=1H_0: \mu = 0, \sigma^2 = 1 in a Gaussian model - 2 constraints, so 2logΛχ22-2\log\Lambda \sim \chi^2_2.

For AI: Wilks' theorem underlies model comparison via likelihood. Any time you compare a restricted neural architecture (fewer parameters) to a full model using their log-likelihoods, you are implicitly using a GLRT. The χ2\chi^2 approximation provides a principled p-value.

5.4 Score and Wald Tests

The GLRT requires fitting both the restricted and unrestricted models. Two alternatives - the score test and the Wald test - each require fitting only one model. Together with the GLRT, they form the trinity of asymptotic tests, all asymptotically equivalent under H0H_0 and local alternatives.

Wald Test: Fit the unrestricted MLE θ^\hat{\theta} and check if it is far from Θ0\Theta_0.

W=(θ^θ0)I^(θ^)(θ^θ0)H0χk2W = (\hat{\theta} - \theta_0)^\top \hat{I}(\hat{\theta})(\hat{\theta} - \theta_0) \overset{H_0}{\to} \chi^2_k

where I^(θ^)\hat{I}(\hat{\theta}) is the observed Fisher information at the MLE. For scalar θ\theta: W=(θ^θ0)2/Var^(θ^)W = (\hat{\theta} - \theta_0)^2 / \widehat{\operatorname{Var}}(\hat{\theta}), which is the square of a z-score.

Score (Rao) Test: Fit only the restricted MLE θ^0\hat{\theta}_0 and check if the score function (gradient of log-likelihood) is far from zero there.

S=s(θ^0)I(θ^0)1s(θ^0)H0χk2S = \mathbf{s}(\hat{\theta}_0)^\top I(\hat{\theta}_0)^{-1} \mathbf{s}(\hat{\theta}_0) \overset{H_0}{\to} \chi^2_k

where s(θ)=θlogL(θ)\mathbf{s}(\theta) = \nabla_\theta \log \mathcal{L}(\theta) is the score. Under H0H_0, the score should be near zero; a large score indicates the null constraint is straining the model.

TRINITY OF ASYMPTOTIC TESTS
========================================================================

  Test       | Fits model under | Statistic              | Geometric intuition
  -----------+------------------+------------------------+--------------------
  Wald       | H_1 (unrestr.)    | Distance from \\thetahat to \\Theta_0  | How far is MLE from H_0?
  Score      | H_0 (restr.)      | Gradient at \\thetahat_0         | Is restricted fit stable?
  LRT (GLRT) | Both             | Ratio of likelihoods    | How much does H_0 cost?

  All three -> \\chi^2_k under H_0, with same asymptotic power under H_1

========================================================================

When they differ: For small nn, the three tests can give different p-values. The LRT is generally most accurate; the Wald test can be anti-conservative (over-rejects) for parameters near boundaries. The score test is preferred when fitting the unconstrained model is computationally expensive.

For AI: The Wald test is used to test whether individual neural network weights are significantly different from zero (a form of pruning criterion). The score test is used in online learning to detect if the current gradient is significantly non-zero (an adaptive stopping criterion).


6. Multiple Testing

6.1 The Multiple Testing Problem

Conduct mm independent hypothesis tests, each at level α\alpha. If all mm null hypotheses are true, what is the probability of making at least one false rejection?

P(at least one false positive)=1(1α)mP(\text{at least one false positive}) = 1 - (1-\alpha)^m

For m=20m = 20 tests at α=0.05\alpha = 0.05: 10.95200.641 - 0.95^{20} \approx 0.64. You expect about one false discovery just by chance, even if nothing is real. This is the multiple testing problem - the fundamental challenge underlying the replication crisis in science and the benchmark arms race in ML.

The error metrics:

MetricDefinitionControls
Per-comparison error rate (PCER)α\alpha per testNothing about joint errors
Family-wise error rate (FWER)P(1P(\geq 1 false rejection))Strict; few false positives
False discovery rate (FDR)E[FP/max(total rejections,1)]\mathbb{E}[\text{FP} / \max(\text{total rejections}, 1)]Balanced; allows some false positives
False discovery proportion (FDP)Actual FP/total rejectionsRandom variable

m_0 and m_1: Of mm tests, let m0m_0 be the number of true nulls and m1=mm0m_1 = m - m_0 be the number of true alternatives. Define VV = false positives, SS = true positives, R=V+SR = V + S = total rejections. Then FWER =P(V1)= P(V \geq 1) and FDR =E[V/R]= \mathbb{E}[V/R] (with V/R=0V/R = 0 if R=0R = 0).

6.2 Bonferroni and Holm Corrections

Bonferroni correction: Test each hypothesis at level α/m\alpha/m. By union bound:

P(V1)mαm=αP(V \geq 1) \leq m \cdot \frac{\alpha}{m} = \alpha

This guarantees FWER α\leq \alpha regardless of the dependency structure between tests.

Procedure: Compute p-values p1,,pmp_1, \ldots, p_m. Reject H0iH_{0i} if pi<α/mp_i < \alpha/m.

Conservative when tests are positively correlated: If tests share the same data, the union bound is loose. Bonferroni wastes power in such settings.

Holm-Bonferroni step-down procedure (1979):

  1. Order p-values: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}.
  2. Find the smallest jj such that p(j)>α/(mj+1)p_{(j)} > \alpha / (m - j + 1).
  3. Reject H0,(1),,H0,(j1)H_{0,(1)}, \ldots, H_{0,(j-1)}.

Claim: Holm controls FWER at level α\alpha and is uniformly more powerful than Bonferroni - it never rejects fewer hypotheses.

Proof sketch: The key step: if H0,(1),,H0,(k)H_{0,(1)}, \ldots, H_{0,(k)} are all true nulls (worst case for false positives), Holm's threshold for the ii-th ordered p-value is α/(mi+1)α/m\alpha/(m-i+1) \geq \alpha/m, which controls each step-wise rejection probability by the Bonferroni argument. \square

Sidak correction: For independent tests, the exact threshold is 1(1α)1/m1 - (1-\alpha)^{1/m} (slightly larger than α/m\alpha/m, hence slightly more powerful).

6.3 False Discovery Rate

The Benjamini-Hochberg (BH) procedure (1995):

Given p-values p(1)p(m)p_{(1)} \leq \cdots \leq p_{(m)} ordered from smallest to largest:

  1. Find k=max ⁣{i:p(i)imα}k = \max\!\left\{i : p_{(i)} \leq \frac{i}{m} \alpha\right\}.
  2. Reject H0,(1),,H0,(k)H_{0,(1)}, \ldots, H_{0,(k)}.

If no such kk exists, reject nothing.

Theorem (Benjamini-Hochberg). Under independence (or positive dependence, PRDS), BH controls FDR at level m0mαα\frac{m_0}{m}\alpha \leq \alpha.

Proof idea (Storey, 2002): Write E[V/R]=iH0P(reject HiHi true)E[1/Rreject Hi]\mathbb{E}[V/R] = \sum_{i \in \mathcal{H}_0} P(\text{reject } H_i \mid H_i \text{ true}) \cdot \mathbb{E}[1/R \mid \text{reject }H_i]. Under independence and the BH threshold, each term is bounded by α/m\alpha/m, giving E[V/R]m0α/mα\mathbb{E}[V/R] \leq m_0\alpha/m \leq \alpha. \square

BH vs. Bonferroni comparison:

PropertyBonferroniBH
ControlsFWERFDR
StringencyVery strictModerate
Power at large mmVery lowMuch higher
False positives allowedNone (probabilistically)Some (controlled on average)
Best forFew tests, each criticalMany tests, some FP acceptable

q-values: For each rejected hypothesis, the q-value qiq_i is the minimum FDR level at which HiH_i would be rejected. Analogous to p-value for FDR control. Introduced by Storey (2002).

For AI: In genomics (the original motivation for BH), researchers test 20,000 gene expression differences - Bonferroni would require p<0.0000025p < 0.0000025. In ML, testing 100 models across 50 benchmarks creates 5,000 comparisons - FDR control via BH is the appropriate framework.

6.4 NLP Benchmark Comparisons

The leaderboard problem (2024-2026): The major LLM leaderboards (MMLU, HellaSwag, HumanEval, GSM8K, LMSYS Arena, LiveBench) face severe multiple testing issues:

  1. Model selection bias: Model developers report best results across many runs, architectures, and prompting strategies. This is implicit p-hacking at the model level.
  2. Benchmark contamination: Test sets get into training data over time. Reported improvements may reflect memorisation rather than generalisation.
  3. Multiple comparisons across benchmarks: A model scoring highest on 3 of 10 benchmarks is not necessarily best - with 100 models and 10 benchmarks, 50 false "wins" are expected by chance at α=0.05\alpha = 0.05.
  4. Non-stationary test sets: Rolling evaluation windows mean the effective sample size is unclear.

Rigorous evaluation practices:

  • Report bootstrap CIs (Section 7.5) on aggregate scores.
  • Apply BH correction when comparing mm models.
  • Use held-out evaluation sets not seen during model selection.
  • Report McNemar's test for paired model comparisons on the same instances.
  • Require pre-registration of evaluation protocols before model training.

Significance thresholds for benchmarks: At m=100m = 100 comparisons, BH at α=0.05\alpha = 0.05 requires p(k)0.05k/100p_{(k)} \leq 0.05k/100. For the top-ranked model to be significantly different from the second, you typically need n5,000n \geq 5,000 test examples per benchmark.

6.5 Bayesian Alternative Preview

Classical multiple testing corrections (Bonferroni, BH) are explicitly frequentist: they control long-run error rates without asking "what is the probability that HiH_i is true?" The Bayesian framework offers a fundamentally different approach.

Preview: Bayesian Model Comparison

Given observed data x\mathbf{x}, the Bayes factor for H0H_0 vs. H1H_1 is:

B01=P(xH0)P(xH1)=p(xθ)p(θH0)dθp(xθ)p(θH1)dθB_{01} = \frac{P(\mathbf{x} \mid H_0)}{P(\mathbf{x} \mid H_1)} = \frac{\int p(\mathbf{x}|\theta)p(\theta|H_0)d\theta}{\int p(\mathbf{x}|\theta)p(\theta|H_1)d\theta}

The Bayes factor naturally accounts for model complexity (Occam's razor) and provides a direct measure of evidence. In the multiple testing setting, Bayesian methods control the posterior expected FDR by placing a prior on the proportion of true nulls π0\pi_0.

-> Full treatment: Section04 Bayesian Inference


7. Nonparametric Tests

7.1 Why Nonparametric?

Classical tests (t, F, z) assume the data follow a specific parametric family (usually Gaussian). What if:

  • The data are ordinal (rankings, Likert scales)?
  • The sample size is small (n<15n < 15) and normality is implausible?
  • The data contain extreme outliers that violate distributional assumptions?
  • You want an exact test without large-sample approximations?

Nonparametric tests make no (or minimal) distributional assumptions. The trade-off: they are typically less powerful than their parametric counterparts when the parametric assumptions hold, but more robust when those assumptions fail.

Distribution-free vs. nonparametric: A test is distribution-free if its null distribution is the same regardless of the data distribution. Permutation tests and rank tests are distribution-free. A test is nonparametric in the sense that it estimates a non-finite-dimensional quantity. The terms are often used interchangeably.

7.2 Permutation and Randomization Tests

Motivation: If H0:F1=F2H_0: F_1 = F_2 (two groups have the same distribution), then under H0H_0, the group labels are exchangeable. We can compute the null distribution of any test statistic exactly by enumerating all (n1+n2n1)\binom{n_1+n_2}{n_1} label permutations.

Algorithm (two-sample permutation test):

  1. Compute the observed test statistic TobsT_{\text{obs}} (e.g., difference in means Xˉ1Xˉ2\bar{X}_1 - \bar{X}_2).
  2. For b=1,,Bb = 1, \ldots, B: randomly permute the combined group labels; recompute T(b)T^{(b)}.
  3. Estimate p-value: p=(#{b:T(b)Tobs}+1)/(B+1)p = (\#\{b : T^{(b)} \geq T_{\text{obs}}\} + 1) / (B + 1).

Properties:

  • Exact (not asymptotic) when all permutations are enumerated.
  • Valid for any test statistic, no distributional assumptions.
  • Computationally expensive for large samples (use B10,000B \approx 10{,}000 Monte Carlo permutations).
  • Any statistic: Unlike t-tests, permutation tests work for medians, trimmed means, Gini coefficients, AUC, or custom ML metrics.

For AI: When comparing two LLMs on a shared benchmark, a permutation test on per-example score differences avoids all distributional assumptions. With n=500n = 500 test examples, a permutation test with B=10,000B = 10{,}000 has better calibration than a t-test.

7.3 Rank-Based Tests

Mann-Whitney U test (Wilcoxon rank-sum): Two-sample test. Combine and rank all n1+n2n_1 + n_2 observations. Let WW = sum of ranks in group 1. Under H0H_0: E[W]=n1(n1+n2+1)/2\mathbb{E}[W] = n_1(n_1+n_2+1)/2.

U=Wn1(n1+1)2,Z=Un1n2/2n1n2(n1+n2+1)/12H0N(0,1)U = W - \frac{n_1(n_1+1)}{2}, \quad Z = \frac{U - n_1n_2/2}{\sqrt{n_1n_2(n_1+n_2+1)/12}} \overset{H_0}{\approx} \mathcal{N}(0,1)

AUC connection: The Mann-Whitney U statistic has a beautiful probabilistic interpretation:

U^=1n1n2i=1n1j=1n21[Xi>Yj]\hat{U} = \frac{1}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2} \mathbf{1}[X_i > Y_j]

This is exactly the empirical AUC - the probability that a random draw from group 1 exceeds a random draw from group 2. A Mann-Whitney test is equivalent to testing whether AUC =0.5= 0.5. This unifies hypothesis testing with classifier evaluation.

Wilcoxon signed-rank test: Paired two-sample test. Compute differences Di=XiYiD_i = X_i - Y_i. Rank the Di\lvert D_i \rvert. W+=W^+ = sum of ranks of positive differences. Under H0:E[W+]=n(n+1)/4H_0: \mathbb{E}[W^+] = n(n+1)/4.

Kruskal-Wallis test: Extension of Mann-Whitney to k2k \geq 2 groups. Rank all NN observations jointly; the test statistic is based on the between-group variability of ranks. Under H0H_0: approximately χk12\chi^2_{k-1}.

7.4 Kolmogorov-Smirnov Test

One-sample KS test: Test whether a sample comes from a specified distribution F0F_0.

Dn=supxFn(x)F0(x)D_n = \sup_x \lvert F_n(x) - F_0(x) \rvert

where FnF_n is the empirical CDF. Under H0H_0, DnD_n has a known distribution (Kolmogorov distribution) independent of F0F_0.

Two-sample KS test: Test whether two samples share the same distribution.

Dn,m=supxFn(x)Gm(x)D_{n,m} = \sup_x \lvert F_n(x) - G_m(x) \rvert

Under H0H_0: nmn+mDn,mdK\sqrt{\frac{nm}{n+m}} D_{n,m} \overset{d}{\to} K where KK is the Kolmogorov distribution. Reject for large Dn,mD_{n,m}.

Properties:

  • Sensitive to differences in location, scale, and shape - not just means.
  • Consistent against all continuous alternative distributions.
  • Less powerful than t-test against pure location shifts (it wastes power on shape/scale).
  • CDF-based, so naturally handles multivariate data via joint ECDFs (though the asymptotic distribution changes).

For AI - Data drift detection: The two-sample KS test is the most widely used drift detector in production ML:

DRIFT DETECTION PIPELINE
========================================================================

  Training data distribution: F_train(x)
  Production batch (daily):   F_prod(x)

  For each feature j:
    Compute D_j = sup_x |F_train(x_j) - F_prod(x_j)|
    Compute p_j = KS test p-value
    Apply BH correction across all features

  Alert if: \\exists j with q_j < 0.05 (BH-adjusted)
  Report: Which features drifted and by how much

========================================================================

Limitations: KS tests features marginally (one at a time). For multivariate drift, use Maximum Mean Discrepancy (MMD) or domain classifier-based tests.

7.5 Bootstrap Hypothesis Tests

The bootstrap (Efron 1979) provides a general method for constructing null distributions without parametric assumptions. Reviewed in Section02 for CIs; here we use it for testing.

Bootstrap test for two-sample means (H0:μ1=μ2H_0: \mu_1 = \mu_2):

  1. Compute Tobs=Xˉ1Xˉ2T_{\text{obs}} = \bar{X}_1 - \bar{X}_2.
  2. Shift both samples to have equal means: X~i=XiXˉ1+Xˉ\tilde{X}_i = X_i - \bar{X}_1 + \bar{X}, Y~j=YjYˉ+Xˉ\tilde{Y}_j = Y_j - \bar{Y} + \bar{X} (where Xˉ\bar{X} is the pooled mean). Now H0H_0 holds exactly in the shifted data.
  3. Draw bootstrap samples from X~\tilde{X} and Y~\tilde{Y}, compute T(b)=Xˉ(b)Yˉ(b)T^{(b)} = \bar{X}^{(b)} - \bar{Y}^{(b)}.
  4. p=P(T(b)Tobs)p = P(T^{(b)} \geq T_{\text{obs}}).

Bootstrap for complex statistics: The t-test requires normality for exact validity. Bootstrap tests work for any statistic: median differences, correlation coefficients, AUC, BLEU scores, F1 scores - anything you can compute on resampled data.

For AI: Bootstrap CI and tests are standard for NLP evaluation. When comparing BLEU or ROUGE scores, a paired bootstrap test (sampling test-set instances) is the gold standard, as used by Koehn (2004) and standard in MT evaluation.


8. A/B Testing and ML Evaluation

8.1 The A/B Testing Framework

An A/B test is a randomised controlled experiment comparing two (or more) versions of a system. The framework:

  1. Define the primary metric (CTR, revenue per user, model accuracy).
  2. Define guardrail metrics (latency, crash rate, user retention) that must not degrade.
  3. Pre-specify α\alpha, β\beta, and MDE (minimum detectable effect) before data collection.
  4. Randomise units (users, sessions, requests) to treatment and control.
  5. Run until the pre-specified sample size is reached (or sequential stopping criterion is met).
  6. Analyse with the appropriate test and report effect size + CI.

Unit of randomisation: The choice of randomisation unit is critical.

  • User-level: Each user sees only one variant. Avoids within-user interference. Used for UI changes.
  • Session-level: Users can see both variants in different sessions. Higher statistical power but potentially biased.
  • Request-level: Each request is independently assigned. Maximum power; appropriate for stateless ML inference.

The experimental design matters more than the test: Even the perfect hypothesis test cannot salvage a poorly designed experiment. Survivorship bias, Novelty effects, and Simpson's paradox are design problems, not statistical ones.

8.2 Sequential A/B Testing

The peeking problem: If you check p-values daily and stop when p<0.05p < 0.05, you have not run an α=0.05\alpha = 0.05 test. You have run a repeated testing procedure with inflated Type I error. For α=0.05\alpha = 0.05, peeking 5 times inflates the actual error rate to 0.14\approx 0.14; peeking indefinitely drives Type I error to 1.

Sequential Probability Ratio Test (SPRT, Wald 1943): A test with no fixed sample size that guarantees both P(false positive)αP(\text{false positive}) \leq \alpha and P(false negative)βP(\text{false negative}) \leq \beta, while stopping as early as possible.

Given observations x1,x2,x_1, x_2, \ldots, compute the log likelihood ratio:

n=logi=1np(xiH1)i=1np(xiH0)=i=1nlogp(xiH1)p(xiH0)\ell_n = \log \frac{\prod_{i=1}^n p(x_i \mid H_1)}{\prod_{i=1}^n p(x_i \mid H_0)} = \sum_{i=1}^n \log \frac{p(x_i \mid H_1)}{p(x_i \mid H_0)}

Decision rule: At each step:

  • If nlog(1β)/α\ell_n \geq \log(1-\beta)/\alpha: reject H0H_0 (accept H1H_1).
  • If nlogβ/(1α)\ell_n \leq \log\beta/(1-\alpha): accept H0H_0.
  • Otherwise: continue sampling.

Wald's bounds: These thresholds guarantee P(false positive)αP(\text{false positive}) \leq \alpha and P(false negative)βP(\text{false negative}) \leq \beta, with minimal expected sample size compared to fixed-nn tests.

Mixture Sequential Ratio Test (mSPRT): An extension by Johari et al. (2022) that uses a mixture distribution over H1H_1, producing "always-valid p-values" - p-values that can be checked at any time without inflating error rates. This is the theoretical foundation for modern continuous A/B testing platforms (Spotify, Netflix, Booking.com).

For AI: The standard "wait N days, then look at p-value" A/B protocol is inefficient. Sequential testing with mSPRT or anytime-valid confidence sequences allows early stopping when effects are clear, reducing the cost of failed experiments by 30-50%.

8.3 Model Comparison Tests

Paired t-test on accuracy: Compare model A and B on the same nn test examples. For each example ii, record whether model A was correct (ai{0,1}a_i \in \{0,1\}) and model B correct (bi{0,1}b_i \in \{0,1\}). Compute differences Di=aibiD_i = a_i - b_i and apply a one-sample t-test to D1,,DnD_1, \ldots, D_n.

McNemar's test: More appropriate for binary outcomes. Contingency table of (correct/incorrect) pairs:

B correctB incorrect
A correctn11n_{11}n10n_{10}
A incorrectn01n_{01}n00n_{00}

The test statistic χ2=(n10n01)2/(n10+n01)χ12\chi^2 = (n_{10} - n_{01})^2 / (n_{10} + n_{01}) \sim \chi^2_1 under H0H_0 (models have equal accuracy). Only discordant pairs (n10n_{10}, n01n_{01}) contribute - concordant pairs carry no information about which model is better.

Diebold-Mariano test: For comparing two forecasters. Test H0:E[dt]=0H_0: \mathbb{E}[d_t] = 0 where dt=L(e1t)L(e2t)d_t = L(e_{1t}) - L(e_{2t}) is the loss differential at time tt. Uses a HAC-robust variance estimator to handle serial correlation in dtd_t.

8.4 Data Drift Detection

Covariate shift: The input distribution P(X)P(X) changes between training and deployment, but the conditional P(YX)P(Y|X) remains stable. This is the most common drift type in production ML.

Concept drift: The relationship P(YX)P(Y|X) changes. Harder to detect without labels.

Statistical tests for drift:

TestDetectsSuitable for
KS test (per feature)Distributional shiftContinuous features, univariate
Chi-squared (per feature)Distributional shiftCategorical features
MMDMultivariate shiftHigh-dimensional features
LSDDLocal shiftDetecting where distributions differ
PSIMagnitude of shiftProduction monitoring, tabular data

Population Stability Index (PSI): A practitioner-favourite drift metric:

PSI=b=1B(pprod,bptrain,b)logpprod,bptrain,b\text{PSI} = \sum_{b=1}^B (p_{\text{prod},b} - p_{\text{train},b}) \log\frac{p_{\text{prod},b}}{p_{\text{train},b}}

where bb indexes histogram bins. PSI < 0.1: no drift; 0.1-0.25: moderate drift; > 0.25: significant drift requiring retraining. Structurally equivalent to a symmetrised KL divergence.

8.5 LLM Evaluation and Leaderboards

Current best practices (2025-2026) for rigorous LLM evaluation:

  1. Bootstrap confidence intervals on aggregate scores: Sample test instances with replacement B=1,000B = 1{,}000 times, compute benchmark score each time. Report median +/- 95% CI.

  2. McNemar's test for pairwise comparisons: For two LLMs on the same benchmark, use McNemar's test (paired binary outcomes) rather than an unpaired proportion test.

  3. BH-corrected comparisons across benchmarks: When reporting "Model X outperforms Model Y on kk benchmarks", apply BH at α=0.05\alpha = 0.05 and report the q-values.

  4. Effect sizes, not just p-values: Report Cohen's h (for accuracy differences), or normalised score differences, alongside p-values.

  5. Power analysis for benchmark design: A new benchmark should be designed with sufficient items (nn) to detect a 1% accuracy difference with 80% power. For α=0.05\alpha = 0.05, β=0.20\beta = 0.20, this requires n7,200n \approx 7{,}200 per model comparison.

  6. Chatbot Arena / ELO ratings: LMSYS Arena uses pairwise preference data to estimate ELO ratings. The uncertainty in ELO estimates should be reported as CIs derived from bootstrap resampling of preference pairs.


9. Common Mistakes

#MistakeWhy It's WrongFix
1Interpreting p-value as P(H0 is true)P(H_0 \text{ is true})p-value is a frequency, not a posterior probabilityp = P(data this extreme | H_0 true); use Bayes factor for posterior
2Claiming "no effect" from p>0.05p > 0.05Absence of evidence \neq evidence of absence; may be underpoweredReport power and 95% CI; use equivalence testing
3Running many tests without correctionFWER inflates to near 1; produces spurious discoveriesApply Bonferroni (few tests) or BH (many tests)
4Peeking at data repeatedly and stopping at p<0.05p < 0.05Actual Type I error rate far exceeds α\alphaUse sequential tests (SPRT, mSPRT) or pre-register fixed nn
5Confusing statistical and practical significanceLarge nn can make trivial effects significantAlways report effect size (Cohen's d/h) alongside p-value
6HARKing: Hypothesising After Results KnownConverts exploratory analysis to confirmatory; p-values invalidPre-register hypotheses; treat post-hoc analysis as exploratory
7Using pooled t-test when variances differCan inflate Type I error dramaticallyDefault to Welch's t-test; test variance equality only if motivated
8Applying chi-squared with small expected countsChi-squared approximation fails; invalid p-valuesUse Fisher's exact test when any Eij<5E_{ij} < 5
9Ignoring paired structureDiscards within-pair correlation; wastes powerUse paired t-test or Wilcoxon signed-rank for paired data
10Not checking normality for small samplest-test assumptions violated; p-values inaccurateFor n<30n < 30 with skewed data, use nonparametric or bootstrap test
11Reporting only "p < 0.05, significant"Loses information; invites binary thinkingReport exact p, effect size, CI, and power
12One-tailed test chosen after seeing data directionHalves the p-value post-hoc; inflates Type I errorPre-register test direction or use two-tailed by default

10. Exercises

Exercise 1 * - One-Sample t-Test from Scratch

A language model's token latency (ms) is measured on 20 requests: mean = 47.3 ms, sample std = 8.1 ms. The SLA requires mean latency \leq 45 ms.

(a) State H0H_0 and H1H_1 precisely. Is this one-sided or two-sided? (b) Compute the t-statistic and degrees of freedom. (c) Find the critical value at α=0.05\alpha = 0.05. (d) Compute the exact p-value. (e) State your conclusion in plain English.

Exercise 2 * - Chi-Squared Goodness-of-Fit

A text classifier should distribute predictions uniformly across 5 categories. On 500 test examples, observed counts are [87, 113, 95, 102, 103].

(a) State H0H_0 and compute expected counts. (b) Compute the χ2\chi^2 statistic. (c) Find the p-value (df = 4, χ4,0.052=9.49\chi^2_{4, 0.05} = 9.49). (d) Is the distribution significantly non-uniform at α=0.05\alpha = 0.05? (e) Compute Cramer's V and interpret the effect size.

Exercise 3 * - Power and Sample Size

You want to detect that model A has a higher accuracy than model B (pA=0.88p_A = 0.88, pB=0.85p_B = 0.85) with 80% power at α=0.05\alpha = 0.05.

(a) Compute Cohen's h for this effect. (b) Derive the required sample size per group for a two-proportion z-test. (c) What is the power if you can only collect n=2,000n = 2{,}000 per group? (d) Plot the power curve as a function of nn (from 500 to 5000). (e) What sample size gives 95% power?

Exercise 4 ** - Neyman-Pearson Lemma for Exponential Distribution

Let X1,,XniidExp(λ)X_1, \ldots, X_n \overset{iid}{\sim} \text{Exp}(\lambda). Test H0:λ=λ0H_0: \lambda = \lambda_0 vs. H1:λ=λ1>λ0H_1: \lambda = \lambda_1 > \lambda_0.

(a) Write the likelihood ratio Λ(x)=L(λ1)/L(λ0)\Lambda(\mathbf{x}) = \mathcal{L}(\lambda_1)/\mathcal{L}(\lambda_0). (b) Show that rejecting when Λ>k\Lambda > k is equivalent to rejecting when Xˉ<c\bar{X} < c for some cc. (c) Find cc in terms of α\alpha, λ0\lambda_0, and nn using the fact that 2λ0nXˉχ2n22\lambda_0 n\bar{X} \sim \chi^2_{2n}. (d) Verify that this test has the correct size α=0.05\alpha = 0.05 for λ0=1\lambda_0 = 1, n=10n = 10. (e) Is this test UMP for all λ1>λ0\lambda_1 > \lambda_0? Justify using the MLR property.

Exercise 5 ** - Multiple Testing Correction

In an NLP evaluation, 50 hypothesis tests are conducted (comparing a new model to baseline on 50 benchmarks). The raw p-values are generated synthetically.

(a) Simulate 45 true nulls (p-values ~ Uniform[0,1]) and 5 true alternatives (p-values ~ Beta(0.2, 1)). (b) Count discoveries with no correction at α=0.05\alpha = 0.05. (c) Apply Bonferroni correction and count discoveries. (d) Apply BH correction and count discoveries. (e) Across 1000 simulation replications, estimate the empirical FWER and FDR for each method. Plot the results.

Exercise 6 ** - Permutation Test for Two-Sample Means

Two LLMs (A and B) are evaluated on 30 shared test prompts. Model A scores: drawn from N(0.72,0.12)\mathcal{N}(0.72, 0.1^2). Model B scores: drawn from N(0.68,0.12)\mathcal{N}(0.68, 0.1^2).

(a) Compute the observed mean difference. (b) Implement a permutation test with B=10,000B = 10{,}000 permutations. (c) Compute the permutation p-value. (d) Compare to a Welch t-test p-value on the same data. (e) Repeat 500 times and compare the empirical Type I error rates of both tests under H0H_0.

Exercise 7 *** - Sequential A/B Test with SPRT

Two model variants A and B are tested on streaming requests. H0:pA=pB=0.80H_0: p_A = p_B = 0.80 vs. H1:pA=0.85,pB=0.80H_1: p_A = 0.85, p_B = 0.80. Set α=0.05\alpha = 0.05, β=0.20\beta = 0.20.

(a) Derive the log-likelihood ratio n\ell_n for Bernoulli outcomes. (b) Compute the Wald stopping boundaries A=(1β)/αA = (1-\beta)/\alpha and B=β/(1α)B = \beta/(1-\alpha). (c) Simulate the sequential process until stopping or n=2,000n = 2{,}000. Plot n\ell_n vs. nn with the boundaries. (d) Compare the expected stopping time under H0H_0 and H1H_1. (e) Estimate the empirical Type I error rate over 1000 simulated experiments where H0H_0 is true. Verify it is α=0.05\leq \alpha = 0.05.

Exercise 8 *** - KS-Based Data Drift Detector for LLM Features

Build a drift detection system for an LLM serving system. Reference distribution: sentence embedding norms N(10,12)\sim \mathcal{N}(10, 1^2). Production batches vary.

(a) Simulate a reference dataset of 5,000 embeddings and three daily production batches: no drift, moderate drift (μ=10.5\mu = 10.5), severe drift (μ=12\mu = 12). (b) Apply the two-sample KS test to each batch vs. reference. (c) Apply BH correction across the 3 batch comparisons. (d) Implement a sliding window detector: alert if last 3 consecutive days all have KS p<0.1p < 0.1. (e) Compare KS vs. a t-test drift detector: which is more sensitive to scale changes? Demonstrate with a scenario where the mean is unchanged but σ=1.5\sigma = 1.5.


11. Why This Matters for AI (2026 Perspective)

ConceptAI / LLM ApplicationImpact
p-values and significanceModel comparison on benchmark leaderboardsPrevents claiming spurious improvements; requires n5,000n \geq 5{,}000 per comparison
Power analysisBenchmark design; A/B experiment sizingDetermines minimum test set size to detect meaningful improvements
Type I / II error trade-offDeployment gates (safety vs. capability)Conservative \alpha (0.01) for safety tests; liberal \alpha (0.1) for early exploration
Multiple testing correctionSimultaneous evaluation across benchmarksBH correction required when testing \geq 10 benchmarks
Welch t-testComparing model variants on different test setsDefault for unpaired, unequal-variance model comparisons
McNemar's testPaired model comparison on shared test examplesMost powerful paired comparison for binary accuracy
GLRT / Wilks' theoremComparing nested model architectures by NLLχk2\chi^2_k test on difference in log-likelihoods; model selection
Wald testPruning significance of neural network weightsTest if weight significantly differs from zero before pruning
BH FDR correctionMulti-benchmark leaderboards (MMLU, HumanEval, etc.)Controls false discovery rate across hundreds of simultaneous comparisons
Permutation testLLM evaluation on custom metrics (BLEU, ROUGE, win rate)Exact calibration without distributional assumptions
KS testProduction ML monitoring; data drift detectionFeature-level drift detection; triggers retraining pipeline
SPRT / mSPRTOnline A/B testing at scale (Spotify, Netflix, deployment)Reduces experiment duration by 30-50% vs. fixed-n tests
Sequential testingLLM RLHF reward model evaluationValid early stopping during human preference collection
PSI (Population Stability Index)Model monitoring dashboardsIndustry-standard drift metric for tabular features
Bootstrap hypothesis testsEvaluation with small test setsValid inference without normality; standard in MT evaluation

12. Conceptual Bridge

Looking Back: Estimation Theory (Section02)

Hypothesis testing builds directly on estimation theory (Section02). The estimators derived there - sample mean Xˉ\bar{X}, MLE θ^\hat{\theta}, sample variance S2S^2 - reappear as the building blocks of every test statistic. The confidence interval duality (Section 2.5) makes this connection explicit: a confidence interval is the set of parameter values we would fail to reject, and the test is an inversion of the CI procedure.

The asymptotic normality of MLE (Section02 Section8) is the theoretical engine behind the Wald test and the asymptotic validity of the z-test for large samples. Fisher information (Section02 Section4) enters hypothesis testing through the score test and through the Cramer-Rao bound's role in characterising optimal tests.

Confidence intervals (Section02 Section7) and hypothesis tests are dual constructions: every confidence interval corresponds to a test, and every test corresponds to a confidence interval. Reporting CIs is strictly more informative, because CIs communicate effect size and precision, not just a binary reject/don't-reject decision.

Looking Forward: Bayesian Inference (Section04)

Section Section04 provides the Bayesian counterpart to every major concept in this section:

Frequentist (Section03)Bayesian (Section04)
p-valuePosterior probability P(H0x)P(H_0 \mid \mathbf{x})
Significance testBayes factor B01B_{01}
Confidence intervalCredible interval
FWER / FDR controlPrior on proportion of true nulls π0\pi_0
Point null H0:θ=θ0H_0: \theta = \theta_0Spike-and-slab prior centred at θ0\theta_0

The philosophical divide is deep: frequentists refuse to assign probabilities to hypotheses (hypotheses are fixed; data are random). Bayesians treat parameters and hypotheses as random variables with prior distributions. The Bayesian framework provides a natural solution to multiple testing (the prior on π0\pi_0 automatically corrects for multiplicity), but requires specification of that prior - a potential source of subjectivity.

For AI practitioners, the practical choice is often dictated by computational constraints and domain norms. Frequentist tests are fast and require no prior specification; Bayesian methods provide richer inference at the cost of prior elicitation and posterior computation.

Looking Further Forward: Regression (Section06)

The F-test derived in Section 4.4 reappears in Section06 as the overall F-test for regression significance. The t-test for individual regression coefficients (H0:βj=0H_0: \beta_j = 0) is a direct application of the Wald test from Section 5.4. The multiple testing problem reappears when testing many coefficients simultaneously in high-dimensional regression - LASSO regularisation can be seen as an implicit multiple testing correction that shrinks small coefficients to zero.

POSITION IN CURRICULUM
========================================================================

  Section02 ESTIMATION THEORY
    MLE, Fisher info, CIs, asymptotic normality
           |
           v (test statistics are functions of estimators)
  Section03 HYPOTHESIS TESTING  <-- YOU ARE HERE
    p-values, power, t/\\chi^2/F tests, LRT, multiple testing,
    nonparametric tests, A/B testing, sequential tests
           |                         |
           v                         v
  Section04 BAYESIAN INFERENCE    Section06 REGRESSION ANALYSIS
  (Bayes factors, posterior    (F-test, t-tests on
   probability of hypotheses)   regression coefficients)
           |
           v
  Ch8 OPTIMISATION
  (RLHF experiment design,
   model selection, early stopping)

========================================================================

Hypothesis testing is the formal language of scientific comparison. Every claim that "model A is better than model B", every statement that "this feature is significant", every assertion that "the distribution shifted" - all of these are hypothesis tests, whether or not they are recognised as such. Making these tests explicit, pre-specified, and properly corrected is the difference between rigorous science and post-hoc storytelling.


Appendix A: Key Distributions in Hypothesis Testing

DistributionPDF / PMFKey role in testing
N(0,1)\mathcal{N}(0,1)(2π)1/2ez2/2(2\pi)^{-1/2}e^{-z^2/2}Z-test null distribution
tn1t_{n-1}(1+t2/(n1))n/2\propto (1+t^2/(n-1))^{-n/2}t-test null distribution
χk2\chi^2_kxk/21ex/2\propto x^{k/2-1}e^{-x/2}Chi-squared, GLRT, Wald, Score test
Fk,mF_{k,m}xk/21(1+kx/m)(k+m)/2\propto x^{k/2-1}(1+kx/m)^{-(k+m)/2}F-test, ANOVA
Kolmogorov\text{Kolmogorov}k(1)k1e2k2t2\sum_k (-1)^{k-1}e^{-2k^2t^2}KS test null distribution

Relationships:

  • Z2χ12Z^2 \sim \chi^2_1
  • tn2F1,nt_n^2 \sim F_{1,n} (square of t is F with df 1 in numerator)
  • χk2/kχm2/m=Fk,m\chi^2_k / k \cdot \chi^2_m / m = F_{k,m} (ratio of chi-squared variables / their df)
  • 2logΛdχk2-2\log\Lambda \overset{d}{\to} \chi^2_k (Wilks' theorem)

Appendix B: Critical Values Reference

Testα=0.10\alpha = 0.10α=0.05\alpha = 0.05α=0.01\alpha = 0.01
N(0,1)\mathcal{N}(0,1) (two-sided)±1.645\pm 1.645±1.960\pm 1.960±2.576\pm 2.576
t30t_{30} (two-sided)±1.697\pm 1.697±2.042\pm 2.042±2.750\pm 2.750
χ12\chi^2_12.7063.8416.635
χ52\chi^2_59.23611.07015.086
χ102\chi^2_{10}15.98718.30723.209
F1,30F_{1,30}2.8814.1717.562
F3,30F_{3,30}2.2762.9224.510

Appendix C: Statistical Testing in Python

from scipy import stats
import numpy as np

# One-sample t-test
t_stat, p_val = stats.ttest_1samp(data, popmean=mu0)

# Welch two-sample t-test
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)

# Paired t-test
t_stat, p_val = stats.ttest_rel(before, after)

# Chi-squared goodness-of-fit
chi2, p_val = stats.chisquare(observed, expected)

# Chi-squared test of independence
chi2, p_val, dof, expected = stats.chi2_contingency(contingency_table)

# One-way ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)

# Mann-Whitney U test
u_stat, p_val = stats.mannwhitneyu(x, y, alternative='two-sided')

# Kolmogorov-Smirnov two-sample test
ks_stat, p_val = stats.ks_2samp(sample1, sample2)

# Wilcoxon signed-rank test
w_stat, p_val = stats.wilcoxon(differences)

Appendix D: Glossary

TermDefinition
p-valuePH0(Ttobs)P_{H_0}(T \geq t_{\text{obs}}); probability of data this extreme under H0H_0
SizesupθΘ0Pθ(reject)\sup_{\theta \in \Theta_0} P_\theta(\text{reject}); actual Type I error rate
LevelUpper bound on size; test of level α\alpha has size α\leq \alpha
PowerPH1(reject)P_{H_1}(\text{reject}); probability of correctly detecting the alternative
Consistent testPower 1\to 1 as nn \to \infty for all θΘ1\theta \in \Theta_1
UMP testUniformly most powerful; maximises power at every θΘ1\theta \in \Theta_1
FWERFamily-wise error rate; probability of at least one false rejection
FDRFalse discovery rate; expected proportion of false rejections among all rejections
SPRTSequential probability ratio test; optimal sequential test (Wald 1943)
mSPRTMixture SPRT; produces always-valid p-values for continuous monitoring
MLRMonotone likelihood ratio; condition guaranteeing existence of UMP tests

Appendix E: Proof of the Bonferroni Inequality

Lemma (Bonferroni). Let A1,,AmA_1, \ldots, A_m be events. Then:

P ⁣(i=1mAi)i=1mP(Ai)P\!\left(\bigcup_{i=1}^m A_i\right) \leq \sum_{i=1}^m P(A_i)

Proof: By inclusion-exclusion and the fact that all higher-order intersection terms are non-negative:

P ⁣(i=1mAi)=P(Ai)i<jP(AiAj)+i=1mP(Ai).P\!\left(\bigcup_{i=1}^m A_i\right) = \sum P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \cdots \leq \sum_{i=1}^m P(A_i). \quad \square

Application to FWER: Let Ai={falsely reject H0i}A_i = \{\text{falsely reject } H_{0i}\}. If each test has size α/m\alpha/m, then P(Ai)α/mP(A_i) \leq \alpha/m and:

FWER=P ⁣(i=1m0Ai)i=1m0P(Ai)m0αmmαm=α\text{FWER} = P\!\left(\bigcup_{i=1}^{m_0} A_i\right) \leq \sum_{i=1}^{m_0} P(A_i) \leq m_0 \cdot \frac{\alpha}{m} \leq m \cdot \frac{\alpha}{m} = \alpha

The Bonferroni correction is conservative because the inequality is tight only when the AiA_i are mutually exclusive - which is the worst case for the union bound.

Simes' inequality (1986): For independent tests, the probability that any p(i)iα/mp_{(i)} \leq i\alpha/m (BH threshold) is exactly α\alpha under the complete null. This is sharper than Bonferroni and is the basis of the BH procedure's validity proof.


Appendix F: Derivation of t-Distribution

Setup: X1,,XniidN(μ,σ2)X_1, \ldots, X_n \overset{iid}{\sim} \mathcal{N}(\mu, \sigma^2). Show that T=(Xˉμ)/(S/n)tn1T = (\bar{X} - \mu)/(S/\sqrt{n}) \sim t_{n-1}.

Step 1: XˉN(μ,σ2/n)\bar{X} \sim \mathcal{N}(\mu, \sigma^2/n) so n(Xˉμ)/σN(0,1)\sqrt{n}(\bar{X}-\mu)/\sigma \sim \mathcal{N}(0,1).

Step 2: (n1)S2/σ2χn12(n-1)S^2/\sigma^2 \sim \chi^2_{n-1} (Cochran's theorem; requires XˉS2\bar{X} \perp S^2 which holds for Gaussian data).

Step 3: Xˉ\bar{X} and S2S^2 are independent (also Cochran).

Step 4: By definition of the t-distribution, if ZN(0,1)Z \sim \mathcal{N}(0,1) and Vχk2V \sim \chi^2_k are independent, then T=Z/V/ktkT = Z/\sqrt{V/k} \sim t_k. Apply with k=n1k = n-1:

T=(Xˉμ)/(σ/n)(n1)S2/(σ2(n1))=XˉμS/ntn1.T = \frac{(\bar{X}-\mu)/(\sigma/\sqrt{n})}{\sqrt{(n-1)S^2/(\sigma^2(n-1))}} = \frac{\bar{X}-\mu}{S/\sqrt{n}} \sim t_{n-1}. \quad \square

Why t has heavier tails than Normal: The denominator S/nS/\sqrt{n} is random. On lucky samples, SS is small, making TT large. On unlucky samples, SS is large, making TT small. This extra randomness spreads the distribution's tails. As nn \to \infty, SσS \to \sigma by LLN, and tnN(0,1)t_n \to \mathcal{N}(0,1).


Appendix G: Power Analysis - Detailed Derivations

One-sample z-test power derivation:

Under H1:μ=μ1H_1: \mu = \mu_1, the test statistic Z=(Xˉμ0)/(σ/n)Z = (\bar{X} - \mu_0)/(\sigma/\sqrt{n}) has distribution:

ZN ⁣((μ1μ0)nσ,1)=N(δn,1)Z \sim \mathcal{N}\!\left(\frac{(\mu_1-\mu_0)\sqrt{n}}{\sigma}, 1\right) = \mathcal{N}(\delta\sqrt{n}, 1)

where δ=(μ1μ0)/σ\delta = (\mu_1 - \mu_0)/\sigma is Cohen's d.

Two-sided rejection region: Z>zα/2\lvert Z \rvert > z_{\alpha/2}. Power:

π(μ1)=Pμ1(Z>zα/2)+Pμ1(Z<zα/2)\pi(\mu_1) = P_{\mu_1}(Z > z_{\alpha/2}) + P_{\mu_1}(Z < -z_{\alpha/2})

For μ1>μ0\mu_1 > \mu_0 (so δ>0\delta > 0), the second term is negligible, giving:

π(μ1)1Φ(zα/2δn)\pi(\mu_1) \approx 1 - \Phi(z_{\alpha/2} - \delta\sqrt{n})

Setting π(μ1)=1β\pi(\mu_1) = 1 - \beta (desired power):

zα/2δn=zβ    n=(zα/2+zβδ)2z_{\alpha/2} - \delta\sqrt{n} = -z_\beta \implies n = \left(\frac{z_{\alpha/2} + z_\beta}{\delta}\right)^2

Power table for two-sample test (α=0.05\alpha = 0.05, β=0.20\beta = 0.20):

Cohen's dn per group
0.20 (small)393
0.50 (medium)64
0.80 (large)26
1.00 (very large)17

For accuracy comparisons (proportions, α=0.05\alpha = 0.05, β=0.20\beta = 0.20):

Accuracy gapBaselinen per group
0.5%85%~28,000
1.0%85%~7,200
2.0%85%~1,800
5.0%85%~310

These numbers explain why ML benchmark evaluations are so often underpowered: a 5% absolute improvement requires only 310 examples per model, but a 1% improvement requires 7,200 - yet many benchmarks have 1,000-3,000 examples total.


Appendix H: Exact Permutation Distribution

For a two-sample test with n1=n2=n/2n_1 = n_2 = n/2 observations, there are (nn/2)\binom{n}{n/2} possible label assignments under H0H_0. For n=20n = 20: (2010)=184,756\binom{20}{10} = 184{,}756 permutations - feasible to enumerate exactly. For n=100n = 100: (10050)1029\binom{100}{50} \approx 10^{29} - use Monte Carlo with B=10,000B = 10{,}000 permutations.

Exactness: The permutation p-value p^={b:T(b)Tobs}/B\hat{p} = |\{b : T^{(b)} \geq T_{\text{obs}}\}| / B is an unbiased estimate of the true permutation p-value. Adding 1 to numerator and denominator (standard practice) ensures p^>0\hat{p} > 0 and conservatism.

Validity without normality: The permutation test is exactly valid for any test statistic, any sample size, and any continuous distribution. The only assumption is exchangeability under H0H_0 - which is guaranteed by randomisation in designed experiments.


Appendix I: Benjamini-Hochberg Procedure - Step-by-Step

Input: p-values p1,,pmp_1, \ldots, p_m (unordered); target FDR level qq.

Algorithm:

  1. Sort: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}.
  2. For each ii from mm down to 1: check if p(i)iqmp_{(i)} \leq \frac{i \cdot q}{m}.
  3. Let k=max{i:p(i)iq/m}k = \max\{i : p_{(i)} \leq iq/m\} (or k=0k = 0 if no such ii exists).
  4. Reject H(1),,H(k)H_{(1)}, \ldots, H_{(k)}.

Example: m=10m = 10 tests, q=0.05q = 0.05. Sorted p-values: 0.001, 0.008, 0.039, 0.041, 0.042, 0.060, 0.074, 0.205, 0.396, 0.950.

BH thresholds (iq/miq/m): 0.005, 0.010, 0.015, 0.020, 0.025, 0.030, 0.035, 0.040, 0.045, 0.050.

iip(i)p_{(i)}iq/miq/mp(i)iq/mp_{(i)} \leq iq/m?
10.0010.005Yes
20.0080.010Yes
30.0390.015No
40.0410.020No

Working from bottom: p(4)=0.041>0.020p_{(4)} = 0.041 > 0.020, p(3)=0.039>0.015p_{(3)} = 0.039 > 0.015, p(2)=0.0080.010p_{(2)} = 0.008 \leq 0.010 -> k=2k = 2. Reject hypotheses 1 and 2.

Bonferroni would require p0.005p \leq 0.005 - only hypothesis 1 would be rejected. BH is more powerful.


Appendix J: Sequential Testing and the Optional Stopping Problem

The optional stopping theorem (Doob): For a martingale {Mn}\{M_n\} and stopping time τ\tau: E[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0] under mild conditions. Under H0H_0, the likelihood ratio Λn=p(xiH1)/p(xiH0)\Lambda_n = \prod p(x_i|H_1)/p(x_i|H_0) is a martingale, so E[Λτ]=1\mathbb{E}[\Lambda_\tau] = 1.

Why peeking inflates error: If you peek at p-values and stop whenever p<αp < \alpha, you are effectively running a random walk and stopping when it first crosses a boundary. The boundary crossings are more frequent than the fixed-nn analysis assumes, inflating the Type I error.

Ville's inequality: For a non-negative martingale {Mn}\{M_n\} with M0=1M_0 = 1 and any stopping time τ\tau:

P ⁣(supnτMn1/α)αP\!\left(\sup_{n \leq \tau} M_n \geq 1/\alpha\right) \leq \alpha

This is the key inequality behind always-valid p-values: if you stop when Λn1/α\Lambda_n \geq 1/\alpha (equivalently when p-value α\leq \alpha), the false positive rate is controlled at α\alpha regardless of when you stop.

E-values: A recent (2020+) framework replaces p-values with e-values E0E \geq 0 satisfying EH0[E]1\mathbb{E}_{H_0}[E] \leq 1. E-values can be combined multiplicatively across observations and across experiments, and Ville's inequality guarantees PH0(E1/α)αP_{H_0}(E \geq 1/\alpha) \leq \alpha at any stopping time. E-values are the natural language for sequential testing and meta-analysis.


Appendix K: Worked Examples - Common Tests

K.1 One-Sample t-Test

Problem: A new LLM fine-tune is tested on 15 reasoning problems. Mean score = 72.3, sample std = 8.7. Baseline score = 68.0. Is the improvement significant at α=0.05\alpha = 0.05?

Solution:

  • H0:μ=68H_0: \mu = 68, H1:μ>68H_1: \mu > 68 (one-sided, improvement was predicted).
  • T=(72.368)/(8.7/15)=4.3/2.247=1.913T = (72.3 - 68)/(8.7/\sqrt{15}) = 4.3/2.247 = 1.913.
  • Critical value: t14,0.05=1.761t_{14, 0.05} = 1.761.
  • T=1.913>1.761T = 1.913 > 1.761: reject H0H_0.
  • p-value =P(t14>1.913)0.038= P(t_{14} > 1.913) \approx 0.038.
  • Conclusion: The fine-tune shows a statistically significant improvement (p=0.038p = 0.038, one-sided t14t_{14}).
  • Effect size: Cohen's d =(72.368)/8.7=0.49= (72.3 - 68)/8.7 = 0.49 (medium effect).

K.2 Two-Proportion Z-Test for A/B Test

Problem: A chat interface is tested: control group (n1=5,000n_1 = 5{,}000) has 12% click-through rate; treatment group (n2=5,000n_2 = 5{,}000) has 13.5% CTR. Is the improvement significant?

Solution:

  • H0:p1=p2H_0: p_1 = p_2, H1:p1p2H_1: p_1 \neq p_2 (two-sided, pre-specified).
  • Pooled p^=(600+675)/10,000=0.1275\hat{p} = (600 + 675)/10{,}000 = 0.1275.
  • SE=0.1275(10.1275)(1/5000+1/5000)=0.12750.87250.0004=0.00667\text{SE} = \sqrt{0.1275(1-0.1275)(1/5000 + 1/5000)} = \sqrt{0.1275 \cdot 0.8725 \cdot 0.0004} = 0.00667.
  • Z=(0.1350.120)/0.00667=2.25Z = (0.135 - 0.120)/0.00667 = 2.25.
  • p-value =2(1Φ(2.25))=2(0.0122)=0.024<0.05= 2(1 - \Phi(2.25)) = 2(0.0122) = 0.024 < 0.05: significant.
  • Effect size: Cohen's h=2arcsin0.1352arcsin0.120=0.043h = 2\arcsin\sqrt{0.135} - 2\arcsin\sqrt{0.120} = 0.043 (very small).
  • Decision: Statistically significant but effect is tiny. Consider cost of deployment vs. 1.5% CTR gain.

K.3 Chi-Squared Test of Independence

Problem: Test whether LLM output quality (good/bad) is independent of prompt language (English/French/Spanish/German). Contingency table:

EnglishFrenchSpanishGerman
Good420310290180
Bad809011070

Solution:

  • H0H_0: quality independent of language.
  • Row totals: 1200, 350; Column totals: 500, 400, 400, 250. Grand total: 1550.
  • E11=1200×500/1550=387.1E_{11} = 1200 \times 500/1550 = 387.1; compute all 8 expected values.
  • χ2=(OE)2/E23.4\chi^2 = \sum (O-E)^2/E \approx 23.4, df =(21)(41)=3= (2-1)(4-1) = 3.
  • p=P(χ32>23.4)<0.0001p = P(\chi^2_3 > 23.4) < 0.0001: strong evidence of language dependence.
  • Cramer's V =23.4/(15501)=0.123= \sqrt{23.4/(1550 \cdot 1)} = 0.123 (small to medium effect).
  • Conclusion: Quality differs significantly across languages; Spanish and German have notably higher error rates.

Appendix L: Further Reading

Core Textbooks

  1. Lehmann & Romano - Testing Statistical Hypotheses (3rd ed., 2005): The definitive theoretical reference. Covers NP lemma, UMP tests, unbiasedness, invariance, and asymptotic theory with full proofs. Essential for anyone wanting the complete frequentist theory.

  2. Casella & Berger - Statistical Inference (2nd ed., 2001): Chapters 8-9 cover hypothesis testing at the graduate textbook level. Excellent balance of theory and computation.

  3. Wasserman - All of Statistics (2004): Compressed, modern treatment with connections to ML. Chapters 10-14 cover testing, p-values, and multiple testing.

  4. Efron & Hastie - Computer Age Statistical Inference (2016): Covers bootstrap, FDR, empirical Bayes, and algorithmic inference. Free PDF from Stanford.

ML-Specific References

  1. Dror et al. - "Deep Dominance: How to Properly Compare Deep Neural Models" (ACL 2019): Comprehensive study of hypothesis tests for NLP model comparison. Advocates for bootstrap and permutation tests over t-tests.

  2. Demsar - "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR, 2006): Recommends Friedman test + Nemenyi post-hoc for comparing multiple classifiers across multiple datasets.

  3. Johari et al. - "Peeking at A/B Tests" (KDD 2017): The original paper on always-valid p-values and mSPRT for online A/B testing.

  4. Ramdas et al. - "Testing Exchangeability: Fork-Convex Hulls, Supermartingales and e-Processes" (2022): Modern framework for e-values and anytime-valid inference.

  5. Liao et al. - "Are Emergent Abilities of Large Language Models a Mirage?" (NeurIPS 2023): Demonstrates that many claimed LLM phase transitions are statistical artifacts of discontinuous metrics + multiple comparisons.

  6. Koehn - "Statistical Significance Tests for Machine Translation Evaluation" (EMNLP 2004): The canonical reference for bootstrap resampling in MT evaluation. Introduced paired bootstrap testing to NLP.


Appendix M: Advanced Topics in Hypothesis Testing

M.1 Composite Hypotheses and Nuisance Parameters

Many practical testing problems involve nuisance parameters - parameters that appear in the model but are not the focus of the test. For example, in the two-sample t-test, the common variance σ2\sigma^2 (or the two separate variances in Welch's test) are nuisance parameters when the hypothesis concerns the difference in means.

Problem: If H0:μ1=μ2H_0: \mu_1 = \mu_2 with unknown σ1,σ2\sigma_1, \sigma_2 (Behrens-Fisher problem), there is no exact test. The Welch t-test provides an approximate solution via the Satterthwaite degrees of freedom approximation.

Conditional tests: One approach is to condition on sufficient statistics for the nuisance parameters. Fisher's exact test conditions on the row and column marginals of a contingency table - the marginals are ancillary for the association parameter of interest.

Profile likelihood: Replace nuisance parameters by their profile MLEs. The profile likelihood ratio test then has the same χ2\chi^2 asymptotic distribution as the full GLRT.

M.2 Equivalence Testing and Non-Inferiority Tests

Classical hypothesis testing asks: "is there an effect?" But in ML deployment, the question is often reversed: "is the new model at least as good as the old one?" This requires equivalence testing or non-inferiority testing.

TOST (Two One-Sided Tests): To test that μ1μ2<Δ\lvert \mu_1 - \mu_2 \rvert < \Delta (practically equivalent):

  1. Test H01:μ1μ2ΔH_{01}: \mu_1 - \mu_2 \geq \Delta at level α\alpha (one-sided).
  2. Test H02:μ1μ2ΔH_{02}: \mu_1 - \mu_2 \leq -\Delta at level α\alpha (one-sided).
  3. Conclude equivalence if both are rejected.

The equivalence margin Δ\Delta must be pre-specified based on domain knowledge (e.g., "a difference of less than 0.5% accuracy is practically irrelevant").

Non-inferiority test: Show that the new model is not worse than baseline by more than Δ\Delta:

H0:μnew<μbaselineΔvs.H1:μnewμbaselineΔH_0: \mu_{\text{new}} < \mu_{\text{baseline}} - \Delta \quad \text{vs.} \quad H_1: \mu_{\text{new}} \geq \mu_{\text{baseline}} - \Delta

Both frameworks are essential for responsible ML deployment: before retiring a production model, verify the replacement is not inferior beyond an acceptable margin.

M.3 Multiple Testing in Modern Machine Learning

Neural architecture search (NAS): Testing thousands of architectural variants involves extreme multiple comparisons. Without FDR correction, reported improvements are largely artifacts. Proper NAS evaluation requires:

  • Held-out final evaluation (not the search objective).
  • BH correction across all tried architectures.
  • Multiple random seeds per architecture.

Hyperparameter tuning: Grid search over kk hyperparameters creates implicit multiple comparisons. Bayesian optimization with proper uncertainty quantification (Gaussian processes) naturally avoids this by reasoning about the distribution over hyperparameter performance rather than making independent comparisons.

Neural network weight testing: Magnitude pruning implicitly tests whether each weight is significantly different from zero. The formal version is a Wald test Wj=θ^j2/Var^(θ^j)W_j = \hat{\theta}_j^2 / \widehat{\text{Var}}(\hat{\theta}_j), where Var^\widehat{\text{Var}} comes from the Fisher information matrix. Applying BH correction gives a principled sparse pruning criterion. This connects to the lottery ticket hypothesis: a subnetwork survives iff its weights are statistically distinguishable from zero.

M.4 Causal Inference and Hypothesis Testing

Standard hypothesis testing establishes association (P(YX)P(Y)P(Y|X) \neq P(Y)) but not causation (do(X=x)\text{do}(X = x) changes P(Y)P(Y)). The connection:

Randomised experiments: When treatments are randomly assigned (RCT), the two-sample t-test or Wilcoxon test on outcomes provides valid causal inference. Randomisation eliminates confounding, so association implies causation.

Observational studies: Without randomisation, a significant test only shows association. Causal inference requires additional assumptions (instrumental variables, regression discontinuity, difference-in-differences) and sensitivity analysis.

RLHF and causal testing: When evaluating whether RLHF improves a model, the "treatment" (RLHF fine-tuning) must be applied to otherwise identical models. Comparing a fine-tuned model to a different base model conflates the RLHF effect with base model differences.


Appendix N: Pitfalls in Benchmark Evaluation - Extended Analysis

N.1 The Evaluation Overfitting Problem

Adaptive data analysis: Every time a benchmark is used to select a model or tune hyperparameters, the benchmark becomes part of the training signal. The final evaluation on the same benchmark is biased upward.

Holdout sets: The standard remedy is a held-out test set that is never used for model selection. In practice, LLM benchmark contamination makes this extremely difficult - web-scraped training data often contains benchmark questions and answers.

Differential privacy approach: Dwork et al. (2015) proved that if researchers are allowed at most kk adaptive queries to a holdout set of size nn, they can answer up to O(n2/3k1/3)O(n^{2/3}k^{1/3}) queries with valid statistical guarantees. This puts a hard limit on the number of models that can be compared on a single benchmark before results become meaningless.

N.2 The Multiple Metrics Problem

When a model is evaluated on 50 metrics (BLEU, ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, ...) and reported as "best on 30 of 50", this is not a well-defined test result. The correct approach:

  1. Pre-specify the primary metric before evaluation.
  2. Report secondary metrics as exploratory with FDR-corrected p-values.
  3. Use a composite score (average normalised ranking across metrics) as the primary outcome.

N.3 Model Size and Benchmark Artefacts

Many apparent improvements in LLM evaluations are confounded with model size. Larger models score higher on essentially every benchmark not because of the specific training choices being evaluated, but because of the additional parameters. Proper evaluation must control for (or fix) model size.

Scaling law adjustments: When comparing models of different sizes, use scaling law predictions (Chinchilla) to normalise scores to a common compute budget. A model that achieves ss score at CC FLOPs is better than one achieving ss score at 10C10C FLOPs, even if their raw scores are identical.


Appendix O: Practice Problems

Problem O.1: Show that the chi-squared goodness-of-fit statistic χ2=(OiEi)2/Ei\chi^2 = \sum (O_i - E_i)^2/E_i can be written as χ2=Oi2/Ein\chi^2 = \sum O_i^2/E_i - n. Use this to show that χ2=0\chi^2 = 0 iff Oi=EiO_i = E_i for all ii.

Problem O.2: A coin is flipped 1000 times and 520 heads observed. Compute the p-value for H0:p=0.5H_0: p = 0.5 vs. H1:p0.5H_1: p \neq 0.5. Is the coin significantly biased at α=0.05\alpha = 0.05? What is the 95% CI for pp? Verify the CI/test duality.

Problem O.3: Two classifiers are evaluated on 200 test examples. Classifier A is correct on 156, B on 148. They agree on 130 correct and 18 incorrect predictions. Set up and perform McNemar's test. Compare to a naive two-proportion z-test on the aggregate counts.

Problem O.4: Prove that for the Mann-Whitney U statistic, E[U/n1n2]=P(X>Y)\mathbb{E}[U/n_1 n_2] = P(X > Y) where XX and YY are independent draws from the two populations. Conclude that U/(n1n2)=U^U/(n_1 n_2) = \hat{U} is an unbiased estimator of P(X>Y)P(X > Y), and that under H0H_0, P(X>Y)=0.5P(X > Y) = 0.5.

Problem O.5: Consider m=100m = 100 independent tests where the true null proportion is π0=0.8\pi_0 = 0.8. Under BH at q=0.05q = 0.05, derive the expected number of true and false discoveries as a function of the effect size under H1H_1. Plot the expected FDP (false discovery proportion) as a function of π0\pi_0.

Problem O.6: Implement the SPRT for comparing two Bernoulli distributions with p0=0.5p_0 = 0.5 vs. p1=0.6p_1 = 0.6. Run the test 10,000 times under H0H_0 (true p=0.5p = 0.5) and 10,000 times under H1H_1 (true p=0.6p = 0.6). Report: (a) empirical Type I error rate, (b) empirical Type II error rate, (c) distribution of stopping times under H0H_0 and H1H_1.

Problem O.7: You have 5,000 features in a fraud detection model. After retraining on new data, 847 features show distributional shift with raw p-value < 0.05 from KS tests. (a) How many false discoveries do you expect under H0H_0? (b) Apply BH at q=0.05q = 0.05 and report the number of discoveries. (c) Which features would be worth investigating for model update?

Problem O.8: A researcher claims that their new prompt engineering technique improves GPT-4 on MMLU from 86.4% to 87.1% (score on 14,000 questions). Perform a formal hypothesis test and compute the 95% CI. Is this improvement practically significant? What is the effect size (Cohen's h)?


Appendix P: Connection to Information Theory

Hypothesis testing has deep connections to information theory. These connections illuminate why certain tests are optimal and provide a geometric view of the testing problem.

P.1 KL Divergence and Test Power

The Chernoff information between two distributions PP and QQ is:

C(P,Q)=min0s1logp(x)sq(x)1sdxC(P, Q) = -\min_{0 \leq s \leq 1} \log \int p(\mathbf{x})^s q(\mathbf{x})^{1-s} d\mathbf{x}

For nn independent observations, the optimal Type II error rate (at fixed Type I error) decays exponentially as enC(P,Q)e^{-nC(P,Q)}. Chernoff information determines the fundamental limit on how fast hypothesis testing errors vanish with sample size.

Special case: For one-sided tests with simple H0H_0 and H1H_1, the Stein's lemma states:

βnenDKL(PQ)\beta_n^* \approx e^{-n D_{\mathrm{KL}}(P \| Q)}

where DKL(PQ)D_{\mathrm{KL}}(P \| Q) is the KL divergence from QQ to PP. More KL divergence -> test errors vanish faster. This is why KL divergence is the natural measure of "distance" between distributions in testing.

P.2 Sufficient Statistics and Data Processing

The Data Processing Inequality states that processing data (applying a function ff) cannot increase information:

I(X;Y)I(f(X);Y)I(X; Y) \geq I(f(X); Y)

In hypothesis testing: a test statistic T(x)T(\mathbf{x}) is a deterministic function of x\mathbf{x}. By the DPI, TT cannot be more informative about θ\theta than x\mathbf{x} itself. Equality holds when TT is a sufficient statistic - when TT captures all the information in the data about θ\theta.

This gives an information-theoretic characterisation of sufficient statistics: TT is sufficient for θ\theta iff I(T;θ)=I(x;θ)I(T; \theta) = I(\mathbf{x}; \theta), i.e., no information about θ\theta is lost by replacing x\mathbf{x} with T(x)T(\mathbf{x}). A test based on a sufficient statistic is just as powerful as one based on the full data.

P.3 Minimum Description Length and Hypothesis Selection

MDL (Minimum Description Length): The MDL principle selects the model that provides the shortest description of the data. For hypothesis testing:

  • The null model H0H_0 has description length L(H0)+L(xH0)L(H_0) + L(\mathbf{x}|H_0).
  • The alternative H1H_1 has description length L(H1)+L(xH1)L(H_1) + L(\mathbf{x}|H_1).

Reject H0H_0 if the alternative provides a shorter total description. This is equivalent to GLRT when H0H_0 and H1H_1 are parametric models of different complexity - the MDL penalty for the more complex model plays the role of the chi-squared df in Wilks' theorem.

Connection to Bayes factors: If priors on H0H_0 and H1H_1 are encoded as prefix codes, the Bayes factor equals 2L(H0)L(H1)2^{L(H_0) - L(H_1)} where LL includes both model complexity and data description length. MDL, Bayes factors, and GLRT are three facets of the same information-theoretic principle: model complexity must be penalised when comparing models of different complexity.


Appendix Q: Numerical Examples - Power Curves

Q.1 Power Curve for One-Sample t-Test

For a one-sample t-test (H0:μ=0H_0: \mu = 0, σ=1\sigma = 1 known, n=25n = 25, α=0.05\alpha = 0.05):

True μ\muCohen's dPower
0.00.000.050 (= α\alpha)
0.10.100.098
0.20.200.212
0.30.300.398
0.40.400.594
0.50.500.761
0.60.600.876
0.80.800.977
1.01.000.998

Observation: 80% power requires d0.57d \approx 0.57, i.e., true mean 0.57σ\approx 0.57\sigma away from null.

Q.2 Effect of Sample Size on Power

For a two-sample t-test detecting d=0.5d = 0.5 (medium effect), α=0.05\alpha = 0.05:

nn per groupPower
200.338
400.521
640.700
1000.851
1500.940
2000.978
3000.997

The "required nn" of 64 achieves 70% power (not 80% - this is a common mistake in power calculation formulas; exact values depend on using the t-distribution vs. normal approximation).

Q.3 Multiple Testing Power Comparison

For m=50m = 50 tests, 10 true alternatives with d=0.5d = 0.5, n=100n = 100 per comparison:

MethodFWERFDRExpected discoveriesExpected true disc.
No correction~1.00~0.30128.5
Bonferroni0.05~0.016.56.4
Holm0.05~0.026.86.7
BH (q=0.05q=0.05)~0.350.059.38.9

BH makes approximately 2.7 more true discoveries than Bonferroni at the cost of slightly higher FWER.


Appendix R: Connection to Decision Theory

R.1 Minimax Hypothesis Testing

Hypothesis testing can be formulated as a decision problem. Let:

  • ij\ell_{ij} = loss of decision did_i when HjH_j is true.
  • Standard 0-1 loss: 00=11=0\ell_{00} = \ell_{11} = 0 (correct decision), 10=1\ell_{10} = 1 (reject when H0H_0 true), 01=1\ell_{01} = 1 (fail to reject when H1H_1 true).

The Bayes risk of a test ϕ\phi with prior (π0,π1)(\pi_0, \pi_1) on the hypotheses:

r(ϕ,π)=π0EH0[10ϕ(x)]+π1EH1[01(1ϕ(x))]=π0α+π1βr(\phi, \pi) = \pi_0 \mathbb{E}_{H_0}[\ell_{10}\phi(\mathbf{x})] + \pi_1 \mathbb{E}_{H_1}[\ell_{01}(1-\phi(\mathbf{x}))] = \pi_0 \alpha + \pi_1 \beta

Minimising Bayes risk gives the likelihood ratio test (NP lemma generalised to Bayesian setting): reject when p(xH1)/p(xH0)>π010/(π101)p(\mathbf{x}|H_1)/p(\mathbf{x}|H_0) > \pi_0 \ell_{10}/(\pi_1 \ell_{01}).

The minimax test minimises the maximum risk over all priors:

ϕ=argminϕmaxπr(ϕ,π)\phi^* = \arg\min_\phi \max_\pi r(\phi, \pi)

For 0-1 loss, the minimax test is the one that equalises the power function at θ0\theta_0 and θ1\theta_1 - equivalently, the Bayes test under the least favourable prior.

R.2 Asymptotic Relative Efficiency

How much more data does test A need compared to test B to achieve the same power? The asymptotic relative efficiency (ARE) of B relative to A is:

ARE(B,A)=limnnAnB\text{ARE}(B, A) = \lim_{n \to \infty} \frac{n_A}{n_B}

where nA,nBn_A, n_B are the sample sizes needed for both tests to achieve power π(θ)\pi(\theta).

Pitman efficiency computes ARE for tests against local alternatives θn=θ0+c/n\theta_n = \theta_0 + c/\sqrt{n}:

ARE(B,A)=eBeA\text{ARE}(B, A) = \frac{e_B}{e_A}

where eT=[(/θ)EH1[T]]2/VarH0(T)e_T = [(\partial/\partial\theta)\mathbb{E}_{H_1}[T]]^2 / \operatorname{Var}_{H_0}(T) is the efficiency of test statistic TT.

Key results:

  • Wilcoxon vs. t-test for Gaussian data: ARE=3/π0.955\text{ARE} = 3/\pi \approx 0.955. Wilcoxon loses only 4.5% efficiency.
  • Wilcoxon vs. t-test for heavy-tailed data: ARE>1\text{ARE} > 1. Wilcoxon can be substantially more efficient.
  • Minimum ARE of Wilcoxon vs. t-test (over all symmetric distributions): 0.8640.864 - Wilcoxon never needs more than 16% more data.

This remarkable result (Hodges-Lehmann, 1956) justifies using Wilcoxon as a default nonparametric test: you sacrifice at most 16% efficiency relative to t-test in the best case for t, while potentially gaining large efficiency for non-normal distributions.

R.3 Sensitivity Analysis for Robust Testing

In observational studies, test validity depends on unverifiable assumptions. Sensitivity analysis asks: how strong would an unmeasured confounder need to be to explain the observed effect?

Rosenbaum's sensitivity parameter Γ\Gamma: For a matched pairs study, Γ\Gamma is the odds ratio of treatment assignment that an unmeasured binary confounder could induce. A result is "significant at Γ\Gamma-sensitivity level" if it remains significant even when allowing for a confounder with odds ratio Γ\Gamma.

Report sensitivity: "Our finding remains significant at the Γ=2\Gamma = 2 level, meaning an unmeasured confounder would need to double the odds of treatment to explain away the effect."

For AI experiments, sensitivity analysis is essential when comparing models across different data pipelines, hardware, or evaluation setups - all of which are potential confounders.


Appendix S: Statistical Testing Checklist for ML Practitioners

Before reporting any hypothesis test in a paper or technical document, verify the following:

S.1 Pre-Analysis Checklist

  • Hypothesis pre-specified: H0H_0, H1H_1, and the primary metric were stated before data collection or model training.
  • Test choice pre-specified: The specific test (t-test, McNemar, permutation, etc.) was chosen based on the study design, not on which test gives a lower p-value.
  • Sample size justified: Power analysis was performed and the required nn was collected (or power at the actual nn is reported).
  • Significance level stated: α\alpha was pre-specified (typically 0.05; consider 0.01 for high-stakes claims).
  • Multiple comparisons planned: If multiple tests are planned, the correction method (Bonferroni/BH) was pre-specified.

S.2 Analysis Checklist

  • Assumptions verified:
    • Normality (for t-test): checked via Shapiro-Wilk or Q-Q plot for n<50n < 50.
    • Homoscedasticity (for pooled t-test/ANOVA): checked via Levene's test; use Welch if violated.
    • Independence: observations are not clustered, repeated, or time-dependent.
  • Correct test applied: Paired data -> paired test. Small expected counts -> Fisher's exact. Non-normal small nn -> nonparametric.
  • Effect size computed: Cohen's d/h/f or Cramer's V reported alongside p-value.
  • Confidence interval reported: 95% CI for the effect size (not just the p-value).

S.3 Reporting Checklist

  • Exact p-value reported: Not just "p < 0.05" but the exact value (e.g., p=0.032p = 0.032).
  • Test statistic and df reported: "t(38)=2.14t(38) = 2.14, p=0.039p = 0.039" is complete; "p<0.05p < 0.05" is not.
  • Effect size and CI reported: "d=0.68d = 0.68 (95% CI: [0.22, 1.14])".
  • Sample size reported: nn per group for two-sample tests.
  • Multiple testing correction applied: Which method and at what level.
  • No HARK: Post-hoc analyses are clearly labelled as exploratory.

S.4 Interpretation Checklist

  • Statistical vs. practical significance distinguished: A significant result with d=0.05d = 0.05 may not justify deployment cost.
  • Null result properly qualified: "We failed to find evidence of X" not "We showed X does not exist". Power and MDE are reported.
  • Replication recommended: A single significant result (especially p0.04p \approx 0.04) should be confirmed in an independent replication.

Appendix T: Quick Reference - Test Statistics and Null Distributions

One-Sample Tests

TestStatisticNull DistributionWhen to Use
Z-testZ=(Xˉμ0)/(σ/n)Z = (\bar{X}-\mu_0)/(\sigma/\sqrt{n})N(0,1)\mathcal{N}(0,1)Normal, σ\sigma known
One-sample tT=(Xˉμ0)/(S/n)T = (\bar{X}-\mu_0)/(S/\sqrt{n})tn1t_{n-1}Normal, σ\sigma unknown
Sign testB=#{Xi>μ0}B = \#\{X_i > \mu_0\}Binomial(n,0.5)\text{Binomial}(n, 0.5)Any continuous, robust
Wilcoxon signed-rankW+=Di>0RiW^+ = \sum_{D_i>0} R_iWilcoxon distributionSymmetric, non-normal
Chi-squared GoF(OiEi)2/Ei\sum (O_i-E_i)^2/E_iχk12\chi^2_{k-1}Count data

Two-Sample Tests (Independent)

TestStatisticNull DistributionWhen to Use
Welch tSee Section4.2tνt_\nu (Satterthwaite)Normal, unequal var
Pooled tT=(Xˉ1Xˉ2)/(Sp1/n1+1/n2)T = (\bar{X}_1-\bar{X}_2)/(S_p\sqrt{1/n_1+1/n_2})tn1+n22t_{n_1+n_2-2}Normal, equal var
Z-test (proportions)See Section4.1N(0,1)\mathcal{N}(0,1)Large nn, proportions
Mann-WhitneyUU statisticWilcoxon/Normal approxNon-normal
KS testDn,mD_{n,m}Kolmogorov distAny continuous
PermutationAny statisticEmpirical (permuted)Any statistic, any dist

Two-Sample Tests (Paired)

TestWhen to Use
Paired t-testNormal differences
Wilcoxon signed-rankNon-normal differences
Sign testOrdinal or non-symmetric
McNemarBinary outcomes (accuracy)
PermutationAny paired statistic

kk-Sample Tests

TestNull distributionWhen to Use
One-way ANOVAFk1,NkF_{k-1, N-k}Normal, equal variances
Welch ANOVAFF (adjusted df)Normal, unequal variances
Kruskal-Wallisχk12\chi^2_{k-1}Non-normal
Friedmanχk12\chi^2_{k-1}Repeated measures

Appendix U: Extended Worked Examples - Machine Learning Scenarios

U.1 McNemar's Test for LLM Comparison

Setting: Two LLMs (Gemini Pro and GPT-4o) are evaluated on 1,200 coding problems. For each problem, each model either passes or fails the test suite.

Data:

  • Both pass: n11=820n_{11} = 820
  • GPT-4o passes, Gemini fails: n10=95n_{10} = 95
  • Gemini passes, GPT-4o fails: n01=68n_{01} = 68
  • Both fail: n00=217n_{00} = 217

McNemar's test:

χ2=(n10n01)2n10+n01=(9568)295+68=729163=4.47\chi^2 = \frac{(n_{10} - n_{01})^2}{n_{10} + n_{01}} = \frac{(95 - 68)^2}{95 + 68} = \frac{729}{163} = 4.47

df=1df = 1, χ1,0.052=3.84\chi^2_{1, 0.05} = 3.84. Since 4.47>3.844.47 > 3.84: reject H0H_0 at α=0.05\alpha = 0.05.

GPT-4o accuracy: (820+95)/1200=76.25%(820 + 95)/1200 = 76.25\%. Gemini accuracy: (820+68)/1200=74.00%(820 + 68)/1200 = 74.00\%. The 2.25% gap is statistically significant (p=0.034p = 0.034).

If we had naively used a two-proportion z-test:

Z=0.76250.7400pˉ(1pˉ)(1/1200+1/1200)=0.02250.75130.24870.00167=0.02250.01766=1.27Z = \frac{0.7625 - 0.7400}{\sqrt{\bar{p}(1-\bar{p})(1/1200+1/1200)}} = \frac{0.0225}{\sqrt{0.7513 \cdot 0.2487 \cdot 0.00167}} = \frac{0.0225}{0.01766} = 1.27

p=0.20p = 0.20: not significant! The z-test ignores the correlation between paired responses. McNemar correctly uses only the 163163 discordant pairs, which concentrate all the information about the performance difference.

U.2 Bootstrap Confidence Interval for BLEU Score Comparison

Setting: Two MT systems are evaluated on 500 test sentences. System A achieves BLEU = 28.3, System B achieves BLEU = 26.8. Is the 1.5 BLEU point difference significant?

Algorithm (paired bootstrap test):

  1. For b=1,,10,000b = 1, \ldots, 10{,}000:
    • Sample 500 sentence pairs with replacement (same indices for both systems).
    • Compute BLEU(A(b)^{(b)}) - BLEU(B(b)^{(b)}) on the resampled set.
  2. Estimate p=P(BLEU(A(b))BLEU(B(b))0)p = P(\text{BLEU}(A^{(b)}) - \text{BLEU}(B^{(b)}) \leq 0) - the fraction of bootstrap replicates where B is better.

This is the Koehn (2004) paired bootstrap test, the standard for MT evaluation.

Why not t-test? BLEU is a corpus-level metric (not an average of per-sentence scores), so the CLT does not directly apply. Bootstrap resampling over sentences respects the actual data-generating process.

U.3 SPRT for Online Evaluation

Setting: A chat assistant is being A/B tested. Primary metric: thumbs-up rate. Control: p0=0.72p_0 = 0.72. Treatment hypothesis: p1=0.75p_1 = 0.75. Target: α=0.05\alpha = 0.05, β=0.10\beta = 0.10.

Wald boundaries:

  • Upper boundary: A=log((1β)/α)=log(0.90/0.05)=log(18)=2.890A = \log((1-\beta)/\alpha) = \log(0.90/0.05) = \log(18) = 2.890.
  • Lower boundary: B=log(β/(1α))=log(0.10/0.95)=log(0.105)=2.254B = \log(\beta/(1-\alpha)) = \log(0.10/0.95) = \log(0.105) = -2.254.

Log-likelihood ratio increment per observation:

For a thumbs-up (x=1x = 1): δ1=log(p1/p0)=log(0.75/0.72)=0.040\delta_1 = \log(p_1/p_0) = \log(0.75/0.72) = 0.040. For a thumbs-down (x=0x = 0): δ0=log((1p1)/(1p0))=log(0.25/0.28)=0.113\delta_0 = \log((1-p_1)/(1-p_0)) = \log(0.25/0.28) = -0.113.

Expected stopping times:

  • Under H1H_1 (p=0.75p = 0.75): E[τ](Bπ1+Aπ0)/E1[δ]\mathbb{E}[\tau] \approx (B \cdot \pi_1 + A \cdot \pi_0) / \mathbb{E}_1[\delta] \approx (2.2540.10+2.8900.90)/(0.75δ1+0.25δ0)2.396/0.0021,200(-2.254 \cdot 0.10 + 2.890 \cdot 0.90) / (0.75\delta_1 + 0.25\delta_0) \approx 2.396/0.002 \approx 1{,}200 observations.
  • Fixed-nn test for same α,β\alpha, \beta: approximately 2,1002{,}100 observations.

SPRT requires ~43% fewer observations in this scenario by stopping early when evidence accumulates quickly.

U.4 KS-Based Feature Drift Alert

Setting: An NLP model processes document embeddings. Reference distribution of document lengths (tokens): N(256,802)\mathcal{N}(256, 80^2) fitted on 50,000 training documents. Daily monitoring with 1,000 production documents.

Drift events:

  • Week 1 (no drift): Xˉprod=258\bar{X}_{\text{prod}} = 258, Sprod=82S_{\text{prod}} = 82. KS statistic D=0.021D = 0.021, p=0.73p = 0.73. No alert.
  • Week 2 (mean shift): Xˉprod=310\bar{X}_{\text{prod}} = 310, Sprod=85S_{\text{prod}} = 85. D=0.118D = 0.118, p=0.00003p = 0.00003. Alert: mean drift.
  • Week 3 (variance shift only): Xˉprod=257\bar{X}_{\text{prod}} = 257, Sprod=150S_{\text{prod}} = 150. D=0.043D = 0.043, p=0.12p = 0.12. t-test: p=0.84p = 0.84 (no mean shift). KS detects variance drift that t-test misses.

This demonstrates the key advantage of KS over t-test for drift detection: KS is sensitive to any distributional change (mean, variance, shape), while t-test only detects mean shifts.


Appendix V: Historical Notes

V.1 The Lady Tasting Tea

Fisher's canonical example (1935): A lady claims she can tell whether tea or milk was poured first. Fisher designs an experiment: 8 cups, 4 with tea first, 4 with milk first, presented in random order. The lady must identify the 4 tea-first cups.

Under H0H_0 (random guessing): P(all 4 correct)=1/(84)=1/70=0.014P(\text{all 4 correct}) = 1/\binom{8}{4} = 1/70 = 0.014.

This tiny experiment - 8 cups, 1 run - is sufficient to achieve p=0.014p = 0.014 if the lady guesses perfectly. Fisher's point: careful experimental design can yield strong statistical conclusions from minimal data.

For AI: The same logic applies to benchmark construction. A cleverly designed benchmark where random performance is exactly 25% (4-choice multiple choice) and human performance is 90% has high discriminating power. MMLU was designed with this principle.

V.2 Gosset and the Brewery

William Sealy Gosset derived the t-distribution in 1908 while working as a statistician for Guinness Brewery. Guinness had small-batch experiments (barley yields, hop compositions) where nn was typically 3-10. The existing large-sample theory (requiring normality and known σ\sigma) was useless. Gosset published under the pseudonym "Student" because Guinness forbade employees from publishing (for fear of revealing industrial methods).

The t-test is thus directly connected to the practical problem of drawing conclusions from small samples - exactly the problem faced by ML researchers evaluating expensive models on small benchmark sets.

V.3 Neyman-Pearson and the Cigarette Industry

Jerzy Neyman and Egon Pearson developed their framework in the 1930s, partly motivated by quality control in manufacturing (testing whether a batch of products meets specifications). The framework is explicitly about decisions, not inference: you must ship or reject a batch based on a sample inspection. This decision-theoretic framing became the dominant paradigm in industrial statistics.

The cigarette industry later (1950s-70s) exploited the p-value/significance framework to manufacture doubt about cancer studies - repeatedly pointing out that individual studies did not achieve p<0.05p < 0.05 while ignoring the overwhelming weight of evidence across hundreds of studies. This historical episode motivates modern emphasis on effect sizes, meta-analysis, and replication over single-study p-values.


Appendix W: Common Distributions - Moments and Quantiles

W.1 Standard Normal

ZN(0,1):f(z)=12πez2/2Z \sim \mathcal{N}(0, 1): \quad f(z) = \frac{1}{\sqrt{2\pi}}e^{-z^2/2}

Key quantiles: z0.10=1.282z_{0.10} = 1.282, z0.05=1.645z_{0.05} = 1.645, z0.025=1.960z_{0.025} = 1.960, z0.01=2.326z_{0.01} = 2.326, z0.005=2.576z_{0.005} = 2.576.

W.2 Student's t-Distribution

Ttν:f(t)=Γ((ν+1)/2)νπΓ(ν/2)(1+t2ν)(ν+1)/2T \sim t_\nu: \quad f(t) = \frac{\Gamma((\nu+1)/2)}{\sqrt{\nu\pi}\,\Gamma(\nu/2)}\left(1+\frac{t^2}{\nu}\right)^{-(\nu+1)/2}

E[T]=0\mathbb{E}[T] = 0, Var(T)=ν/(ν2)\operatorname{Var}(T) = \nu/(\nu-2) for ν>2\nu > 2. Approaches N(0,1)\mathcal{N}(0,1) as ν\nu \to \infty.

dftν,0.025t_{\nu, 0.025}tν,0.005t_{\nu, 0.005}
52.5714.032
102.2283.169
202.0862.845
302.0422.750
602.0002.660
\infty1.9602.576

W.3 Chi-Squared Distribution

Vχk2:f(v)=vk/21ev/22k/2Γ(k/2),v>0V \sim \chi^2_k: \quad f(v) = \frac{v^{k/2-1}e^{-v/2}}{2^{k/2}\Gamma(k/2)}, \quad v > 0

E[V]=k\mathbb{E}[V] = k, Var(V)=2k\operatorname{Var}(V) = 2k. Sum of kk independent squared standard normals.

W.4 F-Distribution

FFk1,k2:F=χk12/k1χk22/k2F \sim F_{k_1, k_2}: \quad F = \frac{\chi^2_{k_1}/k_1}{\chi^2_{k_2}/k_2}

Used in ANOVA and comparing nested model likelihoods. F1,ν=tν2F_{1, \nu} = t_\nu^2.


Appendix X: Summary of Key Theorems

Theorem 1 (Neyman-Pearson Lemma). For simple H0H_0 vs. H1H_1, the most powerful size-α\alpha test rejects when L(θ1)/L(θ0)>kα\mathcal{L}(\theta_1)/\mathcal{L}(\theta_0) > k_\alpha.

Theorem 2 (Wilks' Theorem). Under regularity conditions, 2logΛdχk2-2\log\Lambda \overset{d}{\to} \chi^2_k as nn \to \infty, where kk is the number of equality constraints in H0H_0.

Theorem 3 (Benjamini-Hochberg). The BH procedure at level qq controls FDRqm0/mq\text{FDR} \leq q \cdot m_0/m \leq q under independence (and PRDS).

Theorem 4 (Kolmogorov-Smirnov). For continuous F0F_0, nDndsuptB(F0(t))\sqrt{n} D_n \overset{d}{\to} \sup_t \lvert B(F_0(t)) \rvert where BB is a Brownian bridge.

Theorem 5 (Wald, SPRT). The SPRT with boundaries AA and BB satisfies PH0(reject)α=B/(1A)P_{H_0}(\text{reject}) \leq \alpha = B/(1-A) and PH1(accept H0)β=A/(1B)P_{H_1}(\text{accept } H_0) \leq \beta = A/(1-B) (approximately). The SPRT minimises expected sample size among all tests with the same error bounds.

Theorem 6 (Hodges-Lehmann). The ARE of the Wilcoxon signed-rank test relative to the t-test satisfies ARE3/π0.864\text{ARE} \geq 3/\pi \approx 0.864 for all symmetric continuous distributions, with equality for the logistic distribution. The Wilcoxon test is never less than 86.4% as efficient as the t-test.

Theorem 7 (Pitman-Koopmans). For exponential families, one-sided tests of the natural parameter are UMP: reject when T(x)>cαT(\mathbf{x}) > c_\alpha for the sufficient statistic TT.

Theorem 8 (Equivalence of Trinity Tests). Under H0H_0 and contiguous alternatives, the Wald, Score, and Likelihood Ratio tests are all asymptotically equivalent: they have the same asymptotic size α\alpha and the same asymptotic power function against local alternatives.


This section is part of the Math for LLMs curriculum. Previous: Section02 Estimation Theory | Next: Section04 Bayesian Inference


Appendix Y: Statistical Software and Implementation Notes

Y.1 SciPy Reference for Common Tests

from scipy import stats
import numpy as np

# -- One-sample tests --------------------------------------------------
# Z-test (manually, since scipy has no z-test function)
z = (xbar - mu0) / (sigma / np.sqrt(n))
p_two = 2 * (1 - stats.norm.cdf(abs(z)))

# One-sample t-test
t, p = stats.ttest_1samp(x, popmean=mu0)

# Wilcoxon signed-rank test
w, p = stats.wilcoxon(x - mu0)

# -- Two-sample tests --------------------------------------------------
# Welch's t-test (ALWAYS use equal_var=False unless you have strong reason)
t, p = stats.ttest_ind(x, y, equal_var=False)

# Paired t-test
t, p = stats.ttest_rel(x, y)

# Mann-Whitney U
u, p = stats.mannwhitneyu(x, y, alternative='two-sided')

# Two-sample KS test
d, p = stats.ks_2samp(x, y)

# Permutation test (scipy >= 1.8)
result = stats.permutation_test((x, y),
    statistic=lambda a, b: a.mean() - b.mean(),
    n_resamples=10_000, alternative='two-sided')
p = result.pvalue

# -- Multi-sample tests ------------------------------------------------
# One-way ANOVA
f, p = stats.f_oneway(group1, group2, group3)

# Kruskal-Wallis
h, p = stats.kruskal(group1, group2, group3)

# -- Categorical tests -------------------------------------------------
# Chi-squared goodness-of-fit
chi2, p = stats.chisquare(observed, f_exp=expected)

# Chi-squared test of independence
chi2, p, dof, expected = stats.chi2_contingency(table)

# McNemar's test (statsmodels)
from statsmodels.stats.contingency_tables import mcnemar
result = mcnemar([[n11, n10], [n01, n00]])
p = result.pvalue

Y.2 Multiple Testing Correction

from statsmodels.stats.multitest import multipletests

# Bonferroni, Holm, BH corrections
reject, p_adj, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
reject, p_adj, _, _ = multipletests(p_values, alpha=0.05, method='holm')
reject, p_adj, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')

# method options: 'bonferroni', 'holm', 'fdr_bh', 'fdr_by', 'sidak',
#                 'holm-sidak', 'simes-hochberg', 'hommel'

Y.3 Power Analysis

from statsmodels.stats.power import (
    TTestIndPower, TTestOneSamplePower, NormalIndPower
)

# Required sample size for two-sample t-test
analysis = TTestIndPower()
n = analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.80,
                          ratio=1.0, alternative='two-sided')

# Power at fixed n
power = analysis.power(effect_size=0.5, nobs1=64, alpha=0.05,
                        ratio=1.0, alternative='two-sided')

# For proportions
from statsmodels.stats.proportion import proportion_effectsize, zt_ind_solve_power
h = proportion_effectsize(0.87, 0.85)  # Cohen's h
n = zt_ind_solve_power(effect_size=h, alpha=0.05, power=0.80)

Y.4 Numerical Tips

  • Always set np.random.seed(42) before generating synthetic data for reproducibility.
  • For exact p-values from t-distribution: p = 2 * stats.t.sf(abs(t_stat), df=df) (two-sided).
  • For chi-squared p-value: p = stats.chi2.sf(chi2_stat, df=k-1).
  • KS test is sensitive to sample size - even tiny real differences are "significant" at large nn. Always report KS statistic DD alongside p-value.
  • For bootstrap tests, use at least B=9,999B = 9{,}999 permutations (so that p=1/10000p = 1/10000 is achievable).