<- Back to Chapter 7: Statistics | Next: Bayesian Inference ->
"To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of."
- Sir Ronald A. Fisher, Presidential Address to the First Indian Statistical Congress (1938)
Overview
Hypothesis testing is the art of making principled decisions from data under uncertainty. Where estimation theory (Section02) asks "what is the value of this parameter?", hypothesis testing asks a sharper question: "is this parameter consistent with a specific claim?" This inversion - from continuous estimation to binary decision - is the formal machinery behind every A/B experiment that ships a product feature, every clinical trial that approves a drug, and every benchmark comparison that claims one model outperforms another.
The discipline has two intertwined origins. Ronald Fisher developed the p-value framework in the 1920s: compute how surprising the data would be if the null hypothesis were true, and report that probability as evidence. Jerzy Neyman and Egon Pearson developed the decision-theoretic framework in 1933: pre-commit to a decision rule with controlled error rates before seeing the data. Modern practice blends both - using p-values as a continuous measure of evidence while respecting the Neyman-Pearson discipline of pre-specified significance levels, power analysis, and sample size planning.
For AI and ML, hypothesis testing has never been more important. Every benchmark leaderboard is an implicit multiple-comparison experiment susceptible to false-discovery inflation. Every online A/B test deployed at scale faces the sequential testing problem. Every data pipeline needs distributional drift detection. This section builds the complete framework: from the formal definition of a test statistic through the Neyman-Pearson lemma, classical t/\chi^2/F tests, likelihood ratio tests, multiple testing correction, nonparametric methods, and the sequential A/B testing infrastructure that powers modern ML deployment.
Prerequisites
- Confidence intervals and asymptotic normality of MLE - Section02 Estimation Theory
- Sampling distributions (t, \chi^2, F distributions) - Ch6 Section02 Common Distributions
- Law of large numbers and CLT - Ch6 Section05 Limit Theorems
- Expectation and variance - Ch6 Section04 Expectation and Moments
- Likelihood functions (log-likelihood, score function) - Section02 Section4-Section5
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive derivations: t-tests, chi-squared tests, power curves, NP lemma, GLRT, Bonferroni/BH correction, permutation tests, KS drift detection, sequential SPRT |
| exercises.ipynb | 10 graded exercises from one-sample t-tests through sequential A/B testing and KS-based LLM drift detection |
Learning Objectives
After completing this section, you will:
- State the formal definition of a statistical hypothesis and distinguish simple from composite hypotheses
- Define a test statistic, rejection region, and p-value and explain what each does and does not mean
- Quantify Type I error (\alpha), Type II error (\beta), and power (1-\beta), and explain why they cannot all be minimised simultaneously
- Derive sample size requirements from desired \alpha, \beta, and effect size (Cohen's d)
- Derive and apply one-sample t-test, two-sample Welch t-test, chi-squared goodness-of-fit, and one-way ANOVA
- State and prove the Neyman-Pearson lemma and identify when a UMP test exists
- Apply Wilks' theorem to construct generalized likelihood ratio tests for composite hypotheses
- Explain the multiple testing problem and apply Bonferroni, Holm, and Benjamini-Hochberg corrections
- Implement permutation tests, Wilcoxon rank tests, and KS tests for nonparametric inference
- Design a sequential A/B test using SPRT and explain why it avoids the peeking problem
- Identify statistical pitfalls in NLP benchmark comparisons and LLM evaluation leaderboards
Table of Contents
- 1. Intuition
- 2. The Formal Framework
- 3. Errors, Power, and Sample Size
- 4. Classical Parametric Tests
- 5. Likelihood Ratio Tests and UMP Tests
- 6. Multiple Testing
- 7. Nonparametric Tests
- 8. A/B Testing and ML Evaluation
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI (2026 Perspective)
- 12. Conceptual Bridge
1. Intuition
1.1 The Core Question: Evidence or Noise?
Imagine you flip a coin 100 times and observe 63 heads. Is the coin biased, or is 63 just a chance fluctuation from a fair coin? You cannot answer this by staring at the number 63. You need a framework that asks: how often would a fair coin produce 63 or more heads in 100 flips? If the answer is "1 in 1,000 times", you have strong evidence for bias. If the answer is "1 in 5 times", the result is easily explained by chance.
This is the essence of hypothesis testing: quantify how surprising the data would be if the "nothing interesting happened" explanation were true. The "nothing interesting happened" explanation is the null hypothesis . The alternative - something systematic is going on - is the alternative hypothesis .
The court-room analogy is exact and instructive. In criminal law, the null hypothesis is innocence (: defendant is innocent). The prosecution must present evidence so overwhelming that innocence becomes implausible. The defendant is never "proven innocent" - the court simply fails to accumulate enough evidence to reject . Similarly, in statistics, we never "prove" the null hypothesis true; we can only fail to reject it. The asymmetry is deliberate: falsely convicting an innocent person (Type I error) is considered worse than failing to convict a guilty one (Type II error), so we set the bar for conviction (rejection) very high.
For AI: Every time you report "model A achieves 87.3% accuracy vs. model B's 86.1% - a statistically significant improvement at p < 0.05", you are running a hypothesis test. The null hypothesis is (no real difference). The question is whether the 1.2% gap is real signal or sampling noise from a finite test set.
1.2 Two Schools of Thought
Modern statistical testing is a marriage of two incompatible philosophies that practitioners blend without always realising it.
Fisher's approach (1925): Compute the p-value - the probability of observing data at least as extreme as what was obtained, under . Report it as a continuous measure of evidence against . Never pre-specify . Never pre-specify a decision threshold. The p-value is just one piece of evidence to weigh alongside domain knowledge and replication. Fisher rejected the idea of a fixed significance threshold as "absurdly academic".
Neyman-Pearson approach (1933): Pre-specify both and , a significance level (Type I error rate), and a desired power (sensitivity to ). Compute the most powerful test for those hypotheses. Make a binary decision: reject or do not reject. The p-value is irrelevant - what matters is whether . This framework optimises long-run decision quality across many repeated experiments.
What practitioners actually do: Use the Neyman-Pearson machinery (pre-specify , compute a test statistic, check whether ) while interpreting the p-value in Fisher's spirit (as a continuous measure of evidence). This hybrid is coherent enough for most purposes but creates confusions - particularly the widespread misinterpretation of p-values as "the probability that is true" (which is Bayesian thinking, belonging to neither school).
FISHER vs. NEYMAN-PEARSON COMPARISON
========================================================================
Property | Fisher | Neyman-Pearson
-------------------+-------------------------+----------------------
Goal | Measure evidence | Make optimal decision
Pre-specify H_1? | No | Yes
Pre-specify \\alpha? | No | Yes (before seeing data)
Output | p-value (continuous) | Reject / Do not reject
Power | Not part of framework | Central to design
Philosophical base | Inductive reasoning | Long-run frequency
Use case | Exploratory science | Industrial quality control
========================================================================
1.3 Historical Timeline
| Year | Contributor | Contribution |
|---|---|---|
| 1710 | John Arbuthnot | First known significance test (sex ratio at birth) |
| 1900 | Karl Pearson | Chi-squared goodness-of-fit test |
| 1908 | William Gosset ("Student") | t-distribution for small samples |
| 1922 | Ronald Fisher | Formalises likelihood, degrees of freedom |
| 1925 | Ronald Fisher | Statistical Methods for Research Workers - p-values, F-test, ANOVA |
| 1933 | Neyman & Pearson | Power, UMP tests, Neyman-Pearson lemma |
| 1943 | Abraham Wald | Sequential probability ratio test (SPRT) |
| 1951 | Wald | Statistical decision theory |
| 1979 | Bonferroni correction | Widely adopted for multiple testing |
| 1995 | Benjamini & Hochberg | False discovery rate (FDR) - transformative for genomics and ML |
| 2005 | Ioannidis | "Why Most Published Research Findings Are False" - catalyses replication crisis |
| 2016 | ASA statement | Formal warning against p-value misuse |
| 2019 | Nature editorial | 800+ scientists call for retiring "statistical significance" |
| 2022+ | Always-valid p-values | Ramdaset al. - sequential testing for online A/B experiments |
1.4 Why Hypothesis Testing Matters for AI
Model evaluation: Reporting a test accuracy without a confidence interval or significance test is meaningless for model comparison. Is 87.3% vs. 86.1% real? On 1000 test examples at 0.5 base rate, a 1.2% difference corresponds to a two-proportion z-test with - not significant. On 10,000 examples, the same gap gives . Sample size is everything.
A/B testing at scale: Tech companies run thousands of simultaneous A/B experiments. Each one is a two-sample hypothesis test. The infrastructure problem is: how do you test without pre-committing to a fixed sample size (you want to stop early if the effect is clear), while controlling false discovery rate across simultaneous tests?
Data drift detection: Production ML systems degrade when the input distribution changes. Detecting this is a two-sample test: is the distribution of today's features statistically different from training data? The Kolmogorov-Smirnov test, Maximum Mean Discrepancy, and Population Stability Index are all hypothesis tests under the hood.
LLM benchmark evaluation: The 2024-2026 era of LLM leaderboards (MMLU, HumanEval, BIG-Bench, LMSYS Arena) suffers from massive multiple-comparison inflation. If you test 100 models on 50 benchmarks, you expect 250 false discoveries at even if no model truly differs. Proper evaluation requires FDR correction and bootstrap confidence intervals.
Causal inference for RLHF: When measuring whether RLHF improves output quality, you need a randomised controlled design and a proper two-sample test. Confounded comparisons (different prompts, different raters) can produce entirely spurious "improvements".
2. The Formal Framework
2.1 Hypotheses: Null and Alternative
Definition (Statistical Hypothesis). A statistical hypothesis is a claim about the parameter of a probability model . Formally, a hypothesis specifies a subset :
Simple vs. composite hypotheses:
- A simple hypothesis pins to a single value: .
- A composite hypothesis specifies a range: or .
One-sided vs. two-sided tests:
- One-sided (directional): or . Use when the direction of the effect is theoretically specified in advance.
- Two-sided (non-directional): . Use when any deviation from is of interest, or when the direction is unknown.
The asymmetry between and : The null hypothesis is the "default" - the claim we assume true unless data provide sufficient evidence against it. This asymmetry has important consequences:
- We control the probability of falsely rejecting (Type I error).
- We do not automatically control the probability of falsely accepting (Type II error) - that requires separate power analysis.
- "Fail to reject " is NOT the same as "accept ". Absence of evidence is not evidence of absence.
Standard examples:
| Setting | ||
|---|---|---|
| Coin fairness | ||
| New drug effectiveness | ||
| Model improvement | ||
| Feature distribution shift | ||
| Independence in contingency table | Variables independent | Variables associated |
2.2 Test Statistics and Sampling Distributions
Definition (Test Statistic). A test statistic is a function of the data that summarises the evidence against . A good test statistic:
- Has a known distribution under (enables exact p-value computation).
- Takes extreme values when is true (enables detection).
The sampling distribution of under is central. For example:
- If with known, then under :
- If is unknown, replacing it with the sample standard deviation introduces extra variability:
The shift from to is not a minor detail - for small , the t-distribution has much heavier tails, making it much harder to reject unless the evidence is very strong.
Standardisation principle: Most test statistics are of the form:
This form ensures is dimensionless and has a tractable distribution under .
2.3 Rejection Regions and Critical Values
Definition (Rejection Region). For a test of size , the rejection region is a subset of the sample space such that:
For a two-sided test of a Gaussian mean with known :
where is the quantile of . At , .
Critical value: The boundary such that (one-sided) or (two-sided). The test rejects iff (or ).
Exact vs. approximate rejection regions:
- For normal populations with known : exact (z-test).
- For normal populations with unknown : exact (t-test, using t distribution).
- For non-normal populations, large : approximate (CLT makes approximately standard normal).
- For small , non-normal: nonparametric tests (Section 7).
2.4 The p-Value
Definition (p-Value). The p-value of a test with statistic is:
for a one-sided test, or
for a two-sided test. Equivalently, is the smallest significance level at which the observed data would lead to rejection of .
Key properties of p-values:
- Under , . This is a fundamental result: if is true and the test is exact, the p-value is uniformly distributed. This enables calibration checks.
- Under , is stochastically smaller - it tends toward 0 as sample size grows or effect size increases.
- is a random variable. Running the same experiment twice will give different p-values. The p-value quantifies how surprising the specific data are, not how true or false is.
The six most important p-value misinterpretations:
| Misinterpretation | Why it's wrong | Correct statement |
|---|---|---|
| " = probability is true" | Frequentist makes no probability claim about | = prob of data this extreme under |
| " = probability is true" | Same error | Not a probability about hypotheses |
| " means effect is large" | conflates effect size with sample size | Report effect size separately |
| " means no effect" | Absence of evidence \neq evidence of absence | Report power and CI |
| " means the finding replicates" | Single-study p is unreliable | Need replication studies |
| "We found , thus significant" | Arbitrary threshold; is equally evidential | Report exact ; don't dichotomize |
2.5 Duality: Tests and Confidence Intervals
There is an exact correspondence between hypothesis tests and confidence intervals - a fact that is both theoretically beautiful and practically useful.
The Inversion Principle: Given a size- test for , the confidence interval for is:
Conversely, the size- test rejects if and only if .
Concrete example: The 95% CI for a Gaussian mean with known is:
The corresponding z-test rejects at iff falls outside this interval - exactly the inversion principle.
Recall: Confidence intervals were derived in Section02 Estimation Theory. The CI for was constructed by pivoting on the standard normal. Here, we see that same CI is the set of null values we would fail to reject.
Practical implication: Reporting a CI is strictly more informative than reporting a p-value. The CI tells you the effect size and uncertainty; the p-value alone only tells you whether a point null is rejected. Always prefer CIs over p-values where possible.
3. Errors, Power, and Sample Size
3.1 Type I and Type II Errors
Any binary decision procedure applied to random data will sometimes make mistakes. There are exactly two ways to err:
Definition (Type I Error). Rejecting when is true. Also called a false positive. Probability = (the significance level).
Definition (Type II Error). Failing to reject when is true. Also called a false negative. Probability = (depends on the specific alternative ).
ERROR TYPE TABLE
========================================================================
| H_0 True | H_1 True
--------------------+----------------------+----------------------
Reject H_0 | Type I Error (\\alpha) | CORRECT (Power 1-\\beta)
Do not reject H_0 | CORRECT (1-\\alpha) | Type II Error (\\beta)
Analogy: | Convict innocent | Free the guilty
Medical test: | False positive | False negative
========================================================================
The fundamental trade-off: For a fixed sample size , decreasing (requiring stronger evidence to reject) increases (making it harder to detect real effects). To decrease both simultaneously, you must increase .
Conventional thresholds (and their limitations):
- : Fisher's suggestion from 1925, now a near-universal convention despite having no theoretical justification.
- : More stringent; used in physics and genomics.
- : Proposed by Benjamin et al. (2018) as a new standard to reduce false discoveries.
- For AI deployment decisions: The appropriate depends on the cost of each error type. Deploying a harmful model is a Type II error; rejecting a good model is a Type I error. These costs are application-specific.
3.2 The Power Function
Definition (Power Function). The power function of a test is:
evaluated at every .
Key properties of a well-designed power function:
- for (the test has correct size).
- as moves far from (the test is consistent).
- is large for of practical interest (the test is powerful).
Power at a specific alternative: For the one-sample z-test with vs. :
This formula reveals exactly how power depends on:
- Effect size : larger effect -> higher power.
- Sample size : larger -> higher power (as ).
- Significance level : larger -> higher power (but more Type I errors).
Minimum detectable effect (MDE): The smallest such that . Solving for :
where . At , : .
3.3 Effect Size
The problem with raw differences: A mean difference of 2 points on an exam is huge if the standard deviation is 1, but negligible if it is 100. Effect sizes standardise the comparison.
Cohen's d (for means):
Benchmarks (Cohen 1988): small, medium, large.
Cohen's h (for proportions):
The arcsin transform stabilises variance. Benchmarks: small, medium, large.
Cramer's V (for contingency tables with ):
where are the numbers of rows and columns. with 0 = no association.
For AI: When comparing two models' accuracies, report Cohen's h for proportions. A 1% absolute accuracy gain with is small and may not justify the deployment cost; with it is meaningful. Effect size is always reported alongside p-value in rigorous ML papers.
3.4 Sample Size Calculation
Solving the power equation for gives the required sample size to detect effect size with power at level :
(one-sample, one-sided). For two-sided tests replace with .
Two-sample comparison of means (equal group sizes):
where .
Two-proportion z-test (comparing accuracy rates vs. ):
where .
Worked example: You want to detect a 2% accuracy improvement (from 85% to 87%) with 80% power at .
- , , .
- per group.
This reveals why benchmark comparisons on small test sets are inconclusive: on 1,000 examples per group, the same test has power .
3.5 ROC Analogy
The error rate trade-off in hypothesis testing is structurally identical to the ROC curve in binary classification:
| Hypothesis Testing | Binary Classification |
|---|---|
| Significance level | False positive rate (FPR) |
| Power | True positive rate (TPR) / Recall |
| Critical value | Classification threshold |
| Type I error | False positive |
| Type II error | False negative |
| Reject region | Predicted positive region |
In both settings, you trace out a curve by varying the threshold (critical value / classification threshold), and the curve represents the complete trade-off between sensitivity and specificity. The AUC of a classifier measures the same thing as the integrated power function of a test: how well the score separates the two classes.
For AI: The ROC analogy makes hypothesis testing intuitive for ML practitioners. Choosing is exactly like choosing a classification threshold to achieve 5% FPR. Power analysis is like calculating recall at that threshold.
4. Classical Parametric Tests
4.1 The Z-Test
Setting: with known. Test .
Test statistic:
Rejection regions and p-values:
- Two-sided (): reject iff ; .
- Upper-tailed (): reject iff ; .
- Lower-tailed (): reject iff ; .
Two-sample z-test for proportions: Compare (proportion in group 1) vs. (group 2).
where is the pooled proportion. This is the standard test for A/B experiments comparing click-through rates or model accuracy.
Validity: Requires and . For rare events or small samples, use Fisher's exact test.
4.2 Student's t-Test
The t-test is the workhorse of applied statistics: it handles the realistic case where is unknown.
One-sample t-test: , unknown. Test .
where is the sample variance. Reject at level iff .
Gosset's insight: Why and not ? Because is estimated from data, not known. Substituting for introduces additional randomness. The t-distribution has heavier tails to account for this extra uncertainty. As , .
Paired t-test: When observations come in natural pairs (before/after measurements, matched subjects), compute differences and apply the one-sample t-test to . This removes between-pair variability and dramatically increases power.
Two-sample Welch t-test: Compare means from two independent groups with possibly unequal variances ():
where the Welch-Satterthwaite degrees of freedom are:
Always use the Welch t-test (not the pooled t-test) unless you have strong prior evidence that . The pooled t-test's assumption of equal variances is rarely justified and can inflate Type I error badly.
Robustness: The t-test is remarkably robust to non-normality for by the CLT. For small with strongly skewed or heavy-tailed distributions, use nonparametric alternatives (Wilcoxon signed-rank or Mann-Whitney).
For AI: Use the paired t-test when comparing two models evaluated on the same test examples (paired observations). Use Welch's t-test when comparing models evaluated on different test sets.
4.3 The Chi-Squared Test
Goodness-of-fit test: Observed counts from observations, expected counts under .
Valid when for all . The approximation improves with .
Test of independence: A contingency table with counts . Under (row and column variables independent):
Worked example: A model is evaluated on 4 topic categories. Observed errors: [12, 8, 25, 5]. Under (equal error rate): each. , df = 3, . Strong evidence the error rate varies by topic.
For AI: Chi-squared tests are natural for:
- Testing whether a language model's errors are uniformly distributed across categories.
- Testing whether a tokenizer's vocabulary coverage is uniform across languages.
- Detecting systematic biases in model outputs (contingency table: output category vs. demographic group).
4.4 The F-Test and ANOVA
F-test for two variances: .
Rarely used directly; arises naturally inside ANOVA.
One-way ANOVA: Compare means across groups with observations in group . Total .
Decompose total variance into between-group and within-group components:
Test statistic:
Reject when .
Post-hoc tests: A significant ANOVA F-test says "at least one mean differs" but not which ones. Post-hoc comparisons (Tukey HSD, Bonferroni-corrected t-tests) identify the specific differences while controlling FWER.
ANOVA assumptions: Normality within groups, equal variances (homoscedasticity), independence. Use Welch's ANOVA or Kruskal-Wallis test when homoscedasticity fails.
4.5 Which Test When
TEST SELECTION FLOWCHART
========================================================================
How many groups?
+-- One group
| +-- Normal / large n -> One-sample t-test or z-test
| +-- Non-normal, small n -> Wilcoxon signed-rank
+-- Two groups
| +-- Paired observations?
| | +-- Yes: Normal -> Paired t-test
| | +-- Yes: Non-normal -> Wilcoxon signed-rank
| +-- Independent observations?
| +-- Normal: Welch two-sample t-test
| +-- Non-normal: Mann-Whitney U test
+-- Three or more groups
+-- Normal, equal variances -> One-way ANOVA
+-- Normal, unequal variances -> Welch's ANOVA
+-- Non-normal -> Kruskal-Wallis test
Categorical / count data?
+-- One sample, counts vs. expected -> Chi-squared GoF
+-- Two categorical variables -> Chi-squared independence
+-- Small expected counts (< 5) -> Fisher's exact test
========================================================================
5. Likelihood Ratio Tests and UMP Tests
5.1 The Neyman-Pearson Lemma
The Neyman-Pearson lemma answers a fundamental question: among all tests with size at most , which one maximises power at a specific alternative ?
Theorem (Neyman-Pearson, 1933). Consider testing vs. (both simple). The most powerful size- test rejects when:
where the constant is chosen so that .
Proof sketch: Let be the likelihood ratio test and any other test with . We want to show .
By construction of : for all (both factors have the same sign). Therefore:
since .
Intuition: The likelihood ratio ranks data points by how much more likely they are under than . Including the most -likely data points in the rejection region maximises power. No other region of the same size can do better.
Example - Gaussian mean: Testing vs. with known , observations.
Rejecting when is equivalent to rejecting when for some threshold . The NP lemma tells us the z-test is the most powerful test for this specific .
5.2 Uniformly Most Powerful Tests
The NP lemma gives the most powerful test against a single specific alternative. Can we find a test that is simultaneously most powerful against all alternatives in ?
Definition (UMP Test). A size- test is uniformly most powerful (UMP) if for every other size- test and every :
Monotone Likelihood Ratio (MLR): A family has the MLR property in statistic if for , the ratio is a non-decreasing function of .
Theorem (Karlin-Rubin). If the family has MLR in , then for vs. , the test that rejects when is UMP.
Exponential families have MLR: The natural exponential family has MLR in the sufficient statistic . This means UMP tests exist for one-sided hypotheses about natural parameters of: Gaussian (mean), Bernoulli (logit), Poisson (log-rate), Exponential (rate), Gamma.
When UMP tests do NOT exist: For two-sided alternatives , UMP tests generally do not exist. The best we can do is a UMP unbiased test (UMPU), which has power everywhere in .
5.3 The Generalized Likelihood Ratio Test
For composite hypotheses involving multiple parameters, the Neyman-Pearson approach does not directly apply. The GLRT provides a general-purpose solution.
Definition (GLRT). The generalised likelihood ratio is:
where is the restricted MLE (constrained to ) and is the unrestricted MLE.
Note . Small means the constrained model fits much worse than the unconstrained model - evidence against .
Wilks' Theorem. Under and regularity conditions, as :
where is the number of constraints imposed by .
Proof idea: Taylor-expand around . The second-order term yields a quadratic form in scaled by the Fisher information. By asymptotic normality of MLE (Section02), this quadratic form is .
Example: Testing in a Gaussian model - 2 constraints, so .
For AI: Wilks' theorem underlies model comparison via likelihood. Any time you compare a restricted neural architecture (fewer parameters) to a full model using their log-likelihoods, you are implicitly using a GLRT. The approximation provides a principled p-value.
5.4 Score and Wald Tests
The GLRT requires fitting both the restricted and unrestricted models. Two alternatives - the score test and the Wald test - each require fitting only one model. Together with the GLRT, they form the trinity of asymptotic tests, all asymptotically equivalent under and local alternatives.
Wald Test: Fit the unrestricted MLE and check if it is far from .
where is the observed Fisher information at the MLE. For scalar : , which is the square of a z-score.
Score (Rao) Test: Fit only the restricted MLE and check if the score function (gradient of log-likelihood) is far from zero there.
where is the score. Under , the score should be near zero; a large score indicates the null constraint is straining the model.
TRINITY OF ASYMPTOTIC TESTS
========================================================================
Test | Fits model under | Statistic | Geometric intuition
-----------+------------------+------------------------+--------------------
Wald | H_1 (unrestr.) | Distance from \\thetahat to \\Theta_0 | How far is MLE from H_0?
Score | H_0 (restr.) | Gradient at \\thetahat_0 | Is restricted fit stable?
LRT (GLRT) | Both | Ratio of likelihoods | How much does H_0 cost?
All three -> \\chi^2_k under H_0, with same asymptotic power under H_1
========================================================================
When they differ: For small , the three tests can give different p-values. The LRT is generally most accurate; the Wald test can be anti-conservative (over-rejects) for parameters near boundaries. The score test is preferred when fitting the unconstrained model is computationally expensive.
For AI: The Wald test is used to test whether individual neural network weights are significantly different from zero (a form of pruning criterion). The score test is used in online learning to detect if the current gradient is significantly non-zero (an adaptive stopping criterion).
6. Multiple Testing
6.1 The Multiple Testing Problem
Conduct independent hypothesis tests, each at level . If all null hypotheses are true, what is the probability of making at least one false rejection?
For tests at : . You expect about one false discovery just by chance, even if nothing is real. This is the multiple testing problem - the fundamental challenge underlying the replication crisis in science and the benchmark arms race in ML.
The error metrics:
| Metric | Definition | Controls |
|---|---|---|
| Per-comparison error rate (PCER) | per test | Nothing about joint errors |
| Family-wise error rate (FWER) | false rejection | Strict; few false positives |
| False discovery rate (FDR) | Balanced; allows some false positives | |
| False discovery proportion (FDP) | Actual FP/total rejections | Random variable |
m_0 and m_1: Of tests, let be the number of true nulls and be the number of true alternatives. Define = false positives, = true positives, = total rejections. Then FWER and FDR (with if ).
6.2 Bonferroni and Holm Corrections
Bonferroni correction: Test each hypothesis at level . By union bound:
This guarantees FWER regardless of the dependency structure between tests.
Procedure: Compute p-values . Reject if .
Conservative when tests are positively correlated: If tests share the same data, the union bound is loose. Bonferroni wastes power in such settings.
Holm-Bonferroni step-down procedure (1979):
- Order p-values: .
- Find the smallest such that .
- Reject .
Claim: Holm controls FWER at level and is uniformly more powerful than Bonferroni - it never rejects fewer hypotheses.
Proof sketch: The key step: if are all true nulls (worst case for false positives), Holm's threshold for the -th ordered p-value is , which controls each step-wise rejection probability by the Bonferroni argument.
Sidak correction: For independent tests, the exact threshold is (slightly larger than , hence slightly more powerful).
6.3 False Discovery Rate
The Benjamini-Hochberg (BH) procedure (1995):
Given p-values ordered from smallest to largest:
- Find .
- Reject .
If no such exists, reject nothing.
Theorem (Benjamini-Hochberg). Under independence (or positive dependence, PRDS), BH controls FDR at level .
Proof idea (Storey, 2002): Write . Under independence and the BH threshold, each term is bounded by , giving .
BH vs. Bonferroni comparison:
| Property | Bonferroni | BH |
|---|---|---|
| Controls | FWER | FDR |
| Stringency | Very strict | Moderate |
| Power at large | Very low | Much higher |
| False positives allowed | None (probabilistically) | Some (controlled on average) |
| Best for | Few tests, each critical | Many tests, some FP acceptable |
q-values: For each rejected hypothesis, the q-value is the minimum FDR level at which would be rejected. Analogous to p-value for FDR control. Introduced by Storey (2002).
For AI: In genomics (the original motivation for BH), researchers test 20,000 gene expression differences - Bonferroni would require . In ML, testing 100 models across 50 benchmarks creates 5,000 comparisons - FDR control via BH is the appropriate framework.
6.4 NLP Benchmark Comparisons
The leaderboard problem (2024-2026): The major LLM leaderboards (MMLU, HellaSwag, HumanEval, GSM8K, LMSYS Arena, LiveBench) face severe multiple testing issues:
- Model selection bias: Model developers report best results across many runs, architectures, and prompting strategies. This is implicit p-hacking at the model level.
- Benchmark contamination: Test sets get into training data over time. Reported improvements may reflect memorisation rather than generalisation.
- Multiple comparisons across benchmarks: A model scoring highest on 3 of 10 benchmarks is not necessarily best - with 100 models and 10 benchmarks, 50 false "wins" are expected by chance at .
- Non-stationary test sets: Rolling evaluation windows mean the effective sample size is unclear.
Rigorous evaluation practices:
- Report bootstrap CIs (Section 7.5) on aggregate scores.
- Apply BH correction when comparing models.
- Use held-out evaluation sets not seen during model selection.
- Report McNemar's test for paired model comparisons on the same instances.
- Require pre-registration of evaluation protocols before model training.
Significance thresholds for benchmarks: At comparisons, BH at requires . For the top-ranked model to be significantly different from the second, you typically need test examples per benchmark.
6.5 Bayesian Alternative Preview
Classical multiple testing corrections (Bonferroni, BH) are explicitly frequentist: they control long-run error rates without asking "what is the probability that is true?" The Bayesian framework offers a fundamentally different approach.
Preview: Bayesian Model Comparison
Given observed data , the Bayes factor for vs. is:
The Bayes factor naturally accounts for model complexity (Occam's razor) and provides a direct measure of evidence. In the multiple testing setting, Bayesian methods control the posterior expected FDR by placing a prior on the proportion of true nulls .
-> Full treatment: Section04 Bayesian Inference
7. Nonparametric Tests
7.1 Why Nonparametric?
Classical tests (t, F, z) assume the data follow a specific parametric family (usually Gaussian). What if:
- The data are ordinal (rankings, Likert scales)?
- The sample size is small () and normality is implausible?
- The data contain extreme outliers that violate distributional assumptions?
- You want an exact test without large-sample approximations?
Nonparametric tests make no (or minimal) distributional assumptions. The trade-off: they are typically less powerful than their parametric counterparts when the parametric assumptions hold, but more robust when those assumptions fail.
Distribution-free vs. nonparametric: A test is distribution-free if its null distribution is the same regardless of the data distribution. Permutation tests and rank tests are distribution-free. A test is nonparametric in the sense that it estimates a non-finite-dimensional quantity. The terms are often used interchangeably.
7.2 Permutation and Randomization Tests
Motivation: If (two groups have the same distribution), then under , the group labels are exchangeable. We can compute the null distribution of any test statistic exactly by enumerating all label permutations.
Algorithm (two-sample permutation test):
- Compute the observed test statistic (e.g., difference in means ).
- For : randomly permute the combined group labels; recompute .
- Estimate p-value: .
Properties:
- Exact (not asymptotic) when all permutations are enumerated.
- Valid for any test statistic, no distributional assumptions.
- Computationally expensive for large samples (use Monte Carlo permutations).
- Any statistic: Unlike t-tests, permutation tests work for medians, trimmed means, Gini coefficients, AUC, or custom ML metrics.
For AI: When comparing two LLMs on a shared benchmark, a permutation test on per-example score differences avoids all distributional assumptions. With test examples, a permutation test with has better calibration than a t-test.
7.3 Rank-Based Tests
Mann-Whitney U test (Wilcoxon rank-sum): Two-sample test. Combine and rank all observations. Let = sum of ranks in group 1. Under : .
AUC connection: The Mann-Whitney U statistic has a beautiful probabilistic interpretation:
This is exactly the empirical AUC - the probability that a random draw from group 1 exceeds a random draw from group 2. A Mann-Whitney test is equivalent to testing whether AUC . This unifies hypothesis testing with classifier evaluation.
Wilcoxon signed-rank test: Paired two-sample test. Compute differences . Rank the . sum of ranks of positive differences. Under .
Kruskal-Wallis test: Extension of Mann-Whitney to groups. Rank all observations jointly; the test statistic is based on the between-group variability of ranks. Under : approximately .
7.4 Kolmogorov-Smirnov Test
One-sample KS test: Test whether a sample comes from a specified distribution .
where is the empirical CDF. Under , has a known distribution (Kolmogorov distribution) independent of .
Two-sample KS test: Test whether two samples share the same distribution.
Under : where is the Kolmogorov distribution. Reject for large .
Properties:
- Sensitive to differences in location, scale, and shape - not just means.
- Consistent against all continuous alternative distributions.
- Less powerful than t-test against pure location shifts (it wastes power on shape/scale).
- CDF-based, so naturally handles multivariate data via joint ECDFs (though the asymptotic distribution changes).
For AI - Data drift detection: The two-sample KS test is the most widely used drift detector in production ML:
DRIFT DETECTION PIPELINE
========================================================================
Training data distribution: F_train(x)
Production batch (daily): F_prod(x)
For each feature j:
Compute D_j = sup_x |F_train(x_j) - F_prod(x_j)|
Compute p_j = KS test p-value
Apply BH correction across all features
Alert if: \\exists j with q_j < 0.05 (BH-adjusted)
Report: Which features drifted and by how much
========================================================================
Limitations: KS tests features marginally (one at a time). For multivariate drift, use Maximum Mean Discrepancy (MMD) or domain classifier-based tests.
7.5 Bootstrap Hypothesis Tests
The bootstrap (Efron 1979) provides a general method for constructing null distributions without parametric assumptions. Reviewed in Section02 for CIs; here we use it for testing.
Bootstrap test for two-sample means ():
- Compute .
- Shift both samples to have equal means: , (where is the pooled mean). Now holds exactly in the shifted data.
- Draw bootstrap samples from and , compute .
- .
Bootstrap for complex statistics: The t-test requires normality for exact validity. Bootstrap tests work for any statistic: median differences, correlation coefficients, AUC, BLEU scores, F1 scores - anything you can compute on resampled data.
For AI: Bootstrap CI and tests are standard for NLP evaluation. When comparing BLEU or ROUGE scores, a paired bootstrap test (sampling test-set instances) is the gold standard, as used by Koehn (2004) and standard in MT evaluation.
8. A/B Testing and ML Evaluation
8.1 The A/B Testing Framework
An A/B test is a randomised controlled experiment comparing two (or more) versions of a system. The framework:
- Define the primary metric (CTR, revenue per user, model accuracy).
- Define guardrail metrics (latency, crash rate, user retention) that must not degrade.
- Pre-specify , , and MDE (minimum detectable effect) before data collection.
- Randomise units (users, sessions, requests) to treatment and control.
- Run until the pre-specified sample size is reached (or sequential stopping criterion is met).
- Analyse with the appropriate test and report effect size + CI.
Unit of randomisation: The choice of randomisation unit is critical.
- User-level: Each user sees only one variant. Avoids within-user interference. Used for UI changes.
- Session-level: Users can see both variants in different sessions. Higher statistical power but potentially biased.
- Request-level: Each request is independently assigned. Maximum power; appropriate for stateless ML inference.
The experimental design matters more than the test: Even the perfect hypothesis test cannot salvage a poorly designed experiment. Survivorship bias, Novelty effects, and Simpson's paradox are design problems, not statistical ones.
8.2 Sequential A/B Testing
The peeking problem: If you check p-values daily and stop when , you have not run an test. You have run a repeated testing procedure with inflated Type I error. For , peeking 5 times inflates the actual error rate to ; peeking indefinitely drives Type I error to 1.
Sequential Probability Ratio Test (SPRT, Wald 1943): A test with no fixed sample size that guarantees both and , while stopping as early as possible.
Given observations , compute the log likelihood ratio:
Decision rule: At each step:
- If : reject (accept ).
- If : accept .
- Otherwise: continue sampling.
Wald's bounds: These thresholds guarantee and , with minimal expected sample size compared to fixed- tests.
Mixture Sequential Ratio Test (mSPRT): An extension by Johari et al. (2022) that uses a mixture distribution over , producing "always-valid p-values" - p-values that can be checked at any time without inflating error rates. This is the theoretical foundation for modern continuous A/B testing platforms (Spotify, Netflix, Booking.com).
For AI: The standard "wait N days, then look at p-value" A/B protocol is inefficient. Sequential testing with mSPRT or anytime-valid confidence sequences allows early stopping when effects are clear, reducing the cost of failed experiments by 30-50%.
8.3 Model Comparison Tests
Paired t-test on accuracy: Compare model A and B on the same test examples. For each example , record whether model A was correct () and model B correct (). Compute differences and apply a one-sample t-test to .
McNemar's test: More appropriate for binary outcomes. Contingency table of (correct/incorrect) pairs:
| B correct | B incorrect | |
|---|---|---|
| A correct | ||
| A incorrect |
The test statistic under (models have equal accuracy). Only discordant pairs (, ) contribute - concordant pairs carry no information about which model is better.
Diebold-Mariano test: For comparing two forecasters. Test where is the loss differential at time . Uses a HAC-robust variance estimator to handle serial correlation in .
8.4 Data Drift Detection
Covariate shift: The input distribution changes between training and deployment, but the conditional remains stable. This is the most common drift type in production ML.
Concept drift: The relationship changes. Harder to detect without labels.
Statistical tests for drift:
| Test | Detects | Suitable for |
|---|---|---|
| KS test (per feature) | Distributional shift | Continuous features, univariate |
| Chi-squared (per feature) | Distributional shift | Categorical features |
| MMD | Multivariate shift | High-dimensional features |
| LSDD | Local shift | Detecting where distributions differ |
| PSI | Magnitude of shift | Production monitoring, tabular data |
Population Stability Index (PSI): A practitioner-favourite drift metric:
where indexes histogram bins. PSI < 0.1: no drift; 0.1-0.25: moderate drift; > 0.25: significant drift requiring retraining. Structurally equivalent to a symmetrised KL divergence.
8.5 LLM Evaluation and Leaderboards
Current best practices (2025-2026) for rigorous LLM evaluation:
-
Bootstrap confidence intervals on aggregate scores: Sample test instances with replacement times, compute benchmark score each time. Report median +/- 95% CI.
-
McNemar's test for pairwise comparisons: For two LLMs on the same benchmark, use McNemar's test (paired binary outcomes) rather than an unpaired proportion test.
-
BH-corrected comparisons across benchmarks: When reporting "Model X outperforms Model Y on benchmarks", apply BH at and report the q-values.
-
Effect sizes, not just p-values: Report Cohen's h (for accuracy differences), or normalised score differences, alongside p-values.
-
Power analysis for benchmark design: A new benchmark should be designed with sufficient items () to detect a 1% accuracy difference with 80% power. For , , this requires per model comparison.
-
Chatbot Arena / ELO ratings: LMSYS Arena uses pairwise preference data to estimate ELO ratings. The uncertainty in ELO estimates should be reported as CIs derived from bootstrap resampling of preference pairs.
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Interpreting p-value as | p-value is a frequency, not a posterior probability | p = P(data this extreme | H_0 true); use Bayes factor for posterior |
| 2 | Claiming "no effect" from | Absence of evidence \neq evidence of absence; may be underpowered | Report power and 95% CI; use equivalence testing |
| 3 | Running many tests without correction | FWER inflates to near 1; produces spurious discoveries | Apply Bonferroni (few tests) or BH (many tests) |
| 4 | Peeking at data repeatedly and stopping at | Actual Type I error rate far exceeds | Use sequential tests (SPRT, mSPRT) or pre-register fixed |
| 5 | Confusing statistical and practical significance | Large can make trivial effects significant | Always report effect size (Cohen's d/h) alongside p-value |
| 6 | HARKing: Hypothesising After Results Known | Converts exploratory analysis to confirmatory; p-values invalid | Pre-register hypotheses; treat post-hoc analysis as exploratory |
| 7 | Using pooled t-test when variances differ | Can inflate Type I error dramatically | Default to Welch's t-test; test variance equality only if motivated |
| 8 | Applying chi-squared with small expected counts | Chi-squared approximation fails; invalid p-values | Use Fisher's exact test when any |
| 9 | Ignoring paired structure | Discards within-pair correlation; wastes power | Use paired t-test or Wilcoxon signed-rank for paired data |
| 10 | Not checking normality for small samples | t-test assumptions violated; p-values inaccurate | For with skewed data, use nonparametric or bootstrap test |
| 11 | Reporting only "p < 0.05, significant" | Loses information; invites binary thinking | Report exact p, effect size, CI, and power |
| 12 | One-tailed test chosen after seeing data direction | Halves the p-value post-hoc; inflates Type I error | Pre-register test direction or use two-tailed by default |
10. Exercises
Exercise 1 * - One-Sample t-Test from Scratch
A language model's token latency (ms) is measured on 20 requests: mean = 47.3 ms, sample std = 8.1 ms. The SLA requires mean latency \leq 45 ms.
(a) State and precisely. Is this one-sided or two-sided? (b) Compute the t-statistic and degrees of freedom. (c) Find the critical value at . (d) Compute the exact p-value. (e) State your conclusion in plain English.
Exercise 2 * - Chi-Squared Goodness-of-Fit
A text classifier should distribute predictions uniformly across 5 categories. On 500 test examples, observed counts are [87, 113, 95, 102, 103].
(a) State and compute expected counts. (b) Compute the statistic. (c) Find the p-value (df = 4, ). (d) Is the distribution significantly non-uniform at ? (e) Compute Cramer's V and interpret the effect size.
Exercise 3 * - Power and Sample Size
You want to detect that model A has a higher accuracy than model B (, ) with 80% power at .
(a) Compute Cohen's h for this effect. (b) Derive the required sample size per group for a two-proportion z-test. (c) What is the power if you can only collect per group? (d) Plot the power curve as a function of (from 500 to 5000). (e) What sample size gives 95% power?
Exercise 4 ** - Neyman-Pearson Lemma for Exponential Distribution
Let . Test vs. .
(a) Write the likelihood ratio . (b) Show that rejecting when is equivalent to rejecting when for some . (c) Find in terms of , , and using the fact that . (d) Verify that this test has the correct size for , . (e) Is this test UMP for all ? Justify using the MLR property.
Exercise 5 ** - Multiple Testing Correction
In an NLP evaluation, 50 hypothesis tests are conducted (comparing a new model to baseline on 50 benchmarks). The raw p-values are generated synthetically.
(a) Simulate 45 true nulls (p-values ~ Uniform[0,1]) and 5 true alternatives (p-values ~ Beta(0.2, 1)). (b) Count discoveries with no correction at . (c) Apply Bonferroni correction and count discoveries. (d) Apply BH correction and count discoveries. (e) Across 1000 simulation replications, estimate the empirical FWER and FDR for each method. Plot the results.
Exercise 6 ** - Permutation Test for Two-Sample Means
Two LLMs (A and B) are evaluated on 30 shared test prompts. Model A scores: drawn from . Model B scores: drawn from .
(a) Compute the observed mean difference. (b) Implement a permutation test with permutations. (c) Compute the permutation p-value. (d) Compare to a Welch t-test p-value on the same data. (e) Repeat 500 times and compare the empirical Type I error rates of both tests under .
Exercise 7 *** - Sequential A/B Test with SPRT
Two model variants A and B are tested on streaming requests. vs. . Set , .
(a) Derive the log-likelihood ratio for Bernoulli outcomes. (b) Compute the Wald stopping boundaries and . (c) Simulate the sequential process until stopping or . Plot vs. with the boundaries. (d) Compare the expected stopping time under and . (e) Estimate the empirical Type I error rate over 1000 simulated experiments where is true. Verify it is .
Exercise 8 *** - KS-Based Data Drift Detector for LLM Features
Build a drift detection system for an LLM serving system. Reference distribution: sentence embedding norms . Production batches vary.
(a) Simulate a reference dataset of 5,000 embeddings and three daily production batches: no drift, moderate drift (), severe drift (). (b) Apply the two-sample KS test to each batch vs. reference. (c) Apply BH correction across the 3 batch comparisons. (d) Implement a sliding window detector: alert if last 3 consecutive days all have KS . (e) Compare KS vs. a t-test drift detector: which is more sensitive to scale changes? Demonstrate with a scenario where the mean is unchanged but .
11. Why This Matters for AI (2026 Perspective)
| Concept | AI / LLM Application | Impact |
|---|---|---|
| p-values and significance | Model comparison on benchmark leaderboards | Prevents claiming spurious improvements; requires per comparison |
| Power analysis | Benchmark design; A/B experiment sizing | Determines minimum test set size to detect meaningful improvements |
| Type I / II error trade-off | Deployment gates (safety vs. capability) | Conservative \alpha (0.01) for safety tests; liberal \alpha (0.1) for early exploration |
| Multiple testing correction | Simultaneous evaluation across benchmarks | BH correction required when testing \geq 10 benchmarks |
| Welch t-test | Comparing model variants on different test sets | Default for unpaired, unequal-variance model comparisons |
| McNemar's test | Paired model comparison on shared test examples | Most powerful paired comparison for binary accuracy |
| GLRT / Wilks' theorem | Comparing nested model architectures by NLL | test on difference in log-likelihoods; model selection |
| Wald test | Pruning significance of neural network weights | Test if weight significantly differs from zero before pruning |
| BH FDR correction | Multi-benchmark leaderboards (MMLU, HumanEval, etc.) | Controls false discovery rate across hundreds of simultaneous comparisons |
| Permutation test | LLM evaluation on custom metrics (BLEU, ROUGE, win rate) | Exact calibration without distributional assumptions |
| KS test | Production ML monitoring; data drift detection | Feature-level drift detection; triggers retraining pipeline |
| SPRT / mSPRT | Online A/B testing at scale (Spotify, Netflix, deployment) | Reduces experiment duration by 30-50% vs. fixed-n tests |
| Sequential testing | LLM RLHF reward model evaluation | Valid early stopping during human preference collection |
| PSI (Population Stability Index) | Model monitoring dashboards | Industry-standard drift metric for tabular features |
| Bootstrap hypothesis tests | Evaluation with small test sets | Valid inference without normality; standard in MT evaluation |
12. Conceptual Bridge
Looking Back: Estimation Theory (Section02)
Hypothesis testing builds directly on estimation theory (Section02). The estimators derived there - sample mean , MLE , sample variance - reappear as the building blocks of every test statistic. The confidence interval duality (Section 2.5) makes this connection explicit: a confidence interval is the set of parameter values we would fail to reject, and the test is an inversion of the CI procedure.
The asymptotic normality of MLE (Section02 Section8) is the theoretical engine behind the Wald test and the asymptotic validity of the z-test for large samples. Fisher information (Section02 Section4) enters hypothesis testing through the score test and through the Cramer-Rao bound's role in characterising optimal tests.
Confidence intervals (Section02 Section7) and hypothesis tests are dual constructions: every confidence interval corresponds to a test, and every test corresponds to a confidence interval. Reporting CIs is strictly more informative, because CIs communicate effect size and precision, not just a binary reject/don't-reject decision.
Looking Forward: Bayesian Inference (Section04)
Section Section04 provides the Bayesian counterpart to every major concept in this section:
| Frequentist (Section03) | Bayesian (Section04) |
|---|---|
| p-value | Posterior probability |
| Significance test | Bayes factor |
| Confidence interval | Credible interval |
| FWER / FDR control | Prior on proportion of true nulls |
| Point null | Spike-and-slab prior centred at |
The philosophical divide is deep: frequentists refuse to assign probabilities to hypotheses (hypotheses are fixed; data are random). Bayesians treat parameters and hypotheses as random variables with prior distributions. The Bayesian framework provides a natural solution to multiple testing (the prior on automatically corrects for multiplicity), but requires specification of that prior - a potential source of subjectivity.
For AI practitioners, the practical choice is often dictated by computational constraints and domain norms. Frequentist tests are fast and require no prior specification; Bayesian methods provide richer inference at the cost of prior elicitation and posterior computation.
Looking Further Forward: Regression (Section06)
The F-test derived in Section 4.4 reappears in Section06 as the overall F-test for regression significance. The t-test for individual regression coefficients () is a direct application of the Wald test from Section 5.4. The multiple testing problem reappears when testing many coefficients simultaneously in high-dimensional regression - LASSO regularisation can be seen as an implicit multiple testing correction that shrinks small coefficients to zero.
POSITION IN CURRICULUM
========================================================================
Section02 ESTIMATION THEORY
MLE, Fisher info, CIs, asymptotic normality
|
v (test statistics are functions of estimators)
Section03 HYPOTHESIS TESTING <-- YOU ARE HERE
p-values, power, t/\\chi^2/F tests, LRT, multiple testing,
nonparametric tests, A/B testing, sequential tests
| |
v v
Section04 BAYESIAN INFERENCE Section06 REGRESSION ANALYSIS
(Bayes factors, posterior (F-test, t-tests on
probability of hypotheses) regression coefficients)
|
v
Ch8 OPTIMISATION
(RLHF experiment design,
model selection, early stopping)
========================================================================
Hypothesis testing is the formal language of scientific comparison. Every claim that "model A is better than model B", every statement that "this feature is significant", every assertion that "the distribution shifted" - all of these are hypothesis tests, whether or not they are recognised as such. Making these tests explicit, pre-specified, and properly corrected is the difference between rigorous science and post-hoc storytelling.
Appendix A: Key Distributions in Hypothesis Testing
| Distribution | PDF / PMF | Key role in testing |
|---|---|---|
| Z-test null distribution | ||
| t-test null distribution | ||
| Chi-squared, GLRT, Wald, Score test | ||
| F-test, ANOVA | ||
| KS test null distribution |
Relationships:
- (square of t is F with df 1 in numerator)
- (ratio of chi-squared variables / their df)
- (Wilks' theorem)
Appendix B: Critical Values Reference
| Test | |||
|---|---|---|---|
| (two-sided) | |||
| (two-sided) | |||
| 2.706 | 3.841 | 6.635 | |
| 9.236 | 11.070 | 15.086 | |
| 15.987 | 18.307 | 23.209 | |
| 2.881 | 4.171 | 7.562 | |
| 2.276 | 2.922 | 4.510 |
Appendix C: Statistical Testing in Python
from scipy import stats
import numpy as np
# One-sample t-test
t_stat, p_val = stats.ttest_1samp(data, popmean=mu0)
# Welch two-sample t-test
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)
# Paired t-test
t_stat, p_val = stats.ttest_rel(before, after)
# Chi-squared goodness-of-fit
chi2, p_val = stats.chisquare(observed, expected)
# Chi-squared test of independence
chi2, p_val, dof, expected = stats.chi2_contingency(contingency_table)
# One-way ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)
# Mann-Whitney U test
u_stat, p_val = stats.mannwhitneyu(x, y, alternative='two-sided')
# Kolmogorov-Smirnov two-sample test
ks_stat, p_val = stats.ks_2samp(sample1, sample2)
# Wilcoxon signed-rank test
w_stat, p_val = stats.wilcoxon(differences)
Appendix D: Glossary
| Term | Definition |
|---|---|
| p-value | ; probability of data this extreme under |
| Size | ; actual Type I error rate |
| Level | Upper bound on size; test of level has size |
| Power | ; probability of correctly detecting the alternative |
| Consistent test | Power as for all |
| UMP test | Uniformly most powerful; maximises power at every |
| FWER | Family-wise error rate; probability of at least one false rejection |
| FDR | False discovery rate; expected proportion of false rejections among all rejections |
| SPRT | Sequential probability ratio test; optimal sequential test (Wald 1943) |
| mSPRT | Mixture SPRT; produces always-valid p-values for continuous monitoring |
| MLR | Monotone likelihood ratio; condition guaranteeing existence of UMP tests |
Appendix E: Proof of the Bonferroni Inequality
Lemma (Bonferroni). Let be events. Then:
Proof: By inclusion-exclusion and the fact that all higher-order intersection terms are non-negative:
Application to FWER: Let . If each test has size , then and:
The Bonferroni correction is conservative because the inequality is tight only when the are mutually exclusive - which is the worst case for the union bound.
Simes' inequality (1986): For independent tests, the probability that any (BH threshold) is exactly under the complete null. This is sharper than Bonferroni and is the basis of the BH procedure's validity proof.
Appendix F: Derivation of t-Distribution
Setup: . Show that .
Step 1: so .
Step 2: (Cochran's theorem; requires which holds for Gaussian data).
Step 3: and are independent (also Cochran).
Step 4: By definition of the t-distribution, if and are independent, then . Apply with :
Why t has heavier tails than Normal: The denominator is random. On lucky samples, is small, making large. On unlucky samples, is large, making small. This extra randomness spreads the distribution's tails. As , by LLN, and .
Appendix G: Power Analysis - Detailed Derivations
One-sample z-test power derivation:
Under , the test statistic has distribution:
where is Cohen's d.
Two-sided rejection region: . Power:
For (so ), the second term is negligible, giving:
Setting (desired power):
Power table for two-sample test (, ):
| Cohen's d | n per group |
|---|---|
| 0.20 (small) | 393 |
| 0.50 (medium) | 64 |
| 0.80 (large) | 26 |
| 1.00 (very large) | 17 |
For accuracy comparisons (proportions, , ):
| Accuracy gap | Baseline | n per group |
|---|---|---|
| 0.5% | 85% | ~28,000 |
| 1.0% | 85% | ~7,200 |
| 2.0% | 85% | ~1,800 |
| 5.0% | 85% | ~310 |
These numbers explain why ML benchmark evaluations are so often underpowered: a 5% absolute improvement requires only 310 examples per model, but a 1% improvement requires 7,200 - yet many benchmarks have 1,000-3,000 examples total.
Appendix H: Exact Permutation Distribution
For a two-sample test with observations, there are possible label assignments under . For : permutations - feasible to enumerate exactly. For : - use Monte Carlo with permutations.
Exactness: The permutation p-value is an unbiased estimate of the true permutation p-value. Adding 1 to numerator and denominator (standard practice) ensures and conservatism.
Validity without normality: The permutation test is exactly valid for any test statistic, any sample size, and any continuous distribution. The only assumption is exchangeability under - which is guaranteed by randomisation in designed experiments.
Appendix I: Benjamini-Hochberg Procedure - Step-by-Step
Input: p-values (unordered); target FDR level .
Algorithm:
- Sort: .
- For each from down to 1: check if .
- Let (or if no such exists).
- Reject .
Example: tests, . Sorted p-values: 0.001, 0.008, 0.039, 0.041, 0.042, 0.060, 0.074, 0.205, 0.396, 0.950.
BH thresholds (): 0.005, 0.010, 0.015, 0.020, 0.025, 0.030, 0.035, 0.040, 0.045, 0.050.
| ? | |||
|---|---|---|---|
| 1 | 0.001 | 0.005 | Yes |
| 2 | 0.008 | 0.010 | Yes |
| 3 | 0.039 | 0.015 | No |
| 4 | 0.041 | 0.020 | No |
Working from bottom: , , -> . Reject hypotheses 1 and 2.
Bonferroni would require - only hypothesis 1 would be rejected. BH is more powerful.
Appendix J: Sequential Testing and the Optional Stopping Problem
The optional stopping theorem (Doob): For a martingale and stopping time : under mild conditions. Under , the likelihood ratio is a martingale, so .
Why peeking inflates error: If you peek at p-values and stop whenever , you are effectively running a random walk and stopping when it first crosses a boundary. The boundary crossings are more frequent than the fixed- analysis assumes, inflating the Type I error.
Ville's inequality: For a non-negative martingale with and any stopping time :
This is the key inequality behind always-valid p-values: if you stop when (equivalently when p-value ), the false positive rate is controlled at regardless of when you stop.
E-values: A recent (2020+) framework replaces p-values with e-values satisfying . E-values can be combined multiplicatively across observations and across experiments, and Ville's inequality guarantees at any stopping time. E-values are the natural language for sequential testing and meta-analysis.
Appendix K: Worked Examples - Common Tests
K.1 One-Sample t-Test
Problem: A new LLM fine-tune is tested on 15 reasoning problems. Mean score = 72.3, sample std = 8.7. Baseline score = 68.0. Is the improvement significant at ?
Solution:
- , (one-sided, improvement was predicted).
- .
- Critical value: .
- : reject .
- p-value .
- Conclusion: The fine-tune shows a statistically significant improvement (, one-sided ).
- Effect size: Cohen's d (medium effect).
K.2 Two-Proportion Z-Test for A/B Test
Problem: A chat interface is tested: control group () has 12% click-through rate; treatment group () has 13.5% CTR. Is the improvement significant?
Solution:
- , (two-sided, pre-specified).
- Pooled .
- .
- .
- p-value : significant.
- Effect size: Cohen's (very small).
- Decision: Statistically significant but effect is tiny. Consider cost of deployment vs. 1.5% CTR gain.
K.3 Chi-Squared Test of Independence
Problem: Test whether LLM output quality (good/bad) is independent of prompt language (English/French/Spanish/German). Contingency table:
| English | French | Spanish | German | |
|---|---|---|---|---|
| Good | 420 | 310 | 290 | 180 |
| Bad | 80 | 90 | 110 | 70 |
Solution:
- : quality independent of language.
- Row totals: 1200, 350; Column totals: 500, 400, 400, 250. Grand total: 1550.
- ; compute all 8 expected values.
- , df .
- : strong evidence of language dependence.
- Cramer's V (small to medium effect).
- Conclusion: Quality differs significantly across languages; Spanish and German have notably higher error rates.
Appendix L: Further Reading
Core Textbooks
-
Lehmann & Romano - Testing Statistical Hypotheses (3rd ed., 2005): The definitive theoretical reference. Covers NP lemma, UMP tests, unbiasedness, invariance, and asymptotic theory with full proofs. Essential for anyone wanting the complete frequentist theory.
-
Casella & Berger - Statistical Inference (2nd ed., 2001): Chapters 8-9 cover hypothesis testing at the graduate textbook level. Excellent balance of theory and computation.
-
Wasserman - All of Statistics (2004): Compressed, modern treatment with connections to ML. Chapters 10-14 cover testing, p-values, and multiple testing.
-
Efron & Hastie - Computer Age Statistical Inference (2016): Covers bootstrap, FDR, empirical Bayes, and algorithmic inference. Free PDF from Stanford.
ML-Specific References
-
Dror et al. - "Deep Dominance: How to Properly Compare Deep Neural Models" (ACL 2019): Comprehensive study of hypothesis tests for NLP model comparison. Advocates for bootstrap and permutation tests over t-tests.
-
Demsar - "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR, 2006): Recommends Friedman test + Nemenyi post-hoc for comparing multiple classifiers across multiple datasets.
-
Johari et al. - "Peeking at A/B Tests" (KDD 2017): The original paper on always-valid p-values and mSPRT for online A/B testing.
-
Ramdas et al. - "Testing Exchangeability: Fork-Convex Hulls, Supermartingales and e-Processes" (2022): Modern framework for e-values and anytime-valid inference.
-
Liao et al. - "Are Emergent Abilities of Large Language Models a Mirage?" (NeurIPS 2023): Demonstrates that many claimed LLM phase transitions are statistical artifacts of discontinuous metrics + multiple comparisons.
-
Koehn - "Statistical Significance Tests for Machine Translation Evaluation" (EMNLP 2004): The canonical reference for bootstrap resampling in MT evaluation. Introduced paired bootstrap testing to NLP.
Appendix M: Advanced Topics in Hypothesis Testing
M.1 Composite Hypotheses and Nuisance Parameters
Many practical testing problems involve nuisance parameters - parameters that appear in the model but are not the focus of the test. For example, in the two-sample t-test, the common variance (or the two separate variances in Welch's test) are nuisance parameters when the hypothesis concerns the difference in means.
Problem: If with unknown (Behrens-Fisher problem), there is no exact test. The Welch t-test provides an approximate solution via the Satterthwaite degrees of freedom approximation.
Conditional tests: One approach is to condition on sufficient statistics for the nuisance parameters. Fisher's exact test conditions on the row and column marginals of a contingency table - the marginals are ancillary for the association parameter of interest.
Profile likelihood: Replace nuisance parameters by their profile MLEs. The profile likelihood ratio test then has the same asymptotic distribution as the full GLRT.
M.2 Equivalence Testing and Non-Inferiority Tests
Classical hypothesis testing asks: "is there an effect?" But in ML deployment, the question is often reversed: "is the new model at least as good as the old one?" This requires equivalence testing or non-inferiority testing.
TOST (Two One-Sided Tests): To test that (practically equivalent):
- Test at level (one-sided).
- Test at level (one-sided).
- Conclude equivalence if both are rejected.
The equivalence margin must be pre-specified based on domain knowledge (e.g., "a difference of less than 0.5% accuracy is practically irrelevant").
Non-inferiority test: Show that the new model is not worse than baseline by more than :
Both frameworks are essential for responsible ML deployment: before retiring a production model, verify the replacement is not inferior beyond an acceptable margin.
M.3 Multiple Testing in Modern Machine Learning
Neural architecture search (NAS): Testing thousands of architectural variants involves extreme multiple comparisons. Without FDR correction, reported improvements are largely artifacts. Proper NAS evaluation requires:
- Held-out final evaluation (not the search objective).
- BH correction across all tried architectures.
- Multiple random seeds per architecture.
Hyperparameter tuning: Grid search over hyperparameters creates implicit multiple comparisons. Bayesian optimization with proper uncertainty quantification (Gaussian processes) naturally avoids this by reasoning about the distribution over hyperparameter performance rather than making independent comparisons.
Neural network weight testing: Magnitude pruning implicitly tests whether each weight is significantly different from zero. The formal version is a Wald test , where comes from the Fisher information matrix. Applying BH correction gives a principled sparse pruning criterion. This connects to the lottery ticket hypothesis: a subnetwork survives iff its weights are statistically distinguishable from zero.
M.4 Causal Inference and Hypothesis Testing
Standard hypothesis testing establishes association () but not causation ( changes ). The connection:
Randomised experiments: When treatments are randomly assigned (RCT), the two-sample t-test or Wilcoxon test on outcomes provides valid causal inference. Randomisation eliminates confounding, so association implies causation.
Observational studies: Without randomisation, a significant test only shows association. Causal inference requires additional assumptions (instrumental variables, regression discontinuity, difference-in-differences) and sensitivity analysis.
RLHF and causal testing: When evaluating whether RLHF improves a model, the "treatment" (RLHF fine-tuning) must be applied to otherwise identical models. Comparing a fine-tuned model to a different base model conflates the RLHF effect with base model differences.
Appendix N: Pitfalls in Benchmark Evaluation - Extended Analysis
N.1 The Evaluation Overfitting Problem
Adaptive data analysis: Every time a benchmark is used to select a model or tune hyperparameters, the benchmark becomes part of the training signal. The final evaluation on the same benchmark is biased upward.
Holdout sets: The standard remedy is a held-out test set that is never used for model selection. In practice, LLM benchmark contamination makes this extremely difficult - web-scraped training data often contains benchmark questions and answers.
Differential privacy approach: Dwork et al. (2015) proved that if researchers are allowed at most adaptive queries to a holdout set of size , they can answer up to queries with valid statistical guarantees. This puts a hard limit on the number of models that can be compared on a single benchmark before results become meaningless.
N.2 The Multiple Metrics Problem
When a model is evaluated on 50 metrics (BLEU, ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, ...) and reported as "best on 30 of 50", this is not a well-defined test result. The correct approach:
- Pre-specify the primary metric before evaluation.
- Report secondary metrics as exploratory with FDR-corrected p-values.
- Use a composite score (average normalised ranking across metrics) as the primary outcome.
N.3 Model Size and Benchmark Artefacts
Many apparent improvements in LLM evaluations are confounded with model size. Larger models score higher on essentially every benchmark not because of the specific training choices being evaluated, but because of the additional parameters. Proper evaluation must control for (or fix) model size.
Scaling law adjustments: When comparing models of different sizes, use scaling law predictions (Chinchilla) to normalise scores to a common compute budget. A model that achieves score at FLOPs is better than one achieving score at FLOPs, even if their raw scores are identical.
Appendix O: Practice Problems
Problem O.1: Show that the chi-squared goodness-of-fit statistic can be written as . Use this to show that iff for all .
Problem O.2: A coin is flipped 1000 times and 520 heads observed. Compute the p-value for vs. . Is the coin significantly biased at ? What is the 95% CI for ? Verify the CI/test duality.
Problem O.3: Two classifiers are evaluated on 200 test examples. Classifier A is correct on 156, B on 148. They agree on 130 correct and 18 incorrect predictions. Set up and perform McNemar's test. Compare to a naive two-proportion z-test on the aggregate counts.
Problem O.4: Prove that for the Mann-Whitney U statistic, where and are independent draws from the two populations. Conclude that is an unbiased estimator of , and that under , .
Problem O.5: Consider independent tests where the true null proportion is . Under BH at , derive the expected number of true and false discoveries as a function of the effect size under . Plot the expected FDP (false discovery proportion) as a function of .
Problem O.6: Implement the SPRT for comparing two Bernoulli distributions with vs. . Run the test 10,000 times under (true ) and 10,000 times under (true ). Report: (a) empirical Type I error rate, (b) empirical Type II error rate, (c) distribution of stopping times under and .
Problem O.7: You have 5,000 features in a fraud detection model. After retraining on new data, 847 features show distributional shift with raw p-value < 0.05 from KS tests. (a) How many false discoveries do you expect under ? (b) Apply BH at and report the number of discoveries. (c) Which features would be worth investigating for model update?
Problem O.8: A researcher claims that their new prompt engineering technique improves GPT-4 on MMLU from 86.4% to 87.1% (score on 14,000 questions). Perform a formal hypothesis test and compute the 95% CI. Is this improvement practically significant? What is the effect size (Cohen's h)?
Appendix P: Connection to Information Theory
Hypothesis testing has deep connections to information theory. These connections illuminate why certain tests are optimal and provide a geometric view of the testing problem.
P.1 KL Divergence and Test Power
The Chernoff information between two distributions and is:
For independent observations, the optimal Type II error rate (at fixed Type I error) decays exponentially as . Chernoff information determines the fundamental limit on how fast hypothesis testing errors vanish with sample size.
Special case: For one-sided tests with simple and , the Stein's lemma states:
where is the KL divergence from to . More KL divergence -> test errors vanish faster. This is why KL divergence is the natural measure of "distance" between distributions in testing.
P.2 Sufficient Statistics and Data Processing
The Data Processing Inequality states that processing data (applying a function ) cannot increase information:
In hypothesis testing: a test statistic is a deterministic function of . By the DPI, cannot be more informative about than itself. Equality holds when is a sufficient statistic - when captures all the information in the data about .
This gives an information-theoretic characterisation of sufficient statistics: is sufficient for iff , i.e., no information about is lost by replacing with . A test based on a sufficient statistic is just as powerful as one based on the full data.
P.3 Minimum Description Length and Hypothesis Selection
MDL (Minimum Description Length): The MDL principle selects the model that provides the shortest description of the data. For hypothesis testing:
- The null model has description length .
- The alternative has description length .
Reject if the alternative provides a shorter total description. This is equivalent to GLRT when and are parametric models of different complexity - the MDL penalty for the more complex model plays the role of the chi-squared df in Wilks' theorem.
Connection to Bayes factors: If priors on and are encoded as prefix codes, the Bayes factor equals where includes both model complexity and data description length. MDL, Bayes factors, and GLRT are three facets of the same information-theoretic principle: model complexity must be penalised when comparing models of different complexity.
Appendix Q: Numerical Examples - Power Curves
Q.1 Power Curve for One-Sample t-Test
For a one-sample t-test (, known, , ):
| True | Cohen's d | Power |
|---|---|---|
| 0.0 | 0.00 | 0.050 (= ) |
| 0.1 | 0.10 | 0.098 |
| 0.2 | 0.20 | 0.212 |
| 0.3 | 0.30 | 0.398 |
| 0.4 | 0.40 | 0.594 |
| 0.5 | 0.50 | 0.761 |
| 0.6 | 0.60 | 0.876 |
| 0.8 | 0.80 | 0.977 |
| 1.0 | 1.00 | 0.998 |
Observation: 80% power requires , i.e., true mean away from null.
Q.2 Effect of Sample Size on Power
For a two-sample t-test detecting (medium effect), :
| per group | Power |
|---|---|
| 20 | 0.338 |
| 40 | 0.521 |
| 64 | 0.700 |
| 100 | 0.851 |
| 150 | 0.940 |
| 200 | 0.978 |
| 300 | 0.997 |
The "required " of 64 achieves 70% power (not 80% - this is a common mistake in power calculation formulas; exact values depend on using the t-distribution vs. normal approximation).
Q.3 Multiple Testing Power Comparison
For tests, 10 true alternatives with , per comparison:
| Method | FWER | FDR | Expected discoveries | Expected true disc. |
|---|---|---|---|---|
| No correction | ~1.00 | ~0.30 | 12 | 8.5 |
| Bonferroni | 0.05 | ~0.01 | 6.5 | 6.4 |
| Holm | 0.05 | ~0.02 | 6.8 | 6.7 |
| BH () | ~0.35 | 0.05 | 9.3 | 8.9 |
BH makes approximately 2.7 more true discoveries than Bonferroni at the cost of slightly higher FWER.
Appendix R: Connection to Decision Theory
R.1 Minimax Hypothesis Testing
Hypothesis testing can be formulated as a decision problem. Let:
- = loss of decision when is true.
- Standard 0-1 loss: (correct decision), (reject when true), (fail to reject when true).
The Bayes risk of a test with prior on the hypotheses:
Minimising Bayes risk gives the likelihood ratio test (NP lemma generalised to Bayesian setting): reject when .
The minimax test minimises the maximum risk over all priors:
For 0-1 loss, the minimax test is the one that equalises the power function at and - equivalently, the Bayes test under the least favourable prior.
R.2 Asymptotic Relative Efficiency
How much more data does test A need compared to test B to achieve the same power? The asymptotic relative efficiency (ARE) of B relative to A is:
where are the sample sizes needed for both tests to achieve power .
Pitman efficiency computes ARE for tests against local alternatives :
where is the efficiency of test statistic .
Key results:
- Wilcoxon vs. t-test for Gaussian data: . Wilcoxon loses only 4.5% efficiency.
- Wilcoxon vs. t-test for heavy-tailed data: . Wilcoxon can be substantially more efficient.
- Minimum ARE of Wilcoxon vs. t-test (over all symmetric distributions): - Wilcoxon never needs more than 16% more data.
This remarkable result (Hodges-Lehmann, 1956) justifies using Wilcoxon as a default nonparametric test: you sacrifice at most 16% efficiency relative to t-test in the best case for t, while potentially gaining large efficiency for non-normal distributions.
R.3 Sensitivity Analysis for Robust Testing
In observational studies, test validity depends on unverifiable assumptions. Sensitivity analysis asks: how strong would an unmeasured confounder need to be to explain the observed effect?
Rosenbaum's sensitivity parameter : For a matched pairs study, is the odds ratio of treatment assignment that an unmeasured binary confounder could induce. A result is "significant at -sensitivity level" if it remains significant even when allowing for a confounder with odds ratio .
Report sensitivity: "Our finding remains significant at the level, meaning an unmeasured confounder would need to double the odds of treatment to explain away the effect."
For AI experiments, sensitivity analysis is essential when comparing models across different data pipelines, hardware, or evaluation setups - all of which are potential confounders.
Appendix S: Statistical Testing Checklist for ML Practitioners
Before reporting any hypothesis test in a paper or technical document, verify the following:
S.1 Pre-Analysis Checklist
- Hypothesis pre-specified: , , and the primary metric were stated before data collection or model training.
- Test choice pre-specified: The specific test (t-test, McNemar, permutation, etc.) was chosen based on the study design, not on which test gives a lower p-value.
- Sample size justified: Power analysis was performed and the required was collected (or power at the actual is reported).
- Significance level stated: was pre-specified (typically 0.05; consider 0.01 for high-stakes claims).
- Multiple comparisons planned: If multiple tests are planned, the correction method (Bonferroni/BH) was pre-specified.
S.2 Analysis Checklist
- Assumptions verified:
- Normality (for t-test): checked via Shapiro-Wilk or Q-Q plot for .
- Homoscedasticity (for pooled t-test/ANOVA): checked via Levene's test; use Welch if violated.
- Independence: observations are not clustered, repeated, or time-dependent.
- Correct test applied: Paired data -> paired test. Small expected counts -> Fisher's exact. Non-normal small -> nonparametric.
- Effect size computed: Cohen's d/h/f or Cramer's V reported alongside p-value.
- Confidence interval reported: 95% CI for the effect size (not just the p-value).
S.3 Reporting Checklist
- Exact p-value reported: Not just "p < 0.05" but the exact value (e.g., ).
- Test statistic and df reported: ", " is complete; "" is not.
- Effect size and CI reported: " (95% CI: [0.22, 1.14])".
- Sample size reported: per group for two-sample tests.
- Multiple testing correction applied: Which method and at what level.
- No HARK: Post-hoc analyses are clearly labelled as exploratory.
S.4 Interpretation Checklist
- Statistical vs. practical significance distinguished: A significant result with may not justify deployment cost.
- Null result properly qualified: "We failed to find evidence of X" not "We showed X does not exist". Power and MDE are reported.
- Replication recommended: A single significant result (especially ) should be confirmed in an independent replication.
Appendix T: Quick Reference - Test Statistics and Null Distributions
One-Sample Tests
| Test | Statistic | Null Distribution | When to Use |
|---|---|---|---|
| Z-test | Normal, known | ||
| One-sample t | Normal, unknown | ||
| Sign test | Any continuous, robust | ||
| Wilcoxon signed-rank | Wilcoxon distribution | Symmetric, non-normal | |
| Chi-squared GoF | Count data |
Two-Sample Tests (Independent)
| Test | Statistic | Null Distribution | When to Use |
|---|---|---|---|
| Welch t | See Section4.2 | (Satterthwaite) | Normal, unequal var |
| Pooled t | Normal, equal var | ||
| Z-test (proportions) | See Section4.1 | Large , proportions | |
| Mann-Whitney | statistic | Wilcoxon/Normal approx | Non-normal |
| KS test | Kolmogorov dist | Any continuous | |
| Permutation | Any statistic | Empirical (permuted) | Any statistic, any dist |
Two-Sample Tests (Paired)
| Test | When to Use |
|---|---|
| Paired t-test | Normal differences |
| Wilcoxon signed-rank | Non-normal differences |
| Sign test | Ordinal or non-symmetric |
| McNemar | Binary outcomes (accuracy) |
| Permutation | Any paired statistic |
-Sample Tests
| Test | Null distribution | When to Use |
|---|---|---|
| One-way ANOVA | Normal, equal variances | |
| Welch ANOVA | (adjusted df) | Normal, unequal variances |
| Kruskal-Wallis | Non-normal | |
| Friedman | Repeated measures |
Appendix U: Extended Worked Examples - Machine Learning Scenarios
U.1 McNemar's Test for LLM Comparison
Setting: Two LLMs (Gemini Pro and GPT-4o) are evaluated on 1,200 coding problems. For each problem, each model either passes or fails the test suite.
Data:
- Both pass:
- GPT-4o passes, Gemini fails:
- Gemini passes, GPT-4o fails:
- Both fail:
McNemar's test:
, . Since : reject at .
GPT-4o accuracy: . Gemini accuracy: . The 2.25% gap is statistically significant ().
If we had naively used a two-proportion z-test:
: not significant! The z-test ignores the correlation between paired responses. McNemar correctly uses only the discordant pairs, which concentrate all the information about the performance difference.
U.2 Bootstrap Confidence Interval for BLEU Score Comparison
Setting: Two MT systems are evaluated on 500 test sentences. System A achieves BLEU = 28.3, System B achieves BLEU = 26.8. Is the 1.5 BLEU point difference significant?
Algorithm (paired bootstrap test):
- For :
- Sample 500 sentence pairs with replacement (same indices for both systems).
- Compute BLEU(A) - BLEU(B) on the resampled set.
- Estimate - the fraction of bootstrap replicates where B is better.
This is the Koehn (2004) paired bootstrap test, the standard for MT evaluation.
Why not t-test? BLEU is a corpus-level metric (not an average of per-sentence scores), so the CLT does not directly apply. Bootstrap resampling over sentences respects the actual data-generating process.
U.3 SPRT for Online Evaluation
Setting: A chat assistant is being A/B tested. Primary metric: thumbs-up rate. Control: . Treatment hypothesis: . Target: , .
Wald boundaries:
- Upper boundary: .
- Lower boundary: .
Log-likelihood ratio increment per observation:
For a thumbs-up (): . For a thumbs-down (): .
Expected stopping times:
- Under (): \approx observations.
- Fixed- test for same : approximately observations.
SPRT requires ~43% fewer observations in this scenario by stopping early when evidence accumulates quickly.
U.4 KS-Based Feature Drift Alert
Setting: An NLP model processes document embeddings. Reference distribution of document lengths (tokens): fitted on 50,000 training documents. Daily monitoring with 1,000 production documents.
Drift events:
- Week 1 (no drift): , . KS statistic , . No alert.
- Week 2 (mean shift): , . , . Alert: mean drift.
- Week 3 (variance shift only): , . , . t-test: (no mean shift). KS detects variance drift that t-test misses.
This demonstrates the key advantage of KS over t-test for drift detection: KS is sensitive to any distributional change (mean, variance, shape), while t-test only detects mean shifts.
Appendix V: Historical Notes
V.1 The Lady Tasting Tea
Fisher's canonical example (1935): A lady claims she can tell whether tea or milk was poured first. Fisher designs an experiment: 8 cups, 4 with tea first, 4 with milk first, presented in random order. The lady must identify the 4 tea-first cups.
Under (random guessing): .
This tiny experiment - 8 cups, 1 run - is sufficient to achieve if the lady guesses perfectly. Fisher's point: careful experimental design can yield strong statistical conclusions from minimal data.
For AI: The same logic applies to benchmark construction. A cleverly designed benchmark where random performance is exactly 25% (4-choice multiple choice) and human performance is 90% has high discriminating power. MMLU was designed with this principle.
V.2 Gosset and the Brewery
William Sealy Gosset derived the t-distribution in 1908 while working as a statistician for Guinness Brewery. Guinness had small-batch experiments (barley yields, hop compositions) where was typically 3-10. The existing large-sample theory (requiring normality and known ) was useless. Gosset published under the pseudonym "Student" because Guinness forbade employees from publishing (for fear of revealing industrial methods).
The t-test is thus directly connected to the practical problem of drawing conclusions from small samples - exactly the problem faced by ML researchers evaluating expensive models on small benchmark sets.
V.3 Neyman-Pearson and the Cigarette Industry
Jerzy Neyman and Egon Pearson developed their framework in the 1930s, partly motivated by quality control in manufacturing (testing whether a batch of products meets specifications). The framework is explicitly about decisions, not inference: you must ship or reject a batch based on a sample inspection. This decision-theoretic framing became the dominant paradigm in industrial statistics.
The cigarette industry later (1950s-70s) exploited the p-value/significance framework to manufacture doubt about cancer studies - repeatedly pointing out that individual studies did not achieve while ignoring the overwhelming weight of evidence across hundreds of studies. This historical episode motivates modern emphasis on effect sizes, meta-analysis, and replication over single-study p-values.
Appendix W: Common Distributions - Moments and Quantiles
W.1 Standard Normal
Key quantiles: , , , , .
W.2 Student's t-Distribution
, for . Approaches as .
| df | ||
|---|---|---|
| 5 | 2.571 | 4.032 |
| 10 | 2.228 | 3.169 |
| 20 | 2.086 | 2.845 |
| 30 | 2.042 | 2.750 |
| 60 | 2.000 | 2.660 |
| 1.960 | 2.576 |
W.3 Chi-Squared Distribution
, . Sum of independent squared standard normals.
W.4 F-Distribution
Used in ANOVA and comparing nested model likelihoods. .
Appendix X: Summary of Key Theorems
Theorem 1 (Neyman-Pearson Lemma). For simple vs. , the most powerful size- test rejects when .
Theorem 2 (Wilks' Theorem). Under regularity conditions, as , where is the number of equality constraints in .
Theorem 3 (Benjamini-Hochberg). The BH procedure at level controls under independence (and PRDS).
Theorem 4 (Kolmogorov-Smirnov). For continuous , where is a Brownian bridge.
Theorem 5 (Wald, SPRT). The SPRT with boundaries and satisfies and (approximately). The SPRT minimises expected sample size among all tests with the same error bounds.
Theorem 6 (Hodges-Lehmann). The ARE of the Wilcoxon signed-rank test relative to the t-test satisfies for all symmetric continuous distributions, with equality for the logistic distribution. The Wilcoxon test is never less than 86.4% as efficient as the t-test.
Theorem 7 (Pitman-Koopmans). For exponential families, one-sided tests of the natural parameter are UMP: reject when for the sufficient statistic .
Theorem 8 (Equivalence of Trinity Tests). Under and contiguous alternatives, the Wald, Score, and Likelihood Ratio tests are all asymptotically equivalent: they have the same asymptotic size and the same asymptotic power function against local alternatives.
This section is part of the Math for LLMs curriculum. Previous: Section02 Estimation Theory | Next: Section04 Bayesian Inference
Appendix Y: Statistical Software and Implementation Notes
Y.1 SciPy Reference for Common Tests
from scipy import stats
import numpy as np
# -- One-sample tests --------------------------------------------------
# Z-test (manually, since scipy has no z-test function)
z = (xbar - mu0) / (sigma / np.sqrt(n))
p_two = 2 * (1 - stats.norm.cdf(abs(z)))
# One-sample t-test
t, p = stats.ttest_1samp(x, popmean=mu0)
# Wilcoxon signed-rank test
w, p = stats.wilcoxon(x - mu0)
# -- Two-sample tests --------------------------------------------------
# Welch's t-test (ALWAYS use equal_var=False unless you have strong reason)
t, p = stats.ttest_ind(x, y, equal_var=False)
# Paired t-test
t, p = stats.ttest_rel(x, y)
# Mann-Whitney U
u, p = stats.mannwhitneyu(x, y, alternative='two-sided')
# Two-sample KS test
d, p = stats.ks_2samp(x, y)
# Permutation test (scipy >= 1.8)
result = stats.permutation_test((x, y),
statistic=lambda a, b: a.mean() - b.mean(),
n_resamples=10_000, alternative='two-sided')
p = result.pvalue
# -- Multi-sample tests ------------------------------------------------
# One-way ANOVA
f, p = stats.f_oneway(group1, group2, group3)
# Kruskal-Wallis
h, p = stats.kruskal(group1, group2, group3)
# -- Categorical tests -------------------------------------------------
# Chi-squared goodness-of-fit
chi2, p = stats.chisquare(observed, f_exp=expected)
# Chi-squared test of independence
chi2, p, dof, expected = stats.chi2_contingency(table)
# McNemar's test (statsmodels)
from statsmodels.stats.contingency_tables import mcnemar
result = mcnemar([[n11, n10], [n01, n00]])
p = result.pvalue
Y.2 Multiple Testing Correction
from statsmodels.stats.multitest import multipletests
# Bonferroni, Holm, BH corrections
reject, p_adj, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
reject, p_adj, _, _ = multipletests(p_values, alpha=0.05, method='holm')
reject, p_adj, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
# method options: 'bonferroni', 'holm', 'fdr_bh', 'fdr_by', 'sidak',
# 'holm-sidak', 'simes-hochberg', 'hommel'
Y.3 Power Analysis
from statsmodels.stats.power import (
TTestIndPower, TTestOneSamplePower, NormalIndPower
)
# Required sample size for two-sample t-test
analysis = TTestIndPower()
n = analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.80,
ratio=1.0, alternative='two-sided')
# Power at fixed n
power = analysis.power(effect_size=0.5, nobs1=64, alpha=0.05,
ratio=1.0, alternative='two-sided')
# For proportions
from statsmodels.stats.proportion import proportion_effectsize, zt_ind_solve_power
h = proportion_effectsize(0.87, 0.85) # Cohen's h
n = zt_ind_solve_power(effect_size=h, alpha=0.05, power=0.80)
Y.4 Numerical Tips
- Always set
np.random.seed(42)before generating synthetic data for reproducibility. - For exact p-values from t-distribution:
p = 2 * stats.t.sf(abs(t_stat), df=df)(two-sided). - For chi-squared p-value:
p = stats.chi2.sf(chi2_stat, df=k-1). - KS test is sensitive to sample size - even tiny real differences are "significant" at large . Always report KS statistic alongside p-value.
- For bootstrap tests, use at least permutations (so that is achievable).