Chapter 7 — Statistics

"Statistics is the grammar of science. Without it, data is just noise; with it, data becomes evidence."

Overview

Statistics is the discipline of drawing principled conclusions from data. Where probability theory asks "given a model, what data should we expect?", statistics inverts the question: "given data, what can we infer about the underlying model?" This inversion — inference from observations — is the foundation of every machine learning training algorithm.

This chapter builds statistical reasoning from data summarisation (§01), through classical estimation and hypothesis testing (§02–§03), Bayesian inference (§04), time-series analysis (§05), and regression (§06). Every concept is grounded in its ML application: MLE underpins cross-entropy training, confidence intervals govern model evaluation, Bayesian inference enables uncertainty quantification in neural networks, and regression is the blueprint for supervised learning.

The conceptual arc: summarise data (§01) → estimate parameters from data (§02) → test hypotheses about data (§03) → update beliefs from data (§04) → model sequential data (§05) → model relationships between variables (§06).

Subsection Map

#	Subsection	What It Covers	Canonical Topics
01	Descriptive Statistics	Summarising and characterising datasets before modelling	Mean, median, mode, variance, standard deviation, skewness, kurtosis, quantiles, IQR, correlation, covariance matrices, data visualisation, outlier detection
02	Estimation Theory	Inferring population parameters from samples; MLE and method of moments	Point estimation, bias, variance, MSE, consistency, efficiency, Cramér-Rao bound, MLE derivation, method of moments, confidence intervals, Fisher information
03	Hypothesis Testing	Formal decision-making under uncertainty; p-values, power, and error rates	Null/alternative hypotheses, Type I/II errors, p-values, significance level, power, t-tests, z-tests, chi-squared tests, ANOVA, multiple testing correction, A/B testing
04	Bayesian Inference	Treating parameters as random variables; posterior computation and uncertainty	Prior, likelihood, posterior, conjugate priors, MAP estimation, MCMC posterior sampling, variational inference, credible intervals, Bayesian model comparison
05	Time Series	Modelling and forecasting sequential, temporally dependent data	Stationarity, autocorrelation, AR/MA/ARMA/ARIMA models, spectral analysis, seasonal decomposition, Kalman filter, forecasting
06	Regression Analysis	Modelling relationships between variables; the blueprint for supervised learning	Simple and multiple linear regression, OLS derivation, Gauss-Markov theorem, regularisation (Ridge/Lasso), logistic regression, GLMs, model diagnostics

Reading Order and Dependencies

01-Descriptive-Statistics         (foundation: summarise before modelling)
        ↓
02-Estimation-Theory              (core inference: MLE, confidence intervals, Fisher info)
        ↓
03-Hypothesis-Testing             (decisions: p-values, power, A/B testing)
        ↓
04-Bayesian-Inference             (probabilistic view: posterior, MAP, MCMC)
        ↓
05-Time-Series                    (sequential data: AR/MA, forecasting, Kalman)
        ↓
06-Regression-Analysis            (supervised learning blueprint: OLS, Ridge, Lasso)
        ↓
Chapter 8 — Optimization          (gradient methods, convexity, training algorithms)

What Belongs Where — Canonical Homes

Topic	Canonical Home	Preview Only In
Mean, median, mode, quantiles	§01	§02 (sample mean as estimator)
Variance, standard deviation (sample)	§01	§06 (residual variance)
Correlation and covariance (empirical)	§01	§06 (design matrix structure)
Skewness, kurtosis, distribution shape	§01	—
Outlier detection methods	§01	—
Point estimators: bias, variance, MSE	§02	—
Cramér-Rao lower bound, Fisher information	§02	§04 (Laplace approximation)
MLE derivation	§02	§04 (MAP as regularised MLE)
Method of moments	§02	—
Confidence intervals (frequentist)	§02	§03 (duality with tests)
Asymptotic normality of MLE	§02	—
Null/alternative hypotheses, p-values	§03	—
Type I/II errors, power, significance	§03	—
t-test, z-test, chi-squared, ANOVA	§03	—
Multiple testing correction	§03	—
A/B testing framework	§03	—
Bayes' theorem (prior × likelihood)	§04	Ch6§03 (full derivation), §02 (preview)
Conjugate priors	§04	—
MAP estimation	§04	§02 (as regularised MLE)
Posterior predictive distribution	§04	—
Credible intervals	§04	—
Variational inference	§04	—
Stationarity, ACF/PACF	§05	—
AR/MA/ARIMA models	§05	—
Kalman filter	§05	—
Spectral density, Fourier in time series	§05	—
OLS derivation ($\hat{\beta} = (X^TX)^{-1}X^Ty$)	§06	—
Gauss-Markov theorem, BLUE	§06	—
Ridge and Lasso regularisation	§06	—
Logistic regression, GLMs	§06	—
Residual analysis, model diagnostics	§06	—

Overlap Danger Zones

1. Descriptive Statistics ↔ Probability Theory

§01 computes sample statistics from data (empirical mean, sample variance, sample correlation).
Ch6§04 defines population parameters (expected value, variance, covariance of a distribution).
§01 must not re-derive the expectation operator; it applies sample analogues. §01 should backward-reference Ch6§04 for the population versions.

2. MLE ↔ MAP ↔ Bayesian Inference

§02 derives MLE from the likelihood principle and treats it as a point estimate.
§04 introduces MAP as MLE regularised by a prior, then extends to full posterior inference.
§02 may note that adding a prior gives MAP, but the full treatment of priors, posteriors, and conjugacy belongs in §04.

3. Confidence Intervals ↔ Credible Intervals

§02 defines frequentist confidence intervals: in repeated experiments, 95% of such intervals contain the true parameter.
§04 defines Bayesian credible intervals: the posterior probability that the parameter lies in the interval is 95%.
The philosophical distinction belongs in §04; §02 covers only the frequentist construction.

4. Hypothesis Testing ↔ Bayesian Model Comparison

§03 is the canonical home for p-values, power, and classical tests.
§04 covers Bayes factors and Bayesian model selection as the Bayesian analogue.
§03 may note the Bayesian alternative briefly; §04 covers it in depth.

5. Regression ↔ Optimisation

§06 derives OLS and establishes regression as a statistical model (Gauss-Markov, residual assumptions).
Ch8 covers gradient-based optimisation, convexity, and SGD — the algorithmic machinery for fitting models.
§06 may solve OLS by the normal equations (closed form) without invoking gradient descent; the algorithmic treatment belongs in Ch8.

6. Regression ↔ Descriptive Statistics

§01 computes the empirical correlation coefficient $r$ as a summary statistic.
§06 derives the regression coefficient $\hat{\beta}_1 = r \cdot s_y / s_x$ and explains the relationship.
§01 must not derive OLS; §06 must backward-reference §01 for empirical correlation.

Key Cross-Chapter Dependencies

From Chapter 6 — Probability Theory: - §01 (Random Variables, CDF/PDF) → §02 likelihood functions and sampling distributions - §02 (Common Distributions) → §02/§03 named test statistics (t, chi-squared, F distributions) - §03 (Bayes' theorem, conditional distributions) → §04 Bayesian inference - §04 (Expectation, variance) → §02 estimator properties (bias, variance, MSE) - §05 (CLT, LLN) → §02 asymptotic normality of MLE; §03 large-sample tests - §06 (Stochastic Processes) → §05 time-series modelling (stationary processes, ACF) - §07 (Markov Chains, MH) → §04 MCMC posterior sampling

From Chapter 3 — Advanced Linear Algebra: - SVD → §02 Fisher information geometry; §06 OLS via pseudoinverse - Positive definite matrices → §06 covariance matrix of estimators; ridge regression

Into Chapter 8 — Optimisation: - §02 (MLE) → cross-entropy and NLL as loss functions - §06 (OLS) → normal equations as a linear system; extends to gradient descent - §04 (ELBO, variational inference) → variational autoencoder training objective

Into Chapter 9 — Information Theory: - §02 (Fisher information) → Cramér-Rao and the information-theoretic view of estimation - §04 (KL divergence as posterior approximation criterion) → variational inference

ML Concept Map

ML Concept	Statistics Foundation	Section
Cross-entropy loss $-\sum y \log \hat{p}$	Negative log-likelihood (MLE objective)	§02
Weight decay / L2 regularisation	Ridge regression; MAP with Gaussian prior	§04, §06
L1 / sparsity regularisation	Lasso regression; MAP with Laplace prior	§04, §06
Confidence in model evaluation	Confidence intervals for test accuracy	§02, §03
A/B testing for model comparison	Two-sample hypothesis test	§03
Bayesian neural networks	Posterior over weights; variational inference	§04
Dropout as Bayesian approximation	MC Dropout ↔ approximate posterior sampling	§04
Batch normalisation	Sample mean/variance of activations	§01
Layer normalisation	Sample statistics within layers	§01
Early stopping	Train/validation loss as statistical estimators	§02
Calibration (softmax temperature)	Posterior predictive calibration	§04
Anomaly / OOD detection	Hypothesis testing; statistical distance	§03
Data drift detection	Two-sample tests on feature distributions	§03
Time-series forecasting	ARIMA, Kalman filter	§05
Transformer positional encoding	Spectral methods, Fourier features	§05
Linear probe evaluation	Logistic regression on embeddings	§06
LoRA / low-rank adaptation	Regression on low-dimensional subspaces	§06
Reward modelling (RLHF)	Logistic regression (Bradley-Terry model)	§06

Prerequisites

Before starting this chapter, ensure you are comfortable with:

Probability distributions — PDFs, CDFs, named distributions (Gaussian, Bernoulli, Poisson, Beta) — Chapter 6 §01–§02
Expectation and variance — $\mathbb{E}[X]$, $\text{Var}(X)$, covariance — Chapter 6 §04
Bayes' theorem — $p(\theta|x) \propto p(x|\theta)p(\theta)$ — Chapter 6 §03
Central limit theorem — sample mean converges to Gaussian — Chapter 6 §06
Matrix algebra — matrix inverse, SVD, positive definiteness — Chapter 3
Calculus — derivatives, optimisation (for MLE derivations) — Chapter 4