<- Back to Chapter 7: Statistics | Next: Estimation Theory ->
"The first step in wisdom is knowing what you don't know - and in statistics, that begins by knowing what your data actually looks like."
- John Tukey
Overview
Descriptive statistics is the art and science of summarising data before you model it. It answers a deceptively simple question: what does this dataset look like? Before fitting a model, optimising a loss, or drawing any inference, you must know whether your features are symmetric or skewed, whether outliers will dominate gradient updates, whether two variables move together, and whether the scale differences between features will derail gradient descent. Descriptive statistics provides the vocabulary and tools for all of this.
The discipline divides naturally into three layers. The first layer is univariate: for each variable in isolation, compute its centre (mean, median, mode), its spread (variance, IQR, MAD), and its shape (skewness, kurtosis). The second layer is bivariate: measure how pairs of variables relate through covariance, Pearson correlation, and rank correlation. The third layer is multivariate: characterise the joint behaviour of all variables together through the covariance matrix, Mahalanobis distance, and correlation heatmaps.
Every modern ML pipeline touches descriptive statistics constantly, often invisibly. Batch normalisation computes sample means and variances of activations. Adam's momentum terms are exponentially-weighted sample statistics. Dataset documentation standards require engineers to report label distributions, feature histograms, and correlation structures. Data drift is detected by monitoring shifts in descriptive statistics over time. This section builds the rigorous foundation for all of these applications.
Prerequisites
- Expectation and variance (, , covariance of random variables) - Ch6 Section04 Expectation and Moments
- Common distributions (Gaussian, Bernoulli, Poisson) - Ch6 Section02 Common Distributions
- Basic probability (sample spaces, random variables, CDF/PDF) - Ch6 Section01
- Matrix operations (matrix-vector products, transpose, positive definiteness) - Ch2 Section02
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive implementation of every statistic with visualisations and ML-case studies |
| exercises.ipynb | 10 graded exercises from basic EDA through Batch Norm and Adam momentum analysis |
Learning Objectives
After completing this section, you will:
- Compute and interpret all standard measures of central tendency, spread, and shape for a univariate sample
- Explain Bessel's correction and derive why gives an unbiased variance estimator
- Construct and interpret the empirical CDF, quantiles, five-number summary, and box plots
- Assess outliers using Z-score thresholds, Tukey fences, and MAD-based robust methods
- Define the breakdown point of a statistic and select between mean and median appropriately
- Compute Pearson, Spearman, and Kendall correlation and choose the right one for your data
- Build the empirical covariance matrix, check positive semi-definiteness, and compute Mahalanobis distance
- Explain how sample statistics relate to population parameters and what the iid assumption means
- Connect batch normalisation, layer normalisation, and RMS norm to their descriptive-statistics definitions
- Derive Adam's first and second moments as exponentially-weighted sample statistics
- Detect data drift by monitoring shifts in descriptive statistics over time
- Apply the full EDA workflow to a multivariate ML dataset
Table of Contents
- 1. Intuition
- 2. Formal Definitions and Notation
- 3. Robust Statistics
- 4. Bivariate Statistics
- 5. Multivariate Descriptive Statistics
- 6. Data Standardisation and Preprocessing
- 7. Visualisation for EDA
- 8. Applications in Machine Learning
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI (2026 Perspective)
- 12. Conceptual Bridge
1. Intuition
1.1 What Descriptive Statistics Does
Descriptive statistics does not make predictions. It does not test hypotheses. It does not fit models. Its sole purpose is to look at the data - to summarise its essential features in a form that a human can understand and that downstream methods can rely on.
The canonical demonstration of why this matters is Anscombe's Quartet (1973): four datasets with nearly identical means (, ), variances (, ), and Pearson correlations () - yet one is a perfect linear relationship, one is a curved parabola, one is a near-perfect line with a single outlier, and one is a vertical cluster with a single high-leverage point. A statistician who skips visualisation and reports only the summary statistics would fit the same regression line to all four, missing everything important.
The lesson for ML is immediate: if you skip EDA on your training data, you may train a model on a dataset where a single feature with extreme variance (unnormalised pixel intensity vs normalised embedding values) dominates gradient updates; where two features are so highly correlated that they provide redundant information and slow convergence; where a label is heavily imbalanced, making accuracy a misleading metric; or where systematic missing-data patterns encode the label and will cause the model to cheat.
Descriptive statistics is the gate through which all good modelling passes. Every subsequent step - estimation, hypothesis testing, Bayesian inference - implicitly assumes you already know your data's basic structure.
1.2 Why It Matters for AI
The connection between descriptive statistics and modern AI is deeper than "preprocessing". Sample statistics are computed billions of times per training run in contemporary neural networks.
Batch Normalisation (Ioffe & Szegedy, 2015) computes the sample mean and variance of each feature channel across a mini-batch, then normalises activations to zero mean and unit variance. This is exactly the z-score standardisation formula applied at every layer, every forward pass. Without it, deep networks suffer from internal covariate shift - the distribution of activations drifts as parameters update, causing gradient vanishing or explosion.
Layer Normalisation (Ba et al., 2016), used in every transformer architecture (GPT, BERT, LLaMA, Mistral, Gemini), computes the mean and variance across features within a single token's representation. For a token embedding : . The and are sample statistics over the components.
Adam optimiser (Kingma & Ba, 2015) maintains an exponentially-weighted moving average of past gradients (first moment, analogous to a running mean) and past squared gradients (second moment, analogous to a running variance). The update rule corrects for initialisation bias exactly as one corrects for Bessel bias in sample statistics.
Dataset cards required by Hugging Face, Google, and the EU AI Act mandate reporting of label distributions (a sample frequency table), feature statistics (mean, std, min, max per column), and known biases (correlations between sensitive attributes and labels).
For AI: Every time you decide how to initialise weights, normalise activations, monitor training stability, document a dataset, or detect distribution shift in production, you are doing descriptive statistics.
1.3 Historical Timeline
DESCRIPTIVE STATISTICS - HISTORICAL TIMELINE
========================================================================
1662 John Graunt - Bills of Mortality: first tabulation of population
statistics; birth/death rates; the discipline's birth
1805 Legendre, Gauss - Least squares method; residual analysis;
the mean as the least-squares centre
1885 Francis Galton - Percentiles, regression to the mean, rank
correlation; connected statistics to heredity and evolution
1896 Karl Pearson - Product-moment correlation coefficient r;
standard deviation (coined the term); chi-squared distribution
1919 Ronald Fisher - ANOVA, experimental design; p-values; the
distinction between sample and population statistics
1925 Fisher - "Statistical Methods for Research Workers"; codified
the sample mean, variance, and hypothesis testing framework
1951 John Tukey - Jackknife resampling; robust estimation begins;
"the future of data analysis" (1962) anticipates EDA
1977 John Tukey - "Exploratory Data Analysis" textbook; box plots,
stem-and-leaf plots; EDA as a discipline; look before you infer
1973 Frank Anscombe - Anscombe's Quartet; the necessity of plotting
1983 Peter Rousseeuw - Least Median of Squares; formal breakdown
point theory; robust statistics becomes rigorous
2015 Ioffe & Szegedy - Batch Normalization; sample statistics
embedded inside neural network training loops
2016 Ba et al. - Layer Normalization; per-token sample statistics
for transformer architectures
2020 Gebru et al. - "Datasheets for Datasets"; descriptive statistics
as ethical AI documentation requirement
2026 EU AI Act - Mandatory dataset statistics for high-risk AI systems
========================================================================
1.4 The EDA Workflow
A disciplined exploratory analysis follows five stages:
THE EDA WORKFLOW
========================================================================
Stage 1: LOAD AND AUDIT
-------------------------
- Check shapes, dtypes, memory usage
- Count missing values per column
- Identify data types: numeric / categorical / datetime / text
- Check for duplicate rows
- Verify expected value ranges (no age = -5, no probability > 1)
Stage 2: UNIVARIATE ANALYSIS
------------------------------
For each feature:
- Central tendency: mean, median, mode
- Spread: variance, std, IQR, range
- Shape: skewness, kurtosis
- Extremes: min, max, outliers
- Distribution: histogram, KDE, box plot, ECDF
Stage 3: BIVARIATE ANALYSIS
------------------------------
For each pair (feature, target) and (feature, feature):
- Pearson/Spearman correlation
- Scatter plot with regression line
- Conditional distributions P(Y | X = x)
Stage 4: MULTIVARIATE ANALYSIS
---------------------------------
- Correlation matrix heatmap
- Pair plot (scatter matrix)
- PCA scree plot (variance explained)
- Cluster structure via UMAP/t-SNE
Stage 5: DOCUMENT AND ACT
--------------------------
- Record all findings
- Plan transformations (log, standardise, impute)
- Flag problematic features for domain expert review
- Update dataset card
========================================================================
In ML practice, this workflow is applied to training data before modelling, to validation data at model evaluation time, and to production data continuously (as drift monitoring). The output is not a model - it is a map of the data landscape that guides every subsequent decision.
2. Formal Definitions and Notation
2.1 Sample Space and the Data Matrix
Let be the population - the full set of all possible observations. A sample is a collection of realisations drawn from :
The iid assumption (independent, identically distributed) means each is drawn from the same distribution independently of all others. This is the foundation on which classical descriptive statistics rests - but it is violated routinely in practice:
- Time series data: depends on (autocorrelation)
- Clustered data: patients from the same hospital are correlated
- Sequential text: tokens in a sentence are deeply dependent
- Mini-batches in SGD: often drawn without replacement, not exactly iid
The data matrix for multivariate data is , where is the number of observations (rows) and is the number of features (columns). Row is the feature vector ; column is the vector of all observed values of feature .
A critical distinction runs throughout this section:
| Symbol | Name | Meaning |
|---|---|---|
| Population mean | - theoretical, unknown | |
| Sample mean | - computed from data | |
| Population variance | - theoretical, unknown | |
| Sample variance | - computed from data | |
| Population correlation | - theoretical | |
| Sample correlation | Pearson's - computed from data |
For AI: The distinction matters in Batch Norm. During training, BN uses the mini-batch sample mean and sample variance . During inference, it uses running estimates of the population mean and variance accumulated during training. The two are different statistics with different properties.
2.2 Measures of Central Tendency
Definition (Sample Mean). Given a sample , the sample mean is:
The mean is the unique minimiser of over . It is the least-squares centre of the data. It is also an unbiased estimator of : (proof: by linearity of expectation).
Definition (Sample Median). Sort the sample as . The sample median is:
The median is the minimiser of over . It is the least-absolute-deviations centre.
Definition (Sample Mode). The value that appears most frequently. Not unique if two values tie (bimodal). Not well-defined for continuous data without binning.
Comparison:
CENTRAL TENDENCY COMPARISON
========================================================================
Statistic Loss minimised Sensitivity Use when
--------- ----------------- ------------- -------------------------
Mean \\Sigma(x_i - c)^2 High Symmetric, no outliers
(L2 loss) to outliers Gaussian-like data
Median \\Sigma|x_i - c| Low Skewed distributions,
(L1 loss) to outliers outliers present
Income, house prices
Mode 0/1 loss N/A Categorical data;
multi-modal distributions
========================================================================
Non-examples:
- A dataset of incomes thousand: mean = k is misleading; median = k is representative. The mean is pulled catastrophically by the outlier.
- A bimodal dataset : the mean () and median () both fall in an empty region between the two clusters. The mode is more informative here.
- An exactly symmetric dataset: all three coincide. This is the exception, not the rule.
For AI: In label smoothing and temperature scaling for language models, the "centre" of the output distribution is the mode (the argmax token). In KL divergence minimisation, the mean of the approximate distribution is what's optimised. In median-of-means estimation (used in Byzantine-robust distributed learning), the median replaces the mean to guard against corrupted gradient submissions from compromised workers.
2.3 Measures of Spread
Definition (Sample Variance). The sample variance with Bessel's correction is:
Theorem (Bessel's Correction). The estimator is unbiased for : .
Proof. Write . Taking expectation:
Dividing by gives .
The divisor gives the maximum likelihood estimator of under a Gaussian assumption, which is biased. NumPy uses by default (ddof=0); use ddof=1 for the unbiased version.
Definition (Standard Deviation). . Note: is NOT an unbiased estimator of (Jensen's inequality: ), but the bias is negligible for large .
Definition (Range). . Maximal sensitivity to outliers; breakdown point = .
Definition (Interquartile Range). . Contains the middle 50% of the data. Breakdown point = 25%.
Definition (Coefficient of Variation). (dimensionless). Useful for comparing spread across variables with different units. Not meaningful when .
For AI: In gradient clipping, the gradient norm is compared to a threshold - the threshold is often chosen based on the empirical standard deviation of gradient norms observed during early training. In the Adam optimiser, the second moment estimate estimates , and is the adaptive per-parameter learning rate - a measure of spread in the gradient history.
2.4 Shape Statistics: Skewness and Kurtosis
Definition (Sample Skewness). Fisher's :
where is the -th central moment.
- : right-skewed (long right tail; mean > median)
- : left-skewed (long left tail; mean < median)
- : symmetric
Examples: Income distributions are right-skewed (). Exam scores near a ceiling are left-skewed (). Gradient norms in deep networks are often right-skewed - a few extreme gradients can destabilise training.
Definition (Sample Excess Kurtosis). Fisher's :
The subtraction of 3 normalises so that the Gaussian has .
- : leptokurtic (heavier tails than Gaussian; more extreme values)
- : platykurtic (lighter tails; more concentrated)
- : mesokurtic (Gaussian-like tails)
For AI: The distribution of gradients in neural networks is often leptokurtic - it has heavier tails than a Gaussian. This is why vanilla SGD with a Gaussian noise model is a poor fit; it's why gradient clipping is essential; and it's why the Student- distribution (which has heavier tails) is sometimes used to model stochastic gradient noise. Understanding kurtosis explains why "gradient explosion" is the right mental model for large- gradient distributions.
2.5 Order Statistics and Quantiles
Definition (Order Statistics). Given sample , the order statistics are .
Definition (Quantile / Percentile). The -th quantile , , is a value such that at least proportion of the data lies at or below it. Common choices: , , .
Definition (Empirical CDF). The ECDF is:
This is a step function that jumps by at each observation. It is the non-parametric estimator of the true CDF .
Theorem (Glivenko-Cantelli). For iid data, as .
This guarantees the ECDF converges uniformly to the true CDF - the empirical distribution is a consistent estimator of the population distribution.
Definition (Five-Number Summary). . The basis of the box plot.
For AI: Quantile functions are used in calibration evaluation (reliability diagrams), in quantile regression for uncertainty estimation, in fairness auditing (comparing median outcomes across demographic groups), and in the Wasserstein distance (used in GAN training and dataset comparison), which is computed from the quantile functions of two distributions.
3. Robust Statistics
3.1 Breakdown Point and Influence Function
Classical statistics (mean, variance, Pearson correlation) is vulnerable to outliers because it is designed to be efficient under Gaussian assumptions. Robust statistics replaces efficiency under Gaussianity with resistance to contamination.
Definition (Breakdown Point). The breakdown point of an estimator is the smallest fraction of observations that can be replaced by arbitrarily bad values to make take an arbitrarily extreme value:
- Sample mean: . A single observation replaced by drives the mean to infinity.
- Sample median: . You must corrupt more than half the data to move the median arbitrarily far.
- Sample variance: (same as mean - one outlier suffices).
- IQR: .
- MAD: .
Definition (Influence Function). The influence function measures the effect of a single infinitesimally contaminated observation at on the estimator :
where is a point mass at .
- Mean: . Unbounded - large causes large change.
- Median: . Bounded - extreme has finite effect.
For AI: Robust statistics is directly relevant to Byzantine-robust federated learning (e.g., Krum, median-of-means, geometric median aggregation), where some fraction of participating clients may send corrupted gradients. The median-of-means estimator achieves breakdown point against malicious workers, while naive FedAvg (a simple mean) has .
3.2 Robust Location Estimators
Definition (Trimmed Mean). The -trimmed mean removes the bottom and top -fraction of observations before computing the mean:
Breakdown point: . With , this is the 10%-trimmed mean - discards the most extreme 10% on each side before averaging.
Definition (Winsorised Mean). Rather than discarding extreme values, Winsorisation clips them to the and quantiles:
The Winsorised mean is the mean of . It retains all observations but limits the damage any single one can do.
For AI: Gradient clipping in neural network training is exactly Winsorisation of gradient norms. When , the gradient is rescaled to . This is the -norm equivalent of Winsorising each component. The threshold is a hyperparameter analogous to the Winsorisation quantile. LLaMA, GPT-4, and most modern LLMs use gradient clipping with .
3.3 Robust Spread: MAD and IQR
Definition (Median Absolute Deviation). The MAD is the median of absolute deviations from the median:
For a Gaussian distribution, , so a consistent estimator of is .
The MAD has breakdown point 50% - the best possible for a location-scale estimator.
Definition (IQR). . For a Gaussian, .
Comparison table:
SPREAD ESTIMATORS - PROPERTIES
========================================================================
Estimator Breakdown Efficiency Computation Best for
--------- --------- ---------- ------------ ---------------------
Std dev s ~0% 100% O(n) Clean Gaussian data
IQR 25% 37% O(n log n) Moderate contamination
MAD 50% 37% O(n log n) Heavy contamination
Trimmed s \\alpha depends O(n log n) Tunable robustness
========================================================================
For AI: Robust spread estimation is used in anomaly detection for ML systems monitoring. When a model's loss distribution suddenly widens (increased IQR or MAD of batch losses), it signals a data quality issue or distribution shift - potentially faster and more reliably than watching the mean loss alone, since the mean is pulled by a few extreme batches.
3.4 Outlier Detection Methods
Method 1: Z-score threshold. Flag as an outlier if where . Standard choice: (flags ~0.3% of a Gaussian). Weakness: breakdown point ; the mean and std used to compute are themselves distorted by outliers.
Method 2: Tukey fences. Flag as an outlier if or . This is the standard box plot definition. "Extreme" outliers: use factor . Breakdown point: 25%.
Method 3: Modified Z-score (Iglewicz-Hoaglin). Uses the MAD instead of the standard deviation:
Flag as outlier if . This is robust: breakdown point 50%.
Summary:
| Method | Breakdown | When to use |
|---|---|---|
| Z-score | ~0% | Clean data; fast first pass |
| Tukey fences | 25% | General purpose; default choice |
| Modified Z-score | 50% | Heavily contaminated data |
For AI: Outlier detection on training data is a data-cleaning step before model training. But outlier detection also occurs at inference time: inputs far outside the training distribution (OOD inputs) can be flagged using Mahalanobis distance (Section5.3), energy scores, or model uncertainty. The descriptive statistics methods here are the simplest and most interpretable first pass.
4. Bivariate Statistics
4.1 Sample Covariance and Pearson Correlation
Definition (Sample Covariance). For paired samples :
Covariance measures the direction of linear association: means and tend to increase together; means they move in opposite directions; means no linear association (but possibly nonlinear).
Definition (Pearson Correlation Coefficient). The sample Pearson correlation is:
Properties:
- (by Cauchy-Schwarz inequality)
- : perfect positive linear relationship; : perfect negative linear
- Invariant under affine transformations:
- does NOT imply independence (only linear independence)
Theorem (Cauchy-Schwarz bound on ). By the Cauchy-Schwarz inequality applied to the centred vectors and :
which gives , hence . Equality holds iff for all (perfect linear relationship).
Non-examples of implying independence:
- , : exactly, yet is a deterministic function of .
- Anscombe's Dataset II: parabola, but the fit is entirely non-linear.
For AI: In transformer attention, the attention weights form a row of a correlation-like matrix between query and key vectors. The dot-product is an unnormalised, uncentred covariance between the query and key representations. The softmax normalisation maps these to probabilities. Understanding Pearson correlation provides intuition for what "attention scores" actually measure.
4.2 Rank Correlation: Spearman and Kendall
Definition (Spearman's ). Compute ranks and of each sample within their respective sequences, then compute Pearson correlation on the ranks:
where and the second formula holds when there are no ties.
Definition (Kendall's ). Count concordant pairs (both and increase together) and discordant pairs:
When to use each:
| Method | Captures | Use when |
|---|---|---|
| Pearson | Linear association | Both vars approx. Gaussian, no outliers |
| Spearman | Monotone association | Ordinal data, outliers present, non-linear monotone |
| Kendall | Concordance probability | Small samples, ordinal data, robust inference |
For AI: When auditing ML models for fairness, Spearman correlation (rather than Pearson) between model outputs and sensitive attributes is more appropriate: the relationship between a model's confidence score and a demographic attribute is rarely linear. Rank correlation is also used in information retrieval evaluation - NDCG and Kendall's between predicted and true rankings.
4.3 Association Beyond Linear Correlation
Mutual Information (preview). For random variables and :
MI captures all statistical dependencies (linear and nonlinear). iff . Unlike Pearson correlation, it detects the relationship.
Forward reference: Mutual information is the canonical measure of statistical dependence with no linearity assumption. Full treatment: Chapter 9 - Information Theory.
Distance Correlation. Szekely et al. (2007): iff (for all distributions with finite first moments). Computed via pairwise distance matrices. Captures non-linear dependence while remaining a proper correlation measure in .
Cramer's V (categorical association). For two categorical variables with a contingency table, Cramer's V normalises the chi-squared statistic to : .
4.4 Correlation vs Causation
The maxim "correlation is not causation" is one of the most important lessons in statistics for ML practitioners - and one of the most frequently violated.
Simpson's Paradox (formal example). A dataset shows that treatment A appears better than treatment B when the full dataset is used (), but treatment B is better within every subgroup (men and women separately). This reversal occurs when a confounding variable (sex) is correlated with both the treatment assignment and the outcome.
In ML: A model trained on internet text learns that "doctor" co-occurs more with "he" than "she" - not because being a doctor causes gender, but because historical data reflects a correlation induced by societal biases (the confounder). The model's correlation-based weights perpetuate the bias.
Key insight: Correlation between feature and label is sufficient to use as a predictor only if the test distribution matches the training distribution. If the mechanism generating the correlation changes (distribution shift), the correlation can reverse or disappear - but a causal relationship would remain stable.
For AI: This is why OOD generalisation is fundamentally harder than in-distribution accuracy. Models trained to exploit spurious correlations (e.g., grass background -> outdoor scene) fail when those correlations break (grass in an indoor setting). Causal ML is an active research area aimed at learning invariant, causal features rather than spurious correlations.
5. Multivariate Descriptive Statistics
5.1 Empirical Mean Vector and Covariance Matrix
Definition (Sample Mean Vector). For data matrix :
Definition (Centred Data Matrix). (subtract the mean from each row).
Definition (Sample Covariance Matrix). With Bessel's correction:
Properties of :
- Symmetric:
- Positive semi-definite (PSD): for all . Proof: .
- Rank: . If , the matrix is rank-deficient - it cannot be inverted. This is the classical problem.
- Diagonal entries: (sample variance of feature )
For AI: The empirical covariance matrix is the key object in PCA (eigendecomposition of gives principal components), in Gaussian discriminant analysis, in the Mahalanobis distance (Section5.3), and in LoRA-style low-rank adaptation (the weight update is designed so that the effective covariance of updates is low-rank, matching the low intrinsic dimensionality of fine-tuning data).
5.2 Correlation Matrix and Visualisation
Definition (Sample Correlation Matrix). Standardise each column: . Then:
Properties: for all ; ; is PSD with diagonal ones.
The correlation matrix is related to the covariance matrix by , i.e., where .
Visualisation: A heatmap of with a diverging colormap (e.g., RdBu_r) centred at 0 is the standard tool. Hierarchical clustering of rows/columns groups correlated features.
For AI: Analysing the correlation matrix of attention heads in a transformer layer reveals which heads focus on similar token relationships (high ) and which are functionally diverse (low ). Correlation matrices of model weights across layers are used in mechanistic interpretability to study how information is processed. High inter-layer weight correlation often signals redundancy - the basis for pruning.
5.3 Mahalanobis Distance
Definition (Mahalanobis Distance). The Mahalanobis distance from a point to the sample centre is:
Interpretation: The Mahalanobis distance is the Euclidean distance after whitening - transforming the data so that all features have unit variance and zero correlation. Equivalently, it accounts for the scale and correlation structure of the data: a point that is 3 standard deviations away along a low-variance direction is farther (in Mahalanobis terms) than 3 standard deviations along a high-variance direction.
- If , then (ordinary Euclidean distance).
- For multivariate Gaussian data, - so flags multivariate outliers.
For AI: The Mahalanobis distance is used as an out-of-distribution detection score (Lee et al., 2018): compute of a test example's penultimate-layer embedding relative to the training-set class means and covariances. OOD examples have large Mahalanobis distance. This outperforms simple Euclidean-distance methods and is used in production OOD monitoring for deployed LLMs.
5.4 Dimensionality and the Curse
As grows, the geometry of data changes in counterintuitive ways that make descriptive statistics increasingly unreliable.
Concentration of pairwise distances. In high dimensions, all pairwise Euclidean distances concentrate around the same value: as . This means nearest-neighbour methods, clustering, and any distance-based statistic become meaningless - everything is equally far from everything else.
Covariance matrix estimation degrades. Estimating requires parameters from observations. For , the estimator is noise-dominated. This is the classical problem. Solution: regularised covariance (Ledoit-Wolf), diagonal approximation, or sparse graphical models.
Implication for attention. The softmax attention mechanism computes -dimensional dot products. As grows, dot products concentrate (variance scales as ), causing the softmax to saturate (near-uniform or near-one-hot attention). The scaling in scaled dot-product attention directly addresses this - it's a variance normalisation step grounded in descriptive statistics.
6. Data Standardisation and Preprocessing
6.1 Z-Score Standardisation
Definition (Z-Score Standardisation / Standard Scaling). Transform each observation to:
The result has sample mean and sample standard deviation .
Why it matters for ML: Gradient descent convergence depends critically on the scale of features. If one feature ranges over and another over , the gradient of the loss with respect to the first weight is larger - the loss landscape is elongated, and gradient descent follows a zigzag path rather than heading directly toward the minimum. Standardising features makes the loss landscape more spherical, dramatically improving gradient descent convergence.
Implementation note: Always fit the scaler on the training set only, then apply the same mean and standard deviation to the validation and test sets. Fitting on the full dataset (including test) is data leakage - it allows test-set information to influence preprocessing.
Properties:
- The z-score is dimensionless (units cancel).
- Does not change the shape of the distribution (skewness and kurtosis are invariant).
- Sensitive to outliers (outliers inflate , compressing the bulk of the data toward zero).
6.2 Min-Max and Robust Scaling
Definition (Min-Max Scaling). Maps to :
Useful when features must lie in a fixed range (e.g., inputs to sigmoid activations, image pixel values). Breakdown point: (one extreme outlier shifts the entire scale).
Definition (Robust Scaling). Uses median and IQR instead of mean and standard deviation:
Result has sample median 0 and IQR 1. Breakdown point: 25%. Appropriate when the data is heavily contaminated.
Decision guide:
SCALING METHOD SELECTION
========================================================================
Data has outliers? --Yes--> Robust Scaling (median / IQR)
No outliers, and
need [0,1] range? --Yes--> Min-Max Scaling
No outliers, and
need zero mean,
unit variance? ------> Z-Score Standardisation (default)
========================================================================
6.3 Normalisation in Neural Networks
Modern neural networks embed descriptive statistics directly into their architecture through normalisation layers. Every such layer computes sample statistics - the difference lies in which samples are used.
Batch Normalisation (Ioffe & Szegedy, 2015). For a feature channel across a mini-batch of examples:
where and . These are sample statistics across the batch for a fixed feature. Parameters are learned.
Layer Normalisation (Ba et al., 2016). For a single token embedding :
where and . These are sample statistics across features for a fixed token. Used in all transformer models.
RMS Norm (Zhang & Sennrich, 2019). Used in LLaMA, Mistral, Gemma:
This is standardisation without mean-centering: divide by the root mean square (which is the standard deviation when the mean is 0).
NORMALISATION LAYERS - WHAT STATISTICS THEY COMPUTE
========================================================================
Batch Norm Layer Norm RMS Norm Group Norm
-----------------------------------------------------------
Statistic mean, var mean, var rms only mean, var
Over batch x HxW feature dim feature dim feature groups
Per channel token token group per token
Train \\neq Yes (running No No No
Inference stats used)
Seq vary Breaks for Works fine Works fine Works fine
length variable len
-----------------------------------------------------------
Used in CNNs, ResNet All Transformers LLaMA CNNs+some
(GPT, BERT, etc) Mistral transformers
========================================================================
6.4 Handling Missing Data
Missing data is the rule rather than the exception in real ML datasets. The strategy for handling it depends on the missing data mechanism:
- MCAR (Missing Completely At Random): missingness is independent of all variables. Safe to delete rows or use simple imputation.
- MAR (Missing At Random): missingness depends on observed variables but not on the missing value itself. Model-based imputation works.
- MNAR (Missing Not At Random): missingness depends on the missing value (e.g., high-income people don't report income). Simple imputation is biased; need specialised methods.
Simple imputation strategies:
| Strategy | Use for | Advantage | Disadvantage |
|---|---|---|---|
| Mean imputation | Continuous, MCAR | Fast, simple | Reduces variance; biased for skewed data |
| Median imputation | Continuous, outliers | Robust | Reduces variance |
| Mode imputation | Categorical | Simple | Biases category frequencies |
| Forward fill | Time series | Preserves temporal structure | Wrong for gaps > 1 step |
For AI: Missing data in training datasets often encodes information that models will exploit as a spurious correlation. If high-income records are more likely to have missing income fields, a model that has access to missingness indicators may learn to predict high income from missing income - a label leakage that fails catastrophically when data collection improves.
7. Visualisation for EDA
7.1 Univariate Plots
Histogram. Bins the range into equal intervals and counts observations in each. Choice of is critical:
- Sturges' rule: . Simple; undersmooths for large .
- Scott's rule: . Optimal for Gaussian data.
- Freedman-Diaconis rule: . Robust to outliers; best general choice.
Kernel Density Estimate (KDE). A smooth non-parametric estimate of the PDF:
where is a kernel (typically Gaussian) and is the bandwidth. The optimal bandwidth under a Gaussian kernel for Gaussian data is Silverman's rule: .
Box Plot (Tukey, 1977). Shows five-number summary with: box from to ; median line; whiskers extending to the most extreme non-outlier points (within of the box); dots for outliers beyond the fences.
Empirical CDF. Plot as a step function. Advantages over histogram: no binning choice; shows quantiles directly; convergence to is guaranteed by Glivenko-Cantelli.
Violin Plot. Combines a box plot with a mirrored KDE - shows both summary statistics and distributional shape. Preferred over box plots when sample size is sufficient to estimate the density reliably ().
7.2 Bivariate and Multivariate Plots
Scatter Plot. Plot vs for all . Add a regression line and confidence band. Colour by a third variable. The go-to tool for bivariate exploratory analysis.
Pair Plot (Scatter Matrix). For a -dimensional dataset, create a grid where cell shows the scatter plot of feature vs feature ; diagonal cells show histograms or KDEs. Standard library: seaborn.pairplot.
Correlation Heatmap. Display as a colour-coded matrix. Use a diverging colormap centred at 0 (e.g., coolwarm or RdBu_r). Cluster rows/columns by hierarchical clustering to reveal block structure.
Parallel Coordinates. Each axis corresponds to one feature; each observation is a polyline connecting its values across all axes. Reveals clusters and correlations in high-dimensional data without reducing dimensionality.
7.3 Assessing Normality
Many statistical methods (t-tests, confidence intervals, OLS regression) assume normally distributed data. Assessing normality is therefore a prerequisite for applying them.
Q-Q Plot (Quantile-Quantile Plot). Plot theoretical Gaussian quantiles (x-axis) against sample quantiles (y-axis). If the data is Gaussian, points lie on the line . Deviations reveal:
- S-shaped curve -> lighter tails than Gaussian (platykurtic)
- Bow-shaped curve -> heavier tails (leptokurtic, common in financial and gradient data)
- Points above the line at the top -> right skew
Probability Plot. Same idea as Q-Q but with standardised axes; equivalent to Q-Q for location-scale families.
Shapiro-Wilk Test (preview). A formal hypothesis test for normality. : data is Gaussian; statistic close to 1 supports normality; small -value rejects normality.
Forward reference: Full hypothesis tests for distributional assumptions - Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling - are covered in Section03 Hypothesis Testing.
8. Applications in Machine Learning
8.1 Batch Normalisation and Layer Normalisation
The two dominant normalisation strategies in deep learning are directly grounded in the sample statistics developed in Section2.
Batch Norm workflow:
- For each feature channel in a mini-batch of size :
- Compute (sample mean, divisor, NOT )
- Compute (biased sample variance, MLE)
- Normalise:
- Scale and shift:
- At test time, use exponential moving averages and accumulated during training.
Why Batch Norm uses biased variance ( not ): Batch Norm is not trying to estimate the population variance; it is normalising the activations within the current mini-batch. Unbiasedness is not the goal - numerical stability and consistent normalisation are. The term (typically ) prevents division by zero.
Layer Norm workflow:
- For each token embedding in the sequence:
- Compute (mean over the feature dimensions)
- Compute
- Normalise:
- Apply learned and .
Why transformers use Layer Norm, not Batch Norm: Batch Norm requires a fixed batch size and breaks with variable-length sequences (because the normalisation statistics change with sequence length). Layer Norm normalises within each token independently, making it sequence-length-agnostic and suitable for autoregressive generation (where batch size = 1 at inference).
Why RMSNorm (LLaMA, Mistral, Gemma) drops the mean: Empirically, re-centering (subtracting the mean) provides marginal benefit compared to re-scaling (dividing by RMS). RMSNorm is cheaper to compute (no mean subtraction) and achieves comparable training stability.
8.2 Adam Optimiser as Online Sample Statistics
The Adam optimiser (Kingma & Ba, 2015) maintains running estimates of the first and second moments of the gradient distribution, updated at each step :
These are exponentially-weighted moving averages (EWMA) of and . Initialising introduces a bias toward zero at early steps (analogous to the bias in the vs estimator for finite samples). Adam corrects this with:
These are the bias-corrected moment estimates. The analogy to Bessel's correction: both corrections account for the fact that a short history ( steps or observations) underestimates the true statistic.
The Adam update is:
The denominator is an estimate of the gradient's root-mean-square - an adaptive per-parameter learning rate that scales inversely with the gradient's historical variance. Parameters with high-variance gradients get smaller updates; parameters with low-variance gradients get larger updates.
Connection to descriptive statistics: is a smoothed estimate of the gradient mean; is a smoothed estimate of the uncentred second moment (related to variance + mean^2). Adam is performing online, exponentially-weighted descriptive statistics on the gradient stream.
8.3 Dataset Characterisation and Model Cards
The Datasheets for Datasets framework (Gebru et al., 2020) and Hugging Face Dataset Cards require quantitative documentation of every training dataset. The required statistics are directly from descriptive statistics:
- Sample size: (rows), (features/columns)
- Label distribution: frequency table or histogram for each output class
- Feature statistics: mean, std, min, Q1, median, Q3, max for each numeric feature
- Missing values: count and fraction per column
- Correlation structure: correlation matrix or heatmap between features and between features and label
- Demographic representation: distribution of sensitive attributes (age, gender, race, geography)
- Temporal coverage: date range, recency bias
For embedding spaces (used in retrieval-augmented generation, semantic search, and vector databases): intra-cluster variance (how spread out a class's embeddings are), inter-cluster distance (how separated different classes are), and cosine similarity distribution (the histogram of pairwise cosine similarities across the corpus).
8.4 Data Drift and Distribution Shift
In production ML systems, the data generating distribution may diverge from the training distribution . This is called distribution shift or data drift. Monitoring descriptive statistics is the first line of defence.
Types of drift:
- Covariate shift: changes, stays the same. Feature means shift.
- Label shift: changes, stays the same. Label proportions change.
- Concept drift: changes. Same inputs now map to different outputs.
Detection via descriptive statistics:
| Drift type | Monitoring statistic | Alert when |
|---|---|---|
| Covariate shift | Feature means, variances | Mean shift > |
| Skewness shift | Feature skewness | $ |
| Outlier rate | Fraction > Tukey fence | Rate doubles from baseline |
| Label imbalance | Class frequency table | Class proportions shift |
| PSI (Population Stability Index) | PSI > 0.2 |
The Population Stability Index (PSI) is the symmetric KL divergence between the training and production distributions, computed on binned data - a descriptive statistic that quantifies how much the overall distribution has shifted.
Forward reference: Formal hypothesis tests for detecting drift (KS test, two-sample t-test, chi-squared test) are covered in Section03 Hypothesis Testing.
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using the mean for skewed or heavy-tailed data | A few extreme values (income, house prices, gradient norms) pull the mean far from the "typical" value; it no longer represents the centre | Report median alongside mean; use robust statistics for skewed features |
| 2 | Using instead of for sample variance | Dividing by gives the MLE of under a Gaussian, which is biased: | Always use ddof=1 in NumPy/SciPy (np.var(x, ddof=1)) unless computing BN-style normalisation |
| 3 | Fitting the scaler on train+test data | The test-set statistics (mean, std) leak into the preprocessing, artificially improving reported test performance | Fit the scaler only on the training set; transform the test set using training-set parameters |
| 4 | Interpreting as independence | Zero Pearson correlation means no linear association; nonlinear dependence can be arbitrarily strong | Also compute distance correlation or mutual information; always scatter-plot the pair |
| 5 | Concluding causation from correlation | Spurious correlations (Simpson's paradox, confounders, selection bias) produce with no causal link | Use controlled experiments (A/B tests) or causal inference methods to test causal claims |
| 6 | Removing outliers without investigation | An outlier may be the most interesting data point (a fraud transaction, a rare disease case, a critical edge case for a safety model) | Investigate each outlier's source before deciding to remove vs. keep vs. separate |
| 7 | Standardising after train/test split but using test statistics | Same as mistake 3 in reverse - computing the standard deviation on the test set introduces leakage | Always use training statistics for all scaling |
| 8 | Using Pearson correlation for ordinal data | Ordinal data (ratings 1-5, Likert scales) has rank order but not equal intervals; Pearson assumes interval scale | Use Spearman or Kendall for ordinal data |
| 9 | Ignoring the dimensionality of the covariance matrix | For , is rank-deficient and non-invertible; Mahalanobis distance and PCA are undefined or numerically unstable | Regularise: add (ridge), use diagonal approximation, or reduce dimensionality first |
| 10 | Reporting only mean accuracy as the evaluation metric | Mean accuracy aggregates over all classes; for imbalanced data (99% negative class), a trivial classifier gets 99% accuracy | Also report median, per-class accuracy, precision, recall, F1; use stratified evaluation |
| 11 | Choosing histogram bin width arbitrarily | Too few bins hides multimodality; too many bins creates noise that looks like structure | Use Freedman-Diaconis rule; compare histogram to KDE; be explicit about bin choice |
| 12 | Confusing sample skewness and distribution skewness | High sample skewness for small may be noise, not a true distributional property | Check confidence intervals for ; use bootstrap to assess sampling variability |
10. Exercises
Exercise 1 * - Central Tendency and Spread Given the dataset : (a) Compute mean, median, and mode. (b) Compute sample variance (with Bessel correction), standard deviation, and IQR. (c) Explain which measure of centre is most representative of the data and why. (d) Remove the outlier 100 and recompute all statistics. By what factor does the mean change? The median?
Exercise 2 * - Five-Number Summary and Outlier Detection Generate 200 observations from with 5 additional outliers at . (a) Compute the five-number summary. (b) Identify outliers using Tukey fences (). (c) Identify outliers using the modified Z-score with MAD. (d) Compare which method flags more false positives on the clean Gaussian data (benchmark: critical values for ).
Exercise 3 * - Empirical CDF and Q-Q Plot Generate 100 samples from a (Student- with 3 degrees of freedom) distribution. (a) Construct and plot the empirical CDF . (b) Construct a Q-Q plot against the Gaussian distribution. (c) Describe what the Q-Q plot reveals about the tails of the distribution relative to the Gaussian. (d) Compute the excess kurtosis and interpret the result.
Exercise 4 ** - Robust Statistics Under Contamination Start with 95 samples from . Add 5 outliers drawn from (contamination fraction 5%). (a) Compare mean vs median vs 10%-trimmed mean on the contaminated dataset. Which is closest to 0? (b) Compare std vs MAD vs IQR. Which is closest to 1? (c) Compute the influence of the single largest outlier on each statistic: replace it with and measure the change. (d) Plot the estimates as a function of contamination fraction .
Exercise 5 ** - Covariance Matrix and Mahalanobis Distance Generate 200 samples from with and . (a) Compute the sample mean vector and covariance matrix . Verify they are close to the true values. (b) Verify that is PSD. (c) Compute the Mahalanobis distance from each point to . Plot a histogram of . (d) Compare the histogram to a distribution. What percentage of points exceed the 97.5th percentile?
Exercise 6 ** - Anscombe's Quartet Implement all four of Anscombe's datasets from scratch (exact values from the original paper). (a) Compute and verify that all four datasets have the same mean , , variance , , and Pearson correlation . (b) Fit a linear regression to each dataset and verify all four give the same coefficients. (c) Plot all four scatter plots with the fitted line. (d) Compute skewness and kurtosis for each dataset's residuals. Do they differ? What does this reveal?
Exercise 7 *** - Batch Norm vs Layer Norm from Scratch
(a) Implement batch_norm(X, gamma, beta, eps=1e-5) where X has shape (batch, features), normalising across the batch dimension.
(b) Implement layer_norm(X, gamma, beta, eps=1e-5), normalising across the feature dimension.
(c) Implement rms_norm(X, gamma, eps=1e-5), normalising by RMS only (no mean subtraction).
(d) Apply all three to a random matrix of shape . Verify: for BN, each column (feature) has mean\approx0, std\approx1; for LN, each row (sample) has mean\approx0, std\approx1; for RMSNorm, each row has RMS\approx1.
(e) Verify that when applied to variable-batch-size inputs (1, 4, 32), BN statistics change but LN statistics are invariant. Discuss implications for inference.
Exercise 8 *** - Adam Momentum as Running Sample Statistics (a) For a gradient sequence drawn from , compute Adam's first moment and second moment with , . (b) Plot the bias-corrected estimates and alongside the true running mean and uncentred second moment. (c) Show that without bias correction, underestimates by a factor of at step . (d) Vary and plot how quickly converges to the true mean. Connect this to the "effective window size" .
11. Why This Matters for AI (2026 Perspective)
| Concept | AI / LLM Application | Impact |
|---|---|---|
| Sample mean and variance | Batch Norm, Layer Norm, RMS Norm in every transformer | Core to training stability; without it, deep networks diverge |
| Robust statistics (MAD, trimmed mean) | Byzantine-robust federated learning; gradient aggregation under adversarial workers | Enables privacy-preserving training that resists poisoning attacks |
| Gradient clipping = Winsorisation | Training GPT-4, LLaMA, Mistral; prevents loss spikes | Without gradient clipping, large models regularly diverge during training |
| Adam = online sample statistics | Universal optimiser for neural networks (90%+ of published LLM training) | Bias-corrected moment estimates make adaptive learning rates work |
| Outlier detection | OOD detection, anomaly detection in production model monitoring | Flagging adversarial or corrupted inputs before they cause harmful outputs |
| Covariance matrix + Mahalanobis distance | OOD detection from penultimate-layer embeddings (Lee et al. 2018) | State-of-the-art OOD scoring without retraining the model |
| Correlation matrix | Attention head redundancy analysis; mechanistic interpretability | Identifies which attention heads can be pruned to reduce inference cost |
| attention scaling | Prevents softmax saturation in high-dimensional attention (FlashAttention, MLA) | Directly derived from variance concentration in high dimensions |
| Empirical CDF + quantile functions | Calibration evaluation; Wasserstein distance for dataset comparison | Measuring whether a model's confidence matches its accuracy |
| Standardisation (Z-score) | Feature preprocessing for all tabular ML; embedding normalisation | Ensures equal-scale features; required for SGD convergence |
| Skewness and kurtosis | Gradient noise modelling; heavy-tailed gradient distributions | Justifies Student- noise models over Gaussian for stochastic gradient methods |
| Label distribution (frequency table) | Class-weighted loss, oversampling, dataset cards for responsible AI | Prevents models from learning to predict the majority class trivially |
| Data drift (PSI, KL) | Production monitoring for deployed LLMs and recommendation systems | Early warning system for model degradation; triggers retraining |
| Simpson's paradox + confounders | Fairness auditing; understanding why a model exhibits bias | Correlation-based bias is masked by aggregation; subgroup analysis reveals it |
12. Conceptual Bridge
Descriptive statistics is the foundation on which all statistical inference rests. To draw any conclusion from data - to estimate a parameter, test a hypothesis, train a Bayesian model, or fit a regression - you must first understand what the data actually looks like. This section has provided the complete vocabulary: the mean, median, and mode as measures of centre; variance, standard deviation, IQR, and MAD as measures of spread; skewness and kurtosis as measures of shape; quantiles and the ECDF as complete distributional summaries; covariance and correlation as measures of association; and the covariance matrix and Mahalanobis distance as multivariate generalisations.
The next section, Section02 Estimation Theory, takes the sample statistics developed here and asks: how good are they as estimators of the underlying population parameters? The sample mean is an estimator of - but what is its bias? What is its variance? Is there a better estimator? The Cramer-Rao lower bound sets a fundamental limit on how precisely any unbiased estimator can estimate a parameter, expressed in terms of Fisher information. Bessel's correction (Section2.3) is a preview of estimator theory: the divisor corrects the bias of the sample variance estimator.
Looking back, this section builds directly on probability theory. The population parameters , , and were defined in Chapter 6 Section04 Expectation and Moments. What we compute here are their sample analogues - functions of observed data designed to estimate those population quantities. The Glivenko-Cantelli theorem (Section2.5) provides the theoretical guarantee connecting the ECDF to the population CDF: it is a specialisation of the law of large numbers proved in Ch6 Section06 Stochastic Processes.
POSITION IN THE CURRICULUM
========================================================================
CHAPTER 6: PROBABILITY THEORY
+----------------------------------------------------------------+
| Section01 Intro + RVs -> Section02 Distributions -> Section03 Joint |
| Section04 Expectation -> Section05 Concentration -> Section06 Stochastic |
| Section07 Markov Chains |
| (population parameters \\mu, \\sigma^2, \\rho defined here) |
+----------------------------------+-----------------------------+
| sample analogues
v
CHAPTER 7: STATISTICS
+----------------------------------------------------------------+
| Section01 DESCRIPTIVE STATISTICS <-- YOU ARE HERE |
| "Summarise data before you model it" |
| xbar, s^2, r, ECDF, IQR, MAD, \\Sigmahat, Mahalanobis, z-score |
+----------------------------------+-----------------------------+
| "how good are these estimators?"
v
+----------------------------------------------------------------+
| Section02 Estimation Theory |
| Bias, variance, MSE, Cramer-Rao, MLE, Fisher information |
+----------------------------------+-----------------------------+
| "make decisions from estimates"
v
+----------------------------------------------------------------+
| Section03 Hypothesis Testing -> Section04 Bayesian Inference |
| Section05 Time Series -> Section06 Regression Analysis |
+----------------------------------------------------------------+
========================================================================
<- Back to Chapter 7: Statistics | Next: Estimation Theory ->
Appendix A: Worked Examples
A.1 Computing All Univariate Statistics by Hand
Dataset: ()
Step 1 - Sort:
Step 2 - Central tendency:
- Mean:
- Median (, even):
- Mode: (appears 3 times)
Step 3 - Spread:
- Deviations:
- Squared deviations sum:
- Sample variance: ;
- Range:
- ; ; IQR
Step 4 - Shape:
- Skewness : the distribution is nearly symmetric (mean \approx median \approx mode ); (very slight right skew due to 14)
- All values within Tukey fences: - no outliers
Step 5 - Five-number summary:
A.2 Pearson vs Spearman on a Monotone Nonlinear Dataset
Consider for and :
- ,
- Pearson (strong, but is highly curved so )
- Spearman (perfect rank correlation - the mapping is monotone)
This illustrates: Spearman detects the perfect monotone relationship that Pearson misses.
Now consider and for :
- Pearson : looks like no association
- Spearman : same conclusion
But the relationship is highly structured - it's non-monotone. Both Pearson and Spearman fail to detect it. This is why visualisation is always the first step.
A.3 Covariance Matrix - A Complete Worked Example
Dataset: Standardised heights (cm), weights (kg), age (years) for 5 individuals:
Centred matrix :
Interpretation:
- Height and weight have high positive covariance (): taller people tend to weigh more
- Height and age have low positive covariance (): weak association in this sample
- Diagonal: variances are , , - weight is most variable
Correlation matrix: ; height and weight are very strongly correlated.
Appendix B: Key Formulas Quick Reference
DESCRIPTIVE STATISTICS - FORMULA REFERENCE
========================================================================
CENTRAL TENDENCY
-----------------
Sample mean: xbar = (1/n) \\Sigma x_i
Sample median: middle value of sorted data
Trimmed mean: mean of middle (1-2\\alpha)n observations
SPREAD
--------
Sample variance: s^2 = (1/(n-1)) \\Sigma(x_i - xbar)^2
Std deviation: s = \\sqrts^2
IQR: Q_3 - Q_1
MAD: median(|x_i - median(x)|)
Robust \\sigmahat: 1.4826 x MAD
SHAPE
-------
Skewness: g_1 = m_3 / m_2^(3/2), m_k = (1/n)\\Sigma(x_i-xbar)^k
Excess kurtosis: g_2 = m_4 / m_2^2 - 3
QUANTILES
----------
Empirical CDF: Fhat_n(x) = (1/n)\\Sigma 1[x_i \\leq x]
p-quantile: Q(p) = inf{x : Fhat_n(x) \\geq p}
Five-number: {min, Q_1, median, Q_3, max}
BIVARIATE
----------
Covariance: sxy = (1/(n-1)) \\Sigma(x_i-xbar)(y_i-y_bar)
Pearson r: r = sxy / (sx sy)
Spearman \\rho: Pearson r applied to ranks
MULTIVARIATE
-------------
Mean vector: \\muhat = (1/n) X^T 1
Covariance matrix: \\Sigmahat = (1/(n-1)) Xc^T Xc
Correlation matrix: R = D^{-1/2} \\Sigmahat D^{-1/2}
Mahalanobis dist: d_M(x) = \\sqrt[(x-\\muhat)^T \\Sigmahat^-^1(x-\\muhat)]
STANDARDISATION
-----------------
Z-score: z_i = (x_i - xbar) / s
Min-max: x_i_mm = (x_i - min) / (max - min)
Robust: x_i_rob = (x_i - median) / IQR
NORMALISATION LAYERS
---------------------
Batch Norm: h_hat_i_j = (h_i_j - \\muhat_j^B) / \\sqrt(\\sigmahat_j^B^2 + \\varepsilon) [over batch]
Layer Norm: h_hat_i_j = (h_i_j - \\muhat_i^L) / \\sqrt(\\sigmahat_i^L^2 + \\varepsilon) [over features]
RMS Norm: h_hat_i_j = h_i_j / RMS(h_i) [no mean subtraction]
ADAM MOMENTS
-------------
First moment: m_t = \\beta_1m_t_-_1 + (1-\\beta_1)g_t
Second moment: v_t = \\beta_2v_t_-_1 + (1-\\beta_2)g_t^2
Bias correction: mhat_t = m_t/(1-\\beta_1^t), vhat_t = v_t/(1-\\beta_2^t)
Update: \\theta_t_+_1 = \\theta_t - \\alpha\\cdotmhat_t / (\\sqrtvhat_t + \\varepsilon)
========================================================================
Appendix C: Proofs and Derivations
C.1 Proof: Sample Mean is Unbiased
Claim: where and with mean .
Proof:
This uses linearity of expectation and the iid assumption ( for all ).
C.2 Proof: Variance of the Sample Mean
Claim: .
Proof: By independence ( for ):
This is the standard error result: . Doubling precision requires 4x the data.
C.3 Proof: Pearson Correlation is Bounded in
Claim: .
Proof: Apply Cauchy-Schwarz to the centred vectors and :
Dividing both sides by and by gives . Equality iff for some , i.e., - a perfect linear relationship.
C.4 Proof: Sample Covariance Matrix is PSD
Claim: is positive semi-definite.
Proof: For any :
The matrix is positive definite (not just semi-definite) iff has full column rank, which requires and no perfect linear dependence among features.
Appendix D: Python Implementation Reference
import numpy as np
from scipy import stats
# -- Univariate statistics --------------------------------------
x = np.array([...])
n = len(x)
mean = np.mean(x)
median = np.median(x)
mode = stats.mode(x).mode[0]
var = np.var(x, ddof=1) # unbiased (Bessel's correction)
std = np.std(x, ddof=1)
q1, q3 = np.percentile(x, [25, 75])
iqr = q3 - q1
mad = np.median(np.abs(x - np.median(x)))
robust_s = 1.4826 * mad
skew = stats.skew(x)
kurt = stats.kurtosis(x) # excess kurtosis (Fisher)
# Five-number summary
five_num = np.array([x.min(), q1, median, q3, x.max()])
# Empirical CDF
def ecdf(x_eval, data):
return np.mean(data[:, None] <= x_eval, axis=0)
# -- Outlier detection ------------------------------------------
# Z-score
z = (x - mean) / std
outliers_z = np.abs(z) > 3
# Tukey fences
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers_tukey = (x < lower) | (x > upper)
# Modified Z-score (MAD-based)
m_z = 0.6745 * (x - np.median(x)) / mad
outliers_mad = np.abs(m_z) > 3.5
# -- Bivariate statistics ---------------------------------------
x, y = np.array([...]), np.array([...])
cov = np.cov(x, y, ddof=1)[0, 1]
pearsr = np.corrcoef(x, y)[0, 1] # Pearson
rho_s = stats.spearmanr(x, y).statistic
tau = stats.kendalltau(x, y).statistic
# -- Multivariate statistics ------------------------------------
X = np.array([...]) # shape (n, d)
mu_hat = X.mean(axis=0)
Xc = X - mu_hat
Sigma = (Xc.T @ Xc) / (len(X) - 1) # covariance matrix
D = np.sqrt(np.diag(Sigma))
R = Sigma / np.outer(D, D) # correlation matrix
# Mahalanobis distance
Sigma_inv = np.linalg.inv(Sigma)
def mahal(x, mu, Sigma_inv):
diff = x - mu
return np.sqrt(diff @ Sigma_inv @ diff)
# -- Standardisation --------------------------------------------
z_score = (X - X.mean(axis=0)) / X.std(axis=0, ddof=1)
minmax = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
q1s = np.percentile(X, 25, axis=0)
q3s = np.percentile(X, 75, axis=0)
robust_sc = (X - np.median(X, axis=0)) / (q3s - q1s)
Appendix E: Connections to Other Chapters
To Chapter 6 - Probability Theory: Every sample statistic here is an estimator of a population parameter defined in Ch6:
- estimates (defined in Ch6 Section04)
- estimates (defined in Ch6 Section04)
- estimates (defined in Ch6 Section04)
- estimates (defined in Ch6 Section01)
- The Glivenko-Cantelli guarantee uses the Law of Large Numbers from Ch6 Section06
To Section02 - Estimation Theory: Descriptive statistics IS estimation theory applied to the simplest possible estimands:
- Is a good estimator of ? (Yes - unbiased, consistent, efficient for Gaussian data)
- What is ? ( - derived from the iid assumption)
- Can we do better than for non-Gaussian distributions? (Median is more efficient for Laplace distributions) All of these questions are answered formally in Section02 Estimation Theory.
To Section03 - Hypothesis Testing: Many descriptive statistics have associated hypothesis tests:
- Is ? -> one-sample t-test using
- Is the data Gaussian? -> Shapiro-Wilk test using skewness and kurtosis
- Is there correlation? -> test that using Fisher's -transformation of
- Has the distribution shifted? -> KS test comparing two ECDFs Full treatment: Section03 Hypothesis Testing.
To Section06 - Regression Analysis: The regression coefficient is a function of descriptive statistics. The OLS estimator is derived from minimising the sample variance of residuals. The coefficient of determination equals in simple linear regression. Full treatment: Section06 Regression Analysis.
To Chapter 3 - Advanced Linear Algebra: The covariance matrix is a real symmetric PSD matrix. Its eigendecomposition gives the principal component directions (columns of ) and variance explained per component (diagonal of ). This is PCA - the most important linear dimensionality reduction method in ML. Full treatment: Ch3 Section03 Eigenvalues and Eigenvectors.
Appendix F: Advanced Topics
F.1 Estimating the Population Mean: The Central Limit Theorem Connection
The sample mean is an excellent estimator, but how confident can we be in it? The Central Limit Theorem tells us that for large :
This means the sampling distribution of is approximately Gaussian with mean and standard deviation - regardless of the distribution of individual . The spread of decreases as : to halve the uncertainty, you need 4x the data.
Backward reference: The CLT was proved formally in Ch6 Section06 Stochastic Processes. Here we use it to interpret the reliability of the sample mean.
Implication for ML: The validation loss computed over a test set of size has uncertainty . This is a confidence interval for the true expected loss - a fundamental fact that is frequently ignored when comparing models. Two models that differ by less than are not statistically distinguishable. See Section02 Estimation Theory and Section03 Hypothesis Testing for formal treatment.
F.2 L-Estimators: Unifying the Zoo of Robust Statistics
The mean, trimmed mean, Winsorised mean, and median are all members of a single family called L-estimators: linear combinations of order statistics.
where the weights depend on the estimator:
- Sample mean: for all
- Median ( odd): , all others
- -trimmed mean: for , zero otherwise
- Winsorised mean: for ; ;
This unification shows that robustness and efficiency are controlled by the shape of the weight sequence : flat weights = maximum efficiency under Gaussian; zero-outside-middle weights = robustness.
F.3 The Efficiency-Robustness Tradeoff
For Gaussian data, the sample mean is the minimum variance unbiased estimator (MVUE) of - no other unbiased estimator can achieve lower variance. The relative efficiency of the median vs the mean for Gaussian data is:
The median needs more data to achieve the same precision as the mean - for Gaussian data. But for heavy-tailed distributions (like Student- for small ), the median becomes more efficient than the mean. At the Cauchy distribution (), the sample mean is infinitely inefficient - its variance is infinite.
This tradeoff is fundamental to the choice between Adam and SGD: Adam's adaptive second-moment estimate essentially adapts to the local variance of the gradient distribution, making it more efficient than SGD with a fixed learning rate when gradient variance varies across parameters.
F.4 Bootstrap Confidence Intervals for Descriptive Statistics
The bootstrap (Efron, 1979) provides confidence intervals for any statistic without assuming a parametric form for the distribution. For a statistic :
- Draw bootstrap samples: for , sample observations with replacement from the data.
- Compute on each bootstrap sample.
- The bootstrap distribution estimates the sampling distribution of .
- The 95% bootstrap CI is the percentile interval of .
This works for the mean, median, correlation, MAD - any statistic you can compute. For the sample mean, the bootstrap CI coincides with the CLT-based CI for large ; for the median, it is more accurate.
For AI: Bootstrapping is used to compute confidence intervals on evaluation metrics (accuracy, F1, AUC) that do not have simple analytical distributions. When you report "model A has 94.2% accuracy vs model B's 93.8%", bootstrapping tells you whether this difference is statistically meaningful. Full treatment of hypothesis tests for model comparison: Section03 Hypothesis Testing.
F.5 Multivariate Outlier Detection: The Mahalanobis Envelope
For multivariate Gaussian data , the random variable .
This means that under normality, 95% of observations should have . An observation with is a multivariate outlier at the 2.5% per-observation level.
Example (): , so flags multivariate outliers. An observation can be within the univariate ranges of both features but still be an outlier in the bivariate sense - e.g., if the features are highly positively correlated, an observation with a large positive but small is multivariate-unusual even if neither value is individually extreme.
Robustified version (MCD estimator): The Minimum Covariance Determinant estimator (Rousseeuw, 1984) computes the mean vector and covariance matrix using only the observations whose covariance matrix has minimum determinant. This gives a robust, high-breakdown-point version of the Mahalanobis distance, widely used in OOD detection for deployed ML models.
Appendix G: Glossary
| Term | Definition |
|---|---|
| iid | Independent and identically distributed: each observation drawn independently from the same distribution |
| Sample statistic | A function of observed data; an estimator of a population parameter |
| Population parameter | A fixed but unknown property of the data-generating distribution (e.g., , ) |
| Breakdown point | Smallest fraction of corrupted observations that can cause a statistic to take an arbitrarily extreme value |
| Influence function | The infinitesimal effect of adding a single observation at on a statistic |
| ECDF | Empirical Cumulative Distribution Function: |
| L-estimator | A linear combination of order statistics; unifies mean, median, trimmed mean |
| Bessel's correction | Using instead of in the sample variance formula to make it unbiased |
| Mahalanobis distance | Distance metric that accounts for the scale and correlation structure of multivariate data |
| Anscombe's Quartet | Four datasets with identical standard summary statistics but dramatically different distributions |
| PSI | Population Stability Index: symmetric KL divergence between training and production distributions |
| ARE | Asymptotic Relative Efficiency: ratio of asymptotic variances of two estimators |
| KDE | Kernel Density Estimate: smooth non-parametric estimate of a probability density |
| Q-Q plot | Quantile-Quantile plot: compares empirical quantiles to theoretical quantiles to assess distributional fit |
| Covariate shift | Distribution shift where changes but $P(Y |
| Concept drift | Distribution shift where $P(Y |
| Batch Norm | Normalisation layer that standardises activations across the batch dimension |
| Layer Norm | Normalisation layer that standardises activations across the feature dimension per token |
| RMS Norm | Variant of Layer Norm that divides by root mean square without subtracting the mean |
Appendix H: Practice Problems with Solutions
H.1 Basic Problems
Problem 1. The following are response times (seconds) for 8 API calls: . Compute: (a) mean, (b) median, (c) variance with Bessel's correction, (d) identify outliers using Tukey fences.
Solution: (a) seconds (b) Sorted: ; median seconds (c) Deviations from mean: ; (d) , , IQR ; upper fence ; -> outlier
Note: the mean (s) is pulled above the upper fence - it is no longer representative of the typical response time. The median (s) is the appropriate summary.
Problem 2. Show that for any dataset, is minimised at .
Solution:
Second derivative confirms a minimum.
Problem 3. The waiting times at a coffee shop follow an Exponential() distribution. What is the population skewness? How does this compare to sample estimates from vs ?
Solution: For Exponential(), the skewness is always (independent of ). For , the sample skewness estimate has high variance - approximately under normality, so a standard error of . For , - much more precise. This illustrates why shape statistics require larger samples than location/scale statistics.
H.2 Intermediate Problems
Problem 4. A dataset has , , , (leptokurtic). Using Chebyshev's inequality, what fraction of observations must lie within ? How does this compare to the Gaussian prediction?
Solution: Chebyshev: . For : at least lies within standard deviations. For a Gaussian: . The leptokurtic distribution () has heavier tails than Gaussian, so we expect a value between 75% and 95.45% to lie in this interval - Chebyshev's bound is not tight for leptokurtic distributions.
Backward reference: Chebyshev's inequality was proved rigorously in Ch6 Section05 Concentration Inequalities.
Problem 5. Two features (salary, thousands) and (years of experience) have covariance , variances , . Compute Pearson correlation. Now rescale: (salary in millions). How does change?
Solution: . After rescaling: , . ... wait, that's wrong. Let me redo: if , then , . . Pearson correlation is invariant to scaling. This confirms the affine invariance property.
H.3 Advanced Problems
Problem 6 (Batch Norm during inference). During training, Batch Norm computes mini-batch statistics. During inference, it uses running averages. Why can't it use the current-batch statistics during inference for a batch of size 1?
Solution: For a batch of size : and (variance of a single point is zero). The normalised activation would be for all inputs - the layer outputs a constant, destroying all information. This is why BN uses running averages accumulated during training for inference. Layer Norm has no this problem because it normalises across the feature dimension (), which is fixed regardless of batch size.
Problem 7 (Adam effective window size). Show that the effective window size of Adam's first moment estimator is approximately time steps.
Solution: Unrolling the recursion :
The weight given to gradient is , which decays exponentially with delay . The effective number of steps contributing significantly is approximately the "half-life" of the exponential: the average delay is , and the effective window is approximately . For : window steps. For : window steps.
Appendix I: Software Reference
NumPy / SciPy
| Task | Function | Notes |
|---|---|---|
| Sample mean | np.mean(x) | Defaults to full array |
| Sample median | np.median(x) | |
| Sample variance | np.var(x, ddof=1) | ddof=1 for Bessel |
| Sample std | np.std(x, ddof=1) | |
| Quantiles | np.percentile(x, [25,50,75]) | or np.quantile |
| Skewness | scipy.stats.skew(x) | Fisher's |
| Kurtosis | scipy.stats.kurtosis(x) | Excess (Fisher), |
| Pearson | scipy.stats.pearsonr(x,y) | Returns (r, p-value) |
| Spearman | scipy.stats.spearmanr(x,y) | |
| Kendall | scipy.stats.kendalltau(x,y) | |
| Covariance matrix | np.cov(X.T, ddof=1) | Rows = features |
| Correlation matrix | np.corrcoef(X.T) | |
| ECDF | scipy.stats.ecdf(x) | Or manual |
scikit-learn
| Task | Class | Notes |
|---|---|---|
| Z-score scaling | StandardScaler() | fit on train, transform on test |
| Min-max scaling | MinMaxScaler() | |
| Robust scaling | RobustScaler() | Uses median and IQR |
| Covariance (robust) | MinCovDet() | MCD estimator |
| Mahalanobis OOD | EllipticEnvelope() | Fits Gaussian, flags outliers |
Visualisation (matplotlib / seaborn)
| Plot | Code | Notes |
|---|---|---|
| Histogram + KDE | sns.histplot(x, kde=True) | |
| Box plot | sns.boxplot(x=x) | Tukey fences |
| Violin plot | sns.violinplot(x=x) | |
| ECDF | sns.ecdfplot(x) | |
| Q-Q plot | scipy.stats.probplot(x, plot=plt) | |
| Scatter + regression | sns.regplot(x=x, y=y) | |
| Pair plot | sns.pairplot(df) | |
| Correlation heatmap | sns.heatmap(R, annot=True, cmap='coolwarm') |
Appendix J: Common Distributions and Their Descriptive Statistics
Understanding the theoretical descriptive statistics of named distributions (derived from population parameters) helps calibrate expectations when you compute sample statistics from real data.
| Distribution | Mean | Variance | Skewness | Excess Kurtosis |
|---|---|---|---|---|
| (Student-) | () | () | () | |
Key observations for ML:
-
Exponential distribution has skewness and excess kurtosis - heavy right tail and heavy overall tails. This resembles the distribution of inter-event times in Poisson processes (token generation times, request arrival times).
-
Student- has excess kurtosis for . For : . As , (approaches Gaussian). The leptokurtic behaviour of for small models heavy-tailed phenomena like stochastic gradient noise in LLM training.
-
Laplace distribution (double exponential) has excess kurtosis . MAP estimation with a Gaussian likelihood and Laplace prior gives Lasso regression - the Laplace prior's heavier tails promote sparsity.
-
is always right-skewed with decreasing skewness as increases. The sum of squared standardised Gaussian values follows ; gradient squared norms tend toward this shape.
Appendix K: Connections to Linear Algebra
The multivariate statistics developed in Section5 are deeply connected to linear algebra concepts:
Covariance matrix eigendecomposition (PCA):
where is orthogonal (columns are eigenvectors = principal component directions) and with (eigenvalues = variance explained per component).
The fraction of total variance explained by the first components:
Forward reference: PCA derivation, spectral theorem, and the relationship between SVD of the data matrix and eigendecomposition of the covariance matrix are covered in Ch3 Section01 Eigenvalues and Eigenvectors and Ch3 Section02 SVD.
Whitening (Mahalanobis transform): The transformation maps the data to have (identity covariance matrix). After whitening, Euclidean distance equals Mahalanobis distance: .
Computing : via eigendecomposition , we get where .
Condition number and numerical issues: The condition number measures how reliably can be inverted. Large condition numbers (features with very different scales, highly correlated features) make the covariance matrix ill-conditioned - the Mahalanobis distance becomes numerically unreliable. This is why standardisation (Section6.1) is important before computing the covariance matrix.
Appendix L: Summary Diagram
DESCRIPTIVE STATISTICS - COMPLETE TAXONOMY
========================================================================
UNIVARIATE
----------
LOCATION SPREAD SHAPE DISTRIBUTION
--------- ------ ----- ------------
Mean xbar Variance s^2 Skewness g_1 Histogram
Median xtilde Std dev s Kurtosis g_2 KDE
Mode IQR 5-num summary Box plot
Trimmed mean MAD ECDF Violin plot
Winsorised mean Range Q-Q plot
BIVARIATE
---------
LINEAR RANK NONLINEAR
------ ---- ---------
Covariance sxy Spearman \\rho Distance corr
Pearson r Kendall \\tau Mutual information
Scatter plot Rank scatter Contingency table
Anscombe's Q (for ordinal) Cramer's V
MULTIVARIATE
------------
MATRIX GEOMETRY DIMENSION
------ -------- ---------
Mean vector \\muhat Mahalanobis dist Curse of dim
Cov matrix \\Sigmahat Whitening PCA/scree plot
Corr matrix R Ellipsoidal sets n>>d constraint
Heatmap Outlier envelope Rank(\\Sigmahat) = min(n-1,d)
PREPROCESSING
-------------
Z-SCORE MIN-MAX ROBUST
------ ------- ------
(x - xbar) / s (x-min)/(max-min) (x - xtilde) / IQR
mean=0, std=1 [0, 1] median=0, IQR=1
fits most cases bounded range resistant to outliers
AI APPLICATIONS
---------------
NORMALISATION OPTIMISATION MONITORING
------------- ------------ ----------
Batch Norm (per feat) Adam mhat_t (mean) Data drift (PSI)
Layer Norm (per token) Adam vhat_t (variance) Outlier detection
RMS Norm (no center) Gradient clipping OOD scoring
(Winsorisation) Calibration
========================================================================
Appendix M: The Statistics of Neural Network Activations
This appendix connects descriptive statistics directly to modern neural network analysis - what happens to activation distributions at each layer, and how normalisation methods control them.
M.1 Why Activations Drift (Internal Covariate Shift)
Consider a deep network with layers. Each layer computes . As parameters are updated during training, the distribution of changes - even though the distribution of the input is fixed. This is internal covariate shift.
Descriptively: the sample mean and variance of channel at layer change with every gradient update. Deep networks without normalisation exhibit a cascade: a parameter update in layer 1 shifts the distribution fed to layer 2, which amplifies or distorts, leading to gradients that vanish (activations -> 0) or explode (activations -> +/-\infty).
Batch Norm breaks this cascade by renormalising to zero mean and unit variance after every parameter update, making the distribution of inputs to layer independent of the parameters of layer .
M.2 Analysing Pre-Norm vs Post-Norm Transformers
Modern transformers use either Pre-LN (Layer Norm before the attention/FFN sublayer) or Post-LN (Layer Norm after adding the residual). GPT-2 used Post-LN; GPT-3, LLaMA, Mistral, and modern models use Pre-LN.
Why Pre-LN is more stable: In Post-LN, the gradient of the loss with respect to early layer parameters passes through many unnormalized residual connections, amplifying gradient norms. In Pre-LN, the gradient is normalised at each layer before propagating back, keeping the effective sample variance of gradients more uniform across layers - a descriptive statistics insight with profound training stability consequences.
Concretely: in a 96-layer transformer (like GPT-3), with Post-LN, the gradient variance at layer 1 is approximately the gradient variance at layer 95. With Pre-LN, the ratio is approximately . This is directly measurable as a descriptive statistic on the gradient distributions across layers.
M.3 The Scaling in Attention
In scaled dot-product attention:
Why divide by ? Each element of is a dot product of two -dimensional vectors. If , then:
Without scaling, the dot products have standard deviation , which grows with model dimension. Feeding large-magnitude values into softmax causes it to saturate - the output approaches a one-hot vector, blocking gradient flow. Dividing by restores unit variance:
This is a variance normalisation step - standard scaling applied to the attention logits. It is derived directly from the sample variance formula and the property that the variance of a sum of independent unit-variance terms is .
Multi-head Latent Attention (MLA, DeepSeek-V2, 2024): MLA compresses key-value representations into a low-dimensional latent space before attention. The projection reduces and hence the variance of dot products - a deliberate manipulation of the attention logit statistics to reduce KV cache memory while maintaining expressivity.
M.4 Gradient Statistics as a Training Health Monitor
Monitoring the following descriptive statistics of gradient distributions during LLM training provides an early warning system for training instabilities:
| Statistic | Healthy range | Warning sign |
|---|---|---|
| Gradient norm per layer | Decreasing from input to output layers | Exploding: norm growing exponentially |
| Gradient norm variance across layers | Low, uniform | Spike: sudden increase in one layer |
| Gradient skewness | Near 0 | Large positive: rare large positive gradients |
| Gradient kurtosis | Very large: heavy-tailed gradient distribution -> clip threshold too high | |
| Loss spike detection | Mean batch loss | Single batch loss > mean + 5 MAD -> potentially corrupted batch |
| Parameter update scale | Should be small (); large -> instability |
This is why modern LLM training frameworks (e.g., Megatron-LM, Nanotron) log running statistics of gradient norms, weight update scales, and activation statistics at every layer - real-time EDA of the training process.
Appendix N: Extended Exercise Solutions
N.1 Exercise 7 (Batch Norm vs Layer Norm) - Key Implementation Notes
The critical implementation detail in Batch Norm is the distinction between training and inference modes. During training:
# Training mode: use current mini-batch statistics
mu_B = X.mean(axis=0) # shape (d,)
var_B = X.var(axis=0) # biased variance (ddof=0)
X_norm = (X - mu_B) / np.sqrt(var_B + eps)
running_mean = momentum * running_mean + (1-momentum) * mu_B
running_var = momentum * running_var + (1-momentum) * var_B
During inference:
# Inference mode: use accumulated running statistics
X_norm = (X - running_mean) / np.sqrt(running_var + eps)
For Layer Norm, the statistics are computed per-sample along the feature dimension:
mu_L = X.mean(axis=1, keepdims=True) # shape (n, 1)
var_L = X.var(axis=1, keepdims=True) # shape (n, 1)
X_norm = (X - mu_L) / np.sqrt(var_L + eps)
The key test: for a batch of size 1 (n=1), Batch Norm produces and , so normalisation maps every activation to . Layer Norm is unaffected because it normalises over the feature dimension.
N.2 Exercise 8 (Adam Bias Correction) - Analytical Solution
At time step with :
Taking expectations (assuming iid gradients with ):
Without correction, underestimates by a factor . At : , which for gives - only 10% of the first gradient. Bias correction: . This recovers the full first gradient as the initial estimate. As , and the correction factor - the correction becomes negligible once the EWMA has warmed up.
Appendix O: EDA on Real ML Datasets - Checklist and Patterns
O.1 What to Always Report for a Tabular ML Dataset
Before any model development, the minimum descriptive statistics to report are:
Dataset-level:
- Shape: rows x columns
- Memory footprint
- Number and fraction of missing values (total and per column)
- Number of duplicate rows
- Data types (numeric, categorical, datetime, text, binary)
Per numeric feature:
- Mean, median, mode (if applicable)
- Std dev, IQR, min, max
- Skewness () and kurtosis ()
- Number and fraction of outliers (Tukey fence)
- Distribution plot (histogram + KDE)
Per categorical feature:
- Number of unique values (cardinality)
- Top- most frequent values with counts and fractions
- Frequency of "other" category
- Missing value rate
Bivariate with target variable:
- Pearson/Spearman correlation for each numeric feature vs target
- Conditional distribution plots: for classification
- Point-biserial correlation for numeric feature vs binary target
Multivariate:
- Correlation heatmap of all numeric features
- Variance inflation factor (VIF) for collinearity assessment
- PCA scree plot: how many components explain 90% of variance
O.2 Common Patterns and What They Mean
| Pattern | Observation | Implication |
|---|---|---|
| Feature has | Right-skewed: long right tail | Consider log transform; mean >> median; outliers will inflate variance |
| Feature has | Leptokurtic: heavy tails | Outliers frequent; robust scaling preferred; may need Winsorisation |
| Two features with $ | r | > 0.95$ |
| Feature correlation with target near 0 | Weak linear association | Feature may still be useful (nonlinear); check mutual information |
| Label class has of samples | Severe imbalance | Accuracy misleading; use balanced accuracy, F1; consider oversampling |
| Feature has bimodal histogram | Two subpopulations | Consider splitting dataset or adding an interaction term |
| Many features have near-zero variance | Near-constant features | Remove or near-zero-variance filter (contributes nothing, hurts conditioning) |
| MAD but IQR > 0 | More than 50% identical values | Discrete-valued or heavily concentrated; robust stats unreliable |
| Correlation matrix has block structure | Correlated feature groups | Natural for grouped features (e.g., pixel neighbours, word -grams) |
O.3 Pre-Training Checklist for LLMs
For large-scale pretraining datasets (web text, code, books), descriptive statistics take a different form:
- Token distribution: frequency table of most/least common tokens; vocabulary coverage
- Sequence length distribution: mean, median, percentiles (, , ); right-skewed is typical
- Document quality scores: perplexity under a reference model; flagging low-quality text
- Language distribution: fraction of each language; imbalance affects multilingual capability
- Temporal distribution: date of creation; recency bias analysis
- Duplication rate: -gram overlap statistics; near-duplicate detection
- Toxicity and bias statistics: frequency of flagged content; demographic representation in co-occurrence statistics
These are all descriptive statistics applied at scale - the same concepts (frequency tables, histograms, correlation structures, outlier detection) but applied to discrete text data rather than continuous measurements.
Appendix P: Notation Summary
Following docs/NOTATION_GUIDE.md:
| Symbol | Meaning | Type |
|---|---|---|
| -th scalar observation | Scalar, lowercase | |
| -th feature vector | Bold lowercase vector | |
| Data matrix () | Uppercase matrix | |
| Centred data matrix | Uppercase matrix | |
| Sample mean (scalar) | Scalar | |
| Sample mean vector | Bold lowercase, hat | |
| Sample variance | Scalar | |
| Sample covariance matrix | Uppercase, hat | |
| Sample correlation matrix | Uppercase | |
| Pearson correlation coefficient | Scalar lowercase | |
| Population correlation (or Spearman) | Greek | |
| Kendall's tau | Greek | |
| Sample median | Scalar, tilde | |
| -th quantile function | ||
| Empirical CDF | Hat, subscript | |
| Sample skewness, excess kurtosis | Scalars | |
| Mahalanobis distance | Scalar | |
| Sample size | Positive integer | |
| Number of features/dimensions | Positive integer | |
| Population mean, variance | Greek (theoretical) |
End of Section07-Statistics/01-Descriptive-Statistics/notes.md
Appendix Q: Worked ML Case Study - EDA on a Classification Dataset
This appendix walks through a complete EDA on a synthetic classification dataset representative of tabular ML problems: predicting loan default from applicant features.
Q.1 Dataset Overview
Features: age (years), income (n = 5000d = 6$ features.
Q.2 Univariate Analysis
Age: mean , median , , (nearly symmetric, slight right skew), (slightly platykurtic). Q-Q plot: slight deviation in upper tail only. No outliers by Tukey fences.
Income: mean k, median k, , (strongly right-skewed), (very leptokurtic). Tukey fence upper: k - 8.3% of observations flagged as outliers. Recommendation: log-transform income before modelling. After log transform: , - approximately Gaussian.
Credit score: mean , median , , (left-skewed - ceiling effect near 850). A few scores above 840 represent 98th percentile; no action needed but note the ceiling.
Late payments: mean , median , mode . Heavily right-skewed (). 62% of applicants have zero late payments - consider treating as a mixed discrete-continuous variable or Poisson-distributed.
Q.3 Bivariate Analysis with Target
| Feature | Pearson | Spearman | Note |
|---|---|---|---|
| Age | Negligible | ||
| log(Income) | Moderate negative | ||
| Credit score | Strongest predictor | ||
| Debt-to-income | Strong positive | ||
| Late payments | Moderate positive |
Pearson and Spearman are close for all features except late payments (the most skewed), where Spearman is meaningfully larger - confirming that the relationship is monotone but not linear.
Q.4 Multivariate Analysis
The correlation matrix reveals:
- Income and credit score: (expected: higher income -> better credit history)
- Debt-to-income and late payments: (financially stressed applicants)
- Age and all others: (age is roughly independent of financial variables in this dataset)
The block structure (income-credit-debt cluster vs age) suggests two natural feature groups for regularisation. PCA on the standardised features: first two PCs explain 51% and 19% = 70% total variance. The first PC is approximately an equal-weight financial health score; the second PC is approximately age vs financial stress.
Q.5 Preprocessing Decisions
Based on EDA:
- Log-transform income (right-skewed, leptokurtic)
- Z-score standardise all numeric features (using training-set statistics only)
- No outlier removal - outliers in income are legitimate high-income applicants, not data errors
- Handle 3.2% missing credit scores via median imputation (MAR assumption - missing-at-random; lower-income applicants slightly more likely to have missing credit scores)
- Flag: debt-to-income and late payments are collinear () - regularisation will be important
Q.6 Key EDA Findings Summary
EDA FINDINGS - LOAN DEFAULT PREDICTION
========================================================================
[ok] Class imbalance: 15% default -> use balanced accuracy, F1, AUC
[ok] Income: log-transform required (skew 2.8 -> 0.1 after transform)
[ok] Credit score: best single predictor (r = -0.42)
[ok] Collinearity: debt-to-income <-> late payments (r = 0.58)
[ok] Missing data: 3.2% credit score (MAR) -> median impute
[ok] Age: near-zero correlation with target -> likely weak feature
[ok] Scale: all features need standardisation (income in $k,
credit 0-850, DTI 0-1, late payments 0-20)
RECOMMENDED PIPELINE:
log(income) -> StandardScaler -> [classifier with L2 reg]
Evaluate: AUC-ROC, PR-AUC, balanced accuracy
Monitor: label distribution shift, credit score mean shift
========================================================================
Appendix R: Reading List
Foundational:
- Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. - The founding text; invented box plots, stem-and-leaf, resistant methods.
- Huber, P.J. & Ronchetti, E.M. (2009). Robust Statistics (2nd ed.). Wiley. - Formal breakdown point theory; influence functions; M-estimators.
- Anscombe, F.J. (1973). "Graphs in Statistical Analysis." American Statistician, 27(1), 17-21. - The Quartet paper; 4 pages, required reading.
Applied Statistics: 4. Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). Norton. - Best non-technical introduction; strong on intuition. 5. Wackerly, D., Mendenhall, W., & Scheaffer, R. (2008). Mathematical Statistics with Applications (7th ed.). - Formal treatment; proofs; distributions.
Machine Learning Connections: 6. Ioffe, S. & Szegedy, C. (2015). "Batch Normalization." ICML 2015. - Original BN paper. 7. Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). "Layer Normalization." arXiv:1607.06450. - LN for transformers. 8. Kingma, D.P. & Ba, J.L. (2015). "Adam: A Method for Stochastic Optimization." ICLR 2015. - Adam; bias correction derivation. 9. Gebru, T. et al. (2020). "Datasheets for Datasets." Communications of the ACM, 64(12). - Descriptive statistics as documentation standard. 10. Lee, K. et al. (2018). "A Simple Unified Framework for Detecting Out-of-Distribution Samples." NeurIPS 2018. - Mahalanobis distance for OOD.
Appendix S: Common Statistical Distributions - Shape at a Glance
Recognising the shape of a distribution from its descriptive statistics is a practical skill. This guide maps statistic ranges to likely distribution families.
S.1 Shape Decision Tree
IDENTIFYING DISTRIBUTIONS FROM DESCRIPTIVE STATISTICS
========================================================================
Start: Look at histogram + g_1 + g_2
g_1 \\approx 0, g_2 \\approx 0
+-- Unimodal, bell-shaped -> Gaussian (test with Shapiro-Wilk)
+-- Bimodal -> Mixture of Gaussians
g_1 > 1, support [0, \\infty)
+-- g_2 \\approx 6 -> Exponential (\\lambda = 1/mean)
+-- g_2 large, discrete -> Poisson (\\lambda = mean = variance?)
+-- g_1 = 2, g_2 = 6 -> exactly Exponential(\\lambda)
g_1 > 0, support bounded above
+-- [0, 1], hump shape -> Beta(\\alpha>1, \\beta>1)
+-- [0, 1], J/U shape -> Beta(\\alpha\\leq1 or \\beta\\leq1)
g_1 = 0, g_2 > 0 (heavy tails)
+-- Symmetric, very heavy -> Student-t (\\nu \\approx 6/(g_2) + 4)
+-- Symmetric, long tails -> Laplace (g_2 = 3)
g_1 < 0 (left-skewed)
+-- Bounded above, mode near max -> ceiling effect; truncated distribution
========================================================================
S.2 What Gradient Distributions Look Like
Based on empirical measurements in large LLM training runs:
- Individual parameter gradients: approximately with but - leptokurtic, closer to Student- than Gaussian.
- Gradient norms (per-layer): approximately Exponential or distribution - always non-negative, right-skewed with -.
- Loss values within a training batch: usually right-skewed () with - a few hard examples cause very high loss.
Implication: gradient norm thresholds for clipping should be set based on the high-quantile (e.g., or ) of the gradient norm distribution, not the mean - because the mean is pulled by rare large-norm gradients.
Appendix T: Summary of All Mathematical Results
| Result | Statement | Reference |
|---|---|---|
| Sample mean is unbiased | Section2.2, App C.1 | |
| Sample mean variance | App C.2 | |
| Bessel's correction | with divisor | Section2.3 |
| Cauchy-Schwarz bound on | $ | r |
| is PSD | Section5.1, App C.4 | |
| Mahalanobis | under | Section5.3 |
| Glivenko-Cantelli | ECDF converges to CDF uniformly a.s. | Section2.5 |
| CLT for | App F.1 | |
| Breakdown point of median | as | Section3.1 |
| Adam bias correction | Section8.2, App N.2 | |
| Attention variance | for unit-Gaussian inputs | Section5.4, App M.3 |
| ARE(median, mean) Gaussian | App F.3 |
Appendix U: Practice Problems Bank
U.1 Quick Computation Problems
- Dataset: . Compute all measures of central tendency and spread.
- Prove that for any , . Interpret geometrically.
- Show that adding a constant to all observations changes by but leaves unchanged.
- For samples and , express , , and in terms of , , and .
- Prove Spearman's when there are no ties.
U.2 Conceptual Problems
- The median of is 6. You add a 6th observation. For what values of the new observation does the median stay at 6? Change?
- Two datasets have the same mean (50) and standard deviation (10). Dataset A has ; Dataset B has . Without knowing the data, what can you say about which dataset has more extreme values? Which would be safer to model with a Gaussian?
- A colleague reports that features and have Pearson correlation . They suggest dropping from the model. Is this always the right move? What information do you need?
- You standardise your training data and achieve validation accuracy of 87%. A teammate says to also standardise the test data using test-set statistics for "fair comparison". What is wrong with this? What should you do instead?
- Explain why the sample variance of activations in a neural network might increase as training progresses, and how Batch Norm addresses this.
U.3 Implementation Challenges
- Implement the ECDF from scratch without using any statistical libraries. Verify it converges to the true CDF for samples from .
- Implement a bootstrap confidence interval (95%) for the sample median using bootstrap resamples. Apply to the income dataset from the loan default case study.
- Write a function
detect_drift(train, prod, alpha=0.05)that returns a dictionary of descriptive statistics for both datasets and flags features where the PSI exceeds 0.2 or the mean has shifted by more than 2 standard errors. - Reproduce Anscombe's Quartet from scratch and verify all four statistical summaries match within floating-point precision.
- Implement online (streaming) computation of mean, variance, and skewness using Welford's online algorithm - a numerically stable method that processes one observation at a time without storing the entire dataset.
Welford's algorithm hint:
This is the numerically stable online update rule, used in BN's running mean/variance accumulation.
End of Appendices
Appendix V: Welford's Algorithm - Numerically Stable Online Statistics
Welford's one-pass algorithm (1962) computes the sample mean and variance numerically stably in a single pass, processing one observation at a time. This is essential for streaming data and large datasets.
Algorithm (mean and variance):
Initialize: n = 0, M_1 = 0, S = 0
For each observation x_k:
n <- n + 1
\\Delta <- x_k - M_n_-_1
M_n <- M_n_-_1 + \\Delta/n
S <- S + \\Delta\\cdot(x_k - M_n)
Sample variance: s^2 = S/(n-1)
Why numerically stable: The naive formula suffers catastrophic cancellation when (near-constant data). Welford's maintains running deviations from the current mean, avoiding this subtraction.
Connection to Batch Norm running statistics:
# Batch Norm running mean update (PyTorch default momentum=0.1):
running_mean = (1-momentum) * running_mean + momentum * batch_mean
This is an EWMA (exponentially weighted), not Welford's - but the principle is the same: update the running estimate incrementally without recomputing from scratch.
Extension to skewness (Terriberry, 2007):
\\Phi_3 <- \\Phi_3 + \\Delta^2\\cdot(x_k - M_n)\\cdot(1 - 2/n) + 3\\cdotS\\cdot(-\\Delta/n)
g_1 = \\sqrtn \\cdot \\Phi_3 / S^(3/2)
This allows computing skewness in a single pass - useful for monitoring activation skewness during LLM inference.