NotesMath for LLMs

Descriptive Statistics

Statistics / Descriptive Statistics

Notes

<- Back to Chapter 7: Statistics | Next: Estimation Theory ->


"The first step in wisdom is knowing what you don't know - and in statistics, that begins by knowing what your data actually looks like."

  • John Tukey

Overview

Descriptive statistics is the art and science of summarising data before you model it. It answers a deceptively simple question: what does this dataset look like? Before fitting a model, optimising a loss, or drawing any inference, you must know whether your features are symmetric or skewed, whether outliers will dominate gradient updates, whether two variables move together, and whether the scale differences between features will derail gradient descent. Descriptive statistics provides the vocabulary and tools for all of this.

The discipline divides naturally into three layers. The first layer is univariate: for each variable in isolation, compute its centre (mean, median, mode), its spread (variance, IQR, MAD), and its shape (skewness, kurtosis). The second layer is bivariate: measure how pairs of variables relate through covariance, Pearson correlation, and rank correlation. The third layer is multivariate: characterise the joint behaviour of all variables together through the covariance matrix, Mahalanobis distance, and correlation heatmaps.

Every modern ML pipeline touches descriptive statistics constantly, often invisibly. Batch normalisation computes sample means and variances of activations. Adam's momentum terms are exponentially-weighted sample statistics. Dataset documentation standards require engineers to report label distributions, feature histograms, and correlation structures. Data drift is detected by monitoring shifts in descriptive statistics over time. This section builds the rigorous foundation for all of these applications.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive implementation of every statistic with visualisations and ML-case studies
exercises.ipynb10 graded exercises from basic EDA through Batch Norm and Adam momentum analysis

Learning Objectives

After completing this section, you will:

  • Compute and interpret all standard measures of central tendency, spread, and shape for a univariate sample
  • Explain Bessel's correction and derive why n1n-1 gives an unbiased variance estimator
  • Construct and interpret the empirical CDF, quantiles, five-number summary, and box plots
  • Assess outliers using Z-score thresholds, Tukey fences, and MAD-based robust methods
  • Define the breakdown point of a statistic and select between mean and median appropriately
  • Compute Pearson, Spearman, and Kendall correlation and choose the right one for your data
  • Build the empirical covariance matrix, check positive semi-definiteness, and compute Mahalanobis distance
  • Explain how sample statistics relate to population parameters and what the iid assumption means
  • Connect batch normalisation, layer normalisation, and RMS norm to their descriptive-statistics definitions
  • Derive Adam's first and second moments as exponentially-weighted sample statistics
  • Detect data drift by monitoring shifts in descriptive statistics over time
  • Apply the full EDA workflow to a multivariate ML dataset

Table of Contents


1. Intuition

1.1 What Descriptive Statistics Does

Descriptive statistics does not make predictions. It does not test hypotheses. It does not fit models. Its sole purpose is to look at the data - to summarise its essential features in a form that a human can understand and that downstream methods can rely on.

The canonical demonstration of why this matters is Anscombe's Quartet (1973): four datasets with nearly identical means (xˉ=9.0\bar{x} = 9.0, yˉ=7.5\bar{y} = 7.5), variances (sx2=11.0s_x^2 = 11.0, sy2=4.12s_y^2 = 4.12), and Pearson correlations (r=0.816r = 0.816) - yet one is a perfect linear relationship, one is a curved parabola, one is a near-perfect line with a single outlier, and one is a vertical cluster with a single high-leverage point. A statistician who skips visualisation and reports only the summary statistics would fit the same regression line to all four, missing everything important.

The lesson for ML is immediate: if you skip EDA on your training data, you may train a model on a dataset where a single feature with extreme variance (unnormalised pixel intensity vs normalised embedding values) dominates gradient updates; where two features are so highly correlated that they provide redundant information and slow convergence; where a label is heavily imbalanced, making accuracy a misleading metric; or where systematic missing-data patterns encode the label and will cause the model to cheat.

Descriptive statistics is the gate through which all good modelling passes. Every subsequent step - estimation, hypothesis testing, Bayesian inference - implicitly assumes you already know your data's basic structure.

1.2 Why It Matters for AI

The connection between descriptive statistics and modern AI is deeper than "preprocessing". Sample statistics are computed billions of times per training run in contemporary neural networks.

Batch Normalisation (Ioffe & Szegedy, 2015) computes the sample mean and variance of each feature channel across a mini-batch, then normalises activations to zero mean and unit variance. This is exactly the z-score standardisation formula applied at every layer, every forward pass. Without it, deep networks suffer from internal covariate shift - the distribution of activations drifts as parameters update, causing gradient vanishing or explosion.

Layer Normalisation (Ba et al., 2016), used in every transformer architecture (GPT, BERT, LLaMA, Mistral, Gemini), computes the mean and variance across features within a single token's representation. For a token embedding hRd\mathbf{h} \in \mathbb{R}^d: LN(h)=hμ^σ^+ϵγ+β\text{LN}(\mathbf{h}) = \frac{\mathbf{h} - \hat{\mu}}{\hat{\sigma} + \epsilon} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}. The μ^\hat{\mu} and σ^\hat{\sigma} are sample statistics over the dd components.

Adam optimiser (Kingma & Ba, 2015) maintains an exponentially-weighted moving average of past gradients (first moment, analogous to a running mean) and past squared gradients (second moment, analogous to a running variance). The update rule corrects for initialisation bias exactly as one corrects for Bessel bias in sample statistics.

Dataset cards required by Hugging Face, Google, and the EU AI Act mandate reporting of label distributions (a sample frequency table), feature statistics (mean, std, min, max per column), and known biases (correlations between sensitive attributes and labels).

For AI: Every time you decide how to initialise weights, normalise activations, monitor training stability, document a dataset, or detect distribution shift in production, you are doing descriptive statistics.

1.3 Historical Timeline

DESCRIPTIVE STATISTICS - HISTORICAL TIMELINE
========================================================================

  1662  John Graunt - Bills of Mortality: first tabulation of population
        statistics; birth/death rates; the discipline's birth

  1805  Legendre, Gauss - Least squares method; residual analysis;
        the mean as the least-squares centre

  1885  Francis Galton - Percentiles, regression to the mean, rank
        correlation; connected statistics to heredity and evolution

  1896  Karl Pearson - Product-moment correlation coefficient r;
        standard deviation (coined the term); chi-squared distribution

  1919  Ronald Fisher - ANOVA, experimental design; p-values; the
        distinction between sample and population statistics

  1925  Fisher - "Statistical Methods for Research Workers"; codified
        the sample mean, variance, and hypothesis testing framework

  1951  John Tukey - Jackknife resampling; robust estimation begins;
        "the future of data analysis" (1962) anticipates EDA

  1977  John Tukey - "Exploratory Data Analysis" textbook; box plots,
        stem-and-leaf plots; EDA as a discipline; look before you infer

  1973  Frank Anscombe - Anscombe's Quartet; the necessity of plotting

  1983  Peter Rousseeuw - Least Median of Squares; formal breakdown
        point theory; robust statistics becomes rigorous

  2015  Ioffe & Szegedy - Batch Normalization; sample statistics
        embedded inside neural network training loops

  2016  Ba et al. - Layer Normalization; per-token sample statistics
        for transformer architectures

  2020  Gebru et al. - "Datasheets for Datasets"; descriptive statistics
        as ethical AI documentation requirement

  2026  EU AI Act - Mandatory dataset statistics for high-risk AI systems

========================================================================

1.4 The EDA Workflow

A disciplined exploratory analysis follows five stages:

THE EDA WORKFLOW
========================================================================

  Stage 1: LOAD AND AUDIT
  -------------------------
  - Check shapes, dtypes, memory usage
  - Count missing values per column
  - Identify data types: numeric / categorical / datetime / text
  - Check for duplicate rows
  - Verify expected value ranges (no age = -5, no probability > 1)

  Stage 2: UNIVARIATE ANALYSIS
  ------------------------------
  For each feature:
  - Central tendency: mean, median, mode
  - Spread: variance, std, IQR, range
  - Shape: skewness, kurtosis
  - Extremes: min, max, outliers
  - Distribution: histogram, KDE, box plot, ECDF

  Stage 3: BIVARIATE ANALYSIS
  ------------------------------
  For each pair (feature, target) and (feature, feature):
  - Pearson/Spearman correlation
  - Scatter plot with regression line
  - Conditional distributions P(Y | X = x)

  Stage 4: MULTIVARIATE ANALYSIS
  ---------------------------------
  - Correlation matrix heatmap
  - Pair plot (scatter matrix)
  - PCA scree plot (variance explained)
  - Cluster structure via UMAP/t-SNE

  Stage 5: DOCUMENT AND ACT
  --------------------------
  - Record all findings
  - Plan transformations (log, standardise, impute)
  - Flag problematic features for domain expert review
  - Update dataset card

========================================================================

In ML practice, this workflow is applied to training data before modelling, to validation data at model evaluation time, and to production data continuously (as drift monitoring). The output is not a model - it is a map of the data landscape that guides every subsequent decision.


2. Formal Definitions and Notation

2.1 Sample Space and the Data Matrix

Let X\mathcal{X} be the population - the full set of all possible observations. A sample is a collection of nn realisations drawn from X\mathcal{X}:

{x1,x2,,xn}xiiidP\{x_1, x_2, \ldots, x_n\} \quad x_i \overset{\text{iid}}{\sim} P

The iid assumption (independent, identically distributed) means each xix_i is drawn from the same distribution PP independently of all others. This is the foundation on which classical descriptive statistics rests - but it is violated routinely in practice:

  • Time series data: xtx_t depends on xt1x_{t-1} (autocorrelation)
  • Clustered data: patients from the same hospital are correlated
  • Sequential text: tokens in a sentence are deeply dependent
  • Mini-batches in SGD: often drawn without replacement, not exactly iid

The data matrix for multivariate data is XRn×d\mathbf{X} \in \mathbb{R}^{n \times d}, where nn is the number of observations (rows) and dd is the number of features (columns). Row ii is the feature vector xiRd\mathbf{x}_i \in \mathbb{R}^d; column jj is the vector of all observed values of feature jj.

A critical distinction runs throughout this section:

SymbolNameMeaning
μ\muPopulation meanE[X]\mathbb{E}[X] - theoretical, unknown
xˉ\bar{x}Sample mean1nxi\frac{1}{n}\sum x_i - computed from data
σ2\sigma^2Population varianceVar(X)\text{Var}(X) - theoretical, unknown
s2s^2Sample variance1n1(xixˉ)2\frac{1}{n-1}\sum(x_i - \bar{x})^2 - computed from data
ρ\rhoPopulation correlationCorr(X,Y)\text{Corr}(X,Y) - theoretical
rrSample correlationPearson's rr - computed from data

For AI: The distinction matters in Batch Norm. During training, BN uses the mini-batch sample mean xˉ\bar{x} and sample variance s2s^2. During inference, it uses running estimates of the population mean and variance accumulated during training. The two are different statistics with different properties.

2.2 Measures of Central Tendency

Definition (Sample Mean). Given a sample x1,,xnRx_1, \ldots, x_n \in \mathbb{R}, the sample mean is:

xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i

The mean is the unique minimiser of i=1n(xic)2\sum_{i=1}^n (x_i - c)^2 over cRc \in \mathbb{R}. It is the least-squares centre of the data. It is also an unbiased estimator of μ\mu: E[xˉ]=μ\mathbb{E}[\bar{x}] = \mu (proof: by linearity of expectation).

Definition (Sample Median). Sort the sample as x(1)x(2)x(n)x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)}. The sample median is:

x~={x(m+1)if n=2m+1 (odd)x(m)+x(m+1)2if n=2m (even)\tilde{x} = \begin{cases} x_{(m+1)} & \text{if } n = 2m+1 \text{ (odd)} \\ \frac{x_{(m)} + x_{(m+1)}}{2} & \text{if } n = 2m \text{ (even)} \end{cases}

The median is the minimiser of i=1nxic\sum_{i=1}^n |x_i - c| over cc. It is the least-absolute-deviations centre.

Definition (Sample Mode). The value that appears most frequently. Not unique if two values tie (bimodal). Not well-defined for continuous data without binning.

Comparison:

CENTRAL TENDENCY COMPARISON
========================================================================

  Statistic  Loss minimised     Sensitivity    Use when
  ---------  -----------------  -------------  -------------------------
  Mean       \\Sigma(x_i - c)^2         High           Symmetric, no outliers
             (L2 loss)           to outliers    Gaussian-like data

  Median     \\Sigma|x_i - c|          Low            Skewed distributions,
             (L1 loss)           to outliers    outliers present
                                                Income, house prices

  Mode       0/1 loss           N/A            Categorical data;
                                                multi-modal distributions

========================================================================

Non-examples:

  • A dataset of incomes [25,30,32,35,1000][25, 30, 32, 35, 1000] thousand: mean = 224.4224.4k is misleading; median = 3232k is representative. The mean is pulled catastrophically by the outlier.
  • A bimodal dataset [1,1.1,1.2,8.9,9,9.1][1, 1.1, 1.2, 8.9, 9, 9.1]: the mean (5.055.05) and median (5.055.05) both fall in an empty region between the two clusters. The mode is more informative here.
  • An exactly symmetric dataset: all three coincide. This is the exception, not the rule.

For AI: In label smoothing and temperature scaling for language models, the "centre" of the output distribution is the mode (the argmax token). In KL divergence minimisation, the mean of the approximate distribution is what's optimised. In median-of-means estimation (used in Byzantine-robust distributed learning), the median replaces the mean to guard against corrupted gradient submissions from compromised workers.

2.3 Measures of Spread

Definition (Sample Variance). The sample variance with Bessel's correction is:

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2

Theorem (Bessel's Correction). The estimator s2=1n1(xixˉ)2s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2 is unbiased for σ2\sigma^2: E[s2]=σ2\mathbb{E}[s^2] = \sigma^2.

Proof. Write i=1n(xixˉ)2=i=1nxi2nxˉ2\sum_{i=1}^n (x_i - \bar{x})^2 = \sum_{i=1}^n x_i^2 - n\bar{x}^2. Taking expectation:

E ⁣[i=1n(xixˉ)2]=n(σ2+μ2)n ⁣(σ2n+μ2)=(n1)σ2\mathbb{E}\!\left[\sum_{i=1}^n (x_i - \bar{x})^2\right] = n(\sigma^2 + \mu^2) - n\!\left(\frac{\sigma^2}{n} + \mu^2\right) = (n-1)\sigma^2

Dividing by n1n-1 gives E[s2]=σ2\mathbb{E}[s^2] = \sigma^2. \square

The divisor nn gives the maximum likelihood estimator of σ2\sigma^2 under a Gaussian assumption, which is biased. NumPy uses nn by default (ddof=0); use ddof=1 for the unbiased version.

Definition (Standard Deviation). s=s2s = \sqrt{s^2}. Note: ss is NOT an unbiased estimator of σ\sigma (Jensen's inequality: E[s2]E[s2]=σ\mathbb{E}[\sqrt{s^2}] \leq \sqrt{\mathbb{E}[s^2]} = \sigma), but the bias is negligible for large nn.

Definition (Range). R=x(n)x(1)R = x_{(n)} - x_{(1)}. Maximal sensitivity to outliers; breakdown point = 1/n01/n \approx 0.

Definition (Interquartile Range). IQR=Q3Q1=Q(0.75)Q(0.25)\text{IQR} = Q_3 - Q_1 = Q(0.75) - Q(0.25). Contains the middle 50% of the data. Breakdown point = 25%.

Definition (Coefficient of Variation). CV=s/xˉ\text{CV} = s / \bar{x} (dimensionless). Useful for comparing spread across variables with different units. Not meaningful when xˉ0\bar{x} \approx 0.

For AI: In gradient clipping, the gradient norm g\|\mathbf{g}\| is compared to a threshold - the threshold is often chosen based on the empirical standard deviation of gradient norms observed during early training. In the Adam optimiser, the second moment estimate vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 estimates E[g2]\mathbb{E}[g^2], and vt/(1β2t)\sqrt{v_t/(1-\beta_2^t)} is the adaptive per-parameter learning rate - a measure of spread in the gradient history.

2.4 Shape Statistics: Skewness and Kurtosis

Definition (Sample Skewness). Fisher's g1g_1:

g1=1ni=1n(xixˉ)3s3=m3m23/2g_1 = \frac{\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^3}{s^3} = \frac{m_3}{m_2^{3/2}}

where mk=1n(xixˉ)km_k = \frac{1}{n}\sum (x_i - \bar{x})^k is the kk-th central moment.

  • g1>0g_1 > 0: right-skewed (long right tail; mean > median)
  • g1<0g_1 < 0: left-skewed (long left tail; mean < median)
  • g1=0g_1 = 0: symmetric

Examples: Income distributions are right-skewed (g1>0g_1 > 0). Exam scores near a ceiling are left-skewed (g1<0g_1 < 0). Gradient norms in deep networks are often right-skewed - a few extreme gradients can destabilise training.

Definition (Sample Excess Kurtosis). Fisher's g2g_2:

g2=m4m223g_2 = \frac{m_4}{m_2^2} - 3

The subtraction of 3 normalises so that the Gaussian has g2=0g_2 = 0.

  • g2>0g_2 > 0: leptokurtic (heavier tails than Gaussian; more extreme values)
  • g2<0g_2 < 0: platykurtic (lighter tails; more concentrated)
  • g2=0g_2 = 0: mesokurtic (Gaussian-like tails)

For AI: The distribution of gradients in neural networks is often leptokurtic - it has heavier tails than a Gaussian. This is why vanilla SGD with a Gaussian noise model is a poor fit; it's why gradient clipping is essential; and it's why the Student-tt distribution (which has heavier tails) is sometimes used to model stochastic gradient noise. Understanding kurtosis explains why "gradient explosion" is the right mental model for large-g2g_2 gradient distributions.

2.5 Order Statistics and Quantiles

Definition (Order Statistics). Given sample x1,,xnx_1, \ldots, x_n, the order statistics are x(1)x(2)x(n)x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)}.

Definition (Quantile / Percentile). The pp-th quantile Q(p)Q(p), p[0,1]p \in [0,1], is a value such that at least proportion pp of the data lies at or below it. Common choices: Q1=Q(0.25)Q_1 = Q(0.25), Q2=Q(0.50)=x~Q_2 = Q(0.50) = \tilde{x}, Q3=Q(0.75)Q_3 = Q(0.75).

Definition (Empirical CDF). The ECDF is:

F^n(x)=1ni=1n1[xix]\hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}[x_i \leq x]

This is a step function that jumps by 1/n1/n at each observation. It is the non-parametric estimator of the true CDF F(x)=P(Xx)F(x) = P(X \leq x).

Theorem (Glivenko-Cantelli). For iid data, supxRF^n(x)F(x)a.s.0\sup_{x \in \mathbb{R}} |\hat{F}_n(x) - F(x)| \overset{a.s.}{\to} 0 as nn \to \infty.

This guarantees the ECDF converges uniformly to the true CDF - the empirical distribution is a consistent estimator of the population distribution.

Definition (Five-Number Summary). {min,Q1,x~,Q3,max}\{\min, Q_1, \tilde{x}, Q_3, \max\}. The basis of the box plot.

For AI: Quantile functions are used in calibration evaluation (reliability diagrams), in quantile regression for uncertainty estimation, in fairness auditing (comparing median outcomes across demographic groups), and in the Wasserstein distance (used in GAN training and dataset comparison), which is computed from the quantile functions of two distributions.


3. Robust Statistics

3.1 Breakdown Point and Influence Function

Classical statistics (mean, variance, Pearson correlation) is vulnerable to outliers because it is designed to be efficient under Gaussian assumptions. Robust statistics replaces efficiency under Gaussianity with resistance to contamination.

Definition (Breakdown Point). The breakdown point ε\varepsilon^* of an estimator TT is the smallest fraction of observations that can be replaced by arbitrarily bad values to make TT take an arbitrarily extreme value:

ε=min{mn:supcorrupt m ptsT(corrupted)T(clean)=}\varepsilon^* = \min\left\{\frac{m}{n} : \sup_{\text{corrupt }m\text{ pts}} \|T(\text{corrupted}) - T(\text{clean})\| = \infty\right\}
  • Sample mean: ε=1/n0\varepsilon^* = 1/n \approx 0. A single observation replaced by ++\infty drives the mean to infinity.
  • Sample median: ε=(n1)/2/n50%\varepsilon^* = \lfloor(n-1)/2\rfloor / n \to 50\%. You must corrupt more than half the data to move the median arbitrarily far.
  • Sample variance: ε=1/n\varepsilon^* = 1/n (same as mean - one outlier suffices).
  • IQR: ε=25%\varepsilon^* = 25\%.
  • MAD: ε=50%\varepsilon^* = 50\%.

Definition (Influence Function). The influence function IF(x;T,F)\text{IF}(x; T, F) measures the effect of a single infinitesimally contaminated observation at xx on the estimator TT:

IF(x;T,F)=limε0T((1ε)F+εδx)T(F)ε\text{IF}(x; T, F) = \lim_{\varepsilon \to 0} \frac{T((1-\varepsilon)F + \varepsilon\delta_x) - T(F)}{\varepsilon}

where δx\delta_x is a point mass at xx.

  • Mean: IF(x;xˉ,F)=xμ\text{IF}(x; \bar{x}, F) = x - \mu. Unbounded - large xx causes large change.
  • Median: IF(x;x~,F)=sign(xx~)/(2f(x~))\text{IF}(x; \tilde{x}, F) = \text{sign}(x - \tilde{x}) / (2f(\tilde{x})). Bounded - extreme xx has finite effect.

For AI: Robust statistics is directly relevant to Byzantine-robust federated learning (e.g., Krum, median-of-means, geometric median aggregation), where some fraction of participating clients may send corrupted gradients. The median-of-means estimator achieves breakdown point ε>0\varepsilon^* > 0 against malicious workers, while naive FedAvg (a simple mean) has ε=1/n\varepsilon^* = 1/n.

3.2 Robust Location Estimators

Definition (Trimmed Mean). The α\alpha-trimmed mean removes the bottom and top α\alpha-fraction of observations before computing the mean:

xˉα=1n2αni=αn+1nαnx(i)\bar{x}_\alpha = \frac{1}{n - 2\lfloor\alpha n\rfloor} \sum_{i=\lfloor\alpha n\rfloor + 1}^{n - \lfloor\alpha n\rfloor} x_{(i)}

Breakdown point: α\alpha. With α=0.1\alpha = 0.1, this is the 10%-trimmed mean - discards the most extreme 10% on each side before averaging.

Definition (Winsorised Mean). Rather than discarding extreme values, Winsorisation clips them to the α\alpha and 1α1-\alpha quantiles:

xiW=max(Qα,min(Q1α,xi))x_i^W = \max(Q_\alpha, \min(Q_{1-\alpha}, x_i))

The Winsorised mean is the mean of {x1W,,xnW}\{x_1^W, \ldots, x_n^W\}. It retains all nn observations but limits the damage any single one can do.

For AI: Gradient clipping in neural network training is exactly Winsorisation of gradient norms. When gt>τ\|\mathbf{g}_t\| > \tau, the gradient is rescaled to τgt/gt\tau \cdot \mathbf{g}_t / \|\mathbf{g}_t\|. This is the 2\ell_2-norm equivalent of Winsorising each component. The threshold τ\tau is a hyperparameter analogous to the Winsorisation quantile. LLaMA, GPT-4, and most modern LLMs use gradient clipping with τ{0.5,1.0}\tau \in \{0.5, 1.0\}.

3.3 Robust Spread: MAD and IQR

Definition (Median Absolute Deviation). The MAD is the median of absolute deviations from the median:

MAD=median(xix~)\text{MAD} = \text{median}(|x_i - \tilde{x}|)

For a Gaussian distribution, σ1.4826MAD\sigma \approx 1.4826 \cdot \text{MAD}, so a consistent estimator of σ\sigma is σ^MAD=1.4826MAD\hat{\sigma}_{\text{MAD}} = 1.4826 \cdot \text{MAD}.

The MAD has breakdown point 50% - the best possible for a location-scale estimator.

Definition (IQR). IQR=Q3Q1\text{IQR} = Q_3 - Q_1. For a Gaussian, σIQR/1.349\sigma \approx \text{IQR} / 1.349.

Comparison table:

SPREAD ESTIMATORS - PROPERTIES
========================================================================

  Estimator  Breakdown  Efficiency  Computation   Best for
  ---------  ---------  ----------  ------------  ---------------------
  Std dev s  ~0%        100%        O(n)          Clean Gaussian data
  IQR        25%        37%         O(n log n)    Moderate contamination
  MAD        50%        37%         O(n log n)    Heavy contamination
  Trimmed s  \\alpha          depends     O(n log n)    Tunable robustness

========================================================================

For AI: Robust spread estimation is used in anomaly detection for ML systems monitoring. When a model's loss distribution suddenly widens (increased IQR or MAD of batch losses), it signals a data quality issue or distribution shift - potentially faster and more reliably than watching the mean loss alone, since the mean is pulled by a few extreme batches.

3.4 Outlier Detection Methods

Method 1: Z-score threshold. Flag xix_i as an outlier if zi>k|z_i| > k where zi=(xixˉ)/sz_i = (x_i - \bar{x})/s. Standard choice: k=3k = 3 (flags ~0.3% of a Gaussian). Weakness: breakdown point 0\approx 0; the mean and std used to compute ziz_i are themselves distorted by outliers.

Method 2: Tukey fences. Flag as an outlier if xi<Q11.5IQRx_i < Q_1 - 1.5 \cdot \text{IQR} or xi>Q3+1.5IQRx_i > Q_3 + 1.5 \cdot \text{IQR}. This is the standard box plot definition. "Extreme" outliers: use factor 3.03.0. Breakdown point: 25%.

Method 3: Modified Z-score (Iglewicz-Hoaglin). Uses the MAD instead of the standard deviation:

Mi=0.6745(xix~)MADM_i = \frac{0.6745(x_i - \tilde{x})}{\text{MAD}}

Flag as outlier if Mi>3.5|M_i| > 3.5. This is robust: breakdown point 50%.

Summary:

MethodBreakdownWhen to use
Z-score~0%Clean data; fast first pass
Tukey fences25%General purpose; default choice
Modified Z-score50%Heavily contaminated data

For AI: Outlier detection on training data is a data-cleaning step before model training. But outlier detection also occurs at inference time: inputs far outside the training distribution (OOD inputs) can be flagged using Mahalanobis distance (Section5.3), energy scores, or model uncertainty. The descriptive statistics methods here are the simplest and most interpretable first pass.


4. Bivariate Statistics

4.1 Sample Covariance and Pearson Correlation

Definition (Sample Covariance). For paired samples (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n):

sxy=1n1i=1n(xixˉ)(yiyˉ)s_{xy} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})

Covariance measures the direction of linear association: sxy>0s_{xy} > 0 means xx and yy tend to increase together; sxy<0s_{xy} < 0 means they move in opposite directions; sxy=0s_{xy} = 0 means no linear association (but possibly nonlinear).

Definition (Pearson Correlation Coefficient). The sample Pearson correlation is:

r=sxysxsy=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{s_{xy}}{s_x s_y} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}

Properties:

  • r[1,1]r \in [-1, 1] (by Cauchy-Schwarz inequality)
  • r=1r = 1: perfect positive linear relationship; r=1r = -1: perfect negative linear
  • Invariant under affine transformations: r(ax+b,cy+d)=sign(ac)r(x,y)r(ax+b, cy+d) = \text{sign}(ac) \cdot r(x,y)
  • r=0r = 0 does NOT imply independence (only linear independence)

Theorem (Cauchy-Schwarz bound on rr). By the Cauchy-Schwarz inequality applied to the centred vectors u=(xixˉ)\mathbf{u} = (x_i - \bar{x}) and v=(yiyˉ)\mathbf{v} = (y_i - \bar{y}):

i=1nuivii=1nui2i=1nvi2\left|\sum_{i=1}^n u_i v_i\right| \leq \sqrt{\sum_{i=1}^n u_i^2} \cdot \sqrt{\sum_{i=1}^n v_i^2}

which gives sxysxsy|s_{xy}| \leq s_x s_y, hence r1|r| \leq 1. Equality holds iff yiyˉ=c(xixˉ)y_i - \bar{y} = c(x_i - \bar{x}) for all ii (perfect linear relationship). \square

Non-examples of r=0r = 0 implying independence:

  • xiUniform(1,1)x_i \sim \text{Uniform}(-1,1), yi=xi2y_i = x_i^2: r=0r = 0 exactly, yet yy is a deterministic function of xx.
  • Anscombe's Dataset II: parabola, r=0.816r = 0.816 but the fit is entirely non-linear.

For AI: In transformer attention, the attention weights form a row of a correlation-like matrix between query and key vectors. The dot-product qk/dk\mathbf{q}^\top \mathbf{k} / \sqrt{d_k} is an unnormalised, uncentred covariance between the query and key representations. The softmax normalisation maps these to probabilities. Understanding Pearson correlation provides intuition for what "attention scores" actually measure.

4.2 Rank Correlation: Spearman and Kendall

Definition (Spearman's ρ\rho). Compute ranks R(xi)R(x_i) and R(yi)R(y_i) of each sample within their respective sequences, then compute Pearson correlation on the ranks:

ρS=r(R(x),R(y))=16i=1ndi2n(n21)\rho_S = r(\mathbf{R}(x), \mathbf{R}(y)) = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2-1)}

where di=R(xi)R(yi)d_i = R(x_i) - R(y_i) and the second formula holds when there are no ties.

Definition (Kendall's τ\tau). Count concordant pairs (both xx and yy increase together) and discordant pairs:

τ=(number of concordant pairs)(number of discordant pairs)(n2)\tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{\binom{n}{2}}

When to use each:

MethodCapturesUse when
Pearson rrLinear associationBoth vars approx. Gaussian, no outliers
Spearman ρS\rho_SMonotone associationOrdinal data, outliers present, non-linear monotone
Kendall τ\tauConcordance probabilitySmall samples, ordinal data, robust inference

For AI: When auditing ML models for fairness, Spearman correlation (rather than Pearson) between model outputs and sensitive attributes is more appropriate: the relationship between a model's confidence score and a demographic attribute is rarely linear. Rank correlation is also used in information retrieval evaluation - NDCG and Kendall's τ\tau between predicted and true rankings.

4.3 Association Beyond Linear Correlation

Mutual Information (preview). For random variables XX and YY:

I(X;Y)= ⁣ ⁣p(x,y)logp(x,y)p(x)p(y)dxdyI(X; Y) = \int\!\!\int p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\, dx\, dy

MI captures all statistical dependencies (linear and nonlinear). I(X;Y)=0I(X;Y) = 0 iff XYX \perp Y. Unlike Pearson correlation, it detects the y=x2y = x^2 relationship.

Forward reference: Mutual information is the canonical measure of statistical dependence with no linearity assumption. Full treatment: Chapter 9 - Information Theory.

Distance Correlation. Szekely et al. (2007): dCor(X,Y)=0\text{dCor}(X,Y) = 0 iff XYX \perp Y (for all distributions with finite first moments). Computed via pairwise distance matrices. Captures non-linear dependence while remaining a proper correlation measure in [0,1][0,1].

Cramer's V (categorical association). For two categorical variables with a contingency table, Cramer's V normalises the chi-squared statistic to [0,1][0,1]: V=χ2/(nmin(r1,c1))V = \sqrt{\chi^2 / (n \cdot \min(r-1, c-1))}.

4.4 Correlation vs Causation

The maxim "correlation is not causation" is one of the most important lessons in statistics for ML practitioners - and one of the most frequently violated.

Simpson's Paradox (formal example). A dataset shows that treatment A appears better than treatment B when the full dataset is used (r(treatment,outcome)>0r(\text{treatment}, \text{outcome}) > 0), but treatment B is better within every subgroup (men and women separately). This reversal occurs when a confounding variable (sex) is correlated with both the treatment assignment and the outcome.

In ML: A model trained on internet text learns that "doctor" co-occurs more with "he" than "she" - not because being a doctor causes gender, but because historical data reflects a correlation induced by societal biases (the confounder). The model's correlation-based weights perpetuate the bias.

Key insight: Correlation between feature XX and label YY is sufficient to use XX as a predictor only if the test distribution matches the training distribution. If the mechanism generating the correlation changes (distribution shift), the correlation can reverse or disappear - but a causal relationship would remain stable.

For AI: This is why OOD generalisation is fundamentally harder than in-distribution accuracy. Models trained to exploit spurious correlations (e.g., grass background -> outdoor scene) fail when those correlations break (grass in an indoor setting). Causal ML is an active research area aimed at learning invariant, causal features rather than spurious correlations.


5. Multivariate Descriptive Statistics

5.1 Empirical Mean Vector and Covariance Matrix

Definition (Sample Mean Vector). For data matrix XRn×d\mathbf{X} \in \mathbb{R}^{n \times d}:

μ^=1nX1n=1ni=1nxiRd\hat{\boldsymbol{\mu}} = \frac{1}{n} \mathbf{X}^\top \mathbf{1}_n = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \in \mathbb{R}^d

Definition (Centred Data Matrix). Xc=X1nμ^\mathbf{X}_c = \mathbf{X} - \mathbf{1}_n \hat{\boldsymbol{\mu}}^\top (subtract the mean from each row).

Definition (Sample Covariance Matrix). With Bessel's correction:

Σ^=1n1XcXc=1n1i=1n(xiμ^)(xiμ^)Rd×d\hat{\Sigma} = \frac{1}{n-1} \mathbf{X}_c^\top \mathbf{X}_c = \frac{1}{n-1} \sum_{i=1}^n (\mathbf{x}_i - \hat{\boldsymbol{\mu}})(\mathbf{x}_i - \hat{\boldsymbol{\mu}})^\top \in \mathbb{R}^{d \times d}

Properties of Σ^\hat{\Sigma}:

  • Symmetric: Σ^jk=Σ^kj=sjk\hat{\Sigma}_{jk} = \hat{\Sigma}_{kj} = s_{jk}
  • Positive semi-definite (PSD): vΣ^v0\mathbf{v}^\top \hat{\Sigma} \mathbf{v} \geq 0 for all v\mathbf{v}. Proof: vΣ^v=1n1Xcv20\mathbf{v}^\top \hat{\Sigma} \mathbf{v} = \frac{1}{n-1}\|\mathbf{X}_c \mathbf{v}\|^2 \geq 0.
  • Rank: rank(Σ^)min(d,n1)\text{rank}(\hat{\Sigma}) \leq \min(d, n-1). If n<dn < d, the matrix is rank-deficient - it cannot be inverted. This is the classical pnp \gg n problem.
  • Diagonal entries: Σ^jj=sj2\hat{\Sigma}_{jj} = s_j^2 (sample variance of feature jj)

For AI: The empirical covariance matrix is the key object in PCA (eigendecomposition of Σ^\hat\Sigma gives principal components), in Gaussian discriminant analysis, in the Mahalanobis distance (Section5.3), and in LoRA-style low-rank adaptation (the weight update ΔW=BA\Delta W = BA is designed so that the effective covariance of updates is low-rank, matching the low intrinsic dimensionality of fine-tuning data).

5.2 Correlation Matrix and Visualisation

Definition (Sample Correlation Matrix). Standardise each column: x~ij=(xijxˉj)/sj\tilde{x}_{ij} = (x_{ij} - \bar{x}_j)/s_j. Then:

R=1n1X~X~Rd×dR = \frac{1}{n-1} \tilde{\mathbf{X}}^\top \tilde{\mathbf{X}} \in \mathbb{R}^{d \times d}

Properties: Rjj=1R_{jj} = 1 for all jj; Rjk=rjk[1,1]R_{jk} = r_{jk} \in [-1,1]; RR is PSD with diagonal ones.

The correlation matrix is related to the covariance matrix by Rjk=Σ^jk/(σ^jσ^k)R_{jk} = \hat{\Sigma}_{jk} / (\hat{\sigma}_j \hat{\sigma}_k), i.e., R=D1/2Σ^D1/2R = D^{-1/2} \hat{\Sigma} D^{-1/2} where D=diag(Σ^)D = \text{diag}(\hat{\Sigma}).

Visualisation: A heatmap of RR with a diverging colormap (e.g., RdBu_r) centred at 0 is the standard tool. Hierarchical clustering of rows/columns groups correlated features.

For AI: Analysing the correlation matrix of attention heads in a transformer layer reveals which heads focus on similar token relationships (high rr) and which are functionally diverse (low rr). Correlation matrices of model weights across layers are used in mechanistic interpretability to study how information is processed. High inter-layer weight correlation often signals redundancy - the basis for pruning.

5.3 Mahalanobis Distance

Definition (Mahalanobis Distance). The Mahalanobis distance from a point xRd\mathbf{x} \in \mathbb{R}^d to the sample centre μ^\hat{\boldsymbol{\mu}} is:

dM(x)=(xμ^)Σ^1(xμ^)d_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \hat{\boldsymbol{\mu}})^\top \hat{\Sigma}^{-1} (\mathbf{x} - \hat{\boldsymbol{\mu}})}

Interpretation: The Mahalanobis distance is the Euclidean distance after whitening - transforming the data so that all features have unit variance and zero correlation. Equivalently, it accounts for the scale and correlation structure of the data: a point that is 3 standard deviations away along a low-variance direction is farther (in Mahalanobis terms) than 3 standard deviations along a high-variance direction.

  • If Σ^=I\hat{\Sigma} = I, then dM(x)=xμ^d_M(\mathbf{x}) = \|\mathbf{x} - \hat{\boldsymbol{\mu}}\| (ordinary Euclidean distance).
  • For multivariate Gaussian data, dM(x)2χ2(d)d_M(\mathbf{x})^2 \sim \chi^2(d) - so dM>χ0.975,d2d_M > \sqrt{\chi^2_{0.975, d}} flags multivariate outliers.

For AI: The Mahalanobis distance is used as an out-of-distribution detection score (Lee et al., 2018): compute dM(h)d_M(\mathbf{h}) of a test example's penultimate-layer embedding relative to the training-set class means and covariances. OOD examples have large Mahalanobis distance. This outperforms simple Euclidean-distance methods and is used in production OOD monitoring for deployed LLMs.

5.4 Dimensionality and the Curse

As dd grows, the geometry of data changes in counterintuitive ways that make descriptive statistics increasingly unreliable.

Concentration of pairwise distances. In high dimensions, all pairwise Euclidean distances xixj\|x_i - x_j\| concentrate around the same value: Var(xixj)/E[xixj]20\text{Var}(\|x_i-x_j\|) / \mathbb{E}[\|x_i-x_j\|]^2 \to 0 as dd \to \infty. This means nearest-neighbour methods, clustering, and any distance-based statistic become meaningless - everything is equally far from everything else.

Covariance matrix estimation degrades. Estimating Σ^Rd×d\hat\Sigma \in \mathbb{R}^{d \times d} requires O(d2)O(d^2) parameters from nn observations. For n<d2n < d^2, the estimator is noise-dominated. This is the classical pnp \gg n problem. Solution: regularised covariance (Ledoit-Wolf), diagonal approximation, or sparse graphical models.

Implication for attention. The softmax attention mechanism computes dkd_k-dimensional dot products. As dkd_k grows, dot products concentrate (variance scales as dkd_k), causing the softmax to saturate (near-uniform or near-one-hot attention). The 1/dk1/\sqrt{d_k} scaling in scaled dot-product attention directly addresses this - it's a variance normalisation step grounded in descriptive statistics.


6. Data Standardisation and Preprocessing

6.1 Z-Score Standardisation

Definition (Z-Score Standardisation / Standard Scaling). Transform each observation xix_i to:

zi=xixˉsz_i = \frac{x_i - \bar{x}}{s}

The result has sample mean zˉ=0\bar{z} = 0 and sample standard deviation sz=1s_z = 1.

Why it matters for ML: Gradient descent convergence depends critically on the scale of features. If one feature ranges over [0,106][0, 10^6] and another over [0,1][0, 1], the gradient of the loss with respect to the first weight is 106×10^6 \times larger - the loss landscape is elongated, and gradient descent follows a zigzag path rather than heading directly toward the minimum. Standardising features makes the loss landscape more spherical, dramatically improving gradient descent convergence.

Implementation note: Always fit the scaler on the training set only, then apply the same mean and standard deviation to the validation and test sets. Fitting on the full dataset (including test) is data leakage - it allows test-set information to influence preprocessing.

Properties:

  • The z-score is dimensionless (units cancel).
  • Does not change the shape of the distribution (skewness and kurtosis are invariant).
  • Sensitive to outliers (outliers inflate ss, compressing the bulk of the data toward zero).

6.2 Min-Max and Robust Scaling

Definition (Min-Max Scaling). Maps to [0,1][0, 1]:

ximm=xixminxmaxxminx_i^{\text{mm}} = \frac{x_i - x_{\min}}{x_{\max} - x_{\min}}

Useful when features must lie in a fixed range (e.g., inputs to sigmoid activations, image pixel values). Breakdown point: 1/n1/n (one extreme outlier shifts the entire scale).

Definition (Robust Scaling). Uses median and IQR instead of mean and standard deviation:

xirob=xix~IQRx_i^{\text{rob}} = \frac{x_i - \tilde{x}}{\text{IQR}}

Result has sample median 0 and IQR 1. Breakdown point: 25%. Appropriate when the data is heavily contaminated.

Decision guide:

SCALING METHOD SELECTION
========================================================================

  Data has outliers?  --Yes-->  Robust Scaling (median / IQR)

  No outliers, and
  need [0,1] range?  --Yes-->  Min-Max Scaling

  No outliers, and
  need zero mean,
  unit variance?     ------>  Z-Score Standardisation (default)

========================================================================

6.3 Normalisation in Neural Networks

Modern neural networks embed descriptive statistics directly into their architecture through normalisation layers. Every such layer computes sample statistics - the difference lies in which samples are used.

Batch Normalisation (Ioffe & Szegedy, 2015). For a feature channel jj across a mini-batch of nn examples:

h^ij=hijμ^j(B)(σ^j(B))2+ϵ,BN(hij)=γjh^ij+βj\hat{h}_{ij} = \frac{h_{ij} - \hat{\mu}_j^{(B)}}{\sqrt{(\hat{\sigma}_j^{(B)})^2 + \epsilon}}, \quad \text{BN}(h_{ij}) = \gamma_j \hat{h}_{ij} + \beta_j

where μ^j(B)=1ni=1nhij\hat{\mu}_j^{(B)} = \frac{1}{n}\sum_{i=1}^n h_{ij} and (σ^j(B))2=1ni=1n(hijμ^j(B))2(\hat{\sigma}_j^{(B)})^2 = \frac{1}{n}\sum_{i=1}^n (h_{ij} - \hat{\mu}_j^{(B)})^2. These are sample statistics across the batch for a fixed feature. Parameters γj,βj\gamma_j, \beta_j are learned.

Layer Normalisation (Ba et al., 2016). For a single token embedding hiRd\mathbf{h}_i \in \mathbb{R}^d:

h^ij=hijμ^i(L)(σ^i(L))2+ϵ,LN(hi)=γh^i+β\hat{h}_{ij} = \frac{h_{ij} - \hat{\mu}_i^{(L)}}{\sqrt{(\hat{\sigma}_i^{(L)})^2 + \epsilon}}, \quad \text{LN}(\mathbf{h}_i) = \boldsymbol{\gamma} \odot \hat{\mathbf{h}}_i + \boldsymbol{\beta}

where μ^i(L)=1dj=1dhij\hat{\mu}_i^{(L)} = \frac{1}{d}\sum_{j=1}^d h_{ij} and (σ^i(L))2=1dj=1d(hijμ^i(L))2(\hat{\sigma}_i^{(L)})^2 = \frac{1}{d}\sum_{j=1}^d (h_{ij} - \hat{\mu}_i^{(L)})^2. These are sample statistics across features for a fixed token. Used in all transformer models.

RMS Norm (Zhang & Sennrich, 2019). Used in LLaMA, Mistral, Gemma:

RMSNorm(hi)=hiRMS(hi)γ,RMS(hi)=1dj=1dhij2\text{RMSNorm}(\mathbf{h}_i) = \frac{\mathbf{h}_i}{\text{RMS}(\mathbf{h}_i)} \odot \boldsymbol{\gamma}, \quad \text{RMS}(\mathbf{h}_i) = \sqrt{\frac{1}{d}\sum_{j=1}^d h_{ij}^2}

This is standardisation without mean-centering: divide by the root mean square (which is the standard deviation when the mean is 0).

NORMALISATION LAYERS - WHAT STATISTICS THEY COMPUTE
========================================================================

            Batch Norm    Layer Norm    RMS Norm   Group Norm
  -----------------------------------------------------------
  Statistic  mean, var     mean, var     rms only   mean, var
  Over       batch x HxW   feature dim  feature dim feature groups
  Per        channel       token         token      group per token
  Train \\neq    Yes (running  No            No         No
  Inference  stats used)
  Seq vary   Breaks for    Works fine    Works fine Works fine
  length     variable len
  -----------------------------------------------------------
  Used in    CNNs, ResNet  All Transformers  LLaMA  CNNs+some
                           (GPT, BERT, etc)  Mistral transformers

========================================================================

6.4 Handling Missing Data

Missing data is the rule rather than the exception in real ML datasets. The strategy for handling it depends on the missing data mechanism:

  • MCAR (Missing Completely At Random): missingness is independent of all variables. Safe to delete rows or use simple imputation.
  • MAR (Missing At Random): missingness depends on observed variables but not on the missing value itself. Model-based imputation works.
  • MNAR (Missing Not At Random): missingness depends on the missing value (e.g., high-income people don't report income). Simple imputation is biased; need specialised methods.

Simple imputation strategies:

StrategyUse forAdvantageDisadvantage
Mean imputationContinuous, MCARFast, simpleReduces variance; biased for skewed data
Median imputationContinuous, outliersRobustReduces variance
Mode imputationCategoricalSimpleBiases category frequencies
Forward fillTime seriesPreserves temporal structureWrong for gaps > 1 step

For AI: Missing data in training datasets often encodes information that models will exploit as a spurious correlation. If high-income records are more likely to have missing income fields, a model that has access to missingness indicators may learn to predict high income from missing income - a label leakage that fails catastrophically when data collection improves.


7. Visualisation for EDA

7.1 Univariate Plots

Histogram. Bins the range [xmin,xmax][x_{\min}, x_{\max}] into kk equal intervals and counts observations in each. Choice of kk is critical:

  • Sturges' rule: k=log2n+1k = \lceil \log_2 n \rceil + 1. Simple; undersmooths for large nn.
  • Scott's rule: h=3.49sn1/3h = 3.49 \cdot s \cdot n^{-1/3}. Optimal for Gaussian data.
  • Freedman-Diaconis rule: h=2IQRn1/3h = 2 \cdot \text{IQR} \cdot n^{-1/3}. Robust to outliers; best general choice.

Kernel Density Estimate (KDE). A smooth non-parametric estimate of the PDF:

f^h(x)=1nhi=1nK ⁣(xxih)\hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right)

where KK is a kernel (typically Gaussian) and hh is the bandwidth. The optimal bandwidth under a Gaussian kernel for Gaussian data is Silverman's rule: h=1.06sn1/5h = 1.06 \cdot s \cdot n^{-1/5}.

Box Plot (Tukey, 1977). Shows five-number summary with: box from Q1Q_1 to Q3Q_3; median line; whiskers extending to the most extreme non-outlier points (within 1.5×IQR1.5 \times \text{IQR} of the box); dots for outliers beyond the fences.

Empirical CDF. Plot F^n(x)\hat{F}_n(x) as a step function. Advantages over histogram: no binning choice; shows quantiles directly; convergence to FF is guaranteed by Glivenko-Cantelli.

Violin Plot. Combines a box plot with a mirrored KDE - shows both summary statistics and distributional shape. Preferred over box plots when sample size is sufficient to estimate the density reliably (n100n \gtrsim 100).

7.2 Bivariate and Multivariate Plots

Scatter Plot. Plot yiy_i vs xix_i for all ii. Add a regression line and confidence band. Colour by a third variable. The go-to tool for bivariate exploratory analysis.

Pair Plot (Scatter Matrix). For a dd-dimensional dataset, create a d×dd \times d grid where (j,k)(j,k) cell shows the scatter plot of feature jj vs feature kk; diagonal cells show histograms or KDEs. Standard library: seaborn.pairplot.

Correlation Heatmap. Display R[1,1]d×dR \in [-1,1]^{d \times d} as a colour-coded matrix. Use a diverging colormap centred at 0 (e.g., coolwarm or RdBu_r). Cluster rows/columns by hierarchical clustering to reveal block structure.

Parallel Coordinates. Each axis corresponds to one feature; each observation is a polyline connecting its values across all axes. Reveals clusters and correlations in high-dimensional data without reducing dimensionality.

7.3 Assessing Normality

Many statistical methods (t-tests, confidence intervals, OLS regression) assume normally distributed data. Assessing normality is therefore a prerequisite for applying them.

Q-Q Plot (Quantile-Quantile Plot). Plot theoretical Gaussian quantiles (x-axis) against sample quantiles (y-axis). If the data is Gaussian, points lie on the line y=xy = x. Deviations reveal:

  • S-shaped curve -> lighter tails than Gaussian (platykurtic)
  • Bow-shaped curve -> heavier tails (leptokurtic, common in financial and gradient data)
  • Points above the line at the top -> right skew

Probability Plot. Same idea as Q-Q but with standardised axes; equivalent to Q-Q for location-scale families.

Shapiro-Wilk Test (preview). A formal hypothesis test for normality. H0H_0: data is Gaussian; WW statistic close to 1 supports normality; small pp-value rejects normality.

Forward reference: Full hypothesis tests for distributional assumptions - Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling - are covered in Section03 Hypothesis Testing.


8. Applications in Machine Learning

8.1 Batch Normalisation and Layer Normalisation

The two dominant normalisation strategies in deep learning are directly grounded in the sample statistics developed in Section2.

Batch Norm workflow:

  1. For each feature channel jj in a mini-batch of size nn:
    • Compute μ^j=1nihij\hat{\mu}_j = \frac{1}{n}\sum_i h_{ij} (sample mean, nn divisor, NOT n1n-1)
    • Compute σ^j2=1ni(hijμ^j)2\hat{\sigma}_j^2 = \frac{1}{n}\sum_i (h_{ij} - \hat{\mu}_j)^2 (biased sample variance, MLE)
    • Normalise: h^ij=(hijμ^j)/σ^j2+ϵ\hat{h}_{ij} = (h_{ij} - \hat{\mu}_j) / \sqrt{\hat{\sigma}_j^2 + \epsilon}
  2. Scale and shift: BN(hij)=γjh^ij+βj\text{BN}(h_{ij}) = \gamma_j \hat{h}_{ij} + \beta_j
  3. At test time, use exponential moving averages μˉj\bar{\mu}_j and σˉj2\bar{\sigma}_j^2 accumulated during training.

Why Batch Norm uses biased variance (nn not n1n-1): Batch Norm is not trying to estimate the population variance; it is normalising the activations within the current mini-batch. Unbiasedness is not the goal - numerical stability and consistent normalisation are. The ϵ\epsilon term (typically 10510^{-5}) prevents division by zero.

Layer Norm workflow:

  1. For each token embedding hiRd\mathbf{h}_i \in \mathbb{R}^d in the sequence:
    • Compute μ^i=1djhij\hat{\mu}_i = \frac{1}{d}\sum_j h_{ij} (mean over the dd feature dimensions)
    • Compute σ^i2=1dj(hijμ^i)2\hat{\sigma}_i^2 = \frac{1}{d}\sum_j (h_{ij} - \hat{\mu}_i)^2
    • Normalise: h^ij=(hijμ^i)/σ^i2+ϵ\hat{h}_{ij} = (h_{ij} - \hat{\mu}_i) / \sqrt{\hat{\sigma}_i^2 + \epsilon}
  2. Apply learned γj\gamma_j and βj\beta_j.

Why transformers use Layer Norm, not Batch Norm: Batch Norm requires a fixed batch size and breaks with variable-length sequences (because the normalisation statistics change with sequence length). Layer Norm normalises within each token independently, making it sequence-length-agnostic and suitable for autoregressive generation (where batch size = 1 at inference).

Why RMSNorm (LLaMA, Mistral, Gemma) drops the mean: Empirically, re-centering (subtracting the mean) provides marginal benefit compared to re-scaling (dividing by RMS). RMSNorm is cheaper to compute (no mean subtraction) and achieves comparable training stability.

8.2 Adam Optimiser as Online Sample Statistics

The Adam optimiser (Kingma & Ba, 2015) maintains running estimates of the first and second moments of the gradient distribution, updated at each step tt:

mt=β1mt1+(1β1)gt(first moment - running mean)m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment - running mean)} vt=β2vt1+(1β2)gt2(second moment - running mean of squared gradient)v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment - running mean of squared gradient)}

These are exponentially-weighted moving averages (EWMA) of gtg_t and gt2g_t^2. Initialising m0=v0=0m_0 = v_0 = 0 introduces a bias toward zero at early steps (analogous to the bias in the nn vs n1n-1 estimator for finite samples). Adam corrects this with:

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

These are the bias-corrected moment estimates. The analogy to Bessel's correction: both corrections account for the fact that a short history (tt steps or nn observations) underestimates the true statistic.

The Adam update is:

θt+1=θtαv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

The denominator v^t\sqrt{\hat{v}_t} is an estimate of the gradient's root-mean-square - an adaptive per-parameter learning rate that scales inversely with the gradient's historical variance. Parameters with high-variance gradients get smaller updates; parameters with low-variance gradients get larger updates.

Connection to descriptive statistics: m^t\hat{m}_t is a smoothed estimate of the gradient mean; v^t\hat{v}_t is a smoothed estimate of the uncentred second moment (related to variance + mean^2). Adam is performing online, exponentially-weighted descriptive statistics on the gradient stream.

8.3 Dataset Characterisation and Model Cards

The Datasheets for Datasets framework (Gebru et al., 2020) and Hugging Face Dataset Cards require quantitative documentation of every training dataset. The required statistics are directly from descriptive statistics:

  • Sample size: nn (rows), dd (features/columns)
  • Label distribution: frequency table or histogram for each output class
  • Feature statistics: mean, std, min, Q1, median, Q3, max for each numeric feature
  • Missing values: count and fraction per column
  • Correlation structure: correlation matrix or heatmap between features and between features and label
  • Demographic representation: distribution of sensitive attributes (age, gender, race, geography)
  • Temporal coverage: date range, recency bias

For embedding spaces (used in retrieval-augmented generation, semantic search, and vector databases): intra-cluster variance (how spread out a class's embeddings are), inter-cluster distance (how separated different classes are), and cosine similarity distribution (the histogram of pairwise cosine similarities across the corpus).

8.4 Data Drift and Distribution Shift

In production ML systems, the data generating distribution Pprod(X,Y)P_{\text{prod}}(X, Y) may diverge from the training distribution Ptrain(X,Y)P_{\text{train}}(X, Y). This is called distribution shift or data drift. Monitoring descriptive statistics is the first line of defence.

Types of drift:

  • Covariate shift: P(X)P(X) changes, P(YX)P(Y|X) stays the same. Feature means shift.
  • Label shift: P(Y)P(Y) changes, P(XY)P(X|Y) stays the same. Label proportions change.
  • Concept drift: P(YX)P(Y|X) changes. Same inputs now map to different outputs.

Detection via descriptive statistics:

Drift typeMonitoring statisticAlert when
Covariate shiftFeature means, variancesMean shift > kσk\sigma
Skewness shiftFeature skewness$
Outlier rateFraction > Tukey fenceRate doubles from baseline
Label imbalanceClass frequency tableClass proportions shift
PSI (Population Stability Index)k(pkqk)log(pk/qk)\sum_k (p_k - q_k)\log(p_k/q_k)PSI > 0.2

The Population Stability Index (PSI) is the symmetric KL divergence between the training and production distributions, computed on binned data - a descriptive statistic that quantifies how much the overall distribution has shifted.

Forward reference: Formal hypothesis tests for detecting drift (KS test, two-sample t-test, chi-squared test) are covered in Section03 Hypothesis Testing.


9. Common Mistakes

#MistakeWhy It's WrongFix
1Using the mean for skewed or heavy-tailed dataA few extreme values (income, house prices, gradient norms) pull the mean far from the "typical" value; it no longer represents the centreReport median alongside mean; use robust statistics for skewed features
2Using nn instead of n1n-1 for sample varianceDividing by nn gives the MLE of σ2\sigma^2 under a Gaussian, which is biased: E[sn2]=n1nσ2<σ2\mathbb{E}[s_n^2] = \frac{n-1}{n}\sigma^2 < \sigma^2Always use ddof=1 in NumPy/SciPy (np.var(x, ddof=1)) unless computing BN-style normalisation
3Fitting the scaler on train+test dataThe test-set statistics (mean, std) leak into the preprocessing, artificially improving reported test performanceFit the scaler only on the training set; transform the test set using training-set parameters
4Interpreting r=0r = 0 as independenceZero Pearson correlation means no linear association; nonlinear dependence can be arbitrarily strongAlso compute distance correlation or mutual information; always scatter-plot the pair
5Concluding causation from correlationSpurious correlations (Simpson's paradox, confounders, selection bias) produce r0r \neq 0 with no causal linkUse controlled experiments (A/B tests) or causal inference methods to test causal claims
6Removing outliers without investigationAn outlier may be the most interesting data point (a fraud transaction, a rare disease case, a critical edge case for a safety model)Investigate each outlier's source before deciding to remove vs. keep vs. separate
7Standardising after train/test split but using test statisticsSame as mistake 3 in reverse - computing the standard deviation on the test set introduces leakageAlways use training statistics for all scaling
8Using Pearson correlation for ordinal dataOrdinal data (ratings 1-5, Likert scales) has rank order but not equal intervals; Pearson assumes interval scaleUse Spearman ρS\rho_S or Kendall τ\tau for ordinal data
9Ignoring the dimensionality of the covariance matrixFor n<dn < d, Σ^\hat\Sigma is rank-deficient and non-invertible; Mahalanobis distance and PCA are undefined or numerically unstableRegularise: add λI\lambda I (ridge), use diagonal approximation, or reduce dimensionality first
10Reporting only mean accuracy as the evaluation metricMean accuracy aggregates over all classes; for imbalanced data (99% negative class), a trivial classifier gets 99% accuracyAlso report median, per-class accuracy, precision, recall, F1; use stratified evaluation
11Choosing histogram bin width arbitrarilyToo few bins hides multimodality; too many bins creates noise that looks like structureUse Freedman-Diaconis rule; compare histogram to KDE; be explicit about bin choice
12Confusing sample skewness and distribution skewnessHigh sample skewness g1g_1 for small nn may be noise, not a true distributional propertyCheck confidence intervals for g1g_1; use bootstrap to assess sampling variability

10. Exercises

Exercise 1 * - Central Tendency and Spread Given the dataset {2,4,4,4,5,5,7,9,9,100}\{2, 4, 4, 4, 5, 5, 7, 9, 9, 100\}: (a) Compute mean, median, and mode. (b) Compute sample variance (with Bessel correction), standard deviation, and IQR. (c) Explain which measure of centre is most representative of the data and why. (d) Remove the outlier 100 and recompute all statistics. By what factor does the mean change? The median?

Exercise 2 * - Five-Number Summary and Outlier Detection Generate 200 observations from N(0,1)\mathcal{N}(0,1) with 5 additional outliers at {8,7,7,8,9}\{-8, -7, 7, 8, 9\}. (a) Compute the five-number summary. (b) Identify outliers using Tukey fences (1.5×IQR1.5 \times \text{IQR}). (c) Identify outliers using the modified Z-score with MAD. (d) Compare which method flags more false positives on the clean Gaussian data (benchmark: F^\hat F critical values for n=200n=200).

Exercise 3 * - Empirical CDF and Q-Q Plot Generate 100 samples from a t3t_3 (Student-tt with 3 degrees of freedom) distribution. (a) Construct and plot the empirical CDF F^n(x)\hat{F}_n(x). (b) Construct a Q-Q plot against the Gaussian distribution. (c) Describe what the Q-Q plot reveals about the tails of the t3t_3 distribution relative to the Gaussian. (d) Compute the excess kurtosis g2g_2 and interpret the result.

Exercise 4 ** - Robust Statistics Under Contamination Start with 95 samples from N(0,1)\mathcal{N}(0,1). Add 5 outliers drawn from N(20,1)\mathcal{N}(20, 1) (contamination fraction 5%). (a) Compare mean vs median vs 10%-trimmed mean on the contaminated dataset. Which is closest to 0? (b) Compare std vs MAD vs IQR. Which is closest to 1? (c) Compute the influence of the single largest outlier on each statistic: replace it with 10610^6 and measure the change. (d) Plot the estimates as a function of contamination fraction α{0,0.05,0.1,0.25,0.4,0.49}\alpha \in \{0, 0.05, 0.1, 0.25, 0.4, 0.49\}.

Exercise 5 ** - Covariance Matrix and Mahalanobis Distance Generate 200 samples from N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) with μ=(2,1)\boldsymbol{\mu} = (2, -1)^\top and Σ=(4339)\Sigma = \begin{pmatrix}4 & 3 \\ 3 & 9\end{pmatrix}. (a) Compute the sample mean vector μ^\hat{\boldsymbol{\mu}} and covariance matrix Σ^\hat{\Sigma}. Verify they are close to the true values. (b) Verify that Σ^\hat{\Sigma} is PSD. (c) Compute the Mahalanobis distance from each point to μ^\hat{\boldsymbol{\mu}}. Plot a histogram of dM2d_M^2. (d) Compare the histogram to a χ2(2)\chi^2(2) distribution. What percentage of points exceed the 97.5th percentile?

Exercise 6 ** - Anscombe's Quartet Implement all four of Anscombe's datasets from scratch (exact values from the original paper). (a) Compute and verify that all four datasets have the same mean xˉ=9\bar{x} = 9, yˉ=7.5\bar{y} = 7.5, variance sx2=11s_x^2 = 11, sy24.12s_y^2 \approx 4.12, and Pearson correlation r0.816r \approx 0.816. (b) Fit a linear regression y^=β^0+β^1x\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x to each dataset and verify all four give the same coefficients. (c) Plot all four scatter plots with the fitted line. (d) Compute skewness and kurtosis for each dataset's residuals. Do they differ? What does this reveal?

Exercise 7 *** - Batch Norm vs Layer Norm from Scratch (a) Implement batch_norm(X, gamma, beta, eps=1e-5) where X has shape (batch, features), normalising across the batch dimension. (b) Implement layer_norm(X, gamma, beta, eps=1e-5), normalising across the feature dimension. (c) Implement rms_norm(X, gamma, eps=1e-5), normalising by RMS only (no mean subtraction). (d) Apply all three to a random matrix of shape (32,512)(32, 512). Verify: for BN, each column (feature) has mean\approx0, std\approx1; for LN, each row (sample) has mean\approx0, std\approx1; for RMSNorm, each row has RMS\approx1. (e) Verify that when applied to variable-batch-size inputs (1, 4, 32), BN statistics change but LN statistics are invariant. Discuss implications for inference.

Exercise 8 *** - Adam Momentum as Running Sample Statistics (a) For a gradient sequence g1,,g1000g_1, \ldots, g_{1000} drawn from N(0.1,1.0)\mathcal{N}(0.1, 1.0), compute Adam's first moment mtm_t and second moment vtv_t with β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999. (b) Plot the bias-corrected estimates m^t\hat{m}_t and v^t\hat{v}_t alongside the true running mean and uncentred second moment. (c) Show that without bias correction, mtm_t underestimates E[g]\mathbb{E}[g] by a factor of (1β1t)(1-\beta_1^t) at step tt. (d) Vary β1{0.5,0.9,0.99}\beta_1 \in \{0.5, 0.9, 0.99\} and plot how quickly m^t\hat{m}_t converges to the true mean. Connect this to the "effective window size" 1/(1β1)1/(1-\beta_1).


11. Why This Matters for AI (2026 Perspective)

ConceptAI / LLM ApplicationImpact
Sample mean and varianceBatch Norm, Layer Norm, RMS Norm in every transformerCore to training stability; without it, deep networks diverge
Robust statistics (MAD, trimmed mean)Byzantine-robust federated learning; gradient aggregation under adversarial workersEnables privacy-preserving training that resists poisoning attacks
Gradient clipping = WinsorisationTraining GPT-4, LLaMA, Mistral; prevents loss spikesWithout gradient clipping, large models regularly diverge during training
Adam = online sample statisticsUniversal optimiser for neural networks (90%+ of published LLM training)Bias-corrected moment estimates make adaptive learning rates work
Outlier detectionOOD detection, anomaly detection in production model monitoringFlagging adversarial or corrupted inputs before they cause harmful outputs
Covariance matrix + Mahalanobis distanceOOD detection from penultimate-layer embeddings (Lee et al. 2018)State-of-the-art OOD scoring without retraining the model
Correlation matrixAttention head redundancy analysis; mechanistic interpretabilityIdentifies which attention heads can be pruned to reduce inference cost
1/dk1/\sqrt{d_k} attention scalingPrevents softmax saturation in high-dimensional attention (FlashAttention, MLA)Directly derived from variance concentration in high dimensions
Empirical CDF + quantile functionsCalibration evaluation; Wasserstein distance for dataset comparisonMeasuring whether a model's confidence matches its accuracy
Standardisation (Z-score)Feature preprocessing for all tabular ML; embedding normalisationEnsures equal-scale features; required for SGD convergence
Skewness and kurtosisGradient noise modelling; heavy-tailed gradient distributionsJustifies Student-tt noise models over Gaussian for stochastic gradient methods
Label distribution (frequency table)Class-weighted loss, oversampling, dataset cards for responsible AIPrevents models from learning to predict the majority class trivially
Data drift (PSI, KL)Production monitoring for deployed LLMs and recommendation systemsEarly warning system for model degradation; triggers retraining
Simpson's paradox + confoundersFairness auditing; understanding why a model exhibits biasCorrelation-based bias is masked by aggregation; subgroup analysis reveals it

12. Conceptual Bridge

Descriptive statistics is the foundation on which all statistical inference rests. To draw any conclusion from data - to estimate a parameter, test a hypothesis, train a Bayesian model, or fit a regression - you must first understand what the data actually looks like. This section has provided the complete vocabulary: the mean, median, and mode as measures of centre; variance, standard deviation, IQR, and MAD as measures of spread; skewness and kurtosis as measures of shape; quantiles and the ECDF as complete distributional summaries; covariance and correlation as measures of association; and the covariance matrix and Mahalanobis distance as multivariate generalisations.

The next section, Section02 Estimation Theory, takes the sample statistics developed here and asks: how good are they as estimators of the underlying population parameters? The sample mean xˉ\bar{x} is an estimator of μ\mu - but what is its bias? What is its variance? Is there a better estimator? The Cramer-Rao lower bound sets a fundamental limit on how precisely any unbiased estimator can estimate a parameter, expressed in terms of Fisher information. Bessel's correction (Section2.3) is a preview of estimator theory: the n1n-1 divisor corrects the bias of the sample variance estimator.

Looking back, this section builds directly on probability theory. The population parameters μ=E[X]\mu = \mathbb{E}[X], σ2=Var(X)\sigma^2 = \text{Var}(X), and ρ=Corr(X,Y)\rho = \text{Corr}(X,Y) were defined in Chapter 6 Section04 Expectation and Moments. What we compute here are their sample analogues - functions of observed data designed to estimate those population quantities. The Glivenko-Cantelli theorem (Section2.5) provides the theoretical guarantee connecting the ECDF to the population CDF: it is a specialisation of the law of large numbers proved in Ch6 Section06 Stochastic Processes.

POSITION IN THE CURRICULUM
========================================================================

  CHAPTER 6: PROBABILITY THEORY
  +----------------------------------------------------------------+
  |  Section01 Intro + RVs  ->  Section02 Distributions  ->  Section03 Joint         |
  |  Section04 Expectation  ->  Section05 Concentration  ->  Section06 Stochastic    |
  |  Section07 Markov Chains                                             |
  |  (population parameters \\mu, \\sigma^2, \\rho defined here)                |
  +----------------------------------+-----------------------------+
                                     | sample analogues
                                     v
  CHAPTER 7: STATISTICS
  +----------------------------------------------------------------+
  |  Section01 DESCRIPTIVE STATISTICS  <-- YOU ARE HERE                  |
  |  "Summarise data before you model it"                          |
  |  xbar, s^2, r, ECDF, IQR, MAD, \\Sigmahat, Mahalanobis, z-score          |
  +----------------------------------+-----------------------------+
                                     | "how good are these estimators?"
                                     v
  +----------------------------------------------------------------+
  |  Section02 Estimation Theory                                          |
  |  Bias, variance, MSE, Cramer-Rao, MLE, Fisher information      |
  +----------------------------------+-----------------------------+
                                     | "make decisions from estimates"
                                     v
  +----------------------------------------------------------------+
  |  Section03 Hypothesis Testing  ->  Section04 Bayesian Inference             |
  |  Section05 Time Series  ->  Section06 Regression Analysis                   |
  +----------------------------------------------------------------+

========================================================================

<- Back to Chapter 7: Statistics | Next: Estimation Theory ->


Appendix A: Worked Examples

A.1 Computing All Univariate Statistics by Hand

Dataset: {3,7,7,13,2,8,11,1,14,7}\{3, 7, 7, 13, 2, 8, 11, 1, 14, 7\} (n=10n = 10)

Step 1 - Sort: {1,2,3,7,7,7,8,11,13,14}\{1, 2, 3, 7, 7, 7, 8, 11, 13, 14\}

Step 2 - Central tendency:

  • Mean: xˉ=(1+2+3+7+7+7+8+11+13+14)/10=73/10=7.3\bar{x} = (1+2+3+7+7+7+8+11+13+14)/10 = 73/10 = 7.3
  • Median (n=10n=10, even): Q2=(x(5)+x(6))/2=(7+7)/2=7.0Q_2 = (x_{(5)} + x_{(6)})/2 = (7+7)/2 = 7.0
  • Mode: 77 (appears 3 times)

Step 3 - Spread:

  • Deviations: (6.3,5.3,4.3,0.3,0.3,0.3,0.7,3.7,5.7,6.7)(-6.3, -5.3, -4.3, -0.3, -0.3, -0.3, 0.7, 3.7, 5.7, 6.7)
  • Squared deviations sum: 39.69+28.09+18.49+0.09+0.09+0.09+0.49+13.69+32.49+44.89=178.1039.69 + 28.09 + 18.49 + 0.09 + 0.09 + 0.09 + 0.49 + 13.69 + 32.49 + 44.89 = 178.10
  • Sample variance: s2=178.10/9=19.79s^2 = 178.10/9 = 19.79; s=4.45s = 4.45
  • Range: 141=1314 - 1 = 13
  • Q1=(2+3)/2=2.5Q_1 = (2+3)/2 = 2.5; Q3=(11+13)/2=12.0Q_3 = (11+13)/2 = 12.0; IQR =9.5= 9.5

Step 4 - Shape:

  • Skewness g1g_1: the distribution is nearly symmetric (mean \approx median \approx mode =7= 7); g10.08g_1 \approx 0.08 (very slight right skew due to 14)
  • All values within Tukey fences: [2.514.25,12+14.25]=[11.75,26.25][2.5 - 14.25, 12 + 14.25] = [-11.75, 26.25] - no outliers

Step 5 - Five-number summary: {1,2.5,7.0,12.0,14}\{1, 2.5, 7.0, 12.0, 14\}

A.2 Pearson vs Spearman on a Monotone Nonlinear Dataset

Consider xi=ix_i = i for i=1,,10i = 1, \ldots, 10 and yi=xi3y_i = x_i^3:

  • x=(1,2,3,4,5,6,7,8,9,10)x = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10), y=(1,8,27,64,125,216,343,512,729,1000)y = (1, 8, 27, 64, 125, 216, 343, 512, 729, 1000)
  • Pearson r=0.974r = 0.974 (strong, but yy is highly curved so r<1r < 1)
  • Spearman ρS=1.000\rho_S = 1.000 (perfect rank correlation - the mapping is monotone)

This illustrates: Spearman detects the perfect monotone relationship that Pearson misses.

Now consider xi=ix_i = i and yi=sin(xi)y_i = \sin(x_i) for i=1,,20i = 1, \ldots, 20:

  • Pearson r0.17r \approx -0.17: looks like no association
  • Spearman ρS0.19\rho_S \approx -0.19: same conclusion

But the relationship is highly structured - it's non-monotone. Both Pearson and Spearman fail to detect it. This is why visualisation is always the first step.

A.3 Covariance Matrix - A Complete Worked Example

Dataset: Standardised heights (cm), weights (kg), age (years) for 5 individuals:

X=(17065251655830180802817572351605522)\mathbf{X} = \begin{pmatrix} 170 & 65 & 25 \\ 165 & 58 & 30 \\ 180 & 80 & 28 \\ 175 & 72 & 35 \\ 160 & 55 & 22 \end{pmatrix}
  • μ^=(170,66,28)\hat{\boldsymbol{\mu}} = (170, 66, 28)

Centred matrix Xc\mathbf{X}_c:

Xc=(0135821014056710116)\mathbf{X}_c = \begin{pmatrix} 0 & -1 & -3 \\ -5 & -8 & 2 \\ 10 & 14 & 0 \\ 5 & 6 & 7 \\ -10 & -11 & -6 \end{pmatrix} Σ^=14XcXc=(62.589.56.2589.5134.511.56.2511.524.5)\hat{\Sigma} = \frac{1}{4} \mathbf{X}_c^\top \mathbf{X}_c = \begin{pmatrix} 62.5 & 89.5 & 6.25 \\ 89.5 & 134.5 & 11.5 \\ 6.25 & 11.5 & 24.5 \end{pmatrix}

Interpretation:

  • Height and weight have high positive covariance (89.589.5): taller people tend to weigh more
  • Height and age have low positive covariance (6.256.25): weak association in this sample
  • Diagonal: variances are 62.562.5, 134.5134.5, 24.524.5 - weight is most variable

Correlation matrix: R12=89.5/62.5134.5=0.976R_{12} = 89.5/\sqrt{62.5 \cdot 134.5} = 0.976; height and weight are very strongly correlated.


Appendix B: Key Formulas Quick Reference

DESCRIPTIVE STATISTICS - FORMULA REFERENCE
========================================================================

  CENTRAL TENDENCY
  -----------------
  Sample mean:       xbar = (1/n) \\Sigma x_i
  Sample median:     middle value of sorted data
  Trimmed mean:      mean of middle (1-2\\alpha)n observations

  SPREAD
  --------
  Sample variance:   s^2 = (1/(n-1)) \\Sigma(x_i - xbar)^2
  Std deviation:     s = \\sqrts^2
  IQR:               Q_3 - Q_1
  MAD:               median(|x_i - median(x)|)
  Robust \\sigmahat:          1.4826 x MAD

  SHAPE
  -------
  Skewness:          g_1 = m_3 / m_2^(3/2),   m_k = (1/n)\\Sigma(x_i-xbar)^k
  Excess kurtosis:   g_2 = m_4 / m_2^2 - 3

  QUANTILES
  ----------
  Empirical CDF:     Fhat_n(x) = (1/n)\\Sigma 1[x_i \\leq x]
  p-quantile:        Q(p) = inf{x : Fhat_n(x) \\geq p}
  Five-number:       {min, Q_1, median, Q_3, max}

  BIVARIATE
  ----------
  Covariance:        sxy = (1/(n-1)) \\Sigma(x_i-xbar)(y_i-y_bar)
  Pearson r:         r = sxy / (sx sy)
  Spearman \\rho:        Pearson r applied to ranks

  MULTIVARIATE
  -------------
  Mean vector:       \\muhat = (1/n) X^T 1
  Covariance matrix: \\Sigmahat = (1/(n-1)) Xc^T Xc
  Correlation matrix: R = D^{-1/2} \\Sigmahat D^{-1/2}
  Mahalanobis dist:  d_M(x) = \\sqrt[(x-\\muhat)^T \\Sigmahat^-^1(x-\\muhat)]

  STANDARDISATION
  -----------------
  Z-score:           z_i = (x_i - xbar) / s
  Min-max:           x_i_mm = (x_i - min) / (max - min)
  Robust:            x_i_rob = (x_i - median) / IQR

  NORMALISATION LAYERS
  ---------------------
  Batch Norm:        h_hat_i_j = (h_i_j - \\muhat_j^B) / \\sqrt(\\sigmahat_j^B^2 + \\varepsilon)   [over batch]
  Layer Norm:        h_hat_i_j = (h_i_j - \\muhat_i^L) / \\sqrt(\\sigmahat_i^L^2 + \\varepsilon)   [over features]
  RMS Norm:          h_hat_i_j = h_i_j / RMS(h_i)                  [no mean subtraction]

  ADAM MOMENTS
  -------------
  First moment:      m_t = \\beta_1m_t_-_1 + (1-\\beta_1)g_t
  Second moment:     v_t = \\beta_2v_t_-_1 + (1-\\beta_2)g_t^2
  Bias correction:   mhat_t = m_t/(1-\\beta_1^t),   vhat_t = v_t/(1-\\beta_2^t)
  Update:            \\theta_t_+_1 = \\theta_t - \\alpha\\cdotmhat_t / (\\sqrtvhat_t + \\varepsilon)

========================================================================

Appendix C: Proofs and Derivations

C.1 Proof: Sample Mean is Unbiased

Claim: E[xˉ]=μ\mathbb{E}[\bar{x}] = \mu where xˉ=1ni=1nXi\bar{x} = \frac{1}{n}\sum_{i=1}^n X_i and XiiidPX_i \overset{\text{iid}}{\sim} P with mean μ\mu.

Proof:

E[xˉ]=E ⁣[1ni=1nXi]=1ni=1nE[Xi]=1nnμ=μ\mathbb{E}[\bar{x}] = \mathbb{E}\!\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[X_i] = \frac{1}{n} \cdot n\mu = \mu \quad \square

This uses linearity of expectation and the iid assumption (E[Xi]=μ\mathbb{E}[X_i] = \mu for all ii).

C.2 Proof: Variance of the Sample Mean

Claim: Var(xˉ)=σ2/n\text{Var}(\bar{x}) = \sigma^2/n.

Proof: By independence (Cov(Xi,Xj)=0\text{Cov}(X_i, X_j) = 0 for iji \neq j):

Var(xˉ)=Var ⁣(1nXi)=1n2i=1nVar(Xi)=nσ2n2=σ2n\text{Var}(\bar{x}) = \text{Var}\!\left(\frac{1}{n}\sum X_i\right) = \frac{1}{n^2}\sum_{i=1}^n \text{Var}(X_i) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n} \quad \square

This is the standard error result: SE(xˉ)=σ/n\text{SE}(\bar{x}) = \sigma/\sqrt{n}. Doubling precision requires 4x the data.

C.3 Proof: Pearson Correlation is Bounded in [1,1][-1,1]

Claim: r1|r| \leq 1.

Proof: Apply Cauchy-Schwarz to the centred vectors u=(x1xˉ,,xnxˉ)\mathbf{u} = (x_1-\bar{x}, \ldots, x_n-\bar{x}) and v=(y1yˉ,,ynyˉ)\mathbf{v} = (y_1-\bar{y}, \ldots, y_n-\bar{y}):

uvuv|\mathbf{u} \cdot \mathbf{v}| \leq \|\mathbf{u}\| \|\mathbf{v}\| i(xixˉ)(yiyˉ)i(xixˉ)2i(yiyˉ)2\left|\sum_i (x_i-\bar{x})(y_i-\bar{y})\right| \leq \sqrt{\sum_i (x_i-\bar{x})^2} \cdot \sqrt{\sum_i (y_i-\bar{y})^2}

Dividing both sides by (n1)(n-1) and by sxsys_x s_y gives r1|r| \leq 1. Equality iff u=cv\mathbf{u} = c\mathbf{v} for some c0c \neq 0, i.e., yiyˉ=c(xixˉ)y_i - \bar{y} = c(x_i - \bar{x}) - a perfect linear relationship. \square

C.4 Proof: Sample Covariance Matrix is PSD

Claim: Σ^=1n1XcXc\hat{\Sigma} = \frac{1}{n-1}\mathbf{X}_c^\top \mathbf{X}_c is positive semi-definite.

Proof: For any vRd\mathbf{v} \in \mathbb{R}^d:

vΣ^v=1n1vXcXcv=1n1Xcv20\mathbf{v}^\top \hat{\Sigma} \mathbf{v} = \frac{1}{n-1} \mathbf{v}^\top \mathbf{X}_c^\top \mathbf{X}_c \mathbf{v} = \frac{1}{n-1} \|\mathbf{X}_c \mathbf{v}\|^2 \geq 0 \quad \square

The matrix is positive definite (not just semi-definite) iff Xc\mathbf{X}_c has full column rank, which requires n>dn > d and no perfect linear dependence among features.


Appendix D: Python Implementation Reference

import numpy as np
from scipy import stats

# -- Univariate statistics --------------------------------------
x = np.array([...])
n = len(x)

mean     = np.mean(x)
median   = np.median(x)
mode     = stats.mode(x).mode[0]
var      = np.var(x, ddof=1)       # unbiased (Bessel's correction)
std      = np.std(x, ddof=1)
q1, q3   = np.percentile(x, [25, 75])
iqr      = q3 - q1
mad      = np.median(np.abs(x - np.median(x)))
robust_s = 1.4826 * mad
skew     = stats.skew(x)
kurt     = stats.kurtosis(x)       # excess kurtosis (Fisher)

# Five-number summary
five_num = np.array([x.min(), q1, median, q3, x.max()])

# Empirical CDF
def ecdf(x_eval, data):
    return np.mean(data[:, None] <= x_eval, axis=0)

# -- Outlier detection ------------------------------------------
# Z-score
z = (x - mean) / std
outliers_z = np.abs(z) > 3

# Tukey fences
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers_tukey = (x < lower) | (x > upper)

# Modified Z-score (MAD-based)
m_z = 0.6745 * (x - np.median(x)) / mad
outliers_mad = np.abs(m_z) > 3.5

# -- Bivariate statistics ---------------------------------------
x, y = np.array([...]), np.array([...])
cov    = np.cov(x, y, ddof=1)[0, 1]
pearsr = np.corrcoef(x, y)[0, 1]   # Pearson
rho_s  = stats.spearmanr(x, y).statistic
tau    = stats.kendalltau(x, y).statistic

# -- Multivariate statistics ------------------------------------
X = np.array([...])   # shape (n, d)
mu_hat  = X.mean(axis=0)
Xc      = X - mu_hat
Sigma   = (Xc.T @ Xc) / (len(X) - 1)    # covariance matrix
D       = np.sqrt(np.diag(Sigma))
R       = Sigma / np.outer(D, D)          # correlation matrix

# Mahalanobis distance
Sigma_inv = np.linalg.inv(Sigma)
def mahal(x, mu, Sigma_inv):
    diff = x - mu
    return np.sqrt(diff @ Sigma_inv @ diff)

# -- Standardisation --------------------------------------------
z_score   = (X - X.mean(axis=0)) / X.std(axis=0, ddof=1)
minmax    = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
q1s       = np.percentile(X, 25, axis=0)
q3s       = np.percentile(X, 75, axis=0)
robust_sc = (X - np.median(X, axis=0)) / (q3s - q1s)

Appendix E: Connections to Other Chapters

To Chapter 6 - Probability Theory: Every sample statistic here is an estimator of a population parameter defined in Ch6:

  • xˉ\bar{x} estimates E[X]\mathbb{E}[X] (defined in Ch6 Section04)
  • s2s^2 estimates Var(X)\text{Var}(X) (defined in Ch6 Section04)
  • rr estimates Corr(X,Y)\text{Corr}(X,Y) (defined in Ch6 Section04)
  • F^n(x)\hat{F}_n(x) estimates F(x)=P(Xx)F(x) = P(X \leq x) (defined in Ch6 Section01)
  • The Glivenko-Cantelli guarantee uses the Law of Large Numbers from Ch6 Section06

To Section02 - Estimation Theory: Descriptive statistics IS estimation theory applied to the simplest possible estimands:

  • Is xˉ\bar{x} a good estimator of μ\mu? (Yes - unbiased, consistent, efficient for Gaussian data)
  • What is Var(xˉ)\text{Var}(\bar{x})? (σ2/n\sigma^2/n - derived from the iid assumption)
  • Can we do better than xˉ\bar{x} for non-Gaussian distributions? (Median is more efficient for Laplace distributions) All of these questions are answered formally in Section02 Estimation Theory.

To Section03 - Hypothesis Testing: Many descriptive statistics have associated hypothesis tests:

  • Is μ=μ0\mu = \mu_0? -> one-sample t-test using xˉ\bar{x}
  • Is the data Gaussian? -> Shapiro-Wilk test using skewness and kurtosis
  • Is there correlation? -> test that ρ=0\rho = 0 using Fisher's zz-transformation of rr
  • Has the distribution shifted? -> KS test comparing two ECDFs Full treatment: Section03 Hypothesis Testing.

To Section06 - Regression Analysis: The regression coefficient β^1=rsy/sx\hat{\beta}_1 = r \cdot s_y / s_x is a function of descriptive statistics. The OLS estimator is derived from minimising the sample variance of residuals. The R2R^2 coefficient of determination equals r2r^2 in simple linear regression. Full treatment: Section06 Regression Analysis.

To Chapter 3 - Advanced Linear Algebra: The covariance matrix Σ^\hat\Sigma is a real symmetric PSD matrix. Its eigendecomposition Σ^=QΛQ\hat\Sigma = Q\Lambda Q^\top gives the principal component directions (columns of QQ) and variance explained per component (diagonal of Λ\Lambda). This is PCA - the most important linear dimensionality reduction method in ML. Full treatment: Ch3 Section03 Eigenvalues and Eigenvectors.


Appendix F: Advanced Topics

F.1 Estimating the Population Mean: The Central Limit Theorem Connection

The sample mean xˉ\bar{x} is an excellent estimator, but how confident can we be in it? The Central Limit Theorem tells us that for large nn:

xˉμσ/ndN(0,1)\frac{\bar{x} - \mu}{\sigma/\sqrt{n}} \overset{d}{\to} \mathcal{N}(0, 1)

This means the sampling distribution of xˉ\bar{x} is approximately Gaussian with mean μ\mu and standard deviation σ/n\sigma/\sqrt{n} - regardless of the distribution of individual XiX_i. The spread of xˉ\bar{x} decreases as 1/n1/\sqrt{n}: to halve the uncertainty, you need 4x the data.

Backward reference: The CLT was proved formally in Ch6 Section06 Stochastic Processes. Here we use it to interpret the reliability of the sample mean.

Implication for ML: The validation loss computed over a test set of size nn has uncertainty ±zα/2σ^loss/n\pm z_{\alpha/2} \cdot \hat{\sigma}_{\text{loss}} / \sqrt{n}. This is a confidence interval for the true expected loss - a fundamental fact that is frequently ignored when comparing models. Two models that differ by less than ±2SE\pm 2\text{SE} are not statistically distinguishable. See Section02 Estimation Theory and Section03 Hypothesis Testing for formal treatment.

F.2 L-Estimators: Unifying the Zoo of Robust Statistics

The mean, trimmed mean, Winsorised mean, and median are all members of a single family called L-estimators: linear combinations of order statistics.

TL=i=1nwix(i)T_L = \sum_{i=1}^n w_i x_{(i)}

where the weights wiw_i depend on the estimator:

  • Sample mean: wi=1/nw_i = 1/n for all ii
  • Median (nn odd): wm=1w_m = 1, all others 00
  • α\alpha-trimmed mean: wi=1/(n2k)w_i = 1/(n-2k) for i[k+1,nk]i \in [k+1, n-k], zero otherwise
  • Winsorised mean: wi=1/nw_i = 1/n for i[k+1,nk]i \in [k+1, n-k]; wk=(k+1)/nw_k = (k+1)/n; wnk=(k+1)/nw_{n-k} = (k+1)/n

This unification shows that robustness and efficiency are controlled by the shape of the weight sequence {wi}\{w_i\}: flat weights = maximum efficiency under Gaussian; zero-outside-middle weights = robustness.

F.3 The Efficiency-Robustness Tradeoff

For Gaussian data, the sample mean is the minimum variance unbiased estimator (MVUE) of μ\mu - no other unbiased estimator can achieve lower variance. The relative efficiency of the median vs the mean for Gaussian data is:

ARE(x~,xˉ)=Var(xˉ)Var(x~)=2π0.637\text{ARE}(\tilde{x}, \bar{x}) = \frac{\text{Var}(\bar{x})}{\text{Var}(\tilde{x})} = \frac{2}{\pi} \approx 0.637

The median needs 1/0.6371.57×1/0.637 \approx 1.57\times more data to achieve the same precision as the mean - for Gaussian data. But for heavy-tailed distributions (like Student-tνt_\nu for small ν\nu), the median becomes more efficient than the mean. At the Cauchy distribution (ν=1\nu = 1), the sample mean is infinitely inefficient - its variance is infinite.

This tradeoff is fundamental to the choice between Adam and SGD: Adam's adaptive second-moment estimate essentially adapts to the local variance of the gradient distribution, making it more efficient than SGD with a fixed learning rate when gradient variance varies across parameters.

F.4 Bootstrap Confidence Intervals for Descriptive Statistics

The bootstrap (Efron, 1979) provides confidence intervals for any statistic without assuming a parametric form for the distribution. For a statistic T(x1,,xn)T(x_1, \ldots, x_n):

  1. Draw BB bootstrap samples: for b=1,,Bb = 1, \ldots, B, sample nn observations with replacement from the data.
  2. Compute T(b)T^{(b)} on each bootstrap sample.
  3. The bootstrap distribution {T(1),,T(B)}\{T^{(1)}, \ldots, T^{(B)}\} estimates the sampling distribution of TT.
  4. The 95% bootstrap CI is the [2.5,97.5][2.5, 97.5] percentile interval of {T(b)}\{T^{(b)}\}.

This works for the mean, median, correlation, MAD - any statistic you can compute. For the sample mean, the bootstrap CI coincides with the CLT-based CI for large nn; for the median, it is more accurate.

For AI: Bootstrapping is used to compute confidence intervals on evaluation metrics (accuracy, F1, AUC) that do not have simple analytical distributions. When you report "model A has 94.2% accuracy vs model B's 93.8%", bootstrapping tells you whether this difference is statistically meaningful. Full treatment of hypothesis tests for model comparison: Section03 Hypothesis Testing.

F.5 Multivariate Outlier Detection: The Mahalanobis Envelope

For multivariate Gaussian data xiN(μ,Σ)\mathbf{x}_i \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma), the random variable dM(xi)2=(xiμ)Σ1(xiμ)χ2(d)d_M(\mathbf{x}_i)^2 = (\mathbf{x}_i - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \sim \chi^2(d).

This means that under normality, 95% of observations should have dM2χ0.95,d2d_M^2 \leq \chi^2_{0.95, d}. An observation with dM2>χ0.975,d2d_M^2 > \chi^2_{0.975, d} is a multivariate outlier at the 2.5% per-observation level.

Example (d=2d = 2): χ0.975,22=7.38\chi^2_{0.975, 2} = 7.38, so dM>2.72d_M > 2.72 flags multivariate outliers. An observation can be within the univariate ranges of both features but still be an outlier in the bivariate sense - e.g., if the features are highly positively correlated, an observation with a large positive x1x_1 but small x2x_2 is multivariate-unusual even if neither value is individually extreme.

Robustified version (MCD estimator): The Minimum Covariance Determinant estimator (Rousseeuw, 1984) computes the mean vector and covariance matrix using only the h=(n+d+1)/2h = \lfloor (n+d+1)/2 \rfloor observations whose covariance matrix has minimum determinant. This gives a robust, high-breakdown-point version of the Mahalanobis distance, widely used in OOD detection for deployed ML models.


Appendix G: Glossary

TermDefinition
iidIndependent and identically distributed: each observation drawn independently from the same distribution
Sample statisticA function of observed data; an estimator of a population parameter
Population parameterA fixed but unknown property of the data-generating distribution (e.g., μ\mu, σ2\sigma^2)
Breakdown pointSmallest fraction of corrupted observations that can cause a statistic to take an arbitrarily extreme value
Influence functionThe infinitesimal effect of adding a single observation at xx on a statistic
ECDFEmpirical Cumulative Distribution Function: F^n(x)=1n1[xix]\hat{F}_n(x) = \frac{1}{n}\sum \mathbf{1}[x_i \leq x]
L-estimatorA linear combination of order statistics; unifies mean, median, trimmed mean
Bessel's correctionUsing n1n-1 instead of nn in the sample variance formula to make it unbiased
Mahalanobis distanceDistance metric that accounts for the scale and correlation structure of multivariate data
Anscombe's QuartetFour datasets with identical standard summary statistics but dramatically different distributions
PSIPopulation Stability Index: symmetric KL divergence between training and production distributions
AREAsymptotic Relative Efficiency: ratio of asymptotic variances of two estimators
KDEKernel Density Estimate: smooth non-parametric estimate of a probability density
Q-Q plotQuantile-Quantile plot: compares empirical quantiles to theoretical quantiles to assess distributional fit
Covariate shiftDistribution shift where P(X)P(X) changes but $P(Y
Concept driftDistribution shift where $P(Y
Batch NormNormalisation layer that standardises activations across the batch dimension
Layer NormNormalisation layer that standardises activations across the feature dimension per token
RMS NormVariant of Layer Norm that divides by root mean square without subtracting the mean

Appendix H: Practice Problems with Solutions

H.1 Basic Problems

Problem 1. The following are response times (seconds) for 8 API calls: {0.12,0.15,0.09,0.87,0.11,0.14,0.13,0.10}\{0.12, 0.15, 0.09, 0.87, 0.11, 0.14, 0.13, 0.10\}. Compute: (a) mean, (b) median, (c) variance with Bessel's correction, (d) identify outliers using Tukey fences.

Solution: (a) xˉ=(0.12+0.15+0.09+0.87+0.11+0.14+0.13+0.10)/8=1.71/8=0.214\bar{x} = (0.12+0.15+0.09+0.87+0.11+0.14+0.13+0.10)/8 = 1.71/8 = 0.214 seconds (b) Sorted: {0.09,0.10,0.11,0.12,0.13,0.14,0.15,0.87}\{0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.87\}; median =(0.12+0.13)/2=0.125= (0.12+0.13)/2 = 0.125 seconds (c) Deviations from mean: (0.094,0.064,0.124,0.656,0.104,0.074,0.084,0.114)(-0.094, -0.064, -0.124, 0.656, -0.104, -0.074, -0.084, -0.114); s2=di2/7=0.529/7=0.0756s^2 = \sum d_i^2/7 = 0.529/7 = 0.0756 (d) Q1=0.105Q_1 = 0.105, Q3=0.145Q_3 = 0.145, IQR =0.04= 0.04; upper fence =0.145+0.06=0.205= 0.145 + 0.06 = 0.205; 0.87>0.2050.87 > 0.205 -> outlier

Note: the mean (0.2140.214s) is pulled above the upper fence - it is no longer representative of the typical response time. The median (0.1250.125s) is the appropriate summary.

Problem 2. Show that for any dataset, i=1n(xic)2\sum_{i=1}^n (x_i - c)^2 is minimised at c=xˉc = \bar{x}.

Solution:

ddc(xic)2=2(xic)=2(xinc)=0    c=xin=xˉ\frac{d}{dc} \sum (x_i - c)^2 = -2\sum(x_i - c) = -2\left(\sum x_i - nc\right) = 0 \implies c = \frac{\sum x_i}{n} = \bar{x}

Second derivative =2n>0= 2n > 0 confirms a minimum. \square

Problem 3. The waiting times at a coffee shop follow an Exponential(λ=2\lambda=2) distribution. What is the population skewness? How does this compare to sample estimates from n=20n = 20 vs n=1000n = 1000?

Solution: For Exponential(λ\lambda), the skewness is always g1=2g_1 = 2 (independent of λ\lambda). For n=20n=20, the sample skewness estimate has high variance - approximately Var(g1)6/n=0.3\text{Var}(g_1) \approx 6/n = 0.3 under normality, so a standard error of 0.55\approx 0.55. For n=1000n=1000, SE(g1)6/1000=0.077\text{SE}(g_1) \approx \sqrt{6/1000} = 0.077 - much more precise. This illustrates why shape statistics require larger samples than location/scale statistics.

H.2 Intermediate Problems

Problem 4. A dataset has xˉ=50\bar{x} = 50, s=10s = 10, g1=0g_1 = 0, g2=3g_2 = 3 (leptokurtic). Using Chebyshev's inequality, what fraction of observations must lie within [xˉ2s,xˉ+2s][\bar{x} - 2s, \bar{x} + 2s]? How does this compare to the Gaussian prediction?

Solution: Chebyshev: P(Xμkσ)1/k2P(|X - \mu| \geq k\sigma) \leq 1/k^2. For k=2k=2: at least 11/4=75%1 - 1/4 = 75\% lies within ±2\pm 2 standard deviations. For a Gaussian: 95.45%\approx 95.45\%. The leptokurtic distribution (g2=3>0g_2 = 3 > 0) has heavier tails than Gaussian, so we expect a value between 75% and 95.45% to lie in this interval - Chebyshev's bound is not tight for leptokurtic distributions.

Backward reference: Chebyshev's inequality was proved rigorously in Ch6 Section05 Concentration Inequalities.

Problem 5. Two features X1X_1 (salary, thousands) and X2X_2 (years of experience) have covariance s12=15s_{12} = 15, variances s12=900s_1^2 = 900, s22=4s_2^2 = 4. Compute Pearson correlation. Now rescale: X1=X1/1000X_1' = X_1/1000 (salary in millions). How does rr change?

Solution: r=s12/(s1s2)=15/(302)=0.25r = s_{12}/(s_1 s_2) = 15/(30 \cdot 2) = 0.25. After rescaling: s1=0.9s_1' = 0.9, s12=0.015s_{12}' = 0.015. r=0.015/(0.92)=0.0083r' = 0.015/(0.9 \cdot 2) = 0.0083... wait, that's wrong. Let me redo: if X1=X1/1000X_1' = X_1/1000, then s1=s1/1000=0.03s_1' = s_1/1000 = 0.03, s12=s12/1000=0.015s_{12}' = s_{12}/1000 = 0.015. r=0.015/(0.032)=0.25r' = 0.015/(0.03 \cdot 2) = 0.25. Pearson correlation is invariant to scaling. This confirms the affine invariance property.

H.3 Advanced Problems

Problem 6 (Batch Norm during inference). During training, Batch Norm computes mini-batch statistics. During inference, it uses running averages. Why can't it use the current-batch statistics during inference for a batch of size 1?

Solution: For a batch of size n=1n=1: μ^j(B)=h1j\hat{\mu}_j^{(B)} = h_{1j} and (σ^j(B))2=0(\hat{\sigma}_j^{(B)})^2 = 0 (variance of a single point is zero). The normalised activation would be 0/ϵ00/\epsilon \approx 0 for all inputs - the layer outputs a constant, destroying all information. This is why BN uses running averages accumulated during training for inference. Layer Norm has no this problem because it normalises across the feature dimension (dd), which is fixed regardless of batch size.

Problem 7 (Adam effective window size). Show that the effective window size of Adam's first moment estimator is approximately 1/(1β1)1/(1-\beta_1) time steps.

Solution: Unrolling the recursion mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t:

mt=(1β1)k=0t1β1kgtkm_t = (1-\beta_1)\sum_{k=0}^{t-1} \beta_1^k g_{t-k}

The weight given to gradient gtkg_{t-k} is (1β1)β1k(1-\beta_1)\beta_1^k, which decays exponentially with delay kk. The effective number of steps contributing significantly is approximately the "half-life" of the exponential: the average delay is E[k]=β1/(1β1)\mathbb{E}[k] = \beta_1/(1-\beta_1), and the effective window is approximately 1/(1β1)1/(1-\beta_1). For β1=0.9\beta_1 = 0.9: window 10\approx 10 steps. For β1=0.99\beta_1 = 0.99: window 100\approx 100 steps.


Appendix I: Software Reference

NumPy / SciPy

TaskFunctionNotes
Sample meannp.mean(x)Defaults to full array
Sample mediannp.median(x)
Sample variancenp.var(x, ddof=1)ddof=1 for Bessel
Sample stdnp.std(x, ddof=1)
Quantilesnp.percentile(x, [25,50,75])or np.quantile
Skewnessscipy.stats.skew(x)Fisher's g1g_1
Kurtosisscipy.stats.kurtosis(x)Excess (Fisher), g2g_2
Pearson rrscipy.stats.pearsonr(x,y)Returns (r, p-value)
Spearman ρ\rhoscipy.stats.spearmanr(x,y)
Kendall τ\tauscipy.stats.kendalltau(x,y)
Covariance matrixnp.cov(X.T, ddof=1)Rows = features
Correlation matrixnp.corrcoef(X.T)
ECDFscipy.stats.ecdf(x)Or manual

scikit-learn

TaskClassNotes
Z-score scalingStandardScaler()fit on train, transform on test
Min-max scalingMinMaxScaler()
Robust scalingRobustScaler()Uses median and IQR
Covariance (robust)MinCovDet()MCD estimator
Mahalanobis OODEllipticEnvelope()Fits Gaussian, flags outliers

Visualisation (matplotlib / seaborn)

PlotCodeNotes
Histogram + KDEsns.histplot(x, kde=True)
Box plotsns.boxplot(x=x)Tukey fences
Violin plotsns.violinplot(x=x)
ECDFsns.ecdfplot(x)
Q-Q plotscipy.stats.probplot(x, plot=plt)
Scatter + regressionsns.regplot(x=x, y=y)
Pair plotsns.pairplot(df)
Correlation heatmapsns.heatmap(R, annot=True, cmap='coolwarm')

Appendix J: Common Distributions and Their Descriptive Statistics

Understanding the theoretical descriptive statistics of named distributions (derived from population parameters) helps calibrate expectations when you compute sample statistics from real data.

DistributionMean μ\muVariance σ2\sigma^2Skewness g1g_1Excess Kurtosis g2g_2
N(μ,σ2)\mathcal{N}(\mu, \sigma^2)μ\muσ2\sigma^20000
Uniform(a,b)\text{Uniform}(a,b)(a+b)/2(a+b)/2(ba)2/12(b-a)^2/12006/5-6/5
Exponential(λ)\text{Exponential}(\lambda)1/λ1/\lambda1/λ21/\lambda^22266
Bernoulli(p)\text{Bernoulli}(p)ppp(1p)p(1-p)(12p)/p(1p)(1-2p)/\sqrt{p(1-p)}(16p(1p))/(p(1p))(1-6p(1-p))/(p(1-p))
Poisson(λ)\text{Poisson}(\lambda)λ\lambdaλ\lambda1/λ1/\sqrt{\lambda}1/λ1/\lambda
tνt_\nu (Student-tt)00 (ν>1\nu>1)ν/(ν2)\nu/(\nu-2) (ν>2\nu>2)006/(ν4)6/(\nu-4) (ν>4\nu>4)
χ2(k)\chi^2(k)kk2k2k8/k\sqrt{8/k}12/k12/k
Laplace(μ,b)\text{Laplace}(\mu,b)μ\mu2b22b^20033

Key observations for ML:

  1. Exponential distribution has skewness =2= 2 and excess kurtosis =6= 6 - heavy right tail and heavy overall tails. This resembles the distribution of inter-event times in Poisson processes (token generation times, request arrival times).

  2. Student-tνt_\nu has excess kurtosis 6/(ν4)6/(\nu-4) for ν>4\nu > 4. For ν=5\nu = 5: g2=6g_2 = 6. As ν\nu \to \infty, g20g_2 \to 0 (approaches Gaussian). The leptokurtic behaviour of tνt_\nu for small ν\nu models heavy-tailed phenomena like stochastic gradient noise in LLM training.

  3. Laplace distribution (double exponential) has excess kurtosis =3= 3. MAP estimation with a Gaussian likelihood and Laplace prior gives Lasso regression - the Laplace prior's heavier tails promote sparsity.

  4. χ2(k)\chi^2(k) is always right-skewed with decreasing skewness as kk increases. The sum of squared standardised Gaussian values follows χ2\chi^2; gradient squared norms tend toward this shape.


Appendix K: Connections to Linear Algebra

The multivariate statistics developed in Section5 are deeply connected to linear algebra concepts:

Covariance matrix eigendecomposition (PCA):

Σ^=QΛQ\hat{\Sigma} = Q \Lambda Q^\top

where QQ is orthogonal (columns are eigenvectors = principal component directions) and Λ=diag(λ1,,λd)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_d) with λ1λd0\lambda_1 \geq \ldots \geq \lambda_d \geq 0 (eigenvalues = variance explained per component).

The fraction of total variance explained by the first kk components:

VarExplained(k)=j=1kλjj=1dλj=j=1kλjtr(Σ^)\text{VarExplained}(k) = \frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j} = \frac{\sum_{j=1}^k \lambda_j}{\text{tr}(\hat{\Sigma})}

Forward reference: PCA derivation, spectral theorem, and the relationship between SVD of the data matrix and eigendecomposition of the covariance matrix are covered in Ch3 Section01 Eigenvalues and Eigenvectors and Ch3 Section02 SVD.

Whitening (Mahalanobis transform): The transformation z=Σ^1/2(xμ^)\mathbf{z} = \hat{\Sigma}^{-1/2}(\mathbf{x} - \hat{\boldsymbol{\mu}}) maps the data to have Σ^z=I\hat{\Sigma}_z = I (identity covariance matrix). After whitening, Euclidean distance equals Mahalanobis distance: zizj2=dM(xi,xj)2\|\mathbf{z}_i - \mathbf{z}_j\|^2 = d_M(\mathbf{x}_i, \mathbf{x}_j)^2.

Computing Σ^1/2\hat{\Sigma}^{-1/2}: via eigendecomposition Σ^=QΛQ\hat{\Sigma} = Q\Lambda Q^\top, we get Σ^1/2=QΛ1/2Q\hat{\Sigma}^{-1/2} = Q \Lambda^{-1/2} Q^\top where Λ1/2=diag(λj1/2)\Lambda^{-1/2} = \text{diag}(\lambda_j^{-1/2}).

Condition number and numerical issues: The condition number κ(Σ^)=λmax/λmin\kappa(\hat{\Sigma}) = \lambda_{\max}/\lambda_{\min} measures how reliably Σ^\hat{\Sigma} can be inverted. Large condition numbers (features with very different scales, highly correlated features) make the covariance matrix ill-conditioned - the Mahalanobis distance becomes numerically unreliable. This is why standardisation (Section6.1) is important before computing the covariance matrix.


Appendix L: Summary Diagram

DESCRIPTIVE STATISTICS - COMPLETE TAXONOMY
========================================================================

  UNIVARIATE
  ----------
  LOCATION        SPREAD          SHAPE           DISTRIBUTION
  ---------       ------          -----           ------------
  Mean xbar          Variance s^2     Skewness g_1     Histogram
  Median xtilde        Std dev s       Kurtosis g_2     KDE
  Mode            IQR             5-num summary   Box plot
  Trimmed mean    MAD             ECDF            Violin plot
  Winsorised mean Range           Q-Q plot

  BIVARIATE
  ---------
  LINEAR              RANK              NONLINEAR
  ------              ----              ---------
  Covariance sxy      Spearman \\rho        Distance corr
  Pearson r           Kendall \\tau         Mutual information
  Scatter plot        Rank scatter      Contingency table
  Anscombe's Q        (for ordinal)     Cramer's V

  MULTIVARIATE
  ------------
  MATRIX              GEOMETRY          DIMENSION
  ------              --------          ---------
  Mean vector \\muhat       Mahalanobis dist  Curse of dim
  Cov matrix \\Sigmahat        Whitening         PCA/scree plot
  Corr matrix R       Ellipsoidal sets  n>>d constraint
  Heatmap             Outlier envelope  Rank(\\Sigmahat) = min(n-1,d)

  PREPROCESSING
  -------------
  Z-SCORE           MIN-MAX         ROBUST
  ------            -------         ------
  (x - xbar) / s      (x-min)/(max-min) (x - xtilde) / IQR
  mean=0, std=1     [0, 1]           median=0, IQR=1
  fits most cases   bounded range    resistant to outliers

  AI APPLICATIONS
  ---------------
  NORMALISATION           OPTIMISATION        MONITORING
  -------------           ------------        ----------
  Batch Norm (per feat)   Adam mhat_t (mean)      Data drift (PSI)
  Layer Norm (per token)  Adam vhat_t (variance)  Outlier detection
  RMS Norm (no center)    Gradient clipping   OOD scoring
                          (Winsorisation)     Calibration

========================================================================

Appendix M: The Statistics of Neural Network Activations

This appendix connects descriptive statistics directly to modern neural network analysis - what happens to activation distributions at each layer, and how normalisation methods control them.

M.1 Why Activations Drift (Internal Covariate Shift)

Consider a deep network with LL layers. Each layer computes h(l)=f(W(l)h(l1)+b(l))\mathbf{h}^{(l)} = f(W^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}). As parameters W(l)W^{(l)} are updated during training, the distribution of h(l)\mathbf{h}^{(l)} changes - even though the distribution of the input h(0)=x\mathbf{h}^{(0)} = \mathbf{x} is fixed. This is internal covariate shift.

Descriptively: the sample mean μ^j(l)\hat{\mu}_j^{(l)} and variance (σ^j(l))2(\hat{\sigma}_j^{(l)})^2 of channel jj at layer ll change with every gradient update. Deep networks without normalisation exhibit a cascade: a parameter update in layer 1 shifts the distribution fed to layer 2, which amplifies or distorts, leading to gradients that vanish (activations -> 0) or explode (activations -> +/-\infty).

Batch Norm breaks this cascade by renormalising to zero mean and unit variance after every parameter update, making the distribution of inputs to layer l+1l+1 independent of the parameters of layer ll.

M.2 Analysing Pre-Norm vs Post-Norm Transformers

Modern transformers use either Pre-LN (Layer Norm before the attention/FFN sublayer) or Post-LN (Layer Norm after adding the residual). GPT-2 used Post-LN; GPT-3, LLaMA, Mistral, and modern models use Pre-LN.

Why Pre-LN is more stable: In Post-LN, the gradient of the loss with respect to early layer parameters passes through many unnormalized residual connections, amplifying gradient norms. In Pre-LN, the gradient is normalised at each layer before propagating back, keeping the effective sample variance of gradients more uniform across layers - a descriptive statistics insight with profound training stability consequences.

Concretely: in a 96-layer transformer (like GPT-3), with Post-LN, the gradient variance at layer 1 is approximately 96×96 \times the gradient variance at layer 95. With Pre-LN, the ratio is approximately 11. This is directly measurable as a descriptive statistic on the gradient distributions across layers.

M.3 The 1/dk1/\sqrt{d_k} Scaling in Attention

In scaled dot-product attention:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Why divide by dk\sqrt{d_k}? Each element of QKQK^\top is a dot product of two dkd_k-dimensional vectors. If qi,kjiidN(0,1)q_i, k_j \overset{\text{iid}}{\sim} \mathcal{N}(0,1), then:

Var(qikj)=dk,Std(qikj)=dk\text{Var}(q_i \cdot k_j) = d_k, \quad \text{Std}(q_i \cdot k_j) = \sqrt{d_k}

Without scaling, the dot products have standard deviation dk\sqrt{d_k}, which grows with model dimension. Feeding large-magnitude values into softmax causes it to saturate - the output approaches a one-hot vector, blocking gradient flow. Dividing by dk\sqrt{d_k} restores unit variance:

Var ⁣(qikjdk)=1\text{Var}\!\left(\frac{q_i \cdot k_j}{\sqrt{d_k}}\right) = 1

This is a variance normalisation step - standard scaling applied to the attention logits. It is derived directly from the sample variance formula and the property that the variance of a sum of dkd_k independent unit-variance terms is dkd_k.

Multi-head Latent Attention (MLA, DeepSeek-V2, 2024): MLA compresses key-value representations into a low-dimensional latent space before attention. The projection reduces dkd_k and hence the variance of dot products - a deliberate manipulation of the attention logit statistics to reduce KV cache memory while maintaining expressivity.

M.4 Gradient Statistics as a Training Health Monitor

Monitoring the following descriptive statistics of gradient distributions during LLM training provides an early warning system for training instabilities:

StatisticHealthy rangeWarning sign
Gradient 2\ell_2 norm per layerDecreasing from input to output layersExploding: norm growing exponentially
Gradient norm variance across layersLow, uniformSpike: sudden increase in one layer
Gradient skewness g1g_1Near 0Large positive: rare large positive gradients
Gradient kurtosis g2g_25\lesssim 5Very large: heavy-tailed gradient distribution -> clip threshold too high
Loss spike detectionMean batch lossSingle batch loss > mean + 5 MAD -> potentially corrupted batch
Parameter update scaleWt+1Wt/Wt\|W_{t+1} - W_t\| / \|W_t\|Should be small (103\sim 10^{-3}); large -> instability

This is why modern LLM training frameworks (e.g., Megatron-LM, Nanotron) log running statistics of gradient norms, weight update scales, and activation statistics at every layer - real-time EDA of the training process.


Appendix N: Extended Exercise Solutions

N.1 Exercise 7 (Batch Norm vs Layer Norm) - Key Implementation Notes

The critical implementation detail in Batch Norm is the distinction between training and inference modes. During training:

# Training mode: use current mini-batch statistics
mu_B = X.mean(axis=0)           # shape (d,)
var_B = X.var(axis=0)           # biased variance (ddof=0)
X_norm = (X - mu_B) / np.sqrt(var_B + eps)
running_mean = momentum * running_mean + (1-momentum) * mu_B
running_var  = momentum * running_var  + (1-momentum) * var_B

During inference:

# Inference mode: use accumulated running statistics
X_norm = (X - running_mean) / np.sqrt(running_var + eps)

For Layer Norm, the statistics are computed per-sample along the feature dimension:

mu_L = X.mean(axis=1, keepdims=True)   # shape (n, 1)
var_L = X.var(axis=1, keepdims=True)   # shape (n, 1)
X_norm = (X - mu_L) / np.sqrt(var_L + eps)

The key test: for a batch of size 1 (n=1), Batch Norm produces μ^=x1j\hat{\mu} = x_{1j} and σ^2=0\hat{\sigma}^2 = 0, so normalisation maps every activation to 0/0+ε00/\sqrt{0+\varepsilon} \approx 0. Layer Norm is unaffected because it normalises over the feature dimension.

N.2 Exercise 8 (Adam Bias Correction) - Analytical Solution

At time step tt with m0=0m_0 = 0:

mt=(1β1)k=0t1β1kgtkm_t = (1-\beta_1)\sum_{k=0}^{t-1} \beta_1^k g_{t-k}

Taking expectations (assuming iid gradients with E[g]=μg\mathbb{E}[g] = \mu_g):

E[mt]=(1β1)μgk=0t1β1k=(1β1)μg1β1t1β1=μg(1β1t)\mathbb{E}[m_t] = (1-\beta_1)\mu_g \sum_{k=0}^{t-1}\beta_1^k = (1-\beta_1)\mu_g \cdot \frac{1-\beta_1^t}{1-\beta_1} = \mu_g(1-\beta_1^t)

Without correction, mtm_t underestimates μg\mu_g by a factor (1β1t)(1-\beta_1^t). At t=1t=1: m1=(1β1)g1m_1 = (1-\beta_1)g_1, which for β1=0.9\beta_1 = 0.9 gives m1=0.1g1m_1 = 0.1 g_1 - only 10% of the first gradient. Bias correction: m^1=m1/(10.9)=g1\hat{m}_1 = m_1/(1-0.9) = g_1. This recovers the full first gradient as the initial estimate. As tt \to \infty, β1t0\beta_1^t \to 0 and the correction factor 1/(1β1t)11/(1-\beta_1^t) \to 1 - the correction becomes negligible once the EWMA has warmed up.


Appendix O: EDA on Real ML Datasets - Checklist and Patterns

O.1 What to Always Report for a Tabular ML Dataset

Before any model development, the minimum descriptive statistics to report are:

Dataset-level:

  • Shape: nn rows x dd columns
  • Memory footprint
  • Number and fraction of missing values (total and per column)
  • Number of duplicate rows
  • Data types (numeric, categorical, datetime, text, binary)

Per numeric feature:

  • Mean, median, mode (if applicable)
  • Std dev, IQR, min, max
  • Skewness (g1g_1) and kurtosis (g2g_2)
  • Number and fraction of outliers (Tukey fence)
  • Distribution plot (histogram + KDE)

Per categorical feature:

  • Number of unique values (cardinality)
  • Top-kk most frequent values with counts and fractions
  • Frequency of "other" category
  • Missing value rate

Bivariate with target variable:

  • Pearson/Spearman correlation for each numeric feature vs target
  • Conditional distribution plots: P(featureclass)P(\text{feature} | \text{class}) for classification
  • Point-biserial correlation for numeric feature vs binary target

Multivariate:

  • Correlation heatmap of all numeric features
  • Variance inflation factor (VIF) for collinearity assessment
  • PCA scree plot: how many components explain 90% of variance

O.2 Common Patterns and What They Mean

PatternObservationImplication
Feature has g1>2g_1 > 2Right-skewed: long right tailConsider log transform; mean >> median; outliers will inflate variance
Feature has g2>5g_2 > 5Leptokurtic: heavy tailsOutliers frequent; robust scaling preferred; may need Winsorisation
Two features with $r> 0.95$
Feature correlation with target near 0Weak linear associationFeature may still be useful (nonlinear); check mutual information
Label class kk has <5%<5\% of samplesSevere imbalanceAccuracy misleading; use balanced accuracy, F1; consider oversampling
Feature has bimodal histogramTwo subpopulationsConsider splitting dataset or adding an interaction term
Many features have near-zero varianceNear-constant featuresRemove or near-zero-variance filter (contributes nothing, hurts conditioning)
MAD 0\approx 0 but IQR > 0More than 50% identical valuesDiscrete-valued or heavily concentrated; robust stats unreliable
Correlation matrix has block structureCorrelated feature groupsNatural for grouped features (e.g., pixel neighbours, word nn-grams)

O.3 Pre-Training Checklist for LLMs

For large-scale pretraining datasets (web text, code, books), descriptive statistics take a different form:

  • Token distribution: frequency table of most/least common tokens; vocabulary coverage
  • Sequence length distribution: mean, median, percentiles (P95P_{95}, P99P_{99}, P99.9P_{99.9}); right-skewed is typical
  • Document quality scores: perplexity under a reference model; flagging low-quality text
  • Language distribution: fraction of each language; imbalance affects multilingual capability
  • Temporal distribution: date of creation; recency bias analysis
  • Duplication rate: nn-gram overlap statistics; near-duplicate detection
  • Toxicity and bias statistics: frequency of flagged content; demographic representation in co-occurrence statistics

These are all descriptive statistics applied at scale - the same concepts (frequency tables, histograms, correlation structures, outlier detection) but applied to discrete text data rather than continuous measurements.


Appendix P: Notation Summary

Following docs/NOTATION_GUIDE.md:

SymbolMeaningType
xix_iii-th scalar observationScalar, lowercase
xi\mathbf{x}_iii-th feature vectorBold lowercase vector
X\mathbf{X}Data matrix (n×dn \times d)Uppercase matrix
Xc\mathbf{X}_cCentred data matrixUppercase matrix
xˉ\bar{x}Sample mean (scalar)Scalar
μ^\hat{\boldsymbol{\mu}}Sample mean vectorBold lowercase, hat
s2s^2Sample varianceScalar
Σ^\hat{\Sigma}Sample covariance matrixUppercase, hat
RRSample correlation matrixUppercase
rrPearson correlation coefficientScalar lowercase
ρ\rhoPopulation correlation (or Spearman)Greek
τ\tauKendall's tauGreek
x~\tilde{x}Sample medianScalar, tilde
Q(p)Q(p)pp-th quantile function
F^n(x)\hat{F}_n(x)Empirical CDFHat, subscript nn
g1,g2g_1, g_2Sample skewness, excess kurtosisScalars
dM(x)d_M(\mathbf{x})Mahalanobis distanceScalar
nnSample sizePositive integer
ddNumber of features/dimensionsPositive integer
μ,σ2\mu, \sigma^2Population mean, varianceGreek (theoretical)

End of Section07-Statistics/01-Descriptive-Statistics/notes.md


Appendix Q: Worked ML Case Study - EDA on a Classification Dataset

This appendix walks through a complete EDA on a synthetic classification dataset representative of tabular ML problems: predicting loan default from applicant features.

Q.1 Dataset Overview

Features: age (years), income (k),creditscore(0850),debttoincomeratio,numberofaccounts,latepayments(count).Target:default(0/1),with15Size:k), credit score (0-850), debt-to-income ratio, number of accounts, late payments (count). **Target:** default (0/1), with 15% positive class. **Size:** n = 5000applicants,applicants,d = 6$ features.

Q.2 Univariate Analysis

Age: mean =42.3= 42.3, median =41.0= 41.0, s=11.8s = 11.8, g1=0.2g_1 = 0.2 (nearly symmetric, slight right skew), g2=0.3g_2 = -0.3 (slightly platykurtic). Q-Q plot: slight deviation in upper tail only. No outliers by Tukey fences.

Income: mean =68.5= 68.5k, median =52.3= 52.3k, s=51.2s = 51.2, g1=2.8g_1 = 2.8 (strongly right-skewed), g2=11.4g_2 = 11.4 (very leptokurtic). Tukey fence upper: Q3+1.5×IQR=87+1.5×45=154.5Q_3 + 1.5 \times \text{IQR} = 87 + 1.5 \times 45 = 154.5k - 8.3% of observations flagged as outliers. Recommendation: log-transform income before modelling. After log transform: g1=0.1g_1 = 0.1, g2=0.4g_2 = 0.4 - approximately Gaussian.

Credit score: mean =672= 672, median =685= 685, s=89s = 89, g1=0.8g_1 = -0.8 (left-skewed - ceiling effect near 850). A few scores above 840 represent 98th percentile; no action needed but note the ceiling.

Late payments: mean =1.2= 1.2, median =0= 0, mode =0= 0. Heavily right-skewed (g1=4.1g_1 = 4.1). 62% of applicants have zero late payments - consider treating as a mixed discrete-continuous variable or Poisson-distributed.

Q.3 Bivariate Analysis with Target

FeaturePearson rrSpearman ρS\rho_SNote
Age0.03-0.030.04-0.04Negligible
log(Income)0.31-0.310.34-0.34Moderate negative
Credit score0.42-0.420.45-0.45Strongest predictor
Debt-to-income+0.38+0.38+0.40+0.40Strong positive
Late payments+0.28+0.28+0.35+0.35Moderate positive

Pearson and Spearman are close for all features except late payments (the most skewed), where Spearman is meaningfully larger - confirming that the relationship is monotone but not linear.

Q.4 Multivariate Analysis

The correlation matrix reveals:

  • Income and credit score: r=0.41r = 0.41 (expected: higher income -> better credit history)
  • Debt-to-income and late payments: r=0.58r = 0.58 (financially stressed applicants)
  • Age and all others: r<0.1r < 0.1 (age is roughly independent of financial variables in this dataset)

The block structure (income-credit-debt cluster vs age) suggests two natural feature groups for regularisation. PCA on the standardised features: first two PCs explain 51% and 19% = 70% total variance. The first PC is approximately an equal-weight financial health score; the second PC is approximately age vs financial stress.

Q.5 Preprocessing Decisions

Based on EDA:

  1. Log-transform income (right-skewed, leptokurtic)
  2. Z-score standardise all numeric features (using training-set statistics only)
  3. No outlier removal - outliers in income are legitimate high-income applicants, not data errors
  4. Handle 3.2% missing credit scores via median imputation (MAR assumption - missing-at-random; lower-income applicants slightly more likely to have missing credit scores)
  5. Flag: debt-to-income and late payments are collinear (r=0.58r=0.58) - regularisation will be important

Q.6 Key EDA Findings Summary

EDA FINDINGS - LOAN DEFAULT PREDICTION
========================================================================

  [ok] Class imbalance: 15% default -> use balanced accuracy, F1, AUC
  [ok] Income: log-transform required (skew 2.8 -> 0.1 after transform)
  [ok] Credit score: best single predictor (r = -0.42)
  [ok] Collinearity: debt-to-income <-> late payments (r = 0.58)
  [ok] Missing data: 3.2% credit score (MAR) -> median impute
  [ok] Age: near-zero correlation with target -> likely weak feature
  [ok] Scale: all features need standardisation (income in $k,
    credit 0-850, DTI 0-1, late payments 0-20)

  RECOMMENDED PIPELINE:
  log(income) -> StandardScaler -> [classifier with L2 reg]
  Evaluate: AUC-ROC, PR-AUC, balanced accuracy
  Monitor: label distribution shift, credit score mean shift

========================================================================

Appendix R: Reading List

Foundational:

  1. Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. - The founding text; invented box plots, stem-and-leaf, resistant methods.
  2. Huber, P.J. & Ronchetti, E.M. (2009). Robust Statistics (2nd ed.). Wiley. - Formal breakdown point theory; influence functions; M-estimators.
  3. Anscombe, F.J. (1973). "Graphs in Statistical Analysis." American Statistician, 27(1), 17-21. - The Quartet paper; 4 pages, required reading.

Applied Statistics: 4. Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). Norton. - Best non-technical introduction; strong on intuition. 5. Wackerly, D., Mendenhall, W., & Scheaffer, R. (2008). Mathematical Statistics with Applications (7th ed.). - Formal treatment; proofs; distributions.

Machine Learning Connections: 6. Ioffe, S. & Szegedy, C. (2015). "Batch Normalization." ICML 2015. - Original BN paper. 7. Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). "Layer Normalization." arXiv:1607.06450. - LN for transformers. 8. Kingma, D.P. & Ba, J.L. (2015). "Adam: A Method for Stochastic Optimization." ICLR 2015. - Adam; bias correction derivation. 9. Gebru, T. et al. (2020). "Datasheets for Datasets." Communications of the ACM, 64(12). - Descriptive statistics as documentation standard. 10. Lee, K. et al. (2018). "A Simple Unified Framework for Detecting Out-of-Distribution Samples." NeurIPS 2018. - Mahalanobis distance for OOD.


Appendix S: Common Statistical Distributions - Shape at a Glance

Recognising the shape of a distribution from its descriptive statistics is a practical skill. This guide maps statistic ranges to likely distribution families.

S.1 Shape Decision Tree

IDENTIFYING DISTRIBUTIONS FROM DESCRIPTIVE STATISTICS
========================================================================

  Start: Look at histogram + g_1 + g_2

  g_1 \\approx 0, g_2 \\approx 0
  +-- Unimodal, bell-shaped  -> Gaussian (test with Shapiro-Wilk)
  +-- Bimodal                -> Mixture of Gaussians

  g_1 > 1, support [0, \\infty)
  +-- g_2 \\approx 6                 -> Exponential (\\lambda = 1/mean)
  +-- g_2 large, discrete     -> Poisson (\\lambda = mean = variance?)
  +-- g_1 = 2, g_2 = 6        -> exactly Exponential(\\lambda)

  g_1 > 0, support bounded above
  +-- [0, 1], hump shape     -> Beta(\\alpha>1, \\beta>1)
  +-- [0, 1], J/U shape      -> Beta(\\alpha\\leq1 or \\beta\\leq1)

  g_1 = 0, g_2 > 0 (heavy tails)
  +-- Symmetric, very heavy  -> Student-t (\\nu \\approx 6/(g_2) + 4)
  +-- Symmetric, long tails  -> Laplace (g_2 = 3)

  g_1 < 0 (left-skewed)
  +-- Bounded above, mode near max -> ceiling effect; truncated distribution

========================================================================

S.2 What Gradient Distributions Look Like

Based on empirical measurements in large LLM training runs:

  • Individual parameter gradients: approximately N(0,σ2)\mathcal{N}(0, \sigma^2) with g10g_1 \approx 0 but g2[3,15]g_2 \in [3, 15] - leptokurtic, closer to Student-t5t_5 than Gaussian.
  • Gradient 2\ell_2 norms (per-layer): approximately Exponential or χ\chi distribution - always non-negative, right-skewed with g10.5g_1 \approx 0.5-22.
  • Loss values within a training batch: usually right-skewed (g1>0g_1 > 0) with g23g_2 \gg 3 - a few hard examples cause very high loss.

Implication: gradient norm thresholds for clipping should be set based on the high-quantile (e.g., P95P_{95} or P99P_{99}) of the gradient norm distribution, not the mean - because the mean is pulled by rare large-norm gradients.


Appendix T: Summary of All Mathematical Results

ResultStatementReference
Sample mean is unbiasedE[xˉ]=μ\mathbb{E}[\bar{x}] = \muSection2.2, App C.1
Sample mean varianceVar(xˉ)=σ2/n\text{Var}(\bar{x}) = \sigma^2/nApp C.2
Bessel's correctionE[s2]=σ2\mathbb{E}[s^2] = \sigma^2 with n1n-1 divisorSection2.3
Cauchy-Schwarz bound on rr$r
Σ^\hat{\Sigma} is PSDvΣ^v0\mathbf{v}^\top \hat{\Sigma} \mathbf{v} \geq 0Section5.1, App C.4
Mahalanobis χ2\chi^2dM2χ2(d)d_M^2 \sim \chi^2(d) under N(μ,Σ)\mathcal{N}(\mu, \Sigma)Section5.3
Glivenko-CantelliECDF converges to CDF uniformly a.s.Section2.5
CLT for xˉ\bar{x}n(xˉμ)/σN(0,1)\sqrt{n}(\bar{x}-\mu)/\sigma \to \mathcal{N}(0,1)App F.1
Breakdown point of medianε50%\varepsilon^* \to 50\% as nn \to \inftySection3.1
Adam bias correctionm^t=mt/(1β1t)\hat{m}_t = m_t/(1-\beta_1^t)Section8.2, App N.2
Attention varianceVar(qk)=dk\text{Var}(q \cdot k) = d_k for unit-Gaussian inputsSection5.4, App M.3
ARE(median, mean) Gaussian2/π0.6372/\pi \approx 0.637App F.3

Appendix U: Practice Problems Bank

U.1 Quick Computation Problems

  1. Dataset: {1,2,2,3,3,3,4,4,5}\{1, 2, 2, 3, 3, 3, 4, 4, 5\}. Compute all measures of central tendency and spread.
  2. Prove that for any cc, (xic)2=ns2+n(xˉc)2\sum(x_i - c)^2 = ns^2 + n(\bar{x}-c)^2. Interpret geometrically.
  3. Show that adding a constant cc to all observations changes xˉ\bar{x} by cc but leaves s2s^2 unchanged.
  4. For samples x1,,xnx_1, \ldots, x_n and yi=axi+by_i = ax_i + b, express yˉ\bar{y}, sy2s_y^2, and rxyr_{xy} in terms of xˉ\bar{x}, sx2s_x^2, and a,ba,b.
  5. Prove Spearman's ρ=16di2/(n(n21))\rho = 1 - 6\sum d_i^2 / (n(n^2-1)) when there are no ties.

U.2 Conceptual Problems

  1. The median of {2,4,6,8,10}\{2, 4, 6, 8, 10\} is 6. You add a 6th observation. For what values of the new observation does the median stay at 6? Change?
  2. Two datasets have the same mean (50) and standard deviation (10). Dataset A has g2=0g_2 = 0; Dataset B has g2=5g_2 = 5. Without knowing the data, what can you say about which dataset has more extreme values? Which would be safer to model with a Gaussian?
  3. A colleague reports that features X1X_1 and X2X_2 have Pearson correlation r=0.98r = 0.98. They suggest dropping X2X_2 from the model. Is this always the right move? What information do you need?
  4. You standardise your training data and achieve validation accuracy of 87%. A teammate says to also standardise the test data using test-set statistics for "fair comparison". What is wrong with this? What should you do instead?
  5. Explain why the sample variance of activations in a neural network might increase as training progresses, and how Batch Norm addresses this.

U.3 Implementation Challenges

  1. Implement the ECDF from scratch without using any statistical libraries. Verify it converges to the true CDF for n{10,100,1000}n \in \{10, 100, 1000\} samples from N(0,1)\mathcal{N}(0,1).
  2. Implement a bootstrap confidence interval (95%) for the sample median using B=1000B = 1000 bootstrap resamples. Apply to the income dataset from the loan default case study.
  3. Write a function detect_drift(train, prod, alpha=0.05) that returns a dictionary of descriptive statistics for both datasets and flags features where the PSI exceeds 0.2 or the mean has shifted by more than 2 standard errors.
  4. Reproduce Anscombe's Quartet from scratch and verify all four statistical summaries match within floating-point precision.
  5. Implement online (streaming) computation of mean, variance, and skewness using Welford's online algorithm - a numerically stable method that processes one observation at a time without storing the entire dataset.

Welford's algorithm hint:

Mk=Mk1+(xkMk1)/kM_k = M_{k-1} + (x_k - M_{k-1})/k Sk=Sk1+(xkMk1)(xkMk)S_k = S_{k-1} + (x_k - M_{k-1})(x_k - M_k) sk2=Sk/(k1)for k2s^2_k = S_k/(k-1) \quad \text{for } k \geq 2

This is the numerically stable online update rule, used in BN's running mean/variance accumulation.


End of Appendices


Appendix V: Welford's Algorithm - Numerically Stable Online Statistics

Welford's one-pass algorithm (1962) computes the sample mean and variance numerically stably in a single pass, processing one observation at a time. This is essential for streaming data and large datasets.

Algorithm (mean and variance):

Initialize: n = 0, M_1 = 0, S = 0
For each observation x_k:
    n  <- n + 1
    \\Delta  <- x_k - M_n_-_1
    M_n <- M_n_-_1 + \\Delta/n
    S  <- S + \\Delta\\cdot(x_k - M_n)
Sample variance: s^2 = S/(n-1)

Why numerically stable: The naive formula s2=xi2nxˉ2n1s^2 = \frac{\sum x_i^2 - n\bar{x}^2}{n-1} suffers catastrophic cancellation when xi2nxˉ2\sum x_i^2 \approx n\bar{x}^2 (near-constant data). Welford's maintains running deviations from the current mean, avoiding this subtraction.

Connection to Batch Norm running statistics:

# Batch Norm running mean update (PyTorch default momentum=0.1):
running_mean = (1-momentum) * running_mean + momentum * batch_mean

This is an EWMA (exponentially weighted), not Welford's - but the principle is the same: update the running estimate incrementally without recomputing from scratch.

Extension to skewness (Terriberry, 2007):

\\Phi_3 <- \\Phi_3 + \\Delta^2\\cdot(x_k - M_n)\\cdot(1 - 2/n) + 3\\cdotS\\cdot(-\\Delta/n)
g_1 = \\sqrtn \\cdot \\Phi_3 / S^(3/2)

This allows computing skewness in a single pass - useful for monitoring activation skewness during LLM inference.