NotesMath for LLMs

Principal Component Analysis

Advanced Linear Algebra / Principal Component Analysis

Notes

Principal Component Analysis (PCA)

"Give me the directions of maximum variance and I will show you the geometry of the data. Give me enough of them and I can reconstruct any measurement ever made."

Overview

Principal Component Analysis is the fundamental linear technique for understanding high-dimensional data. Given a dataset in Rd\mathbb{R}^d, PCA finds a new coordinate system - the principal components - aligned with the directions of greatest variance. Projecting onto the top kk components gives the best possible rank-kk linear summary of the data, optimal in mean-squared reconstruction error.

PCA sits at the intersection of spectral theory, geometry, and statistics. Its derivation is an optimization problem (maximize projected variance) whose solution is an eigenvalue problem (diagonalize the covariance matrix) whose computation is best handled via the SVD. All three views - optimization, spectral, and computational - are essential for practitioners.

In 2026, PCA remains indispensable. It reveals the intrinsic dimensionality of LLM token representations, drives mechanistic interpretability analyses of transformer residual streams, and motivates the low-rank hypothesis behind LoRA. Understanding PCA deeply means understanding why high-dimensional models often live near low-dimensional manifolds - a fact that shapes training dynamics, generalization, and efficiency.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive demos: geometry of PCA, scree plots, whitening, PPCA, kernel PCA, eigenfaces, LLM intrinsic dimensionality
exercises.ipynb8 graded exercises: from covariance derivation through kernel PCA and LoRA rank analysis

Learning Objectives

After completing this section, you will:

  • Derive PCA from the variance-maximization principle using Lagrange multipliers
  • Explain the equivalence of the eigendecomposition and SVD routes to PCA
  • Compute variance explained, draw and interpret scree plots, and choose kk systematically
  • Implement PCA whitening (PCA and ZCA variants) and understand when each is appropriate
  • State the probabilistic PCA generative model and its MLE solution
  • Implement kernel PCA and explain what the kernel matrix encodes
  • Describe sparse PCA, robust PCA, and randomized PCA and when each is preferred
  • Connect PCA to eigenfaces, LLM intrinsic dimension, mechanistic interpretability, LoRA, and loss landscape analysis
  • Identify the ten most common PCA mistakes and fix them

Table of Contents


1. Intuition

1.1 The Big Picture - Variance, Compression, and Geometry

Imagine measuring 1,000 features on each of 10,000 patients. Most of those features are correlated - height and weight move together, blood pressure readings at different times are similar, gene expression patterns cluster. The data occupies a 10,000×1,00010{,}000 \times 1{,}000 matrix, but the effective "information content" may be much lower. PCA finds that smaller set of independent directions.

The key geometric idea: data in Rd\mathbb{R}^d typically doesn't spread equally in all directions. It has a shape - an ellipsoid - with a longest axis, a second-longest axis orthogonal to the first, and so on. PCA finds those axes, in order of length (variance). Projecting onto the first kk axes keeps the longest dimensions and discards the short ones, where variation is small and likely noise.

PCA AS AXIS ALIGNMENT
========================================================================

  Original frame          After PCA rotation
  (arbitrary x_1,x_2)      (aligned to data structure)

        x_2                         PC2 (low variance)
        |                            up
      - |-                           | *
    -   | -  -                  *   |  *
   -    |   -           ->     *    |    *
  -     |    -                *    |     *
 -      |     -             *      |      *
--------+------ x_1        -------------------> PC1 (high variance)

  The covariance ellipse    PC1 = major axis = max variance direction
  is tilted in (x_1,x_2)     PC2 = minor axis = min variance direction

========================================================================

What PCA does, step by step:

  1. Center the data (subtract the mean) - PCA measures variance around the centroid.
  2. Find the longest axis of the resulting cloud (PC1). This is the direction v1\mathbf{v}_1 maximizing Var(Xv1)\operatorname{Var}(\mathbf{X}\mathbf{v}_1).
  3. Find the next longest axis orthogonal to PC1 (PC2). Repeat for PC3, PC4, ..., PC_d.
  4. Project onto the top kk axes to get a kk-dimensional representation that preserves as much variance as possible.

For AI: A transformer with dmodel=4096d_{\mathrm{model}} = 4096 has token representations in R4096\mathbb{R}^{4096}. PCA of these representations typically shows that 50-200 components explain 90%+ of the variance - the model has learned a much lower-dimensional structure than its nominal dimension suggests. This is the empirical foundation of the intrinsic dimensionality hypothesis.

1.2 Three Equivalent Views

The same PCA solution arises from three completely different starting points. Understanding all three gives deep insight into why PCA is so fundamental.

THREE EQUIVALENT VIEWS OF PCA
========================================================================

  View 1: VARIANCE MAXIMIZATION
  -----------------------------
  Find the unit vector v that maximizes Var(Xv) = v^TCv
  where C = X^TX/(n-1) is the sample covariance matrix.

  Solution: v_1 = top eigenvector of C.
  Intuition: look in the direction where the data spreads most.

  View 2: RECONSTRUCTION MINIMIZATION
  -------------------------------------
  Find the k-dimensional subspace S minimizing mean reconstruction error:
      min_{S, dim(S)=k}  (1/n) \Sigma^i ||x^i - proj_s(x^i)||^2

  Solution: S = span{v_1, ..., v_k} (top k eigenvectors of C).
  Intuition: project onto the flat that keeps points closest to original.

  View 3: COVARIANCE DIAGONALIZATION
  -----------------------------------
  Find orthogonal Q such that Q^TCQ = \Lambda is diagonal.

  Solution: Q = eigenvector matrix of C; \Lambda = eigenvalue matrix.
  Intuition: find a rotation that makes all features independent.

  All three yield the same Q = [v_1 | v_2 | ... | v_a] <- SAME ANSWER.

========================================================================
ViewObjectiveMathematical formWho uses it
Variance maximizationFind most informative directionmaxv=1vCv\max_{\lVert\mathbf{v}\rVert=1} \mathbf{v}^\top C \mathbf{v}Feature extraction, signal processing
ReconstructionBest low-rank approximationminrank(M)=kXMF2\min_{\operatorname{rank}(M)=k}\lVert X - M\rVert_F^2Compression, denoising
DecorrelationDiagonalize the covarianceQCQ=ΛQ^\top C Q = \LambdaStatistics, ICA preprocessing

The equivalence of views 1 and 2 is not obvious - it is proved by the Eckart-Young theorem (see 02-SVD). The equivalence of views 1 and 3 follows directly from the spectral theorem for symmetric matrices (see 01-Eigenvalues).

1.3 Why PCA Matters for AI in 2026

PCA is not merely a classical preprocessing step - it is embedded in the mathematical foundations of modern deep learning:

Intrinsic dimensionality of representations. LLM token embeddings, despite living in R4096\mathbb{R}^{4096} or larger, occupy a low-dimensional effective subspace. Studies of GPT-2 through GPT-4-class models consistently find that PCA explains 90% of embedding variance with far fewer components than the nominal dimension. This matters for compression, distillation, and understanding generalization.

Mechanistic interpretability. Researchers apply PCA to the residual stream of transformer layers to identify "directions" that encode specific features (sentiment, position, subject-verb agreement). Each interpretable direction is a principal component of the activations. The technique, used by Anthropic, DeepMind, and EleutherAI, depends entirely on PCA as a lens.

LoRA and low-rank adaptation. The LoRA paper (Hu et al., 2021) hypothesizes that fine-tuning weight updates have low intrinsic rank - i.e., the gradient matrix ΔW\Delta W is well-approximated by a rank-rr matrix with rdr \ll d. PCA of ΔW\Delta W is the theoretical grounding. DoRA, GaLore, and Flora extend this to gradient-space PCA.

Loss landscape analysis. Goodfellow et al. showed that the loss landscape along random directions is complex, but along the top PCA directions of gradient trajectories, it is nearly one-dimensional. Li et al.'s filter normalization visualization uses PCA of parameter trajectories.

Eigenfaces and foundation models. Turk & Pentland (1991) showed that faces lie near a low-dimensional linear subspace - the "face space" spanned by eigenfaces, which are PCA components of face images. Modern vision transformers implicitly learn analogous patch-level PCA components in their first layers.

1.4 Historical Timeline

YearWhoContribution
1901Karl PearsonFirst formulation: lines/planes of closest fit to point systems
1933Harold HotellingNamed "principal components"; Lagrange-multiplier derivation
1936FisherDiscriminant analysis (supervised analog)
1963Eckart & YoungBest low-rank approximation theorem (published 1936, recognized 1963)
1978Tipping & BishopProbabilistic PCA as a latent variable model
1998Scholkopf et al.Kernel PCA - non-linear extension via RKHS
2003Candes & RechtRobust PCA (nuclear norm minimization)
2009Halko et al.Randomized SVD for large-scale PCA
2021Hu et al.LoRA: low-rank adaptation hypothesis connects PCA to fine-tuning
2022-26Many authorsPCA of LLM residual streams as mechanistic interpretability tool

2. Formal Definitions

2.1 Setup: Centered Data and the Covariance Matrix

Data matrix. Let XRn×dX \in \mathbb{R}^{n \times d} be a dataset with nn observations and dd features. Row ii is x(i)Rd\mathbf{x}^{(i)} \in \mathbb{R}^d, a single data point (column vector stored as a row in XX).

Centering. Define the sample mean:

xˉ=1ni=1nx(i)\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}^{(i)}

The centered data matrix is:

X~=X1xˉ\tilde{X} = X - \mathbf{1}\bar{\mathbf{x}}^\top

where 1Rn\mathbf{1} \in \mathbb{R}^n is the all-ones vector. After centering, X~\tilde{X} has zero column means. PCA is always applied to centered data - failing to center is the most common mistake.

Sample covariance matrix. The covariance matrix CRd×dC \in \mathbb{R}^{d \times d} is:

C=1n1X~X~C = \frac{1}{n-1}\tilde{X}^\top \tilde{X}

Its (i,j)(i,j) entry is the sample covariance between feature ii and feature jj:

Cij=1n1k=1n(X~ki)(X~kj)C_{ij} = \frac{1}{n-1}\sum_{k=1}^n (\tilde{X}_{ki})(\tilde{X}_{kj})

Key properties of CC:

  • CC is symmetric: C=CC^\top = C (since (X~X~)=X~X~(\tilde{X}^\top\tilde{X})^\top = \tilde{X}^\top\tilde{X})
  • CC is positive semidefinite: vCv0\mathbf{v}^\top C \mathbf{v} \geq 0 for all v\mathbf{v} (since vCv=X~v22/(n1)0\mathbf{v}^\top C \mathbf{v} = \lVert\tilde{X}\mathbf{v}\rVert_2^2 / (n-1) \geq 0)
  • Diagonal entries Cii=Var(feature i)C_{ii} = \operatorname{Var}(\text{feature } i) are non-negative
  • Off-diagonal CijC_{ij} measures linear co-movement between features ii and jj
  • tr(C)=i=1dCii\operatorname{tr}(C) = \sum_{i=1}^d C_{ii} = total variance = sum of all feature variances

Because CC is real symmetric and PSD, the spectral theorem guarantees a full orthonormal eigenbasis with non-negative eigenvalues. This is the structural fact that makes PCA possible.

Recall (Spectral Theorem): Every real symmetric matrix CC admits C=QΛQC = Q\Lambda Q^\top where QQ is orthogonal and Λ=diag(λ1,,λd)\Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_d) with λ1λd0\lambda_1 \geq \cdots \geq \lambda_d \geq 0. Full treatment: 01-Eigenvalues-and-Eigenvectors.

2.2 Variance Maximization - The Optimization Problem

What we want: a unit vector vRd\mathbf{v} \in \mathbb{R}^d such that the projected data X~vRn\tilde{X}\mathbf{v} \in \mathbb{R}^n has maximum variance.

Variance of the projection:

Var(X~v)=1n1X~v22=1n1(X~v)(X~v)=v(X~X~n1)v=vCv\operatorname{Var}(\tilde{X}\mathbf{v}) = \frac{1}{n-1}\lVert \tilde{X}\mathbf{v} \rVert_2^2 = \frac{1}{n-1}(\tilde{X}\mathbf{v})^\top(\tilde{X}\mathbf{v}) = \mathbf{v}^\top \left(\frac{\tilde{X}^\top\tilde{X}}{n-1}\right) \mathbf{v} = \mathbf{v}^\top C \mathbf{v}

The first principal component solves:

v1=argmaxvvCvsubject tov2=1\mathbf{v}_1 = \arg\max_{\mathbf{v}} \mathbf{v}^\top C \mathbf{v} \quad \text{subject to} \quad \lVert\mathbf{v}\rVert_2 = 1

Lagrange multiplier solution. Form the Lagrangian:

L(v,λ)=vCvλ(vv1)\mathcal{L}(\mathbf{v}, \lambda) = \mathbf{v}^\top C \mathbf{v} - \lambda(\mathbf{v}^\top\mathbf{v} - 1)

Setting L/v=0\partial\mathcal{L}/\partial\mathbf{v} = 0:

2Cv2λv=0    Cv=λv2C\mathbf{v} - 2\lambda\mathbf{v} = 0 \implies C\mathbf{v} = \lambda\mathbf{v}

This is the eigenvalue equation. The optimal v\mathbf{v} is an eigenvector of CC, and the achieved variance is the corresponding eigenvalue:

vCv=v(λv)=λv22=λ\mathbf{v}^\top C \mathbf{v} = \mathbf{v}^\top (\lambda \mathbf{v}) = \lambda \lVert \mathbf{v}\rVert_2^2 = \lambda

To maximize, we choose the eigenvector corresponding to the largest eigenvalue λ1\lambda_1.

2.3 The Eigenvalue Equation and Principal Components

Definition (Principal Components): The ii-th principal component direction vi\mathbf{v}_i is the ii-th eigenvector of the sample covariance matrix CC, corresponding to the ii-th largest eigenvalue λi\lambda_i:

Cvi=λivi,vi2=1,vi,vj=0 for ijC \mathbf{v}_i = \lambda_i \mathbf{v}_i, \qquad \lVert\mathbf{v}_i\rVert_2 = 1, \qquad \langle\mathbf{v}_i, \mathbf{v}_j\rangle = 0 \text{ for } i \neq j

The eigenvalues satisfy λ1λ2λd0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0.

Sequential derivation. After finding v1\mathbf{v}_1, the second PC maximizes variance subject to being orthogonal to v1\mathbf{v}_1:

v2=argmaxv=1,vv1vCv\mathbf{v}_2 = \arg\max_{\lVert\mathbf{v}\rVert=1,\, \mathbf{v}\perp\mathbf{v}_1} \mathbf{v}^\top C \mathbf{v}

By the spectral theorem, the solution is the second eigenvector. Repeating gives all dd PCs. The key: they are exactly the eigenvectors of CC, computed all at once by diagonalizing C=QΛQC = Q\Lambda Q^\top.

Principal component matrix. Stack all dd eigenvectors as columns:

V=[v1v2vd]Rd×dV = [\mathbf{v}_1 \mid \mathbf{v}_2 \mid \cdots \mid \mathbf{v}_d] \in \mathbb{R}^{d \times d}

VV is orthogonal: VV=VV=IdV^\top V = VV^\top = I_d. Geometrically, VV is a rotation matrix that aligns the coordinate axes with the principal directions of variance.

2.4 Projection and Reconstruction

Full projection (change of basis). Project the centered data onto all principal components:

Z=X~VRn×dZ = \tilde{X} V \in \mathbb{R}^{n \times d}

ZijZ_{ij} is the coordinate of point ii along PC direction jj. The columns of ZZ are uncorrelated and have variances λ1λ2λd\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d.

Rank-kk approximation. Keep only the top kk components - define Vk=[v1vk]Rd×kV_k = [\mathbf{v}_1 \mid \cdots \mid \mathbf{v}_k] \in \mathbb{R}^{d \times k}:

Zk=X~VkRn×k(scores / embedding)Z_k = \tilde{X} V_k \in \mathbb{R}^{n \times k} \quad \text{(scores / embedding)}

ZkZ_k is the low-dimensional representation. Each row zk(i)Rk\mathbf{z}_k^{(i)} \in \mathbb{R}^k is the PCA embedding of data point ii.

Reconstruction. Map back to the original space:

X^=ZkVk+1xˉRn×d\hat{X} = Z_k V_k^\top + \mathbf{1}\bar{\mathbf{x}}^\top \in \mathbb{R}^{n \times d}

The reconstruction error is:

X~ZkVkF2=i=k+1dλi=i=k+1dσi2n1\lVert \tilde{X} - Z_k V_k^\top \rVert_F^2 = \sum_{i=k+1}^d \lambda_i = \sum_{i=k+1}^d \frac{\sigma_i^2}{n-1}

That is, the error equals the sum of discarded eigenvalues (= discarded singular values squared, normalized). Retaining more components reduces error monotonically.

2.5 PCA via SVD - The Preferred Route

In practice, computing the covariance C=X~X~/(n1)C = \tilde{X}^\top\tilde{X}/(n-1) explicitly is numerically inferior. The SVD of X~\tilde{X} directly gives all PCA quantities with better numerical stability.

SVD of centered data:

X~=UΣV,URn×n,  ΣRn×d,  VRd×d\tilde{X} = U\Sigma V^\top, \quad U \in \mathbb{R}^{n \times n},\; \Sigma \in \mathbb{R}^{n \times d},\; V \in \mathbb{R}^{d \times d}

where σ1σ2σmin(n,d)0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(n,d)} \geq 0 are the singular values.

Recall (SVD): For the full treatment of the SVD, its geometric meaning, and the Eckart-Young theorem, see 02-Singular-Value-Decomposition.

The connection:

C=X~X~n1=(UΣV)(UΣV)n1=VΣΣn1V=Vdiag ⁣(σ12n1,)VC = \frac{\tilde{X}^\top\tilde{X}}{n-1} = \frac{(U\Sigma V^\top)^\top(U\Sigma V^\top)}{n-1} = V \frac{\Sigma^\top\Sigma}{n-1} V^\top = V \operatorname{diag}\!\left(\frac{\sigma_1^2}{n-1}, \ldots\right) V^\top

This is exactly the eigendecomposition of CC - the columns of VV are the principal components and λi=σi2/(n1)\lambda_i = \sigma_i^2/(n-1).

Complete PCA quantities from SVD:

PCA quantityFormula via SVD
Principal component directionsColumns of VV (rows of VV^\top)
Sample variancesλi=σi2/(n1)\lambda_i = \sigma_i^2 / (n-1)
Scores (projected data)Z=UΣZ = U\Sigma or equivalently X~V\tilde{X}V
Rank-kk reconstructionX^=UkΣkVk+1xˉ\hat{X} = U_k\Sigma_k V_k^\top + \mathbf{1}\bar{\mathbf{x}}^\top
Reconstruction errori>kσi2\sum_{i>k}\sigma_i^2

Why SVD is numerically preferred:

IssueCovariance + EigenSVD
Forming X~X~\tilde{X}^\top\tilde{X}Squares condition number: κ(C)=κ(X~)2\kappa(C) = \kappa(\tilde{X})^2Avoids squaring - works on X~\tilde{X} directly
MemoryO(d2)O(d^2) for CCO(nd)O(nd) for X~\tilde{X}
When n<dn < dCC is rank-deficient, unstableThin SVD is efficient
Complex eigenvaluesPossible for np.linalg.eigNever (singular values always real)

Practical implementation:

import numpy as np

def pca_svd(X, k=None):
    """PCA via SVD - the numerically stable route."""
    n, d = X.shape
    k = k or d

    # Step 1: Center
    mean = X.mean(axis=0)
    X_c = X - mean

    # Step 2: Thin SVD (economy SVD)
    U, s, Vt = np.linalg.svd(X_c, full_matrices=False)
    # U: (n, min(n,d)), s: (min(n,d),), Vt: (min(n,d), d)

    # Step 3: Extract PCA quantities
    V_k = Vt[:k].T            # Principal component directions (d, k)
    Z_k = U[:, :k] * s[:k]   # Scores = projected data (n, k)
    lambdas = s**2 / (n - 1)  # Explained variances

    var_ratio = lambdas[:k] / lambdas.sum()
    cumvar = np.cumsum(var_ratio)

    return {
        'scores': Z_k,           # (n, k) - low-dim representation
        'components': V_k,       # (d, k) - principal directions
        'explained_var': lambdas[:k],
        'explained_var_ratio': var_ratio,
        'cumulative_var': cumvar,
        'mean': mean,
    }

3. Properties of PCA

3.1 Decorrelation: Diagonalizing the Covariance Matrix

The most important algebraic property of PCA is that the projected scores are uncorrelated. If Z=X~VZ = \tilde{X}V is the full projection, then:

Cov(Z)=ZZn1=VX~X~Vn1=VCV=V(VΛV)V=Λ\operatorname{Cov}(Z) = \frac{Z^\top Z}{n-1} = \frac{V^\top \tilde{X}^\top \tilde{X} V}{n-1} = V^\top C V = V^\top (V\Lambda V^\top) V = \Lambda

The covariance matrix of the PCA scores is diagonal - equal to Λ=diag(λ1,,λd)\Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_d). This means:

  • All pairs of PCA scores are uncorrelated: Cov(Z:,i,Z:,j)=0\operatorname{Cov}(Z_{:,i}, Z_{:,j}) = 0 for iji \neq j
  • Each score's variance equals the corresponding eigenvalue: Var(Z:,i)=λi\operatorname{Var}(Z_{:,i}) = \lambda_i
  • Total variance is preserved: iλi=tr(C)=iCii\sum_i \lambda_i = \operatorname{tr}(C) = \sum_i C_{ii}

This is precisely view 3 from 1.2 - PCA performs a rotation that diagonalizes the covariance matrix. After PCA, each component carries independent information (in the linear sense; they are not necessarily statistically independent, only uncorrelated - independence requires a distributional assumption like Gaussianity).

Contrast with ICA: Independent Component Analysis (ICA) finds directions of statistical independence, not just decorrelation. PCA decorrelation is a necessary but not sufficient condition for independence.

3.2 Optimality - The Eckart-Young Connection

PCA achieves two distinct optimality guarantees, both following from the Eckart-Young-Mirsky theorem:

Theorem (Optimality of PCA, informal). Among all rank-kk linear maps RdRk\mathbb{R}^d \to \mathbb{R}^k, the PCA projection minimizes mean reconstruction error:

minorthonormal VkRd×kX~X~VkVkF2=i=k+1dσi2\min_{\text{orthonormal } V_k \in \mathbb{R}^{d\times k}} \lVert \tilde{X} - \tilde{X}V_kV_k^\top \rVert_F^2 = \sum_{i=k+1}^d \sigma_i^2

The minimum is achieved by taking VkV_k = first kk right singular vectors of X~\tilde{X} = first kk eigenvectors of CC.

Proof: This is a direct consequence of the Eckart-Young theorem applied to X~\tilde{X}. See 02-SVD 4.2 for the complete proof.

Two sides of the same coin:

OptimalityStatementError
Maximum variancePCA directions maximize total explained varianceikλi\sum_{i\leq k}\lambda_i is maximized
Minimum reconstructionPCA projection minimizes XX^F2\lVert X - \hat{X}\rVert_F^2i>kλi\sum_{i>k}\lambda_i is minimized

These are equivalent: maximizing captured variance = minimizing discarded variance = minimizing reconstruction error. The optimal solution is the same in both cases.

Implication for choosing kk: Setting a reconstruction error budget ϵ\epsilon directly determines kk - find the smallest kk such that i>kσi2ϵiσi2\sum_{i>k}\sigma_i^2 \leq \epsilon \cdot \sum_i \sigma_i^2 (i.e., keep the fraction 1ϵ1-\epsilon of total variance).

3.3 Sign Ambiguity and Uniqueness

Sign ambiguity. If vi\mathbf{v}_i is an eigenvector of CC with eigenvalue λi\lambda_i, so is vi-\mathbf{v}_i. PCA directions are only defined up to sign. This is unavoidable: there is no canonical "positive" direction for an eigenvector.

Consequence: Two PCA implementations may return vi\mathbf{v}_i and vi-\mathbf{v}_i for the same data. Always compare directions via u,v|\langle\mathbf{u}, \mathbf{v}\rangle| or cos2θ\cos^2\theta, not the raw dot product.

Uniqueness of the subspace. While individual directions are sign-ambiguous, the subspace span{v1,,vk}\operatorname{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} is unique - as long as the eigenvalues are distinct (λk>λk+1\lambda_k > \lambda_{k+1}). If there is an eigenvalue tie (λk=λk+1\lambda_k = \lambda_{k+1}), any orthonormal basis for the corresponding eigenspace is a valid solution. This is not a flaw but a genuine mathematical degeneracy: all directions in that eigenspace explain equal variance.

When uniqueness fails:

  • Perfectly spherical data: C=σ2IC = \sigma^2 I -> every direction is a PC, problem is entirely degenerate
  • Repeated eigenvalues: the corresponding eigenspace can be any rotation
  • Near-equal eigenvalues: direction is numerically unstable - small data perturbations change it dramatically

Practical fix for sign ambiguity (sklearn convention): flip vi\mathbf{v}_i so that the entry with the largest absolute value is positive:

signs = np.sign(Vt[np.abs(Vt).argmax(axis=1), range(Vt.shape[1])])
Vt = Vt * signs[:, None]

3.4 Scale Sensitivity and Standardization

PCA is not scale-invariant: multiplying a feature by a constant changes its variance and therefore its contribution to the PCA solution.

Example: Suppose feature 1 is height in centimetres (range 150-190) and feature 2 is weight in kilograms (range 50-100). The variance of feature 1 is ~1702/12240170^2 / 12 \approx 240 and of feature 2 is ~752/1246975^2 / 12 \approx 469. But if we rescale height to metres (divide by 100), variance drops to 0.024\approx 0.024 - and PC1 now points almost entirely in the weight direction.

Decision guide for standardization:

ScenarioRecommendationReason
Features on same scale and unitCenter onlyPreserves relative magnitudes; large variance is genuinely informative
Features on different scales/unitsStandardize (zz-score)Equal contribution regardless of unit
Interpreting component loadingsStandardizeLoadings become comparable across features
Applying Kaiser criterionStandardizeCriterion requires each feature to have unit variance
Image pixels, identical sensorsCenter onlySame unit, variance differences are meaningful

Z-score standardization:

xstd=xμ^σ^,μ^=sample mean,  σ^=sample stdx_{\text{std}} = \frac{x - \hat{\mu}}{\hat{\sigma}}, \qquad \hat{\mu} = \text{sample mean},\;\hat{\sigma} = \text{sample std}

After standardizing, CC becomes the sample correlation matrix Rij=Cij/CiiCjjR_{ij} = C_{ij}/\sqrt{C_{ii}C_{jj}}, and PCA on standardized data is equivalent to PCA on the correlation matrix.

Important: Standardization parameters (μ^\hat{\mu}, σ^\hat{\sigma}) must be computed on training data only, then applied to test data. Never fit these on test data - this constitutes data leakage and inflates performance estimates.


4. Explained Variance and Component Selection

4.1 Variance Ratio and Cumulative Explained Variance

Total variance of the data equals the trace of the covariance matrix, which equals the sum of all eigenvalues:

Total Var=tr(C)=i=1dλi=i=1dσi2n1\text{Total Var} = \operatorname{tr}(C) = \sum_{i=1}^d \lambda_i = \sum_{i=1}^d \frac{\sigma_i^2}{n-1}

Variance ratio for component ii:

ρi=λij=1dλj=σi2j=1dσj2\rho_i = \frac{\lambda_i}{\sum_{j=1}^d \lambda_j} = \frac{\sigma_i^2}{\sum_{j=1}^d \sigma_j^2}

ρi[0,1]\rho_i \in [0,1] and iρi=1\sum_i \rho_i = 1.

Cumulative explained variance for kk components:

CEV(k)=i=1kρi=i=1kλij=1dλj\text{CEV}(k) = \sum_{i=1}^k \rho_i = \frac{\sum_{i=1}^k \lambda_i}{\sum_{j=1}^d \lambda_j}

CEV(k)\text{CEV}(k) increases monotonically from 0 to 1 as kk goes from 0 to dd.

def explained_variance(X_centered):
    """Compute variance ratios and cumulative EVR from centered data."""
    _, s, _ = np.linalg.svd(X_centered, full_matrices=False)
    var_ratio = s**2 / (s**2).sum()
    cum_var = np.cumsum(var_ratio)
    return var_ratio, cum_var

4.2 Scree Plots and the Elbow Heuristic

A scree plot (from geology: the rubble at the base of a cliff) shows the eigenvalues λi\lambda_i plotted against component index ii.

SCREE PLOT ANATOMY
========================================================================

  \lambda^i                      Individual eigenvalues
  |
  5.2 - *
  |
  3.1 -   *                 Cumulative EVR
  |         *               100%|              ---------
  1.8 -       *              90%|         ----
  |             *  *  *  *   80%|      ---
  0.2 -             -------   70%|   ---
  +-------------------- i      60%|---
      1  2  3  4  5  6  7      0  1  2  3  4  5  6

  "Elbow" at i=3: eigenvalues      Choose k at the "knee" of
  level off after here             the cumulative curve

========================================================================

Reading the scree plot:

  • A sharp drop followed by a "floor" of small eigenvalues indicates a clear intrinsic dimension
  • The elbow (knee) is the bend where eigenvalues stop decreasing steeply - a heuristic for the number of meaningful components
  • No elbow: data may genuinely be spread across many dimensions, or may need more samples

Limitations of the elbow heuristic: Subjective; different analysts may pick different elbows; fails when the spectrum decays smoothly (e.g., power-law decay). Cross-validation or parallel analysis is more rigorous.

4.3 Threshold Methods

The most common method is to choose the smallest kk such that the cumulative explained variance exceeds a threshold τ\tau:

k=min{k:CEV(k)τ}k^* = \min\left\{k : \text{CEV}(k) \geq \tau\right\}

Typical thresholds by application:

Threshold τ\tauUse case
90%Exploratory analysis, rough compression
95%Standard ML preprocessing
99%Applications where accuracy matters (genomics, finance)
99.9%Near-lossless compression
def components_for_threshold(s, tau=0.95):
    """Return minimum k achieving tau cumulative explained variance."""
    var_ratio = s**2 / (s**2).sum()
    cumvar = np.cumsum(var_ratio)
    k = int(np.searchsorted(cumvar, tau)) + 1
    return min(k, len(s))

sklearn shortcut: PCA(n_components=0.95) automatically chooses kk to explain 95% of variance.

4.4 Kaiser Criterion

For standardized data (each feature has variance 1), the total variance equals dd (the number of features), and each eigenvalue λi\lambda_i measures how many "original features" that component explains.

Kaiser criterion: Keep components with λi>1\lambda_i > 1 - components that explain more variance than a single original feature.

kKaiser=#{i:λi>1}k_{\text{Kaiser}} = \#\{i : \lambda_i > 1\}

This is the default in many statistics packages. It works specifically because standardized data makes the threshold λ=1\lambda = 1 meaningful - without standardization, the threshold has no natural interpretation.

When Kaiser fails:

  • Over-extraction (too many components) when dd is large relative to nn
  • Under-extraction when the first PC has very high eigenvalue and the rest decay slowly
  • Should not be applied to unstandardized data

4.5 Cross-Validation and Downstream-Task Selection

For predictive modelling, the optimal kk is application-specific. Cross-validation is the principled approach:

Reconstruction CV: Hold out observations, compute reconstruction error as a function of kk. Select kk minimizing validation reconstruction error. Provides a principled but unsupervised criterion.

Downstream CV: Fit a model (classifier, regressor) on PCA-reduced features. Select kk maximizing cross-validation performance on the downstream task. This is the most principled choice for supervised settings.

from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np

def find_k_cv(X, y, k_values, cv=5):
    """Find optimal k via downstream cross-validation."""
    scores = []
    for k in k_values:
        pipe = Pipeline([
            ('pca', PCA(n_components=k)),
            ('clf', LogisticRegression(max_iter=1000))
        ])
        score = cross_val_score(pipe, X, y, cv=cv).mean()
        scores.append(score)
    return k_values[np.argmax(scores)], scores

Parallel analysis (Horn's method): Compare each λi\lambda_i against the λi\lambda_i from randomly generated data of the same shape. Keep components where the actual eigenvalue exceeds the random baseline. More statistically rigorous than the Kaiser criterion; less commonly implemented but available in some R packages.


5. PCA Whitening

5.1 What Whitening Achieves

Standard PCA projects data onto principal components and produces scores with variances λ1λk\lambda_1 \geq \cdots \geq \lambda_k - the components are uncorrelated but have different magnitudes. Whitening goes one step further: it rescales each component to unit variance, producing data with covariance IkI_k.

A dataset ZwhiteZ_{\text{white}} is called white (or spherical) if:

Cov(Zwhite)=I\operatorname{Cov}(Z_{\text{white}}) = I

Geometrically, whitening maps the data ellipsoid to a hypersphere - every direction has equal variance.

PCA whitening transform:

Starting from centered data X~\tilde{X} with SVD X~=UΣV\tilde{X} = U\Sigma V^\top:

Zwhite=X~VΛ1/2=UΣVVΛ1/2=UΣΛ1/2Z_{\text{white}} = \tilde{X} V \Lambda^{-1/2} = U\Sigma V^\top V \Lambda^{-1/2} = U \Sigma \Lambda^{-1/2}

where Λ=diag(λ1,,λk)\Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_k) and Λ1/2=diag(1/λ1,,1/λk)\Lambda^{-1/2} = \operatorname{diag}(1/\sqrt{\lambda_1}, \ldots, 1/\sqrt{\lambda_k}).

Since λi=σi2/(n1)\lambda_i = \sigma_i^2/(n-1), this simplifies to:

Zwhite=n1UkZ_{\text{white}} = \sqrt{n-1}\, U_k

The columns of UU from the SVD are already orthonormal, so Cov(Uk)=Ik/(n1)\operatorname{Cov}(U_k) = I_k/(n-1). After scaling by n1\sqrt{n-1}, each column has unit variance.

Verification:

Cov(Zwhite)=ZwhiteZwhiten1=(n1)UkUkn1=Ik\operatorname{Cov}(Z_{\text{white}}) = \frac{Z_{\text{white}}^\top Z_{\text{white}}}{n-1} = \frac{(n-1)U_k^\top U_k}{n-1} = I_k

5.2 PCA Whitening vs ZCA Whitening

PCA whitening rotates to the PC axes first, then scales:

ZPCA-white=X~VkΛk1/2Z_{\text{PCA-white}} = \tilde{X} V_k \Lambda_k^{-1/2}

The result is in a different coordinate system than the original data - the axes are the principal components.

ZCA whitening (Zero-phase Component Analysis) applies a further rotation to bring the data back to the original feature space:

ZZCA=X~VkΛk1/2Vk=X~C1/2Z_{\text{ZCA}} = \tilde{X} V_k \Lambda_k^{-1/2} V_k^\top = \tilde{X} C^{-1/2}

where C1/2=VΛ1/2VC^{-1/2} = V\Lambda^{-1/2}V^\top is the symmetric square root of C1C^{-1}.

ZCA vs PCA WHITENING
========================================================================

  Original data:              PCA-white:               ZCA-white:
  correlated ellipse          decorrelated sphere       sphere in orig frame
                              (rotated)                 (NOT rotated)

       -- -                        -  -                    - -
      -   -                       -    -                  -   -
     -     -         ->           -      -        ->       -     -
      -   -                       -    -                  -   -
       -- -                        -  -                    - -

  \times-- features --->          \times--- PCs ----->           \times-- features --->

  Best for images: ZCA preserves spatial locality.
  Best for ICA: PCA-white, since ICA works in PC space.

========================================================================
PropertyPCA whiteningZCA whitening
CovarianceIkI_kIdI_d (in original space)
AxesPC axesOriginal feature axes
Spatial localityDestroyedPreserved
Invertible?No (if k<dk < d)Yes
Preferred forICA, neural netsImage preprocessing, GANs

5.3 Applications of Whitening

ICA preprocessing. Independent Component Analysis assumes whitened inputs. Whitening reduces the ICA problem from d2d^2 parameters to d(d1)/2d(d-1)/2 (rotation angles only), making it tractable.

Neural network training. Whitening input features can dramatically speed up SGD convergence by removing the elongated covariance ellipse that makes gradient descent take zig-zag paths. Batch Normalization is a learned approximation of per-layer whitening.

Image preprocessing. ZCA whitening (also called "global contrast normalization") is used in CNN preprocessing pipelines (CIFAR-10 baseline, etc.) to remove correlations between neighboring pixels.

Reparametrization tricks. To sample from N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma), first sample zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I), then compute x=Lz+μ\mathbf{x} = L\mathbf{z} + \boldsymbol{\mu} where Σ=LL\Sigma = LL^\top (Cholesky). Whitening is the inverse: given xN(μ,Σ)\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu},\Sigma), compute z=L1(xμ)N(0,I)\mathbf{z} = L^{-1}(\mathbf{x}-\boldsymbol{\mu}) \sim \mathcal{N}(\mathbf{0},I).


6. Probabilistic PCA (PPCA)

6.1 The Generative Model

Standard PCA is a geometric algorithm - it finds optimal directions but provides no probabilistic model. Probabilistic PCA (PPCA), introduced by Tipping & Bishop (1999), embeds PCA in a latent variable model that enables likelihood-based model selection, missing data handling, and Bayesian extensions.

The PPCA generative model:

zN(0,Ik)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I_k) xzN(Wz+μ,σ2Id)\mathbf{x} \mid \mathbf{z} \sim \mathcal{N}(W\mathbf{z} + \boldsymbol{\mu},\, \sigma^2 I_d)

where:

  • zRk\mathbf{z} \in \mathbb{R}^k is the latent code (low-dimensional)
  • WRd×kW \in \mathbb{R}^{d \times k} is the loading matrix (maps latent to observed)
  • μRd\boldsymbol{\mu} \in \mathbb{R}^d is the mean of the observed data
  • σ2>0\sigma^2 > 0 is the isotropic noise variance

Marginalizing over z\mathbf{z}, the observed distribution is also Gaussian:

xN(μ,WW+σ2Id)\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu},\, WW^\top + \sigma^2 I_d)

The key parameters to learn from data are WW, μ\boldsymbol{\mu}, and σ2\sigma^2.

Comparison to standard PCA:

AspectStandard PCAPPCA
TypeGeometric/algebraicProbabilistic/generative
ParametersVkV_k, meanWW, μ\boldsymbol{\mu}, σ2\sigma^2
Model selectionHeuristic (kk via scree)Likelihood-based (BIC/AIC)
Missing dataNot directly handledEM handles naturally
UncertaintyNonePosterior over z\mathbf{z}
Limit σ20\sigma^2 \to 0Recovers standard PCAOK

6.2 MLE Solution and Connection to Standard PCA

The MLE for the PPCA model has a closed-form solution. Let CC be the sample covariance matrix, λi\lambda_i its eigenvalues, and vi\mathbf{v}_i its eigenvectors. Then:

MLE for μ\boldsymbol{\mu}: μ^=xˉ\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}} (sample mean).

MLE for σ2\sigma^2:

σ^2=1dki=k+1dλi\hat{\sigma}^2 = \frac{1}{d-k}\sum_{i=k+1}^d \lambda_i

This is the average variance of the discarded components - the isotropic noise level.

MLE for WW:

W^=Vk(Λkσ^2Ik)1/2R\hat{W} = V_k (\Lambda_k - \hat{\sigma}^2 I_k)^{1/2} R

where VkV_k contains the top kk eigenvectors, Λk=diag(λ1,,λk)\Lambda_k = \operatorname{diag}(\lambda_1,\ldots,\lambda_k), and RRk×kR \in \mathbb{R}^{k\times k} is any orthogonal matrix (rotational ambiguity).

Connection to PCA: As σ20\sigma^2 \to 0, W^Vkdiag(λ1,,λk)\hat{W} \to V_k \operatorname{diag}(\sqrt{\lambda_1},\ldots,\sqrt{\lambda_k}), and the latent posterior concentrates on the standard PCA projection. PPCA is a proper generative model whose zero-noise limit recovers classical PCA.

Log-likelihood at the MLE:

logp(XW^,μ^,σ^2)=n2[dlog(2π)+logC^+d]\log p(X \mid \hat{W}, \hat{\boldsymbol{\mu}}, \hat{\sigma}^2) = -\frac{n}{2}\left[d\log(2\pi) + \log|\hat{C}| + d\right]

where C^=W^W^+σ^2I\hat{C} = \hat{W}\hat{W}^\top + \hat{\sigma}^2 I. This provides a proper likelihood for model comparison (choosing kk via AIC or BIC).

6.3 Handling Missing Data via EM

PPCA handles missing data naturally through the EM algorithm, unlike standard PCA which requires complete data.

E-step: Given current parameters WW, μ\boldsymbol{\mu}, σ2\sigma^2, compute the posterior over latent codes for each observation (even those with missing features):

E[z(i)xobs(i)]=M1Wobs(xobs(i)μobs)\mathbb{E}[\mathbf{z}^{(i)} \mid \mathbf{x}_{\text{obs}}^{(i)}] = M^{-1}W_{\text{obs}}^\top(\mathbf{x}_{\text{obs}}^{(i)} - \boldsymbol{\mu}_{\text{obs}})

where M=WobsWobs+σ2IM = W_{\text{obs}}^\top W_{\text{obs}} + \sigma^2 I and subscript "obs" denotes the non-missing entries.

M-step: Update WW, μ\boldsymbol{\mu}, σ2\sigma^2 using the expected sufficient statistics from the E-step.

The EM converges to the MLE. For the complete-data case, EM recovers the exact closed-form MLE. For missing data, it provides principled imputation - predicting missing entries via the posterior mean.

For AI: Masked autoencoding (MAE, BERT) is conceptually related - learn representations that can reconstruct masked/missing inputs, which is a non-linear deep-learning analog of the PPCA EM algorithm.

6.4 Mixture of PPCA

A single PPCA model assumes the data is approximately Gaussian in Rd\mathbb{R}^d. For multi-modal data, a Mixture of PPCA uses MM local PPCA models:

p(x)=m=1MπmN(x;μm,WmWm+σm2I)p(\mathbf{x}) = \sum_{m=1}^M \pi_m \mathcal{N}(\mathbf{x};\, \boldsymbol{\mu}_m, W_m W_m^\top + \sigma_m^2 I)

Each component mm finds its own local principal subspace. The full EM algorithm alternates between:

  • E-step: Assign soft responsibilities rim=p(mx(i))r_{im} = p(m \mid \mathbf{x}^{(i)}) via Bayes' rule
  • M-step: Update πm\pi_m, μm\boldsymbol{\mu}_m, WmW_m, σm2\sigma_m^2 weighted by responsibilities

This is a smooth, probabilistic generalization of K-means + PCA per cluster. It handles non-convex clusters and provides uncertainty estimates, at the cost of needing to choose both MM (number of clusters) and kk (number of components per cluster).


7. Kernel PCA

7.1 Motivation - When Linear PCA Fails

Standard PCA finds linear structure in data. If the meaningful variation lies along a non-linear manifold, PCA will fail to capture it - the top PCs will not align with the manifold's intrinsic directions.

WHEN LINEAR PCA FAILS
========================================================================

  Two-moons data:                Swiss roll:

      ***                        (3D spiral surface)
    **   **                      PCA finds the 3D extent.
   **     **       <-PCA->        Kernel PCA (RBF) unfolds the
   **     **      collapses      roll into a 2D plane.
    **   **       the two
      ***         moons!

  PCA sees two overlapping         Kernel PCA extracts
  blobs, not two crescents.        intrinsic 2D structure.

========================================================================

Kernel PCA (Scholkopf, Smola, Muller, 1998) solves this by implicitly mapping data to a high-dimensional feature space H\mathcal{H} where the non-linear structure becomes linear, then applying standard PCA there.

7.2 The Kernel Trick and Feature-Space PCA

Feature map. Let ϕ:RdH\phi: \mathbb{R}^d \to \mathcal{H} be a (possibly infinite-dimensional) feature map. The feature-space covariance is:

Cϕ=1ni=1nϕ(x(i))ϕ(x(i))C_\phi = \frac{1}{n}\sum_{i=1}^n \phi(\mathbf{x}^{(i)})\phi(\mathbf{x}^{(i)})^\top

We want the eigenvectors of CϕC_\phi, but H\mathcal{H} may be infinite-dimensional - we can never compute CϕC_\phi explicitly.

Key insight. Every eigenvector v\mathbf{v} of CϕC_\phi lies in the span of {ϕ(x(1)),,ϕ(x(n))}\{\phi(\mathbf{x}^{(1)}), \ldots, \phi(\mathbf{x}^{(n)})\}:

v=i=1nαiϕ(x(i))\mathbf{v} = \sum_{i=1}^n \alpha_i \phi(\mathbf{x}^{(i)})

Substituting into the eigenvalue equation Cϕv=λvC_\phi \mathbf{v} = \lambda \mathbf{v} and multiplying both sides by ϕ(x(j))\phi(\mathbf{x}^{(j)})^\top:

Kα=nλαK\boldsymbol{\alpha} = n\lambda\boldsymbol{\alpha}

where Kij=ϕ(x(i)),ϕ(x(j))H=k(x(i),x(j))K_{ij} = \langle\phi(\mathbf{x}^{(i)}), \phi(\mathbf{x}^{(j)})\rangle_\mathcal{H} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) is the kernel matrix (Gram matrix).

The kernel trick: We never need ϕ\phi explicitly. We only need the kernel function k(x,y)k(\mathbf{x}, \mathbf{y}) that computes inner products in H\mathcal{H}.

Kernel PCA algorithm:

  1. Compute n×nn \times n kernel matrix KK with Kij=k(x(i),x(j))K_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})
  2. Center the kernel matrix (see 7.3)
  3. Eigendecompose K~=ΦDΦ\tilde{K} = \Phi D \Phi^\top (or use SVD)
  4. Extract top kk eigenvectors α(1),,α(k)\boldsymbol{\alpha}^{(1)}, \ldots, \boldsymbol{\alpha}^{(k)}, normalized: α(i)=1/di\lVert\boldsymbol{\alpha}^{(i)}\rVert = 1/\sqrt{d_i}
  5. Project new point x\mathbf{x}: compute ki=k(x(i),x)k_i = k(\mathbf{x}^{(i)}, \mathbf{x}) for all ii, then scores =Φk= \Phi^\top \mathbf{k}

Computational cost:

  • Building KK: O(n2d)O(n^2 d) - expensive for large nn
  • Eigendecomposition: O(n3)O(n^3) - bottleneck
  • For large nn: use Nystrom approximation or random features

7.3 Centering the Kernel Matrix

Standard PCA requires centered data. In feature space, centering means:

ϕ~(x(i))=ϕ(x(i))1nj=1nϕ(x(j))\tilde{\phi}(\mathbf{x}^{(i)}) = \phi(\mathbf{x}^{(i)}) - \frac{1}{n}\sum_{j=1}^n \phi(\mathbf{x}^{(j)})

The centered kernel matrix is:

K~=K1nKK1n+1nK1n\tilde{K} = K - \mathbf{1}_n K - K \mathbf{1}_n + \mathbf{1}_n K \mathbf{1}_n

where 1n=1n11Rn×n\mathbf{1}_n = \frac{1}{n}\mathbf{1}\mathbf{1}^\top \in \mathbb{R}^{n \times n} is the matrix of all 1/n1/n entries.

In code:

def center_kernel(K):
    """Center the n\timesn kernel matrix."""
    n = K.shape[0]
    one_n = np.ones((n, n)) / n
    return K - one_n @ K - K @ one_n + one_n @ K @ one_n

Centering for new data: When projecting a new point x\mathbf{x}, the kernel vector k=[k(x(1),x),,k(x(n),x)]\mathbf{k} = [k(\mathbf{x}^{(1)},\mathbf{x}), \ldots, k(\mathbf{x}^{(n)},\mathbf{x})]^\top must also be centered:

k~=k1nK11n1k1+1n21K11\tilde{\mathbf{k}} = \mathbf{k} - \frac{1}{n}\mathbf{K}\mathbf{1} - \frac{1}{n}\mathbf{1}^\top\mathbf{k}\,\mathbf{1} + \frac{1}{n^2}\mathbf{1}^\top\mathbf{K}\mathbf{1}\,\mathbf{1}

Failing to center new-point kernels is a common implementation error.

7.4 Common Kernels and Their Geometric Effects

KernelFormula k(x,y)k(\mathbf{x}, \mathbf{y})Feature spaceCaptures
Linearxy\mathbf{x}^\top\mathbf{y}Rd\mathbb{R}^d (identity map)Standard PCA
Polynomial(xy+c)p(\mathbf{x}^\top\mathbf{y} + c)^pMonomials up to degree ppPolynomial interactions
RBF / Gaussianexp(xy2/2σ2)\exp(-\lVert\mathbf{x}-\mathbf{y}\rVert^2 / 2\sigma^2)Infinite-dimensionalSmooth, local structure
Sigmoidtanh(κxy+θ)\tanh(\kappa\mathbf{x}^\top\mathbf{y} + \theta)Neural-net-like(Not always PD)
Laplacianexp(xy1/σ)\exp(-\lVert\mathbf{x}-\mathbf{y}\rVert_1/\sigma)Infinite-dimensionalRobust to outliers

RBF kernel bandwidth σ\sigma:

  • Small σ\sigma: each point forms its own neighborhood - overfitting
  • Large σ\sigma: kernel approaches constant - loses discriminative power
  • Rule of thumb: set σ\sigma to the median pairwise distance (the "median heuristic")

For AI: The NTK (Neural Tangent Kernel) is the kernel induced by an infinitely wide neural network. Kernel PCA with the NTK is equivalent to PCA on neural network features at initialization - a theoretical tool for analyzing early training dynamics.

7.5 Limitations and Out-of-Sample Extension

Limitations of kernel PCA:

  1. No out-of-sample formula. Unlike standard PCA (where new x\mathbf{x} maps to Vk(xxˉ)V_k^\top(\mathbf{x}-\bar{\mathbf{x}})), kernel PCA requires the kernel vector against all training points - inference cost is O(n)O(n) per new point.

  2. Scalability. O(n2)O(n^2) memory and O(n3)O(n^3) compute. For n>10,000n > 10{,}000, need approximations.

  3. Hyperparameter sensitivity. Kernel choice and bandwidth significantly affect results; requires cross-validation.

  4. No explicit feature map. Cannot interpret the kernel PCA components as directions in the original feature space.

Out-of-sample extension (Bengio et al., 2004). For new point x\mathbf{x}:

zj(x)=1dji=1nαj(i)k~(x(i),x)z_j(\mathbf{x}) = \frac{1}{d_j}\sum_{i=1}^n \alpha_j^{(i)} \tilde{k}(\mathbf{x}^{(i)}, \mathbf{x})

where djd_j is the jj-th eigenvalue and k~\tilde{k} denotes the centered kernel. This gives O(n)O(n) inference.

Nystrom approximation. Approximate KK using a random subset of mnm \ll n landmark points:

KKnmKmm1KmnK \approx K_{nm} K_{mm}^{-1} K_{mn}

Reduces cost to O(nm2)O(nm^2) while preserving the top eigenspectrum approximately.


8. PCA Variants

8.1 Sparse PCA - Interpretable Loadings

Standard PCA components are generally dense - each PC is a linear combination of all original features. This makes interpretation difficult: "PC1 is 0.12feature1 + 0.31feature2 - 0.07*feature3 + ..." is hard to understand.

Sparse PCA constrains components to use only a few original features, trading reconstruction optimality for interpretability.

Formulation (LASSO-regularized):

minV,ZX~ZVF2+λV1subject toZZ=I\min_{V, Z} \lVert \tilde{X} - ZV^\top \rVert_F^2 + \lambda\lVert V \rVert_1 \quad \text{subject to} \quad Z^\top Z = I

The 1\ell_1 penalty on VV forces most loadings to zero. The result: each PC depends on only a handful of original features - interpretable as a "factor" in the original variable space.

Algorithm (Zou, Hastie, Tibshirani, 2006 - Elastic Net formulation):

  • Fix ZZ: solve Lasso for each column of VV
  • Fix VV: solve for ZZ via SVD of X~V\tilde{X}V
  • Iterate until convergence
from sklearn.decomposition import SparsePCA

spca = SparsePCA(n_components=5, alpha=0.1)  # alpha = L1 regularization
Z_sparse = spca.fit_transform(X)
# spca.components_ has many zeros - interpretable loadings

Applications: Gene expression analysis (each PC = a small gene module), finance (each PC = a sector factor), neuroscience (each PC = a brain region pattern).

8.2 Robust PCA - Low-Rank + Sparse Decomposition

Standard PCA is sensitive to outliers: a single corrupted data point can completely distort the principal components (because PCA minimizes squared Frobenius norm, and outliers contribute quadratically).

Robust PCA (Candes, Li, Ma, Wright, 2011) decomposes a matrix MM into:

M=L+SM = L + S

where LL is low-rank (the "clean" signal) and SS is sparse (the outliers or corruptions).

Formulation (Principal Component Pursuit):

minL,SL+μS1subject toL+S=M\min_{L, S} \lVert L \rVert_* + \mu \lVert S \rVert_1 \quad \text{subject to} \quad L + S = M

where L\lVert L\rVert_* is the nuclear norm (sum of singular values - convex surrogate for rank) and S1\lVert S\rVert_1 is the element-wise 1\ell_1 norm (convex surrogate for sparsity).

Exact recovery theorem: Under mild conditions (incoherence conditions on LL and random support for SS), the optimization exactly recovers (L,S)(L, S) even when the sparse component has a constant fraction of corrupted entries.

Applications:

  • Video surveillance: MM = surveillance video frames; LL = static background (low-rank); SS = moving objects (sparse). Robust PCA cleanly separates background from foreground.
  • Collaborative filtering: Robust PCA on rating matrices separates low-rank preference patterns from sparse anomalous ratings.
  • Medical imaging: Separate structured anatomical patterns (low-rank) from focal lesions (sparse).

8.3 Incremental and Randomized PCA

Incremental PCA (also called online PCA) processes data in mini-batches, updating the PCA solution without storing the full data matrix. Essential when nn is too large for memory.

from sklearn.decomposition import IncrementalPCA

ipca = IncrementalPCA(n_components=50, batch_size=500)
for batch in data_generator():     # stream data in batches
    ipca.partial_fit(batch)

Z = ipca.transform(X_new)          # apply to new data

The algorithm maintains running estimates of the covariance matrix eigenvectors, using Gram-Schmidt to enforce orthogonality after each update.

Randomized SVD (Halko, Martinsson, Tropp, 2011) computes an approximate truncated SVD much faster than the full SVD when kmin(n,d)k \ll \min(n,d):

  1. Generate random Gaussian matrix ΩRd×(k+p)\Omega \in \mathbb{R}^{d \times (k+p)} (oversampling with p10p \approx 10)
  2. Compute Y=X~ΩRn×(k+p)Y = \tilde{X}\Omega \in \mathbb{R}^{n \times (k+p)} - projects data to random subspace
  3. QR-decompose Y=QRY = QR to get orthonormal basis QQ
  4. Compute B=QX~R(k+p)×dB = Q^\top \tilde{X} \in \mathbb{R}^{(k+p) \times d} - small matrix
  5. SVD of B=U^ΣVB = \hat{U}\Sigma V^\top; then U=QU^U = Q\hat{U}

Cost: O(ndk)O(ndk) instead of O(ndmin(n,d))O(nd\min(n,d)) - a factor of min(n,d)/k\min(n,d)/k speedup.

from sklearn.utils.extmath import randomized_svd

U, s, Vt = randomized_svd(X_centered, n_components=50,
                           n_iter=5, random_state=42)

For AI: sklearn's PCA(svd_solver='randomized') uses randomized SVD by default for large inputs. This is how PCA on ImageNet-scale datasets (n>106n > 10^6) becomes tractable.


9. Applications in Machine Learning and AI

9.1 Eigenfaces and Image Compression

The classic application of PCA to images (Turk & Pentland, 1991) demonstrates the power of low-rank approximation in practice.

Setup: nn face images of size h×wh \times w pixels, each flattened to a vector in Rhw\mathbb{R}^{hw}. Form the data matrix XRn×hwX \in \mathbb{R}^{n \times hw}.

Eigenfaces: The principal components v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k can be reshaped back to h×wh \times w images - these are the "eigenfaces," the most important face patterns in the dataset.

def compute_eigenfaces(face_images, k=50):
    """
    face_images: (n, h, w) array
    Returns: eigenfaces (k, h, w), mean_face (h, w), scores (n, k)
    """
    n, h, w = face_images.shape
    X = face_images.reshape(n, -1).astype(float)

    mean_face = X.mean(axis=0)
    X_c = X - mean_face

    U, s, Vt = np.linalg.svd(X_c, full_matrices=False)
    eigenfaces = Vt[:k].reshape(k, h, w)
    scores = U[:, :k] * s[:k]    # (n, k) - PCA coordinates

    return eigenfaces, mean_face.reshape(h, w), scores

Compression ratio: Storing kk eigenfaces (khwk \cdot hw values) and nn score vectors (nkn \cdot k values) vs. the original nhwn \cdot hw values. For k=50k = 50, n=1000n = 1000, hw=4096hw = 4096: compression from 4M to 255K values - a 16\times reduction.

For AI: ViT (Vision Transformers) learn patch embeddings that, in early layers, closely resemble the top PCA components of image patches. The first self-attention heads often perform a form of learned PCA on the patch token space.

9.2 Intrinsic Dimensionality of LLM Representations

A fundamental empirical question in LLM research: "how many dimensions does a language model actually use?"

Methodology:

  1. Pass nn text samples through an LLM layer, collect activations XRn×dmodelX \in \mathbb{R}^{n \times d_{\mathrm{model}}}
  2. Compute PCA and examine cumulative explained variance
  3. Define intrinsic dimension kk^* as the smallest kk with CEV 90%\geq 90\% (or use participation ratio)

Empirical findings (across multiple studies):

  • GPT-2 (d=768): ~50-100 components explain 90% of variance at middle layers
  • LLaMA-2-7B (d=4096): ~200-400 components at most layers
  • The effective dimension grows with layer depth and shrinks in final layers
  • Instruction-tuned models tend to have lower intrinsic dimension than base models

Participation ratio - a continuous intrinsic dimensionality measure:

dPR=(iλi)2iλi2d_{\text{PR}} = \frac{(\sum_i \lambda_i)^2}{\sum_i \lambda_i^2}

Equals dd when all eigenvalues are equal (maximum dimensionality); equals 1 when one eigenvalue dominates.

def participation_ratio(X_centered):
    """Continuous measure of effective dimensionality."""
    _, s, _ = np.linalg.svd(X_centered, full_matrices=False)
    lambdas = s**2
    return lambdas.sum()**2 / (lambdas**2).sum()

9.3 Mechanistic Interpretability: PCA of the Residual Stream

Mechanistic interpretability researchers use PCA to understand what information is stored and processed at each transformer layer.

The technique:

  1. Collect residual stream activations X[l]Rn×dX^{[l]} \in \mathbb{R}^{n \times d} at layer ll for a dataset of prompts
  2. Apply PCA: X[l]=U[l]Σ[l](V[l])X^{[l]} = U^{[l]}\Sigma^{[l]}(V^{[l]})^\top
  3. Examine top principal components v1[l],v2[l],\mathbf{v}^{[l]}_1, \mathbf{v}^{[l]}_2, \ldots by probing: which token properties predict high/low scores on each PC?

Examples from published work:

  • PC1 of early layers often tracks token position (Fourier modes of position)
  • PCs in middle layers encode syntactic structure (subject, verb, object positions)
  • PCs in late layers encode semantic content (entity type, sentiment, factual attributes)
  • The "residual stream basis" changes sharply at layers where attention heads write task-relevant information

Logit lens is a related technique: project residual stream activations at each layer back to vocabulary space using the unembedding matrix, then PCA on these logit distributions tracks information flow from token features to next-token predictions.

9.4 LoRA and the Low-Rank Hypothesis

LoRA (Low-Rank Adaptation, Hu et al., 2021) is the dominant parameter-efficient fine-tuning method for LLMs. Its mathematical grounding is directly the low-rank PCA hypothesis.

The hypothesis: During fine-tuning, the weight update matrix ΔW=Wfine-tunedWpretrained\Delta W = W_{\text{fine-tuned}} - W_{\text{pretrained}} has low intrinsic rank. That is, PCA of ΔW\Delta W reveals that most of its "information" lies in a rank-rr subspace with rmin(din,dout)r \ll \min(d_{\mathrm{in}}, d_{\mathrm{out}}).

LoRA parametrization:

ΔW=BA,BRdout×r,  ARr×din\Delta W = BA, \quad B \in \mathbb{R}^{d_{\mathrm{out}} \times r},\; A \in \mathbb{R}^{r \times d_{\mathrm{in}}}

The modified forward pass: h=Wx+ΔWx=Wx+BAx\mathbf{h} = W\mathbf{x} + \Delta W\mathbf{x} = W\mathbf{x} + BA\mathbf{x}.

Parameter savings: Original ΔW\Delta W has doutdind_{\mathrm{out}} \cdot d_{\mathrm{in}} parameters; LoRA uses r(dout+din)r(d_{\mathrm{out}} + d_{\mathrm{in}}) - a ratio of r/dr/d savings (often r=4r = 4--6464 vs d=4096d = 4096).

Empirical validation: Aghajanyan et al. (2020) measured the intrinsic dimensionality of fine-tuning loss landscapes and found most models can be fine-tuned with r<200r < 200 free parameters in a random low-dimensional subspace - validating the hypothesis empirically.

Extensions:

  • DoRA (Weight Decomposed LoRA): decomposes weight update into magnitude and direction components
  • GaLore (Gradient Low-Rank Projection): computes running PCA of gradient matrices, projects updates to the dominant gradient subspace
  • Flora: applies random projections (randomized PCA) to gradients to reduce optimizer memory

9.5 Loss Landscape Geometry

PCA of gradient trajectories and weight perturbations reveals the geometry of the loss landscape.

Li et al. (2018) - Filter normalization visualization:

  1. Train a model, record the final weights θ\boldsymbol{\theta}^*
  2. Sample nn perturbation directions; collect weights along training trajectory θt\boldsymbol{\theta}_t
  3. Apply PCA to the trajectory matrix: find the 2 principal directions d1,d2\mathbf{d}_1, \mathbf{d}_2
  4. Plot loss surface in the 2D PCA plane: L(θ+αd1+βd2)\mathcal{L}(\boldsymbol{\theta}^* + \alpha\mathbf{d}_1 + \beta\mathbf{d}_2)

This "loss landscape visualization" shows that well-trained models live in wide, flat valleys; poorly trained ones (skip connections removed) live in sharp minima with many barriers.

Dominant gradient subspace. The gradient L(θ)\nabla\mathcal{L}(\boldsymbol{\theta}) lies almost entirely in a kk-dimensional subspace with kdk \ll d throughout training. This subspace is almost stable: the top-kk PCA directions of gradient matrices computed at different training steps have >90%> 90\% overlap. This has implications for: (a) why SGD finds flat minima, (b) why momentum methods work, (c) why random projections preserve gradient structure.

9.6 PCA vs t-SNE vs UMAP for Visualization

When reducing to 2D/3D for visualization, practitioners must choose between linear and non-linear methods.

MethodTypePreservesTimeDeterministic?Best for
PCALinearGlobal variance, large-scale structureO(nd2)O(nd^2) or fasterYesExploring bulk structure, preprocessing
t-SNENon-linearLocal neighborhoods (distances within clusters)O(n2)O(n^2) or O(nlogn)O(n\log n)No (random init)Revealing cluster structure in 2D
UMAPNon-linearLocal + some global topologyO(n1.14)O(n^{1.14}) approx.No (random init)Faster t-SNE alternative; better global

Key differences:

  • PCA: distances in 2D PCA plot are meaningful; axes have units (variance explained)
  • t-SNE/UMAP: cluster shapes and inter-cluster distances are NOT preserved - only within-cluster topology
  • PCA is reproducible; t-SNE/UMAP have random initialization (seed for reproducibility)
  • For LLM embedding visualization: PCA first (to ~50 dims), then t-SNE/UMAP on PCA scores is standard practice (PCA removes noise, t-SNE reveals cluster fine structure)

Common mistake: Interpreting t-SNE distances between clusters as meaningful. They are not - t-SNE cluster sizes and inter-cluster distances are artifacts of the perplexity parameter and optimization.


10. Common Mistakes

#MistakeWhy It's WrongFix
1Not centering before PCAThe covariance is computed around the mean; without centering, PC1 points toward the mean (not the direction of variance), giving spurious componentsAlways subtract X.mean(axis=0) before SVD
2Not standardizing when features have different unitsA feature in metres dominates one in millimetres even if they have equal information content; PCA will find a direction dominated by the large-scale featureStandardize with (X - mean) / std when units differ
3Fitting mean/std on test dataData leakage: test statistics partially determine the PCA basis, inflating generalization metricsFit mean, std, and VkV_k on training data; apply to test
4Using np.linalg.eig on the covariance matrixeig may return complex eigenvalues due to floating-point asymmetry; sorting eigenvalues requires extra code; condition number is squaredUse np.linalg.svd on the centered data directly
5Interpreting PCA scores as probabilities or distances without thoughtPCA scores have different scales (λ1λk\lambda_1 \gg \lambda_k); Euclidean distance in score space is not isotropicUse whitened scores for distance-based methods (k-NN, k-means)
6Choosing kk once and never revisitingThe right kk depends on the downstream task; the variance-threshold kk may be too large or too small for a specific classifierUse cross-validation on downstream task to validate kk
7Applying test-time kernel without centeringKernel PCA centering involves the training kernel matrix; new-point kernels must be centered using training statisticsImplement k~=kKtrain1/nk1/n1+const\tilde{\mathbf{k}} = \mathbf{k} - K_{\text{train}}\mathbf{1}/n - \mathbf{k}^\top\mathbf{1}/n \cdot\mathbf{1} + \text{const}
8Interpreting PC directions as independent (not just uncorrelated)PCA guarantees zero correlation in the scores; it does NOT guarantee statistical independence (non-Gaussian data can have dependent PC scores)For independence: use ICA after whitening
9Using PCA for class separation before LDAPCA discards directions with low variance, which may be the most discriminative; applying PCA before LDA can hurt classificationUse LDA directly, or apply PCA only to reduce within LDA's constraint k<C1k < C-1
10Comparing PCA loadings across datasets without alignmentPCA directions are sign-ambiguous and may be permuted across different datasets or runsAlign directions via Procrustes rotation or compare absolute values
11Using PCA on count data without transformationPCA assumes continuous, approximately Gaussian data; count data (word counts, genomic reads) violates this assumptionApply log(1+x) or variance-stabilizing transformation first
12Reporting variance explained on training set onlyIn-sample variance explained is optimistic; held-out variance explained gives a fair estimate of generalizationReport reconstruction error on held-out data

11. Exercises

**Exercise 1 *** - Derivation from first principles

Consider centered data X~R3×3\tilde{X} \in \mathbb{R}^{3 \times 3} given below. Without using np.linalg.svd or np.linalg.eigh directly on CC:

(a) Compute the sample covariance matrix C=X~X~/(n1)C = \tilde{X}^\top\tilde{X}/(n-1) by hand.

(b) Find the characteristic polynomial det(CλI)\det(C - \lambda I) and solve for the eigenvalues.

(c) For each eigenvalue, find the corresponding eigenvector by solving (CλI)v=0(C - \lambda I)\mathbf{v} = \mathbf{0}.

(d) Verify that your eigenvectors are orthogonal and normalize them.

(e) Project the data onto the first 2 PCs and verify that the reconstruction error equals λ3\lambda_3.

**Exercise 2 *** - PCA via SVD and numerical equivalence

(a) Implement PCA using the covariance eigendecomposition (np.linalg.eigh).

(b) Implement PCA using np.linalg.svd on the centered data matrix.

(c) Compare the two implementations on a 100×20100 \times 20 random matrix. Do the loadings match? The explained variances? The scores?

(d) Construct a numerically ill-conditioned matrix (e.g., two nearly identical columns) and show that the covariance approach returns spurious imaginary components when using np.linalg.eig (not eigh), while the SVD approach is stable.

(e) Explain why eigh (symmetric eigenvalue solver) is better than eig for covariance matrices, and why svd on X~\tilde{X} is better than both.

**Exercise 3 *** - Explained variance and component selection

(a) Generate a correlated multivariate Gaussian dataset in R20\mathbb{R}^{20} with a covariance matrix that has a spectrum decaying as λi1/i\lambda_i \propto 1/i.

(b) Plot the scree plot (individual eigenvalues) and cumulative explained variance.

(c) Find the number of components needed for 80%, 90%, 95%, and 99% explained variance.

(d) Apply the Kaiser criterion and compare with the threshold methods.

(e) Fit a linear regression on the raw features vs. PCA scores with 90% variance retained. Which gives better cross-validated R2R^2? Explain.

**Exercise 4 **** - Whitening and its effects

(a) Generate a 2D dataset with covariance Σ=(9665)\Sigma = \begin{pmatrix}9 & 6 \\ 6 & 5\end{pmatrix}. Verify numerically.

(b) Apply PCA whitening and verify that the whitened covariance is I2I_2.

(c) Apply ZCA whitening and verify the covariance is I2I_2 AND that the result is in the original feature axes (not rotated).

(d) Plot both whitened versions alongside the original. Describe geometrically what each transformation does.

(e) Time the convergence of gradient descent on a simple quadratic f(x)=12xΣxf(\mathbf{x}) = \frac{1}{2}\mathbf{x}^\top\Sigma\mathbf{x} starting from (1,1)(1,1) - first with raw data, then with whitened inputs. Report the number of steps to reach f<104f < 10^{-4}.

**Exercise 5 **** - Probabilistic PCA

(a) Implement PPCA via the closed-form MLE. Inputs: centered data X~\tilde{X}, number of components kk. Outputs: W^\hat{W}, σ^2\hat{\sigma}^2, log-likelihood.

(b) Generate data from a PPCA model with k=3k=3, d=20d=20, n=500n=500, σ2=0.1\sigma^2 = 0.1, and random WW. Fit PPCA with k=2,3,4,5k = 2, 3, 4, 5 and plot log-likelihood vs kk.

(c) Does the true k=3k=3 achieve the highest likelihood? Why or why not? (Hint: consider the bias-variance tradeoff in likelihood estimation.)

(d) Implement the MLE formula for σ^2\hat{\sigma}^2 and verify it equals the mean discarded eigenvalue.

(e) Compare the PPCA reconstruction (posterior mean x^=W^z^+μ^\hat{\mathbf{x}} = \hat{W}\hat{\mathbf{z}} + \hat{\boldsymbol{\mu}}) with standard PCA reconstruction on 20% of randomly masked features.

**Exercise 6 **** - Kernel PCA

(a) Generate the two-moons dataset (sklearn.datasets.make_moons). Show that linear PCA fails to separate the classes in 2D.

(b) Implement kernel PCA with an RBF kernel from scratch: compute KK, center it, eigendecompose, extract top 2 components.

(c) Plot the kernel PCA embedding for σ{0.1,0.5,1.0,2.0,5.0}\sigma \in \{0.1, 0.5, 1.0, 2.0, 5.0\}. Which value separates the moons?

(d) Implement the out-of-sample projection formula for new test points. Verify that it matches projecting the test points alongside the training set.

(e) Compare kernel PCA with t-SNE and UMAP (from sklearn and umap-learn) on the same two-moons data. Discuss the qualitative and quantitative differences.

**Exercise 7 ***** - LoRA and the low-rank hypothesis

(a) Generate a "weight matrix" WR100×80W \in \mathbb{R}^{100 \times 80} with a known low-rank structure: W=W0+UVW = W_0 + UV^\top where UR100×5U \in \mathbb{R}^{100 \times 5}, VR80×5V \in \mathbb{R}^{80 \times 5} are random, W0W_0 is small random noise.

(b) Compute the SVD of WW and plot the singular value spectrum. How many singular values are significantly above the noise floor?

(c) Implement LoRA: parametrize ΔW=BA\Delta W = BA with BR100×rB \in \mathbb{R}^{100 \times r}, ARr×80A \in \mathbb{R}^{r \times 80}. For r{2,5,10,20}r \in \{2, 5, 10, 20\}, minimize WW0BAF2\lVert W - W_0 - BA\rVert_F^2 using gradient descent.

(d) Compare the LoRA reconstruction error with the optimal rank-rr approximation error from SVD (Eckart-Young). How close does LoRA get?

(e) Explain why r=5r = 5 should achieve near-zero error here, and compute the compression ratio for each rr vs. storing ΔW\Delta W directly.

**Exercise 8 ***** - Intrinsic dimensionality of representations

(a) Generate synthetic "LLM activations" by: (i) sample n=500n = 500 points in R200\mathbb{R}^{200} from a Gaussian with covariance Σ=QΛQ\Sigma = Q\Lambda Q^\top where Λ\Lambda has eigenvalues λi=ei/20\lambda_i = e^{-i/20}.

(b) Compute the participation ratio and compare with the kk for 90% EVR. Are they consistent?

(c) Simulate "layer depth" by applying successive random rotations + low-rank perturbations. Plot the intrinsic dimensionality at each "layer."

(d) Implement robust PCA (via sklearn or manual nuclear norm minimization) on a corrupted version of the activations (add sparse outliers to 10% of entries). Compare the recovered low-rank component with standard PCA.

(e) Write a function analyze_representations(X) that returns: participation ratio, 90%-EVR dimension, top-3 eigenvalue ratio, and spectrum decay rate (fit λiiα\lambda_i \propto i^{-\alpha} via linear regression on log-log scale). Apply to your synthetic "layers" and report how each statistic evolves.


12. Why This Matters for AI (2026 Perspective)

ConceptAI/LLM ApplicationConcrete Example
PCA variance maximizationUnderstanding which directions in weight space matterWeightWatcher: heavy-tailed singular value spectra predict model quality
Eckart-Young theoremJustifies low-rank approximation everywhereLoRA: ΔW=BA\Delta W = BA, rank r=4r = 4--6464 recovers most fine-tuning capability
Intrinsic dimensionalityCharacterizes "how much information" a model layer usesGPT-4 class models: 90% EVR at ~200 dims in R8192\mathbb{R}^{8192}
PCA of residual streamMechanistic interpretabilityAnthropic's "linear representation hypothesis": features are directions in residual stream
PPCA / missing dataMasked language modelling as probabilistic inferenceBERT's MLM objective \approx EM for missing tokens in a PPCA-like model
Kernel PCAAttention as implicit feature-space PCANTK theory: infinite-width networks perform kernel regression
Sparse PCAInterpretable features in transformersSuperposition hypothesis: polysemantic neurons are superposed sparse features
Randomized SVDEfficient computation on billion-parameter modelsGaLore projects gradients to dominant right-singular-vector subspace
Whitening / decorrelationNormalization layersBatch Norm \approx approximate feature whitening per layer
PCA of attention headsUnderstanding head specializationPCA of QKQK^\top matrices reveals head function (positional, syntactic, semantic)
Loss landscape PCAVisualization of optimization dynamicsFilter-normalized loss landscape shows flat minima for residual networks
Robust PCADenoising corrupted activationsSeparating adversarial perturbations (sparse) from clean representations (low-rank)

13. Conceptual Bridge

Where we came from. This section builds directly on two prior foundations. From 01-Eigenvalues-and-Eigenvectors, we take the spectral theorem for symmetric matrices: every real symmetric positive semidefinite matrix CC diagonalizes as C=QΛQC = Q\Lambda Q^\top with real non-negative eigenvalues. PCA is precisely the application of this theorem to the sample covariance matrix. From 02-Singular-Value-Decomposition, we take the Eckart-Young theorem and the SVD computation: PCA is most efficiently and stably computed by taking the SVD of the centered data matrix X~\tilde{X} rather than computing CC explicitly.

What PCA adds to the story. The eigendecomposition and SVD are algebraic facts about matrices. PCA gives them statistical meaning: the data matrix X~\tilde{X} is not arbitrary - it is a sample from an unknown distribution, and the covariance matrix CC is an estimate of that distribution's second moment. Maximizing projected variance, minimizing reconstruction error, and diagonalizing the covariance are all expressions of the same statistical goal: find the coordinate system in which the data is most "efficiently" described. This is PCA's unique contribution - it marries linear algebra with statistical estimation.

Forward to Linear Transformations. The next section, 04-Linear-Transformations, develops the abstract theory of linear maps between vector spaces. PCA can be understood in this language: the PCA projection xVkx\mathbf{x} \mapsto V_k^\top \mathbf{x} is a linear map from Rd\mathbb{R}^d to Rk\mathbb{R}^k, and the reconstruction zVkz\mathbf{z} \mapsto V_k\mathbf{z} is its "almost-inverse" (a pseudoinverse). The kernel of the projection is the span of the discarded components {vk+1,,vd}\{\mathbf{v}_{k+1}, \ldots, \mathbf{v}_d\} - the null space of the PCA transform. This connection between PCA and kernel/image theory is explored in 04.

Forward to Orthogonality. 05-Orthogonality-and-Orthonormality develops the Gram-Schmidt algorithm for constructing orthonormal bases, which is closely related to PCA (both find orthonormal bases, but PCA optimizes which basis is chosen). The QR decomposition studied there is also the computational backbone of many iterative eigenvalue algorithms used in large-scale PCA.

PCA IN THE CURRICULUM
========================================================================

  01 Eigenvalues          02 SVD
  -----------------        ---------------
  Spectral theorem    -->  Eckart-Young
  Covariance matrix   -->  Singular values
  Eigenvectors        -->  Right singular vectors
          |                      |
          +----------+-----------+
                     v
            03 PCA  <-- THIS SECTION
            -----------------------------
            - Variance maximization
            - Explained variance
            - Whitening
            - Probabilistic PCA
            - Kernel PCA
            - AI applications
                     |
          +----------+-----------+
          v                      v
  04 Linear              05 Orthogonality
  Transformations         -----------------
  ---------------         Gram-Schmidt
  Kernel of PCA proj      QR decomposition
  Image = PC subspace     Orthogonal bases

========================================================================

The deeper pattern. PCA is the first section in this curriculum where algebra, geometry, optimization, and statistics come together in one algorithm. Every major dimension-reduction technique in modern AI - LoRA, attention head analysis, representation geometry, loss landscape visualization - is a variation on the PCA theme: find a low-dimensional linear subspace that captures the essential structure of high-dimensional data. Mastering PCA is mastering the mathematical language in which most of AI's structural properties are expressed.


<- Back to Advanced Linear Algebra | <- SVD | Linear Transformations ->


Appendix A: Complete PCA Algorithm and Complexity

A.1 Algorithm with Full Pseudocode

ALGORITHM: Principal Component Analysis
========================================================================

INPUT:
  X \in \mathbb{R}^{n\timesd}   Data matrix (n samples, d features)
  k              Number of components (1 \leq k \leq min(n,d))
  standardize    Boolean: whether to z-score normalize (default: False)

PREPROCESSING:
  1. Compute mean: \mu = (1/n) X^T 1  \in \mathbb{R}^d
  2. Center data: X = X - 1\mu^T
  3. If standardize:
       \sigma = std(X, axis=0)        \in \mathbb{R}^d   (per-feature std)
       X = X / \sigma                           (in-place normalize)
     Else:
       \sigma = 1  (no-op)

CORE COMPUTATION (via thin SVD):
  4. Compute thin SVD: X = U\SigmaV^T
       U \in \mathbb{R}^{n\timesr},  \Sigma = diag(\sigma_1,...,\sigma^r),  V \in \mathbb{R}^{d\timesr}
       r = min(n, d),  \sigma_1 \geq \sigma_2 \geq ... \geq \sigma^r \geq 0
  5. Select top k columns:
       V_k = V[:, :k] \in \mathbb{R}^{d\timesk}   (principal component directions)
       U_k = U[:, :k] \in \mathbb{R}^{n\timesk}
       s_k = \sigma[:k]    \in \mathbb{R}^k

OUTPUT COMPUTATION:
  6. Scores (low-dim representation): Z = U_k * diag(s_k) \in \mathbb{R}^{n\timesk}
       Equivalently: Z = X V_k
  7. Explained variances: \lambda^i = \sigma^i^2 / (n-1)   for i = 1,...,k
  8. Variance ratios: \rho^i = \lambda^i / \Sigma_j \lambda_j
  9. Cumulative EVR: CEV(k) = \Sigma^i_=_1^k \rho^i

RECONSTRUCTION (optional):
  10. X = Z V_k^T * diag(\sigma) + 1\mu^T    (undo standardization and centering)
  11. Reconstruction error = \Sigma^i_=_k_+_1^r \sigma^i^2

NEW-SAMPLE PROJECTION:
  Given x_new \in \mathbb{R}^d:
  12. x_new = (x_new - \mu) / \sigma        (use TRAINING \mu, \sigma)
  13. z_new = V_k^T x_new \in \mathbb{R}^k

========================================================================

A.2 Time and Space Complexity

OperationTime complexitySpace complexityNotes
CenteringO(nd)O(nd)O(nd)O(nd)In-place possible
StandardizationO(nd)O(nd)O(d)O(d)Per-feature
Full SVDO(min(n2d,nd2))O(\min(n^2 d, nd^2))O(nd)O(nd)np.linalg.svd
Thin (economy) SVDO(ndmin(n,d))O(nd \cdot \min(n,d))O(nmin(n,d))O(n\cdot\min(n,d))full_matrices=False
Truncated SVD (ARPACK)O(ndk)O(ndk)O(nk+dk)O(nk + dk)scipy.sparse.linalg.svds
Randomized SVDO(ndk+k2(n+d))O(ndk + k^2(n+d))O(nk+dk)O(nk + dk)sklearn default for large n,dn,d
ProjectionO(ndk)O(ndk)O(nk)O(nk)Z=X~VkZ = \tilde{X}V_k
ReconstructionO(ndk)O(ndk)O(nd)O(nd)X^=ZVk\hat{X} = ZV_k^\top

When to use which SVD:

SettingRecommendation
n,d10,000n, d \leq 10{,}000, any kkFull/thin SVD (np.linalg.svd, full_matrices=False)
kmin(n,d)k \ll \min(n,d), moderate sizeRandomized SVD (sklearn.utils.extmath.randomized_svd)
n>105n > 10^5 or streaming dataIncremental PCA (sklearn.decomposition.IncrementalPCA)
kmin(n,d)k \ll \min(n,d), sparse XXARPACK (scipy.sparse.linalg.svds)
Distributed computationDistributed SVD (Spark MLlib, Dask)

A.3 Numerical Stability Details

Why squaring the data is dangerous. If X~\tilde{X} has condition number κ(X~)=σ1/σr\kappa(\tilde{X}) = \sigma_1/\sigma_r, then C=X~X~/(n1)C = \tilde{X}^\top\tilde{X}/(n-1) has condition number κ(C)=κ(X~)2\kappa(C) = \kappa(\tilde{X})^2. For ill-conditioned data (common in genomics, sensor fusion), this means:

  • Data with κ(X~)=104\kappa(\tilde{X}) = 10^4: SVD of X~\tilde{X} stable in double precision (ϵ1016\epsilon \approx 10^{-16})
  • But covariance CC has κ(C)=108\kappa(C) = 10^8: eigenvalue computation loses 88 digits of precision
  • Near-zero eigenvalues of CC may become negative due to rounding - np.linalg.eig may return complex values

Regularization for ill-conditioned covariance. Add a small diagonal:

Cϵ=C+ϵI,ϵ=δtr(C)/dC_\epsilon = C + \epsilon I, \quad \epsilon = \delta \cdot \operatorname{tr}(C)/d

This "regularized PCA" or "ridge PCA" ensures all eigenvalues are at least ϵ\epsilon. Common in high-dimensional settings (d>nd > n) where CC is rank-deficient.

def pca_regularized(X, k, delta=1e-5):
    """PCA with Tikhonov regularization for ill-conditioned data."""
    n, d = X.shape
    X_c = X - X.mean(axis=0)
    C = X_c.T @ X_c / (n - 1)
    eps = delta * np.trace(C) / d
    C_reg = C + eps * np.eye(d)
    lambdas, V = np.linalg.eigh(C_reg)
    # eigh returns ascending order; reverse
    lambdas = lambdas[::-1]
    V = V[:, ::-1]
    return V[:, :k], lambdas[:k] - eps  # subtract regularization

Appendix B: PCA Decision Guide

PCA DECISION FLOWCHART
========================================================================

  START: I need to reduce dimensionality
          |
          v
  Is the relationship between variables LINEAR?
  +-- YES -> Standard PCA (2-4)
  +-- NO  -> Is n manageable (< 50,000)?
             +-- YES -> Kernel PCA with RBF (7)
             +-- NO  -> Autoencoder or UMAP

  Chosen PCA. Should I STANDARDIZE?
  +-- Features on same scale, units meaningful -> Center only
  +-- Different scales or units -> Standardize (z-score)

  How to choose k?
  +-- Exploratory / visualization -> Scree plot elbow + 90% EVR
  +-- Preprocessing for classifier -> Cross-validate downstream metric
  +-- Compression -> Set reconstruction error budget
  +-- Statistical test needed -> Parallel analysis (Horn's method)

  Large dataset? (n > 10,000 or d > 10,000)
  +-- n >> d -> Incremental PCA (mini-batch)
  +-- k << min(n,d) -> Randomized SVD
  +-- Data has outliers -> Robust PCA first, then PCA on low-rank part

  Need uncertainty / missing data handling?
  +-- YES -> Probabilistic PCA (6) with EM

========================================================================

Appendix C: Key Formulas Reference

QuantityFormulaCode
Sample covarianceC=X~X~/(n1)C = \tilde{X}^\top\tilde{X}/(n-1)np.cov(X.T)
PCA via SVDX~=UΣV\tilde{X} = U\Sigma V^\topnp.linalg.svd(X_c, full_matrices=False)
PC directionsColumns of VV (rows of VV^\top)Vt.T
Explained varianceλi=σi2/(n1)\lambda_i = \sigma_i^2/(n-1)s**2 / (n-1)
Variance ratioρi=σi2/jσj2\rho_i = \sigma_i^2 / \sum_j\sigma_j^2s**2 / (s**2).sum()
ScoresZ=X~Vk=UkΣkZ = \tilde{X}V_k = U_k\Sigma_kX_c @ Vt[:k].T
ReconstructionX^=ZVk+xˉ\hat{X} = ZV_k^\top + \bar{x}Z @ Vt[:k] + mean
Recon errori>kσi2\sum_{i>k}\sigma_i^2(s[k:]**2).sum()
PCA whiteningZw=ZΛk1/2Z_w = Z\Lambda_k^{-1/2}Z / np.sqrt(lambdas)
ZCA whiteningZw=X~VkΛk1/2VkZ_w = \tilde{X}V_k\Lambda_k^{-1/2}V_k^\topSee 5.2
Participation ratio(λi)2/λi2(\sum\lambda_i)^2 / \sum\lambda_i^2Custom (9.2)
Centered kernelK~=K1nKK1n+1nK1n\tilde{K} = K - \mathbf{1}_nK - K\mathbf{1}_n + \mathbf{1}_nK\mathbf{1}_nSee 7.3
PPCA noiseσ^2=i>kλi/(dk)\hat{\sigma}^2 = \sum_{i>k}\lambda_i/(d-k)See 6.2


Appendix D: Worked Example - Complete PCA from Scratch

This section walks through a complete PCA computation on a small dataset, showing every intermediate result.

D.1 Setup

Consider a dataset of n=5n=5 samples and d=3d=3 features - temperature, humidity, and pressure readings from a weather station:

X=(2365101327721008195810202568101521611018)X = \begin{pmatrix}23 & 65 & 1013 \\ 27 & 72 & 1008 \\ 19 & 58 & 1020 \\ 25 & 68 & 1015 \\ 21 & 61 & 1018\end{pmatrix}

D.2 Step 1: Center the Data

Sample mean:

xˉ=(23,65,1013)+15(4752233822)(23,64.8,1014.8)\bar{\mathbf{x}} = (23, 65, 1013)^\top + \frac{1}{5}\begin{pmatrix}4\\7\\-5\\2\\-2\\-3\\3\\8\\2\\-2\end{pmatrix} \approx (23, 64.8, 1014.8)^\top

After centering: X~=X1xˉ\tilde{X} = X - \mathbf{1}\bar{\mathbf{x}}^\top. Note the scale difference - pressure values are ~1000\times larger than temperature values. This illustrates why standardization is critical.

D.3 Step 2: Effect of Scale

Without standardization, the covariance matrix is dominated by the pressure column (variance 22\approx 22) which is already largest, but temperature (variance 10\approx 10) and humidity (variance 30\approx 30) are comparable. After standardization, all features have unit variance, and PCA finds directions of equal information.

import numpy as np

X = np.array([[23, 65, 1013],
              [27, 72, 1008],
              [19, 58, 1020],
              [25, 68, 1015],
              [21, 61, 1018]], dtype=float)

# Without standardization
X_c = X - X.mean(axis=0)
C = X_c.T @ X_c / (len(X) - 1)
print("Covariance matrix (no standardization):")
print(np.round(C, 2))
# [[ 10.    12.25 -12.5 ]
#  [ 12.25  27.2  -21.25]
#  [-12.5  -21.25  24.2 ]]

# With standardization
X_std = X_c / X_c.std(axis=0, ddof=1)
C_std = X_std.T @ X_std / (len(X) - 1)
print("\nCorrelation matrix (standardized):")
print(np.round(C_std, 3))
# [[1.    0.984 -0.999]
#  [0.984 1.    -0.997]
#  [-0.999 -0.997 1.   ]]

Insight from the correlation matrix: All three features are almost perfectly correlated (or anti-correlated). Temperature and humidity move together; both are anti-correlated with pressure. This is physically meaningful: warm, humid days are low-pressure days. PCA should find ~1 meaningful component.

D.4 Step 3: SVD and Principal Components

U, s, Vt = np.linalg.svd(X_std, full_matrices=False)

# Singular values -> explained variance ratios
var_ratio = s**2 / (s**2).sum()
print("Explained variance ratios:", np.round(var_ratio, 4))
# [0.9986, 0.0013, 0.0001]
# -> PC1 explains 99.86% of variance!

print("PC1 direction:", np.round(Vt[0], 3))
# [ 0.577  0.579 -0.577] <- equal weights on all three
# Temperature and humidity load positively, pressure negatively
# -> PC1 is a "warm-humid-low-pressure" index

Interpretation: PC1 = (0.577,0.579,0.577)(0.577, 0.579, -0.577)^\top - a linear combination that increases with temperature and humidity and decreases with pressure. This is physically the "storm index." PC2 and PC3 explain only 0.14% of variance combined - essentially noise.

D.5 Step 4: Reconstruction

k = 1  # Keep only PC1
Z_k = U[:, :k] * s[:k]       # Scores: (5, 1)
X_reconstructed = Z_k @ Vt[:k] + X_std.mean(axis=0)

recon_error = np.linalg.norm(X_std - Z_k @ Vt[:k])**2
print(f"Reconstruction error (k=1): {recon_error:.6f}")
print(f"Expected (discarded variance): {(s[1:]**2).sum():.6f}")
# Both ~0.0014 - matches Eckart-Young

Appendix E: Connections to Information Theory and Statistics

E.1 PCA and Maximum Likelihood under Gaussianity

If the data is Gaussian, x(i)N(μ,Σ)\mathbf{x}^{(i)} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma), then the sample covariance CC is the maximum likelihood estimator of Σ\Sigma. PCA on CC finds the directions of greatest information concentration under the Gaussian assumption.

Under this model, the PCA projection scores Z=X~VkZ = \tilde{X}V_k are also Gaussian: Z:,iN(0,λi)Z_{:,i} \sim \mathcal{N}(0, \lambda_i). The optimal Bayesian compression of a Gaussian random vector to kk dimensions is exactly the PCA projection - PCA is the optimal linear encoder under squared loss and Gaussian data.

Non-Gaussian data. For non-Gaussian distributions, PCA minimizes reconstruction error but may not capture the most "interesting" structure. Alternative approaches:

  • ICA: maximize statistical independence (not just decorrelation)
  • NMF: non-negative matrix factorization for parts-based representations
  • Sparse coding: find sparse representations in overcomplete dictionaries
  • VAE: learn a non-linear latent space via variational inference

E.2 Information-Geometric View

The set of all d×dd \times d positive definite matrices forms a Riemannian manifold (S++d,g)(\mathbb{S}^d_{++}, g) with the Fisher information metric. The MLE CC is the Frechet mean on this manifold.

The PCA truncation CVkΛkVk+σ^2(IVkVk)C \to V_k\Lambda_k V_k^\top + \hat{\sigma}^2(I - V_kV_k^\top) (the PPCA approximation) is the projection of CC onto the set of matrices with at most kk large eigenvalues - the PPCA manifold. This gives information-geometric meaning to choosing kk: it is the number of "directions" the data needs to distinguish itself from noise.

E.3 PCA as a Linear Autoencoder

A linear autoencoder with bottleneck dimension kk consists of:

  • Encoder: z=WexRk\mathbf{z} = W_e\mathbf{x} \in \mathbb{R}^k, WeRk×dW_e \in \mathbb{R}^{k \times d}
  • Decoder: x^=WdzRd\hat{\mathbf{x}} = W_d\mathbf{z} \in \mathbb{R}^d, WdRd×kW_d \in \mathbb{R}^{d \times k}
  • Loss: L=xx^22\mathcal{L} = \lVert\mathbf{x} - \hat{\mathbf{x}}\rVert_2^2

Theorem. The global minimum of the linear autoencoder loss is achieved when col(Wd)=span{v1,,vk}\operatorname{col}(W_d) = \operatorname{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} (the PCA subspace). The optimal reconstruction equals the PCA reconstruction.

Important caveat: The optimal WeW_e and WdW_d need not equal VkV_k^\top and VkV_k individually - any rotation RR works: Wd=VkRW_d = V_k R, We=RVkW_e = R^\top V_k^\top. The autoencoder can discover any orthogonal basis for the PCA subspace - it is unique as a subspace, not as specific directions.

Contrast with non-linear autoencoders: A neural autoencoder with non-linear activations can learn non-linear manifolds, capturing curvature that linear PCA misses. But for linear data, the two are equivalent, and the non-linear version is strictly harder to optimize.


Appendix F: PCA in Python Ecosystem

F.1 sklearn API

from sklearn.decomposition import PCA, KernelPCA, SparsePCA, IncrementalPCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Standard PCA with variance threshold
pca = PCA(n_components=0.95)    # keep 95% variance
pca.fit(X_train)
Z_train = pca.transform(X_train)
Z_test = pca.transform(X_test)  # uses training mean/components
print(f"Shape: {X_train.shape} -> {Z_train.shape}")
print(f"Components: {pca.n_components_}")
print(f"Explained var: {pca.explained_variance_ratio_.cumsum()[-1]:.3f}")

# Full pipeline with standardization
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
])
Z = pipe.fit_transform(X_train)
Z_test = pipe.transform(X_test)  # correct: uses training scaler+PCA

# Randomized PCA (faster for large n, d, small k)
pca_rand = PCA(n_components=50, svd_solver='randomized', random_state=42)

# Incremental PCA for streaming/large data
ipca = IncrementalPCA(n_components=50, batch_size=1000)
for batch in data_batches:
    ipca.partial_fit(batch)

# Kernel PCA
kpca = KernelPCA(n_components=10, kernel='rbf', gamma=0.1,
                  fit_inverse_transform=True)
Z_kpca = kpca.fit_transform(X)
X_back = kpca.inverse_transform(Z_kpca)  # approximate reconstruction

F.2 NumPy/SciPy Low-Level API

import numpy as np
import scipy.linalg as la

# Full SVD
U, s, Vt = np.linalg.svd(X_c, full_matrices=False)

# Symmetric eigendecomposition (for SPD matrices - better than eig)
lambdas, Q = np.linalg.eigh(C)    # returns ascending order
lambdas = lambdas[::-1]; Q = Q[:, ::-1]  # -> descending

# Truncated SVD via ARPACK (for sparse matrices, large k)
from scipy.sparse.linalg import svds
U_k, s_k, Vt_k = svds(X_sparse, k=50)  # top-50 SVD of sparse X
idx = np.argsort(s_k)[::-1]             # ARPACK returns arbitrary order
U_k, s_k, Vt_k = U_k[:, idx], s_k[idx], Vt_k[idx]

F.3 PyTorch (GPU-accelerated)

import torch

X_t = torch.tensor(X_centered, dtype=torch.float32).cuda()

# Full SVD on GPU
U, s, Vh = torch.linalg.svd(X_t, full_matrices=False)

# Low-rank SVD (much faster for k << n,d)
U_k, s_k, Vh_k = torch.svd_lowrank(X_t, q=k, niter=4)

# Projection and reconstruction
Z_k = X_t @ Vh_k.T          # (n, k)
X_recon = Z_k @ Vh_k + mean_t  # (n, d)


Appendix G: PCA vs Factor Analysis

PCA and Factor Analysis (FA) both reduce dimensionality, but with fundamentally different models and goals. This confusion is common in practice.

G.1 The Factor Analysis Model

FA assumes:

x=Λf+ϵ,fN(0,Ik),ϵN(0,Ψ)\mathbf{x} = \Lambda \mathbf{f} + \boldsymbol{\epsilon}, \quad \mathbf{f} \sim \mathcal{N}(\mathbf{0}, I_k), \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \Psi)

where ΛRd×k\Lambda \in \mathbb{R}^{d \times k} is the loading matrix, f\mathbf{f} are the latent factors, and Ψ=diag(ψ1,,ψd)\Psi = \operatorname{diag}(\psi_1, \ldots, \psi_d) is a diagonal (not isotropic) noise covariance.

The observed covariance is: Σ=ΛΛ+Ψ\Sigma = \Lambda\Lambda^\top + \Psi

Key difference from PPCA: In PPCA, Ψ=σ2I\Psi = \sigma^2 I (isotropic noise - same noise in every direction). In FA, Ψ\Psi is diagonal but arbitrary - each feature has its own noise level. This makes FA more flexible but harder to fit (requires EM or iterative estimation).

G.2 Comparison Table

AspectPCAFactor Analysis
GoalExplain maximum varianceExplain correlations via shared factors
Noise modelNone (geometric)N(0,Ψ)\mathcal{N}(\mathbf{0}, \Psi), Ψ\Psi diagonal
Unique componentsNo (rotation-ambiguous)No (rotation-ambiguous)
Scale-invariantNoYes (FA is correlation-based)
InterpretationComponents = linear combos of featuresFactors = latent causes of correlations
Model selectionScree, EVR, KaiserLikelihood ratio test, BIC
Handles unique variancesNo (all noise enters identically)Yes (ψi\psi_i per feature)
Preferred whenFeatures are measured on same scale; want linear compressionFeatures have different measurement errors; want causal factors

G.3 When to Use Which

Use PCA when:

  • The goal is linear dimensionality reduction for downstream modelling
  • Features are on the same scale (or you standardize for scale-invariance)
  • You want the computationally simplest, most interpretable method
  • Reconstruction error is the primary criterion

Use Factor Analysis when:

  • Features have different measurement error variances
  • You believe latent factors causally generate the features
  • Statistical testing of the number of factors is needed
  • You are working in psychometrics, econometrics, or psychophysiology

Use PPCA when:

  • You want a probabilistic model that recovers standard PCA in the zero-noise limit
  • Missing data is present (EM handles it naturally)
  • You need likelihood scores for model comparison (choosing kk)

Appendix H: Geometric Interpretation Across Dimensions

H.1 PCA in 2D

In 2D, the data cloud is an ellipse. The two principal components are the major and minor axes of the ellipse. The aspect ratio of the ellipse equals λ1/λ2\sqrt{\lambda_1/\lambda_2}.

2D DATA ELLIPSE AND PRINCIPAL COMPONENTS
========================================================================

          x_2
          |       . . .
          |    .         .           v_1 = major axis = PC1
          |  .      up-right     .          \lambda_1 = half-length^2 (variance)
          | .    v_1       .
          |.  upv_2          .         v_2 = minor axis = PC2
          | .              .          \lambda_2 = half-length^2 (variance)
          |  .            .
          |    .         .
          |       . . .
          +-------------------- x_1

  PC1 direction: angle \theta where tan(2\theta) = 2C_1_2/(C_1_1 - C_2_2)
  (from characteristic polynomial of 2\times2 covariance)

========================================================================

For 2D Gaussian data with covariance C=(abbc)C = \begin{pmatrix}a & b \\ b & c\end{pmatrix}:

  • Eigenvalues: λ1,2=(a+c)±(ac)2+4b22\lambda_{1,2} = \frac{(a+c) \pm \sqrt{(a-c)^2 + 4b^2}}{2}
  • PC1 angle: θ=12arctan ⁣(2bac)\theta = \frac{1}{2}\arctan\!\left(\frac{2b}{a-c}\right)

H.2 PCA in 3D

In 3D, the data lives in an ellipsoid with three principal axes of lengths λ1λ2λ3\sqrt{\lambda_1} \geq \sqrt{\lambda_2} \geq \sqrt{\lambda_3}. Three cases:

SpectrumShapeExample
λ1λ2λ3\lambda_1 \gg \lambda_2 \approx \lambda_3Cigar (1D structure)Collinear data, LoRA rank-1
λ1λ2λ3\lambda_1 \approx \lambda_2 \gg \lambda_3Pancake (2D structure)Planar data, face images
λ1λ2λ3\lambda_1 \approx \lambda_2 \approx \lambda_3Sphere (3D, no structure)White noise
λ1>λ2>λ3\lambda_1 > \lambda_2 > \lambda_3General ellipsoidTypical real data

Projection onto PC1: map to a line (1D cigar projection). Projection onto PC1, PC2: map to a plane (2D cross-section of ellipsoid).

H.3 High-Dimensional PCA: The Marchenko-Pastur Law

In the "spiked" random matrix model, data has nn samples and dd features with no structure: Xiji.i.d.N(0,1)X_{ij} \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0,1). The eigenvalues of C=XX/nC = X^\top X/n converge (as n,dn,d \to \infty with d/nγd/n \to \gamma) to the Marchenko-Pastur distribution:

p(λ)=(λ+λ)(λλ)2πγλ,λ[λ,λ+]p(\lambda) = \frac{\sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)}}{2\pi\gamma\lambda}, \quad \lambda \in [\lambda_-, \lambda_+]

where λ±=(1±γ)2\lambda_\pm = (1 \pm \sqrt{\gamma})^2.

Critical implication: When eigenvalues from PCA fall within the Marchenko-Pastur bulk [λ,λ+][\lambda_-, \lambda_+], they correspond to pure noise - keeping these components adds noise, not signal. Only eigenvalues above λ+\lambda_+ are genuine signal components.

This provides a rigorous threshold for component selection:

  • Parallel analysis (Horn's method) approximates this empirically
  • For standardized data with n=500n = 500, d=100d = 100 (γ=0.2\gamma = 0.2): λ+=(1+0.2)21.99\lambda_+ = (1+\sqrt{0.2})^2 \approx 1.99 - keep only PCs with λ>1.99\lambda > 1.99, not just λ>1\lambda > 1 (Kaiser criterion is too lenient here)

Appendix I: Self-Check Questions

Before moving to the next section, verify you can answer these questions:

Conceptual:

  1. Why must data be centered before PCA? What goes wrong if you skip centering?
  2. Why do PCA principal components have sign ambiguity? Is the PCA subspace unique?
  3. In what sense is PCA the "optimal" linear dimensionality reduction? What is it optimizing?
  4. What is the difference between PCA scores and PCA loadings (components)?
  5. Why is SVD numerically preferred over eigendecomposition of the covariance matrix?
  6. When should you standardize before PCA? Give a concrete example where failing to standardize changes the answer qualitatively.
  7. What does the PPCA noise parameter σ^2\hat{\sigma}^2 estimate, and why is it the average of the discarded eigenvalues?
  8. Why does kernel PCA require centering the kernel matrix? What is the kernel matrix equivalent of "centering the data"?

Mathematical:

  1. Derive the PCA optimization: show that maximizing vCv\mathbf{v}^\top C\mathbf{v} subject to v=1\lVert\mathbf{v}\rVert = 1 gives the eigenvalue equation Cv=λvC\mathbf{v} = \lambda\mathbf{v}.
  2. Prove that the PCA scores are decorrelated: show Cov(Z)=Λ\operatorname{Cov}(Z) = \Lambda.
  3. State the Eckart-Young theorem and explain how it implies PCA minimizes reconstruction error.
  4. Show that the reconstruction error from keeping kk PCs equals i>kσi2\sum_{i>k}\sigma_i^2.
  5. For PPCA: verify that as σ20\sigma^2 \to 0, the PPCA MLE W^\hat{W} converges to VkΛk1/2V_k\Lambda_k^{1/2} (up to rotation).

Computational:

  1. Given a 100×50100 \times 50 data matrix, what is the shape of UU, Σ\Sigma, VV^\top from the thin SVD?
  2. How do you project a new test point onto the PCA subspace without data leakage?
  3. What does PCA(n_components=0.95) in sklearn compute?
  4. Why must you center the kernel vector (not just the kernel matrix) when projecting new points in kernel PCA?


Appendix J: PCA and the Four Fundamental Subspaces

The connection between PCA and the four fundamental subspaces (see 02-Linear-Algebra-Basics 05) is illuminating.

For the centered data matrix X~Rn×d\tilde{X} \in \mathbb{R}^{n \times d} with rank r=min(n,d)r = \min(n,d) (assuming generic position):

SubspaceFor X~\tilde{X}PCA interpretation
Column space C(X~)\mathcal{C}(\tilde{X})Span of columns == span of UUSpace of all possible score vectors
Row space C(X~)\mathcal{C}(\tilde{X}^\top)Span of rows == span of VVThe PCA subspace - span of principal components
Null space N(X~)\mathcal{N}(\tilde{X})Vectors v\mathbf{v} with X~v=0\tilde{X}\mathbf{v}=\mathbf{0}Discarded directions - zero-variance PCs
Left null space N(X~)\mathcal{N}(\tilde{X}^\top)Vectors u\mathbf{u} with X~u=0\tilde{X}^\top\mathbf{u}=\mathbf{0}Directions in sample space with no data

PCA projection in subspace language:

  • The PCA encoder xVkx\mathbf{x} \mapsto V_k^\top\mathbf{x} is an orthogonal projection onto the row space of X~\tilde{X}
  • The discarded components {vk+1,,vd}\{\mathbf{v}_{k+1},\ldots,\mathbf{v}_d\} span the part of the row space that is "noisy" or irrelevant
  • If n<dn < d, then X~\tilde{X} has a null space of dimension drd - r - these directions have zero variance and are automatically discarded

Dimension counting:

d=ksignal+knoise+dimN(X~)d = k_{\text{signal}} + k_{\text{noise}} + \dim\mathcal{N}(\tilde{X})

For n>dn > d (more samples than features): dimN(X~)=0\dim\mathcal{N}(\tilde{X}) = 0, all variance is captured. For n<dn < d: dimN(X~)=dn\dim\mathcal{N}(\tilde{X}) = d - n - the data cannot fill the feature space, and PCA automatically collapses those directions.


Appendix K: References and Further Reading

Primary References

  1. Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space." Philosophical Magazine, 2(11), 559-572. - The original PCA paper.

  2. Hotelling, H. (1933). "Analysis of a Complex of Statistical Variables into Principal Components." Journal of Educational Psychology, 24(6), 417-441. - Named "principal components" and gave the Lagrange-multiplier derivation.

  3. Tipping, M.E. & Bishop, C.M. (1999). "Probabilistic Principal Component Analysis." Journal of the Royal Statistical Society B, 61(3), 611-622. - PPCA: the generative model and MLE.

  4. Scholkopf, B., Smola, A., & Muller, K.-R. (1998). "Nonlinear Component Analysis as a Kernel Eigenvalue Problem." Neural Computation, 10(5), 1299-1319. - Kernel PCA.

  5. Candes, E.J., Li, X., Ma, Y., & Wright, J. (2011). "Robust Principal Component Analysis?" Journal of the ACM, 58(3), 1-37. - Robust PCA via nuclear norm minimization.

  6. Halko, N., Martinsson, P.-G., & Tropp, J.A. (2011). "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions." SIAM Review, 53(2), 217-288. - Randomized SVD.

AI/ML Applications

  1. Hu, E.J. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. - The LoRA paper; low-rank hypothesis.

  2. Li, H. et al. (2018). "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018. - PCA of gradient trajectories for loss landscape visualization.

  3. Aghajanyan, A. et al. (2020). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL 2021. - Measuring intrinsic dimensionality of fine-tuning.

  4. Elhage, N. et al. (2022). "Toy Models of Superposition." Anthropic Research. - PCA and sparse representations in transformer models.

Textbooks

  1. Jolliffe, I.T. (2002). Principal Component Analysis (2nd ed.). Springer. - The definitive reference.

  2. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. - Chapter 12: PPCA, mixtures, probabilistic view.

  3. Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. - Chapter 20: low-rank models, PCA, factor analysis.

  4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. - Chapter 14: unsupervised learning, PCA, kernel PCA.



Appendix L: PCA in Practice - Common Workflows

L.1 Preprocessing Pipeline for Supervised Learning

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Full pipeline: scale -> PCA -> classifier
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('clf', SVC(kernel='rbf'))
])

# Grid search over number of PCA components AND SVM hyperparameters
param_grid = {
    'pca__n_components': [10, 20, 50, 100, 0.95],  # 0.95 = 95% variance
    'clf__C': [0.1, 1.0, 10.0],
    'clf__gamma': ['scale', 'auto']
}

search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
search.fit(X_train, y_train)

print(f"Best k: {search.best_params_['pca__n_components']}")
print(f"Best accuracy: {search.best_score_:.4f}")

Best practice: Always search over n_components as a hyperparameter when using PCA for supervised learning. The optimal kk for downstream accuracy often differs from the kk chosen by variance threshold.

L.2 Visualization Workflow

Standard workflow for visualizing high-dimensional data:

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Step 1: PCA to ~50 dims (remove noise, keep structure)
pca = PCA(n_components=50, random_state=42)
X_pca50 = pca.fit_transform(StandardScaler().fit_transform(X))
print(f"50-dim PCA captures {pca.explained_variance_ratio_.sum():.1%} variance")

# Step 2: t-SNE on PCA output (not raw data - faster, more stable)
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
X_2d = tsne.fit_transform(X_pca50)

# Step 3: Plot with class labels
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# PCA 2D (first 2 components)
ax = axes[0]
for c in np.unique(labels):
    mask = labels == c
    ax.scatter(X_pca50[mask, 0], X_pca50[mask, 1], label=f'Class {c}', alpha=0.7, s=20)
ax.set_title(f'PCA (PC1+PC2: {pca.explained_variance_ratio_[:2].sum():.1%} var)')
ax.set_xlabel('PC1'); ax.set_ylabel('PC2')
ax.legend()

# t-SNE 2D
ax = axes[1]
for c in np.unique(labels):
    mask = labels == c
    ax.scatter(X_2d[mask, 0], X_2d[mask, 1], label=f'Class {c}', alpha=0.7, s=20)
ax.set_title('t-SNE (on 50-dim PCA output)')
ax.legend()

plt.tight_layout()

L.3 Anomaly Detection with PCA

PCA-based anomaly detection: high reconstruction error = anomalous point.

def pca_anomaly_detector(X_train, X_test, k, threshold_pct=95):
    """
    Detect anomalies as points with reconstruction error above threshold.
    Returns: anomaly scores for X_test, threshold value
    """
    # Fit PCA
    scaler = StandardScaler()
    X_train_s = scaler.fit_transform(X_train)
    X_test_s = scaler.transform(X_test)

    pca = PCA(n_components=k)
    pca.fit(X_train_s)

    # Reconstruction error for training data
    X_train_recon = pca.inverse_transform(pca.transform(X_train_s))
    train_errors = np.sum((X_train_s - X_train_recon)**2, axis=1)
    threshold = np.percentile(train_errors, threshold_pct)

    # Reconstruction error for test data
    X_test_recon = pca.inverse_transform(pca.transform(X_test_s))
    test_errors = np.sum((X_test_s - X_test_recon)**2, axis=1)

    anomalies = test_errors > threshold
    return test_errors, threshold, anomalies

How it works: Normal samples live near the PCA subspace (low reconstruction error). Anomalous samples deviate from the normal subspace (high error). The threshold is set from the distribution of training reconstruction errors.

L.4 Monitoring Distribution Shift

PCA scores can detect dataset drift between training and production:

def check_distribution_shift(X_train, X_prod, k=10, alpha=0.05):
    """
    Check if production data has drifted from training distribution.
    Uses Hotelling's T^2 test on PCA scores.
    """
    from scipy import stats

    scaler = StandardScaler()
    X_tr_s = scaler.fit_transform(X_train)
    X_pr_s = scaler.transform(X_prod)

    pca = PCA(n_components=k)
    Z_train = pca.fit_transform(X_tr_s)
    Z_prod = pca.transform(X_pr_s)

    # Test each component for distribution shift (Kolmogorov-Smirnov)
    drifted = []
    for i in range(k):
        _, p_value = stats.ks_2samp(Z_train[:, i], Z_prod[:, i])
        if p_value < alpha:
            drifted.append(i)

    if drifted:
        print(f"Distribution drift detected in PCA dimensions: {drifted}")
        print(f"PC{drifted[0]+1} explains {pca.explained_variance_ratio_[drifted[0]]:.1%} variance")
    else:
        print("No significant drift detected.")
    return drifted

For AI: Production monitoring of LLM inputs and outputs often uses PCA of embeddings. If the PCA score distribution of production inputs drifts from training, it signals distribution shift - triggering retraining or alerts.

L.5 Summary of Key Implementation Choices

IMPLEMENTATION DECISION TREE
========================================================================

  Dataset size?
  +--- Small (n,d < 10k) -------------------------------------------+
  |  np.linalg.svd(X_c, full_matrices=False)  -> exact, fast        |
  +-----------------------------------------------------------------+
  +--- Large, small k ----------------------------------------------+
  |  sklearn.utils.extmath.randomized_svd(k, n_iter=5)             |
  |  -> ~10x faster than full SVD                                   |
  +-----------------------------------------------------------------+
  +--- Streaming / memory-constrained ------------------------------+
  |  sklearn.decomposition.IncrementalPCA(batch_size=B)             |
  |  -> processes data in batches of size B                         |
  +-----------------------------------------------------------------+
  +--- Sparse data -------------------------------------------------+
  |  scipy.sparse.linalg.svds(X_sparse, k=k)                       |
  |  -> never forms dense matrices                                  |
  +-----------------------------------------------------------------+
  +--- GPU / large-scale -------------------------------------------+
  |  torch.svd_lowrank(X_tensor, q=k, niter=4)                     |
  |  -> GPU-accelerated randomized SVD                              |
  +-----------------------------------------------------------------+
  +--- Outliers present --------------------------------------------+
  |  sklearn.decomposition.MiniBatchSparsePCA or Robust PCA        |
  |  -> decompose X = L + S first, then PCA on L                   |
  +-----------------------------------------------------------------+

========================================================================