NotesMath for LLMs

Matrix Decompositions

Advanced Linear Algebra / Matrix Decompositions

Notes

"The purpose of computing is insight, not numbers - and no insight is more powerful than factoring a matrix into pieces whose structure you understand."

Overview

Every practical computation with matrices reduces, ultimately, to a factorization. Solving Ax=bA\mathbf{x} = \mathbf{b}? Factor A=LUA = LU and solve two triangular systems. Fitting a linear model? Factor the data matrix A=QRA = QR and solve a triangular system stably. Sampling from a multivariate Gaussian? Factor the covariance Σ=LL\Sigma = LL^\top and transform independent normals. Estimating the rank of a nearly singular matrix? Factor with column pivoting and read off the diagonal of RR.

This section develops the three foundational computational decompositions: LU factorization (Gaussian elimination with pivoting, the workhorse for general square systems), QR factorization via Householder and Givens (the stable method for least squares and rank-revealing problems), and a brief computational recap of Cholesky (the SPD specialist, developed fully in 07). The emphasis throughout is computational: how these algorithms work, why they are numerically stable, what can go wrong, and how to implement them efficiently.

The machine learning connections are direct and load-bearing. Every Newton step requires solving a Hessian system - that is LU or Cholesky. Every least-squares layer (linear probing, ridge regression on features) uses QR under the hood in numerically reliable implementations. Gaussian process inference - the backbone of Bayesian optimization and uncertainty-aware ML - is Cholesky factorization applied thousands of times. Differentiating through factorizations (for implicit differentiation of constrained problems) requires understanding the algebraic structure developed here. And the emerging field of randomized numerical linear algebra, which underlies methods like LoRA and sketch-and-solve preconditioning, is built on randomized LU and QR.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: LU with pivoting, Householder QR, Givens rotations, RRQR, blocked algorithms, GP inference, randomized methods
exercises.ipynb8 graded exercises from triangular solves through differentiating Cholesky for GP log-likelihood

Learning Objectives

After completing this section, you will be able to:

  • Implement forward and backward substitution and analyze their numerical properties
  • Derive LU factorization from Gaussian elimination and prove the existence theorem
  • Explain why partial pivoting is necessary and implement PA=LUPA = LU with growth factor analysis
  • Distinguish partial, complete, and rook pivoting and state the stability guarantee of each
  • Implement Householder QR from scratch, including the sign convention for numerical stability
  • Construct Givens rotations and explain when to prefer them over Householder reflectors
  • Implement rank-revealing QR with column pivoting and use the R diagonal to estimate rank
  • Describe tall-skinny QR (TSQR) and explain its communication-optimal property
  • State the backward error theorem and compare the stability of LU, QR, and Cholesky
  • Explain blocked algorithms and why BLAS-3 operations dominate on modern hardware
  • Apply QR to solve least-squares problems and explain why it is more stable than normal equations
  • Differentiate through Cholesky factorization and compute gradients of log-determinants

Table of Contents


1. Intuition

1.1 The Factorization Paradigm

Every problem in computational linear algebra is, at its heart, a search for structure. When you factor a matrix into simpler pieces - triangular, orthogonal, diagonal - you transform a hard problem into a sequence of easy ones. The word "easy" is precise: triangular systems can be solved in O(n2)O(n^2) operations by substitution; orthogonal transformations are numerically perfect (they preserve norms exactly); diagonal systems are trivially solvable. A decomposition is a way of peeling back complexity to reveal the structural skeleton underneath.

This perspective explains why factorizations are not merely computational conveniences. They are mathematical revelations. LU factorization reveals which rows are "heavy" (large pivots dominate) and which are "light" (small pivots signal near-singularity). QR factorization reveals the orthogonal complement of the column space via the zero rows of RR. Cholesky factorization reveals the "matrix square root" LL that encodes the geometry of a positive definite quadratic form. Each decomposition is a lens tuned to a different structural feature.

The central insight: nearly every matrix algorithm of practical importance runs in O(n3)O(n^3) time, dominated by the factorization phase. After factorization, the actual problem-solving (applying L1L^{-1}, U1U^{-1}, QQ^\top) costs only O(n2)O(n^2). This is the fundamental return on investment: spend O(n3)O(n^3) once to factor, then spend O(n2)O(n^2) as many times as needed for different right-hand sides.

For AI: this principle appears in natural gradient methods (factor the Fisher information matrix once per outer loop, solve multiple gradient updates cheaply), in Gaussian process regression (factor the covariance matrix once, evaluate posterior at many test points cheaply), and in iterative solvers (factor a preconditioner once, apply it repeatedly to accelerate convergence).

1.2 Triangular Systems: The Foundation

Before understanding factorizations, you must understand why triangular systems are considered "solved." A lower triangular system Lx=bL\mathbf{x} = \mathbf{b} with Lii0L_{ii} \neq 0 can be solved row by row from top to bottom:

x1=b1/L11x_1 = b_1 / L_{11} x2=(b2L21x1)/L22x_2 = (b_2 - L_{21}x_1) / L_{22} xi=(bij=1i1Lijxj)/Liix_i = \left(b_i - \sum_{j=1}^{i-1} L_{ij} x_j\right) / L_{ii}

This is forward substitution: each unknown is determined uniquely by those already computed. The computation requires exactly n2/2n^2/2 multiplications and n2/2n^2/2 additions - O(n2)O(n^2) total. Similarly, an upper triangular system Ux=bU\mathbf{x} = \mathbf{b} is solved by backward substitution from bottom to top.

The key properties of triangular systems:

  1. Existence and uniqueness: a triangular system has a unique solution if and only if all diagonal entries are nonzero.
  2. Numerical stability: forward/backward substitution are inherently stable when pivots are large; small pivots amplify errors.
  3. Parallelism: the sequential dependency (each unknown requires all previous) limits parallelism, but blocked implementations can exploit level-2 BLAS parallelism within panels.
  4. Structural exploitation: sparse triangular systems (band, arrowhead, bidiagonal) can be solved in O(nk)O(nk) for bandwidth kk.

Non-example: a general dense n×nn \times n system Ax=bA\mathbf{x} = \mathbf{b} cannot be solved by substitution directly - it requires O(n3)O(n^3) work. The entire purpose of LU, QR, and Cholesky is to reduce the general case to triangular cases.

1.3 Three Canonical Decompositions

The three decompositions in this section partition the space of structured matrix problems:

DecompositionMatrix classFactored formPrimary use
LUGeneral square, non-singularPA=LUPA = LUSolving Ax=bA\mathbf{x} = \mathbf{b}, computing det(A)\det(A)
QRAny rectangular m×nm \times n, mnm \geq nA=QRA = QRLeast squares, rank determination, eigenvalue algorithms
CholeskySymmetric positive definiteA=LLA = LL^\topSPD systems, Gaussian sampling, log-det computation

Why three? Each exploits a different structural property. LU uses no special structure beyond invertibility; it is the most general and therefore requires more care (pivoting for stability). QR uses orthogonality, which is the most numerically pristine structure - QQ has condition number 1 by definition. Cholesky uses both symmetry and positive definiteness, which allows working with only half the matrix and guarantees the factorization exists without pivoting.

The hierarchy of stability: Cholesky >> QR >> LU (partially pivoted) >> LU (no pivoting). As you sacrifice structure, you need more algorithmic sophistication to maintain numerical reliability.

1.4 Why This Matters for AI

Matrix factorizations are not background infrastructure - they are in the critical path of the most computationally expensive operations in modern AI.

Gaussian process regression (used in Bayesian optimization, uncertainty quantification, and hyperparameter tuning) requires computing K1yK^{-1}\mathbf{y} and logdetK\log\det K for an n×nn \times n kernel matrix KK. Both require Cholesky factorization. A single GP training step on n=10,000n = 10{,}000 points costs O(n3)=1012O(n^3) = 10^{12} operations - factorization is the bottleneck.

Second-order optimization (Newton's method, natural gradient, K-FAC) requires solving Hessian or Fisher information systems Hδ=gH\boldsymbol{\delta} = \mathbf{g}. For K-FAC (Kronecker-Factored Approximate Curvature, Martens & Grosse 2015), the Fisher approximation is block-diagonal with Kronecker structure, and each block is inverted via Cholesky - thousands of small Cholesky factorizations per training step.

Linear probing and fine-tuning - fitting a linear model on top of frozen features - is a least-squares problem. Numerically reliable implementations use QR (e.g., numpy.linalg.lstsq calls LAPACK's dgelsd which uses divide-and-conquer SVD, or dgelsy which uses column-pivoted QR).

Differentiating through factorizations is required whenever a factorization appears inside a differentiable computational graph. PyTorch's torch.linalg.cholesky_solve and JAX's jax.lax.linalg.cholesky support automatic differentiation through the factorization via implicit function theorem.

Randomized methods (LoRA, sketching, randomized preconditioning) use approximate LU and QR factorizations on sketched matrices. Understanding the exact factorizations is prerequisite to understanding their randomized variants.

1.5 Historical Timeline

YearContributionAuthor(s)
1809Gaussian elimination for geodesy (Gauss not published until 1810)Carl Friedrich Gauss
1938LU factorization formalized as matrix decompositionTadeusz Banachiewicz
1945Systematic analysis of elimination and error growthAlan Turing
1954Givens rotations for QRWallace Givens
1958Householder reflections; modern QR algorithmAlston Householder
1961QR algorithm for eigenvalues (combined QR iteration)Francis; Kublanovskaya
1961LAPACK precursor: ALGOL programs for matrix problemsWilkinson, Reinsch
1979LINPACK (first standardized library)Dongarra et al.
1987LAPACK + BLAS-3 blocked algorithmsAnderson, Bai, Dongarra et al.
1995Randomised SVD (sketch-then-factor)Frieze, Kannan, Vempala
2011Halko-Martinsson-Tropp randomized low-rank frameworkHalko, Martinsson, Tropp
2015K-FAC: Cholesky for Fisher information blocksMartens & Grosse
2022LoRA: low-rank LU-inspired weight decompositionHu et al.
2024FlashAttention-3: blocked QR-like tiling for attentionShah et al.

2. Background: Triangular Systems

2.1 Forward Substitution

Definition. Given a lower triangular matrix LRn×nL \in \mathbb{R}^{n \times n} with Lii0L_{ii} \neq 0 for all ii, and bRn\mathbf{b} \in \mathbb{R}^n, forward substitution computes x\mathbf{x} satisfying Lx=bL\mathbf{x} = \mathbf{b} via:

xi=1Lii(bij=1i1Lijxj),i=1,2,,nx_i = \frac{1}{L_{ii}}\left(b_i - \sum_{j=1}^{i-1} L_{ij} x_j\right), \quad i = 1, 2, \ldots, n

Algorithm (row-oriented):

for i = 1 to n:
    x[i] = b[i]
    for j = 1 to i-1:
        x[i] -= L[i,j] * x[j]
    x[i] /= L[i,i]

Complexity: The inner loop runs 0,1,,n10, 1, \ldots, n-1 times, giving i=1n(i1)=n(n1)/2\sum_{i=1}^{n}(i-1) = n(n-1)/2 multiplications and the same number of additions. Total: n2n^2 floating-point operations, or O(n2)O(n^2).

Numerical properties: Forward substitution is unconditionally stable in exact arithmetic. In floating-point arithmetic, errors accumulate proportionally to κ(L)\kappa(L) (the condition number of LL), not to the intermediate values of xjx_j. The key source of numerical difficulty is a small diagonal entry LiiL_{ii}: dividing by a near-zero diagonal amplifies errors quadratically.

Unit lower triangular: When Lii=1L_{ii} = 1 for all ii (as in the LU factorization where LL is unit lower triangular by convention), the divisions are trivial and the algorithm reduces to:

for i = 1 to n:
    x[i] = b[i] - sum(L[i,1:i-1] * x[1:i-1])

This eliminates nn divisions, slightly improving performance and stability.

For AI: Forward substitution appears in the final step of any matrix factorization-based solver. In neural network training with second-order methods (Newton, K-FAC), solving Lx=gL\mathbf{x} = \mathbf{g} extracts the natural gradient direction g=F1L\mathbf{g} = F^{-1}\nabla\mathcal{L} given the Cholesky factor LL of the Fisher matrix FF.

2.2 Backward Substitution

Definition. Given an upper triangular URn×nU \in \mathbb{R}^{n \times n} with Uii0U_{ii} \neq 0, backward substitution computes x\mathbf{x} satisfying Ux=bU\mathbf{x} = \mathbf{b} via:

xi=1Uii(bij=i+1nUijxj),i=n,n1,,1x_i = \frac{1}{U_{ii}}\left(b_i - \sum_{j=i+1}^{n} U_{ij} x_j\right), \quad i = n, n-1, \ldots, 1

The algorithm proceeds from the last equation (which has a single unknown xn=bn/Unnx_n = b_n / U_{nn}) backward to the first. Complexity and numerical properties mirror forward substitution.

Column-oriented backward substitution: An equivalent formulation updates the remaining components of b\mathbf{b} as each xix_i is resolved:

for i = n downto 1:
    x[i] = b[i] / U[i,i]
    b[1:i-1] -= x[i] * U[1:i-1, i]

This accesses UU by columns rather than rows, which can improve cache efficiency on column-major storage (Fortran, LAPACK convention).

Connection to matrix inverse: The columns of U1U^{-1} can be computed by solving nn triangular systems Ux(k)=ekU \mathbf{x}^{(k)} = \mathbf{e}_k, one per standard basis vector. This is how scipy.linalg.solve_triangular computes U1U^{-1} when called with a matrix right-hand side.

2.3 Gaussian Elimination Revisited

Gaussian elimination is the process of reducing a matrix AA to upper triangular form by applying elementary row operations (EROs). Each ERO of type "add cc times row jj to row ii (with i>ji > j)" corresponds to left-multiplication by an elementary lower triangular matrix (an elementary elimination matrix or Frobenius matrix):

Eij=Imijeiej,mij=Aij/AjjE_{ij} = I - m_{ij}\mathbf{e}_i\mathbf{e}_j^\top, \quad m_{ij} = A_{ij}/A_{jj}

The multiplier mijm_{ij} is the ratio that zeros out entry (i,j)(i,j). Applying all elimination matrices to reduce AA to UU gives:

En,n1E31E21A=UE_{n,n-1} \cdots E_{31} E_{21} A = U

Crucially, the product of elementary lower triangular matrices is lower triangular, and their inverses are trivially:

Eij1=I+mijeiejE_{ij}^{-1} = I + m_{ij}\mathbf{e}_i\mathbf{e}_j^\top

So the product of inverses, L=E211E311En,n11L = E_{21}^{-1} E_{31}^{-1} \cdots E_{n,n-1}^{-1}, is unit lower triangular with the multipliers mijm_{ij} in their natural positions. This gives:

A=LUA = LU

The multipliers go directly into LL: one of the most elegant facts in numerical linear algebra is that the multipliers mijm_{ij} used during elimination appear, without any further computation, as the subdiagonal entries LijL_{ij} of the factor LL. No separate computation of LL is required - it is built up in-place during elimination.


3. LU Factorization

3.1 LU Decomposition Theorem

Theorem (LU factorization). Let ARn×nA \in \mathbb{R}^{n \times n}. Then AA has an LU factorization A=LUA = LU (with LL unit lower triangular and UU upper triangular) if and only if all leading principal submatrices Ak=A[1:k,1:k]A_k = A[1:k, 1:k] for k=1,,n1k = 1, \ldots, n-1 are non-singular.

Proof sketch. The condition AkA_k non-singular for all kk ensures that no zero pivot is encountered during Gaussian elimination. At step kk, the pivot is Ukk=det(Ak)/det(Ak1)U_{kk} = \det(A_k)/\det(A_{k-1}), so all pivots are nonzero iff all leading minors are nonzero. Conversely, if Gaussian elimination completes without zero pivots, the multipliers are finite and L,UL, U are well-defined. \square

Non-examples:

  • A=(0110)A = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} fails: A11=0A_{11} = 0 (a zero pivot at step 1).
  • A=(1111)A = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}: A11=10A_{11} = 1 \neq 0 but det(A)=0\det(A) = 0, so the factorization runs but U22=0U_{22} = 0 (singular).
  • Any singular matrix: A=LUdet(A)=det(L)det(U)=iUiiA = LU \Rightarrow \det(A) = \det(L)\det(U) = \prod_i U_{ii}, so det(A)=0\det(A) = 0 iff some Uii=0U_{ii} = 0.

Uniqueness: The factorization with Lii=1L_{ii} = 1 (unit lower triangular) is unique when it exists. Without the normalization, one could scale LL and UU by any diagonal matrix and its inverse.

The factorization in-place: LU computation overwrites AA: the upper triangle stores UU, the strict lower triangle stores the multipliers (i.e., the strict lower triangle of LL, since Lii=1L_{ii} = 1 need not be stored). This halves the memory requirement.

3.2 The LU Algorithm: Outer Product Form

The standard view of LU elimination is column-by-column. But the outer product form (also called rank-1 update form) exposes the structure more clearly and maps naturally to BLAS-3 blocked implementations.

At step kk of the factorization, we have already reduced columns 1,,k11, \ldots, k-1. The remaining (nk+1)×(nk+1)(n-k+1) \times (n-k+1) submatrix A(k)A^{(k)} (the Schur complement after k1k-1 steps) has the form:

A(k+1)=A(k)[k+1:n,k+1:n]l(k)u(k)A^{(k+1)} = A^{(k)}[k+1:n, k+1:n] - \mathbf{l}^{(k)} \mathbf{u}^{(k)\top}

where l(k)=A(k)[k+1:n,k]/Akk(k)\mathbf{l}^{(k)} = A^{(k)}[k+1:n, k] / A^{(k)}_{kk} (the multipliers column) and u(k)=A(k)[k,k+1:n]\mathbf{u}^{(k)\top} = A^{(k)}[k, k+1:n] (the pivot row). This is a rank-1 update of the trailing submatrix.

Algorithm:

for k = 1 to n-1:
    # Compute multipliers
    L[k+1:n, k] = A[k+1:n, k] / A[k, k]    # size (n-k)
    # Rank-1 update of trailing submatrix
    A[k+1:n, k+1:n] -= L[k+1:n, k] * A[k, k+1:n]   # BLAS-2: ger
    U[k, k:n] = A[k, k:n]

Complexity: Step kk costs (nk)(n-k) divisions and (nk)2(n-k)^2 multiply-adds. Total:

k=1n1[(nk)+(nk)2]=n(n1)2+(n1)n(2n1)62n33\sum_{k=1}^{n-1} [(n-k) + (n-k)^2] = \frac{n(n-1)}{2} + \frac{(n-1)n(2n-1)}{6} \approx \frac{2n^3}{3}

So LU factorization costs 2n33\frac{2n^3}{3} floating-point operations (flops). For comparison: solving Ax=bA\mathbf{x} = \mathbf{b} after factorization costs 2n22n^2 (two triangular solves). The ratio n/3n/3 means factorization dominates for large nn.

For AI: the rank-1 update structure maps directly to GPU tensor cores. Blocked implementations reformulate the trailing submatrix update as a matrix-matrix multiply (BLAS-3: DGEMM), which achieves near-peak GPU throughput.

3.3 Failure Without Pivoting

Naive LU (no pivoting) fails catastrophically when:

  1. Zero pivots: If Akk(k)=0A_{kk}^{(k)} = 0 during step kk, the division by the pivot is undefined. This happens even for non-singular matrices that happen to have zero leading principal minors.

  2. Small pivots - numerical catastrophe: Consider:

A=(ε111),ε=1016A = \begin{pmatrix} \varepsilon & 1 \\ 1 & 1 \end{pmatrix}, \quad \varepsilon = 10^{-16}

Naive LU gives multiplier m21=1/ε=1016m_{21} = 1/\varepsilon = 10^{16}, and U22=1101611016U_{22} = 1 - 10^{16} \cdot 1 \approx -10^{16} (in floating-point, this overwhelms the exact value 11/ε=(ε1)/ε1 - 1/\varepsilon = ({\varepsilon - 1})/\varepsilon). The computed solution x^\hat{\mathbf{x}} can have relative error of order 1 even when the exact solution is well-conditioned.

  1. Growth factor explosion: The growth factor ρn\rho_n is defined as:
ρn=maxi,j,kAij(k)maxi,jAij\rho_n = \frac{\max_{i,j,k} |A_{ij}^{(k)}|}{\max_{i,j} |A_{ij}|}

Without pivoting, ρn\rho_n can grow as 2n12^{n-1} in the worst case (Wilkinson's matrix), making the computed UU useless.

Example of catastrophic failure:

A=(1020112),b=(13)A = \begin{pmatrix} 10^{-20} & 1 \\ 1 & 2 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} 1 \\ 3 \end{pmatrix}

Exact solution: x=(1,1)\mathbf{x} = (1, 1)^\top approximately. Naive LU in IEEE double precision gives multiplier m21=1020m_{21} = 10^{20}, yielding U22=210201020U_{22} = 2 - 10^{20} \approx -10^{20} - a catastrophic cancellation.

For AI: PyTorch's torch.linalg.lu uses partial pivoting by default. numpy.linalg.solve calls LAPACK dgesv which uses partial pivoting. Never use unpivoted LU for numerical computation.

3.4 Partial Pivoting: PA = LU

Idea: Before processing column kk, swap row kk with the row (among k,k+1,,nk, k+1, \ldots, n) having the largest absolute value in column kk. This ensures mij1|m_{ij}| \leq 1 for all multipliers.

Algorithm with partial pivoting:

for k = 1 to n-1:
    # Find pivot row
    p = argmax_{i >= k} |A[i, k]|
    swap rows k and p in A (and record in permutation P)
    # Now A[k,k] is the largest in column k below
    L[k+1:n, k] = A[k+1:n, k] / A[k, k]       # |L[i,k]| <= 1
    A[k+1:n, k+1:n] -= L[k+1:n, k] * A[k, k+1:n]

Result: PA=LUPA = LU where PP is a permutation matrix encoding all row swaps. The factor LL satisfies Lij1|L_{ij}| \leq 1 for all i>ji > j.

Growth factor with partial pivoting: The theoretical bound is ρn2n1\rho_n \leq 2^{n-1} (same as no pivoting!), but in practice the growth factor is almost always small (ρnn2/3\rho_n \approx n^{2/3} for random matrices). Foster (1997) proved that adversarial inputs achieving ρn=2n1\rho_n = 2^{n-1} are exponentially rare.

Stability theorem (Wilkinson 1961): Partial pivoting produces a computed factorization L^U^\hat{L}\hat{U} such that:

PA=L^U^+E,E8n3uρnAPA = \hat{L}\hat{U} + E, \quad \|E\|_\infty \leq 8n^3 u \cdot \rho_n \cdot \|A\|_\infty

where uu is the unit roundoff (e.g., u=253u = 2^{-53} for IEEE double). For typical ρn\rho_n, this bound is well within practical tolerance.

Storage of the permutation: PP is stored as a pivot vector pZn1\mathbf{p} \in \mathbb{Z}^{n-1} where pkp_k is the row index selected at step kk. Applying PP to b\mathbf{b} before the triangular solve costs O(n)O(n).

Non-example: Partial pivoting does NOT guarantee ρn\rho_n is small. The 2×22 \times 2 example:

A=(1111)A = \begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}

has pivot column (11)\begin{pmatrix} 1 \\ 1 \end{pmatrix}, so either row could be chosen. After step 1, U22=2U_{22} = 2, ρ2=1\rho_2 = 1. But for A=(1111+ε)A = \begin{pmatrix} 1 & 1 \\ 1 & 1+\varepsilon \end{pmatrix}, ρ2=(2+ε)/12\rho_2 = (2+\varepsilon)/1 \approx 2.

3.5 Complete Pivoting: PAQ = LU

Complete pivoting selects both the row and column with the globally largest absolute entry at each step:

At step kk: find (p,q)=argmaxik,jkAij(p, q) = \arg\max_{i \geq k, j \geq k} |A_{ij}|, then swap row kpk \leftrightarrow p and column kqk \leftrightarrow q.

Result: PAQ=LUPAQ = LU where PP and QQ are permutation matrices (row and column respectively).

Growth factor bound: Complete pivoting satisfies the tighter bound:

ρn(n2131/241/3n1/(n1))1/2\rho_n \leq (n \cdot 2^1 \cdot 3^{1/2} \cdot 4^{1/3} \cdots n^{1/(n-1)})^{1/2}

which grows much more slowly than 2n12^{n-1} but is still superlinear. Numerically, complete pivoting is extremely stable - no practical example of large growth is known.

Cost: Finding the global maximum at each step requires examining the full trailing submatrix, adding O(n2)O(n^2) comparisons per step and O(n3)O(n^3) total - which triples the leading constant relative to partial pivoting. This is the reason complete pivoting is rarely used in practice.

When to use complete pivoting:

  • When stability is paramount (e.g., certified computation, interval arithmetic)
  • When the matrix is suspected to be rank-deficient (complete pivoting exposes rank)
  • When solving very ill-conditioned systems where partial pivoting is known to fail

For AI: Complete pivoting is rarely used in mainstream ML workflows. Partial pivoting suffices for almost all practical problems. The theoretical importance of complete pivoting is in establishing lower bounds and in the analysis of rank-revealing factorizations.

3.6 Rook Pivoting

Rook pivoting is a middle ground between partial and complete pivoting. At step kk:

  1. Find the largest element in column kk (as in partial pivoting) - say row pp.
  2. Check if Apk|A_{pk}| is also the largest in row pp. If yes, use (p,k)(p, k) as pivot.
  3. If not, find the largest in row pp, swap columns, repeat until convergence.

The name comes from the chess rook: the pivot search alternates between column and row moves until it finds a position that is simultaneously the largest in its row and column - a "rook position."

Properties:

  • Growth factor: ρn2n1\rho_n \leq 2^{n-1} but in practice much smaller than partial pivoting.
  • Cost: O(n2)O(n^2) additional work per step in the worst case but typically O(n)O(n) in practice (convergence in 1-2 alternations).
  • Rank-revealing: Rook pivoting reveals rank approximately as well as complete pivoting.

Theorem (Foster 1997): For a matrix of rank rr, rook pivoting ensures R11/Rkkn|R_{11}| / |R_{kk}| \leq \sqrt{n} for k>rk > r, making the diagonal jump at position r+1r+1 clearly visible.

3.7 LU Stability Analysis and Backward Error

The definitive framework for understanding LU stability is backward error analysis, developed by Wilkinson in the 1960s.

Definition (backward error). The backward error of a computed solution x^\hat{\mathbf{x}} to Ax=bA\mathbf{x} = \mathbf{b} is the smallest perturbation δA,δb\delta A, \delta\mathbf{b} such that:

(A+δA)x^=b+δb(A + \delta A)\hat{\mathbf{x}} = \mathbf{b} + \delta\mathbf{b}

A method is backward stable if δA/A\|\delta A\| / \|A\| and δb/b\|\delta\mathbf{b}\| / \|\mathbf{b}\| are of order uu (machine epsilon) regardless of the input.

Backward error theorem for LU with partial pivoting:

Theorem (Wilkinson 1961). Let x^\hat{\mathbf{x}} be the solution computed by LU with partial pivoting in IEEE double precision. Then:

(A+δA)x^=b,δAcnuρnA(A + \delta A)\hat{\mathbf{x}} = \mathbf{b}, \quad \|\delta A\|_\infty \leq c_n u \cdot \rho_n \cdot \|A\|_\infty

where cn=O(n2)c_n = O(n^2) is a modest polynomial constant and u=2531.1×1016u = 2^{-53} \approx 1.1 \times 10^{-16}.

Forward error bound: The forward error satisfies:

xx^xcnuρnκ(A)+O(u2)\frac{\|\mathbf{x} - \hat{\mathbf{x}}\|_\infty}{\|\mathbf{x}\|_\infty} \leq c_n u \cdot \rho_n \cdot \kappa_\infty(A) + O(u^2)

where κ(A)=AA1\kappa_\infty(A) = \|A\|_\infty \|A^{-1}\|_\infty is the condition number. This is the key identity: numerical error = roundoff ×\times growth factor ×\times condition number.

Implication for practice: If κ(A)10k\kappa(A) \approx 10^k in double precision, you lose kk digits of accuracy. With u1016u \approx 10^{-16}, you have 16k16 - k correct digits in x^\hat{\mathbf{x}}.

Backward stability of QR: QR (Householder) is backward stable with growth factor ρn=1\rho_n = 1 - Householder reflectors are orthogonal and preserve norms exactly. This is why QR is preferred for ill-conditioned systems.

3.8 Blocked LU for Cache Efficiency

Modern hardware has a deep memory hierarchy (L1: 32KB, L2: 256KB, L3: 8MB, DRAM: \infty). The key metric for algorithmic performance is arithmetic intensity: floating-point operations per byte of data moved. BLAS operations have different intensities:

OperationBLAS levelFlopsDataIntensity
yAx\mathbf{y} \leftarrow A\mathbf{x} (DGEMV)22n22n^2n2n^2O(1)O(1)
CAB+CC \leftarrow AB + C (DGEMM)32n32n^3n2n^2O(n)O(n)

Blocked LU reformulates the algorithm so that the dominant computation is DGEMM (matrix-matrix multiply), achieving O(n)O(n) arithmetic intensity and near-peak GPU/CPU throughput.

Block algorithm (block size bb):

for k = 0 to n/b - 1:
    A_kk = A[kb:(k+1)b, kb:(k+1)b]
    # Factor the diagonal panel (unblocked, uses BLAS-2)
    P_k, L_kk, U_kk = LU_factor(A_kk)    # O(b^3)
    # Apply permutation to remaining columns
    A[kb:(k+1)b, (k+1)b:n] = P_k @ A[kb:(k+1)b, (k+1)b:n]
    # Solve for U panel (triangular solve, BLAS-3: DTRSM)
    U_k_right = solve_lower_triangular(L_kk, A[kb:(k+1)b, (k+1)b:n])
    # Update trailing submatrix (BLAS-3: DGEMM)
    A[(k+1)b:n, (k+1)b:n] -= L_right @ U_k_right     # This is the bottleneck

The trailing submatrix update (A -= L_right @ U_k_right) is a pure DGEMM and dominates the total computation. On a modern CPU/GPU with peak DGEMM performance, this block formulation achieves 80-95% of theoretical peak.

LAPACK routine: DGETRF implements blocked LU with partial pivoting. The block size bb is tuned automatically per platform via the ilaenv oracle.

For AI: Blocked LU is the algorithm behind PyTorch's torch.linalg.lu_factor, jax.scipy.linalg.lu, and scipy's lu_factor. For GPU, cuSOLVER implements batched blocked LU for thousands of small systems simultaneously - critical for K-FAC training.

3.9 Solving Ax = b via LU

Given the LU factorization PA=LUPA = LU, solving Ax=bA\mathbf{x} = \mathbf{b} proceeds in four steps:

  1. Apply permutation: c=Pb\mathbf{c} = P\mathbf{b} (O(n), just index lookup)
  2. Forward solve: solve Ly=cL\mathbf{y} = \mathbf{c} (O(n^2), forward substitution)
  3. Backward solve: solve Ux=yU\mathbf{x} = \mathbf{y} (O(n^2), backward substitution)
  4. (Optional) Iterative refinement: compute r=bAx^\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}, solve for correction δx\delta\mathbf{x}, update

The total cost after factorization is 2n22n^2 flops - linear in n2n^2, negligible compared to the 2n33\frac{2n^3}{3} factorization cost for large nn.

Multiple right-hand sides: For kk different vectors b1,,bk\mathbf{b}_1, \ldots, \mathbf{b}_k, one factorization costing 2n33\frac{2n^3}{3} is followed by kk triangular solves each costing 2n22n^2. This is the key economic argument for LU: amortize the factorization cost over many solves.

Computing the determinant: det(A)=det(P1)det(L)det(U)=(1)si=1nUii\det(A) = \det(P^{-1})\det(L)\det(U) = (-1)^s \prod_{i=1}^n U_{ii}, where ss is the number of row swaps. Since LL is unit lower triangular, det(L)=1\det(L) = 1.

Computing A1A^{-1}: Solve Ax(k)=ekA\mathbf{x}^{(k)} = \mathbf{e}_k for k=1,,nk = 1, \ldots, n (one factorization, nn triangular solves). But explicitly forming A1A^{-1} is almost always unnecessary and should be avoided - solve the system directly instead.

3.10 Rank-Deficient LU

When AA is singular or numerically rank-deficient, the LU factorization (even with pivoting) produces a UU with some diagonal entries near zero. The factorization still "works" mechanically but produces Ukk0U_{kk} \approx 0 at position kk corresponding to a dependent column.

Signature of rank deficiency:

rank(A)=r    U11,,Urr are "large" and Ur+1,r+1,,Unn0\operatorname{rank}(A) = r \iff U_{11}, \ldots, U_{rr} \text{ are "large" and } U_{r+1,r+1}, \ldots, U_{nn} \approx 0

Problem: With partial pivoting, the threshold for "approximately zero" is ambiguous. If Ukk/U11<εtol|U_{kk}| / |U_{11}| < \varepsilon_{\text{tol}}, we declare the matrix has numerical rank <k< k. But the choice of εtol\varepsilon_{\text{tol}} is problem-dependent.

Rank-revealing LU: Column-pivoted LU (complete or rook pivoting) provides better rank-revelation: the diagonal entries of UU decay in a manner that reflects the true rank structure.

Better alternative for rank-deficient problems: Use rank-revealing QR (RRQR, 4.6) or SVD. The SVD provides the definitive rank determination via singular value thresholding - but costs O(n3)O(n^3) just as LU does. RRQR provides a cheaper, nearly-as-reliable alternative.

For AI: Neural network weight matrices often have approximate low rank (Hu et al., 2022 show that fine-tuning adapters are intrinsically low-dimensional). Rank-deficient LU can be used to detect this structure, but RRQR or randomized SVD are preferred in practice.


4. QR Factorization: Computational Algorithms

4.1 From Gram-Schmidt to Algorithms

Recall from 05: The QR factorization A=QRA = QR of an m×nm \times n matrix (mnm \geq n) decomposes AA into an orthonormal factor QRm×nQ \in \mathbb{R}^{m \times n} (or m×mm \times m for the full QR) and an upper triangular factor RRn×nR \in \mathbb{R}^{n \times n}. This was constructed via Gram-Schmidt orthogonalization in 05.

The problem with classical Gram-Schmidt: Classical Gram-Schmidt (CGS) is mathematically correct but numerically unstable for ill-conditioned matrices. The computed QQ can lose orthogonality catastrophically: for a matrix with condition number 10810^8, the computed QQ from CGS may have QQIF108u108\|Q^\top Q - I\|_F \approx 10^8 \cdot u \approx 10^{-8}, causing significant errors in subsequent computations.

Modified Gram-Schmidt (MGS): Reorders the orthogonalization to reduce error accumulation. MGS is numerically equivalent to classical QR of a slightly perturbed matrix, giving QQIF=O(κ(A)u)\|Q^\top Q - I\|_F = O(\kappa(A) \cdot u). Still insufficient for highly ill-conditioned problems.

Householder and Givens: Both achieve QQIF=O(u)\|Q^\top Q - I\|_F = O(u) regardless of κ(A)\kappa(A), making them the preferred algorithms for production use. The key is that they build QQ as a product of exactly orthogonal elementary transformations (reflectors or rotations), never losing orthogonality during construction.

For AI: numpy.linalg.qr and scipy.linalg.qr use LAPACK DGEQRF (Householder QR). torch.linalg.qr similarly calls cuSOLVER which uses Householder on GPU. Understanding the algorithmic basis helps you interpret condition number warnings and numerical precision behavior.

4.2 Householder Reflections

Definition. A Householder reflector (or Householder transformation) is a matrix of the form:

H=I2vvvv=I2v22vvH = I - 2\frac{\mathbf{v}\mathbf{v}^\top}{\mathbf{v}^\top\mathbf{v}} = I - \frac{2}{\|\mathbf{v}\|_2^2}\mathbf{v}\mathbf{v}^\top

where vRm\mathbf{v} \in \mathbb{R}^m is a nonzero Householder vector. The matrix HH is symmetric (H=HH = H^\top) and orthogonal (HH=H2=IH^\top H = H^2 = I), so H1=HH^{-1} = H - it is its own inverse.

Geometric interpretation: HH reflects vectors across the hyperplane perpendicular to v\mathbf{v}. Every point x\mathbf{x} in the hyperplane satisfies Hx=xH\mathbf{x} = \mathbf{x} (invariant). Every point cvc\mathbf{v} along the v\mathbf{v} direction satisfies Hcv=cvHc\mathbf{v} = -c\mathbf{v} (negated).

Key property: Given any vector aRm\mathbf{a} \in \mathbb{R}^m and any unit vector e^Rm\hat{\mathbf{e}} \in \mathbb{R}^m, there exists a Householder reflector HH such that Ha=sign(a1)a2e^1H\mathbf{a} = -\operatorname{sign}(a_1)\|\mathbf{a}\|_2 \hat{\mathbf{e}}_1. That is, HH maps a\mathbf{a} to a multiple of the first standard basis vector.

Construction (numerical form): Given aRm\mathbf{a} \in \mathbb{R}^m to be zeroed below its first entry:

α=sign(a1)a2\alpha = -\operatorname{sign}(a_1)\|\mathbf{a}\|_2 v=aαe1=(a1αa2am)\mathbf{v} = \mathbf{a} - \alpha\mathbf{e}_1 = \begin{pmatrix} a_1 - \alpha \\ a_2 \\ \vdots \\ a_m \end{pmatrix} H=I2v22vv,Ha=αe1H = I - \frac{2}{\|\mathbf{v}\|_2^2}\mathbf{v}\mathbf{v}^\top, \quad H\mathbf{a} = \alpha\mathbf{e}_1

The sign convention (critical for stability): We choose α=sign(a1)a2\alpha = -\operatorname{sign}(a_1)\|\mathbf{a}\|_2 to avoid cancellation in computing v1=a1αv_1 = a_1 - \alpha. If a1>0a_1 > 0, choosing α=a2\alpha = -\|\mathbf{a}\|_2 gives v1=a1+a2>a1v_1 = a_1 + \|\mathbf{a}\|_2 > a_1, avoiding the catastrophic subtraction that would occur with α=+a2\alpha = +\|\mathbf{a}\|_2.

Cost of applying HH: Computing HxH\mathbf{x} directly costs O(m2)O(m^2) (matrix-vector product), but using the formula Hx=x2v(vx)/(vv)H\mathbf{x} = \mathbf{x} - 2\mathbf{v}(\mathbf{v}^\top\mathbf{x})/(\mathbf{v}^\top\mathbf{v}) costs only O(m)O(m) - first compute the scalar s=vxs = \mathbf{v}^\top\mathbf{x}, then update xx(2s/v2)v\mathbf{x} \leftarrow \mathbf{x} - (2s/\|\mathbf{v}\|^2)\mathbf{v}. This is the implicit representation of HH.

4.3 Householder QR Algorithm

Algorithm: Apply Householder reflectors successively to zero out subdiagonal entries of each column.

for k = 1 to n:                                # n columns
    # Construct H_k to zero out A[k+1:m, k]
    a = A[k:m, k]                              # current column
    alpha = -sign(a[0]) * norm(a)
    v = a.copy(); v[0] -= alpha
    v /= norm(v)                               # normalized Householder vector
    # Apply H_k implicitly to A[k:m, k:n]
    A[k:m, k:n] -= 2 * v * (v.T @ A[k:m, k:n])
    # Store v in lower triangle of A (LAPACK convention)
    A[k+1:m, k] = v[1:]

After nn steps, AA has been overwritten with RR in the upper triangle and Householder vectors in the lower triangle (the implicit storage of QQ).

Implicit Q representation: The full QQ matrix is Q=H1H2HnQ = H_1 H_2 \cdots H_n and has size m×mm \times m. Forming QQ explicitly costs O(mn2)O(mn^2) additional work and O(mn)O(mn) storage. LAPACK's DORGQR computes the explicit QQ when needed (e.g., for the thin QR, only the first nn columns of QQ are needed).

Complexity:

  • Applying HkH_k to trailing submatrix: O((mk)(nk))O((m-k)(n-k)) per step
  • Total: k=1nO((mk)(nk))=O(mn2n3/3)2mn22n3/3\sum_{k=1}^{n} O((m-k)(n-k)) = O(mn^2 - n^3/3) \approx 2mn^2 - 2n^3/3
  • For square m=nm = n: 4n3/34n^3/3 flops (vs 2n3/32n^3/3 for LU)

Numerical stability: Householder QR is backward stable. The computed Q^,R^\hat{Q}, \hat{R} satisfy:

A+E=Q^R^,EFcmnuAFA + E = \hat{Q}\hat{R}, \quad \|E\|_F \leq c_{mn} u \|A\|_F

with cmnc_{mn} a modest polynomial constant and uu machine epsilon. The orthogonality error satisfies Q^Q^IF=O(u)\|\hat{Q}^\top\hat{Q} - I\|_F = O(u), independent of κ(A)\kappa(A).

LAPACK: DGEQRF computes the Householder QR in blocked form (block Householder, using WY representation). DORMQR applies QQ or QQ^\top to a matrix without forming QQ explicitly.

For AI: The Householder algorithm is used in PyTorch's torch.linalg.qr, in scipy's linalg.qr, and in LAPACK DGEQRF. It is the foundation for numerically reliable QR-based least-squares solvers used in linear probing, weight matrix analysis, and second-order methods.

4.4 Givens Rotations

Definition. A Givens rotation G(i,j,θ)Rn×nG(i,j,\theta) \in \mathbb{R}^{n \times n} is the identity matrix with a 2×22 \times 2 rotation embedded in the (i,j)(i,j) rows and columns:

G(i,j,θ)=(10000cs00sc00001)G(i,j,\theta) = \begin{pmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & & \vdots & & \vdots \\ 0 & \cdots & c & \cdots & s & \cdots & 0 \\ \vdots & & \vdots & \ddots & \vdots & & \vdots \\ 0 & \cdots & -s & \cdots & c & \cdots & 0 \\ \vdots & & \vdots & & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 & \cdots & 1 \end{pmatrix}

with c=cosθc = \cos\theta and s=sinθs = \sin\theta in positions (i,i)(i,i), (i,j)(i,j), (j,i)(j,i), (j,j)(j,j).

Key property: G(i,j,θ)G(i,j,\theta) is orthogonal. Left-multiplying x\mathbf{x} by GG rotates the (xi,xj)(x_i, x_j) plane by angle θ\theta.

Zeroing a specific entry: Given x\mathbf{x} with entries xix_i and xjx_j, choose:

r=xi2+xj2,c=xi/r,s=xj/rr = \sqrt{x_i^2 + x_j^2}, \quad c = x_i/r, \quad s = -x_j/r

Then (Gx)i=r(G\mathbf{x})_i = r, (Gx)j=0(G\mathbf{x})_j = 0 - the rotation zeros out xjx_j while preserving xix_i as rr.

Numerically stable computation (LAPACK DLARTG):

if x_j == 0:
    c = 1, s = 0, r = x_i
elif |x_j| > |x_i|:
    t = x_i / x_j; s = 1/sqrt(1+t^2); c = s*t; r = x_j/s
else:
    t = x_j / x_i; c = 1/sqrt(1+t^2); s = c*t; r = x_i/c

This avoids overflow and underflow in computing xi2+xj2\sqrt{x_i^2 + x_j^2}.

Cost: Applying G(i,j,θ)G(i,j,\theta) to a matrix costs 6m6m flops (touching only rows ii and jj). Constructing GG costs O(1)O(1).

For AI: Givens rotations appear in updating QR factorizations when a new row is appended (streaming QR), which is used in online learning and in updating Cholesky factors after rank-1 additions (Cholesky rank-1 update, relevant to incremental GP regression).

4.5 Givens QR and Sparse Matrices

Algorithm: Apply Givens rotations sequentially to zero out subdiagonal entries of AA column by column. To zero AijA_{ij} (i>ji > j), apply G(j,i,θ)G(j, i, \theta) on the left.

For a dense m×nm \times n matrix, the total number of Givens rotations needed is mnn(n+1)/2mnmn - n(n+1)/2 \approx mn, each costing O(m)O(m) flops - total O(m2n)O(m^2 n), which is worse than Householder QR for dense matrices.

Advantage for sparse matrices: Each Givens rotation touches exactly two rows and two columns. If AA has a known sparsity pattern, Givens rotations can be sequenced to minimize fill-in (new nonzeros created by the transformation). In contrast, a single Householder reflector HkH_k is a rank-2 update that touches an entire panel, potentially creating dense fill.

Banded matrices: For a banded matrix with bandwidth bb, Givens QR costs O(nb2)O(nb^2) instead of O(nb2+n2b)O(nb^2 + n^2 b) for Householder, and the resulting RR remains banded. This is critical for signal processing and finite-element computations.

Online/streaming QR: When a new row a\mathbf{a}^\top is appended to AA, the existing QR factorization can be updated using a single sequence of nn Givens rotations, costing O(n2)O(n^2) rather than recomputing from scratch in O(mn2)O(mn^2).

For AI: Streaming QR via Givens is used in online learning settings where data arrives sequentially. It also underlies incremental SVD algorithms used for streaming PCA (e.g., in continual learning, where new tasks arrive without access to past data).

4.6 Column-Pivoted QR (RRQR)

Motivation: Standard QR doesn't reveal rank. The diagonal entries of RR decay, but not necessarily in a way that makes rank determination reliable. Column-pivoted QR (RRQR) reorders columns of AA to expose rank structure in RR.

Algorithm: At step kk, before computing the kk-th Householder reflector, swap column kk with the column jkj \geq k having maximum 2-norm among remaining columns. This produces:

AP=QRA P = QR

where PP is a permutation and RR has the property that RkkRk+1,k+1Rnn|R_{kk}| \geq |R_{k+1,k+1}| \geq \cdots \geq |R_{nn}|.

Rank estimation: If AA has numerical rank rr, then Rrr/Rr+1,r+1|R_{rr}| / |R_{r+1,r+1}| is large (theoretically σr/σr+1\geq \sigma_r / \sigma_{r+1}, practically much larger). The threshold τ\tau for declaring rank is:

r=max{k:Rkk>τR11}r = \max\{k : |R_{kk}| > \tau \cdot |R_{11}|\}

RRQR theorem (Golub 1965, Chandrasekaran & Ipsen 1994): The strong RRQR algorithm guarantees:

σk(A)2σk(R11)2(1+f(k,nk,r)),f=O(n)\sigma_k(A)^2 \leq \sigma_k(R_{11})^2 \cdot (1 + f(k, n-k, r)), \quad f = O(n)

ensuring that R11R_{11} (the leading r×rr \times r block) captures essentially all the energy of the matrix.

Connection to SVD: RRQR is a cheap (O(mn2)O(mn^2)) approximation to the SVD. It provides the same rank information and a good basis for the column space, but singular values only approximately. For exact singular values, use SVD at O(mn2+n3)O(mn^2 + n^3) cost.

For AI:

  • LoRA (Hu et al., 2022): Low-Rank Adaptation uses rank-rr decompositions of weight updates ΔW=BA\Delta W = BA. Choosing rr requires rank estimation - RRQR is one approach.
  • DoRA (Liu et al., 2024): Decomposes weight matrices into magnitude and direction; RRQR is used to identify the principal components.
  • Intrinsic dimensionality: RRQR-based rank estimation reveals the intrinsic dimensionality of weight matrices, guiding compression decisions.

4.7 Tall-Skinny QR (TSQR)

Motivation: For matrices ARm×nA \in \mathbb{R}^{m \times n} with mnm \gg n (tall and skinny - e.g., m=109m = 10^9, n=100n = 100), standard Householder QR communicates O(n2)O(n^2) data between levels of the memory hierarchy at each of nn steps, totaling O(n3)O(n^3) words of communication. For distributed or GPU computation, this is the bottleneck.

TSQR algorithm (Demmel et al., 2008):

  1. Local factorization: Partition AA into PP panels A1,,APA_1, \ldots, A_P (one per processor/GPU block).
  2. Local QR: Factor each Ai=QiRiA_i = Q_i R_i independently (no communication).
  3. Reduction tree: Form (R1R2)\begin{pmatrix} R_1 \\ R_2 \end{pmatrix}, factor its QR to get R12R_{12}. Repeat up the tree.
  4. Result: RR is the final upper triangular factor. QQ can be recovered by traversing the reduction tree backward.

Communication cost: TSQR communicates O(n2logP)O(n^2 \log P) words (vs O(mnn/b)O(mn \cdot n/b) for blocked Householder), achieving communication optimality.

For AI:

  • Distributed training: TSQR is used in distributed least-squares and in computing the QR of gradient matrices across multiple GPUs.
  • Randomized methods: TSQR underlies the "sketch-then-QR" approach for randomized low-rank factorization.
  • Attention-efficient computation: The block-tile structure of FlashAttention-3 (Shah et al., 2024) is inspired by the communication-optimal blocking patterns of TSQR.

4.8 QR for Least Squares

Problem: Given ARm×nA \in \mathbb{R}^{m \times n} with m>nm > n (overdetermined) and bRm\mathbf{b} \in \mathbb{R}^m, find x\mathbf{x}^* minimizing Axb22\|A\mathbf{x} - \mathbf{b}\|_2^2.

Method 1 (Normal equations): x=(AA)1Ab\mathbf{x}^* = (A^\top A)^{-1} A^\top \mathbf{b}. Form AAA^\top A, factor via Cholesky, solve. Cost: O(mn2+n3)O(mn^2 + n^3). Problem: κ(AA)=κ(A)2\kappa(A^\top A) = \kappa(A)^2 - squaring the condition number doubles the digits lost.

Method 2 (QR): Factor A=QRA = QR (thin QR), then:

Axb22=QRxb22=RxQb22+(IQQ)b22\|A\mathbf{x} - \mathbf{b}\|_2^2 = \|QR\mathbf{x} - \mathbf{b}\|_2^2 = \|R\mathbf{x} - Q^\top\mathbf{b}\|_2^2 + \|(I - QQ^\top)\mathbf{b}\|_2^2

Minimizing over x\mathbf{x}: solve Rx=QbR\mathbf{x} = Q^\top\mathbf{b} (upper triangular, backward substitution). Cost: O(mn2)O(mn^2) for QR, O(mn)O(mn) for QbQ^\top\mathbf{b}, O(n2)O(n^2) for backsolve. The condition number of RR is κ(A)\kappa(A), not κ(A)2\kappa(A)^2.

Stability comparison:

MethodCondition amplificationCostWhen to use
Normal equations + Choleskyκ(A)2\kappa(A)^2O(mn2+n3)O(mn^2 + n^3)Well-conditioned AA, mnm \gg n
QR (Householder)κ(A)\kappa(A)O(mn2)O(mn^2)General purpose, numerically safe
RRQRκ(A)\kappa(A), reveals rankO(mn2)O(mn^2)Rank-deficient or near-singular AA
SVD (truncated)None (sets small SVs to zero)O(mn2+n3)O(mn^2 + n^3)Ill-conditioned, rank-deficient

For AI:

  • Linear probing: Fitting a linear classifier on top of frozen features is an overdetermined least-squares problem. sklearn.linear_model.LinearRegression uses LAPACK DGELSD (divide-and-conquer SVD) or DGELSY (column-pivoted QR).
  • Ridge regression: minxAxb22+λx22\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 + \lambda\|\mathbf{x}\|_2^2 is equivalent to the augmented system (AλI)x=(b0)\begin{pmatrix} A \\ \sqrt{\lambda}I \end{pmatrix} \mathbf{x} = \begin{pmatrix} \mathbf{b} \\ \mathbf{0} \end{pmatrix}, which QR solves stably.
  • Attention weight regression: Computing attention pattern regression (e.g., for mechanistic interpretability) uses QR-based least squares.

5. Cholesky Factorization (Computational Recap)

5.1 Brief Overview

Full treatment: The complete theory of Cholesky factorization - existence proofs, LDL^T, modified Cholesky, log-determinant, connection to PSD cone - is in 07: Positive Definite Matrices. This section covers only the computational aspects not covered in 07: the relationship to LU, blocked algorithms, and LAPACK routines.

For a symmetric positive definite matrix A0A \succ 0, the Cholesky factorization is:

A=LLA = LL^\top

where LL is unit lower triangular with positive diagonal entries Lii>0L_{ii} > 0. The factorization exists and is unique for every A0A \succ 0 (Theorem from 07).

Why Cholesky for SPD systems:

  1. Efficiency: Cholesky costs n33\frac{n^3}{3} flops - exactly half of LU's 2n33\frac{2n^3}{3} - because symmetry halves the work.
  2. Storage: Only the lower triangle needs to be stored (n(n+1)/2 entries vs n^2 for LU).
  3. No pivoting needed: Positive definiteness guarantees Lii>0L_{ii} > 0 at every step without any row interchanges.
  4. Stability: Cholesky is backward stable for SPD matrices without any pivoting.

5.2 Cholesky as Specialized LU

The connection between Cholesky and LU: for a symmetric positive definite AA, the LU factorization (without pivoting, which exists since all leading principal minors of AA are positive) gives A=LUA = LU. But A=AA = A^\top implies LU=ULLU = U^\top L^\top, so U=DLU = DL^\top where D=diag(U11,,Unn)D = \operatorname{diag}(U_{11}, \ldots, U_{nn}). Thus A=LDLA = L D L^\top (the LDL^T form). Since A0A \succ 0, all diagonal entries Dii>0D_{ii} > 0, and we can define L~=LD1/2\tilde{L} = L D^{1/2} to get A=L~L~A = \tilde{L}\tilde{L}^\top - the Cholesky factorization.

Cholesky algorithm (jth column):

Ljj=Ajjk=1j1Ljk2L_{jj} = \sqrt{A_{jj} - \sum_{k=1}^{j-1} L_{jk}^2} Lij=1Ljj(Aijk=1j1LikLjk),i=j+1,,nL_{ij} = \frac{1}{L_{jj}}\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk}\right), \quad i = j+1, \ldots, n

The argument of the square root must be positive - this is guaranteed by positive definiteness.

Cost analysis:

  • Column jj costs: 1 square root + O(j)O(j) flops for LjjL_{jj}, plus (nj)O(j)(n-j) \cdot O(j) flops for LijL_{ij}
  • Total: j=1nO(nj)=O(n3/3)\sum_{j=1}^{n} O(nj) = O(n^3/3), exactly half of LU

5.3 LDL^T for Indefinite Systems

For symmetric indefinite matrices (not necessarily PD), Cholesky cannot be applied directly (negative square roots). The LDL^T factorization computes:

PAP=LDLP A P^\top = L D L^\top

where LL is unit lower triangular, DD is block diagonal with 1×11 \times 1 and 2×22 \times 2 blocks (to handle negative eigenvalues), and PP is a permutation.

Bunch-Kaufman pivoting: The 2×22 \times 2 blocks in DD capture pairs of eigenvalues of opposite sign, avoiding the square root altogether. The pivoting strategy (Bunch & Kaufman 1977) ensures Lij(1+17)/80.64|L_{ij}| \leq (1 + \sqrt{17})/8 \approx 0.64 - a stability bound analogous to Lij1|L_{ij}| \leq 1 for partial pivoting in LU.

Applications in ML:

  • Indefinite Hessians: At saddle points (common early in neural network training), the Hessian is indefinite. LDL^T factors it without the positive-definiteness requirement, enabling second-order descent directions even at saddle points.
  • Modified Cholesky: Adding a diagonal shift δI\delta I to make A+δI0A + \delta I \succ 0 before Cholesky factorization is the standard approach in L-BFGS-B and quasi-Newton methods. See 07 for the modified Cholesky algorithm.

LAPACK routine: DSYTRF implements LDL^T with Bunch-Kaufman pivoting.

5.4 Blocked Cholesky and LAPACK dpotrf

Blocked Cholesky follows the same blocked structure as blocked LU. For block size bb:

for k = 0 to n/b - 1:
    # Factor diagonal block
    L_kk = cholesky(A[kb:(k+1)b, kb:(k+1)b])   # O(b^3)
    # Solve for L panel (DTRSM)
    L_k_below = solve_lower_triangular(L_kk.T, A[(k+1)b:n, kb:(k+1)b].T).T
    # Update trailing submatrix (DSYRK + DGEMM)
    A[(k+1)b:n, (k+1)b:n] -= L_k_below @ L_k_below.T

The trailing submatrix update (A -= L @ L^T) is a symmetric rank-bb update, computed by LAPACK's DSYRK (symmetric rank-k update) which exploits symmetry to halve the work relative to DGEMM.

LAPACK routine DPOTRF: The production Cholesky routine. On entry: the upper or lower triangle of AA. On exit: LL (or UU for upper Cholesky). Returns an error flag INFO (=0= 0 for success; =k>0= k > 0 if the kk-th leading minor is not positive definite).

For AI: DPOTRF is called by:

  • scipy.linalg.cholesky (wraps LAPACK)
  • numpy.linalg.cholesky (wraps LAPACK)
  • torch.linalg.cholesky (calls cuSOLVER cusolverDnDpotrf on GPU)
  • Every GP regression library (GPyTorch, GPflow, etc.) for computing K1yK^{-1}\mathbf{y} and logdetK\log\det K

6. Numerical Stability and Error Analysis

6.1 Forward vs Backward Error

Forward error: The actual difference between the computed answer x^\hat{\mathbf{x}} and the true answer x\mathbf{x}^*:

forward error=x^x/x\text{forward error} = \|\hat{\mathbf{x}} - \mathbf{x}^*\| / \|\mathbf{x}^*\|

Backward error: The smallest perturbation to the input data that makes x^\hat{\mathbf{x}} an exact solution:

backward error=min{δA/A:(A+δA)x^=b}\text{backward error} = \min\{\|\delta A\| / \|A\| : (A + \delta A)\hat{\mathbf{x}} = \mathbf{b}\}

The fundamental inequality:

forward errorκ(A)backward error+O(u2)\text{forward error} \leq \kappa(A) \cdot \text{backward error} + O(u^2)

This separates the algorithm's contribution (backward error, controlled by stability) from the problem's inherent difficulty (condition number κ(A)\kappa(A), intrinsic to the matrix). A backward-stable algorithm does the best possible: it cannot do better than κ(A)u\kappa(A) \cdot u.

Why backward error is the right measure: A backward-stable algorithm produces exact solutions to slightly perturbed problems. If the input data itself has error of order uu (which it does - all real data has measurement error), then backward-stable computation introduces no additional error beyond what the input uncertainty already implies.

Example: Computing x^=(1.0+1016)/(1.0+1016)=1.0\hat{x} = (1.0 + 10^{-16}) / (1.0 + 10^{-16}) = 1.0 in floating-point. The forward error is zero, but the backward error is also zero (exact computation). Now: x^=1016/1017=10.0\hat{x} = 10^{-16} / 10^{-17} = 10.0, while the true answer is 10.0+O(1016)10.0 + O(10^{-16}). Forward error is tiny; backward error is tiny.

6.2 The Fundamental Theorem of Backward Stability

Theorem (Backward Stability, informal). An algorithm for computing f(x)f(x) is backward stable if for every input xx, the computed output y^\hat{y} satisfies:

y^=f(x+δx) for some δxcux\hat{y} = f(x + \delta x) \text{ for some } \|\delta x\| \leq c \cdot u \cdot \|x\|

where cc is a modest polynomial function of the problem size and uu is machine epsilon.

Corollary: For a backward stable algorithm, the forward error satisfies:

y^f(x)f(x)κfcu\frac{\|\hat{y} - f(x)\|}{\|f(x)\|} \leq \kappa_f \cdot c \cdot u

where κf\kappa_f is the condition number of ff. The condition number determines how much the backward perturbation δx\delta x amplifies into forward error.

LU with partial pivoting: backward stable with growth factor ρn\rho_n. Not backward stable in theory (since ρn\rho_n can be 2n12^{n-1}), but backward stable in practice (since ρn\rho_n rarely exceeds n2/3n^{2/3} for typical inputs).

Householder QR: backward stable unconditionally. The computed Q^\hat{Q} satisfies Q^Q^I=O(u)\|\hat{Q}^\top\hat{Q} - I\| = O(u) regardless of κ(A)\kappa(A).

Cholesky (for SPD): backward stable unconditionally. No pivoting needed; positive definiteness guarantees numerical stability.

6.3 Stability Comparison: LU vs QR vs Cholesky

PropertyLU (partial pivot)Householder QRCholesky (SPD)
Growth factor ρn\rho_n2n1\leq 2^{n-1} (worst case)1 (exact)1 (exact)
Backward stable?In practice, yesUnconditionallyUnconditionally
Orthogonality lossN/AO(u)O(u)N/A
Requires pivoting?Yes (partial pivot)No (column pivot for RRQR)No
Cost (leading term)23n3\frac{2}{3}n^32mn223n32mn^2 - \frac{2}{3}n^313n3\frac{1}{3}n^3
Memoryn2n^2mnmn (+ n2n^2 for R)n(n+1)2\frac{n(n+1)}{2}
Handles all square matrices?Yes (nonsingular)YesNo (SPD only)

Rule of thumb:

  • Have a symmetric positive definite matrix? Use Cholesky (twice as fast as LU, always stable).
  • Have a general square system? Use LU with partial pivoting (standard default in all libraries).
  • Have an overdetermined least-squares problem? Use Householder QR.
  • Need rank determination or have a rank-deficient matrix? Use RRQR or SVD.
  • Have an ill-conditioned system where every digit matters? Use QR (avoids condition number squaring).

6.4 Mixed Precision and Iterative Refinement

Mixed precision computation: Modern GPUs (NVIDIA A100, H100) support FP64, FP32, FP16, and BF16 with different throughputs. The ratio is typically 1:2:8:8 (64:32:16:16 bit throughput). Training AI models almost exclusively uses FP16 or BF16 to maximize throughput; but this reduces precision from u=1016u = 10^{-16} (FP64) to u=103u = 10^{-3} (FP16).

Iterative refinement algorithm:

1. Factor A \approx PA = LU in low precision (FP16 or FP32)
2. Solve Ax_0 = b using L, U in low precision          -> x_0
3. For k = 0, 1, 2, ...:
   a. Compute residual r_k = b - A x_k  (in HIGH precision, FP64)
   b. Solve L U \delta_k = r_k               (in LOW precision - cheap)
   c. Update x_{k+1} = x_k + \delta_k
4. Converge when ||r_k|| / ||b|| < \epsilon_target

Convergence: Iterative refinement converges in O(1)O(1) iterations if the initial precision is high enough relative to κ(A)\kappa(A). Specifically, if κ(A)ufactor<1\kappa(A) \cdot u_{\text{factor}} < 1, refinement converges to FP64 precision in 2-3 steps.

NVIDIA cuSOLVER IR (Iterative Refinement): Implements exactly this pattern: BF16 factorization (8x throughput vs FP64) + FP64 residual computation + BF16 solve for correction. Used in NVIDIA's deep learning solver libraries and in the mixed-precision training of large language models.

For AI: NVIDIA's "GMRES-IR" (GMRES Iterative Refinement) in cuSOLVER applies this to poorly conditioned systems arising in transformer inference optimization. K-FAC implementations use this to factor Fisher information blocks quickly in FP16 and refine to FP32 accuracy.

6.5 Ill-Conditioned Systems and Regularization

When κ(A)\kappa(A) is very large, even backward-stable algorithms produce solutions with large forward error. The cure is regularization - changing the problem to a nearby well-conditioned one.

Tikhonov regularization: Replace Ax=bA\mathbf{x} = \mathbf{b} with minxAxb2+λx2\min_\mathbf{x} \|A\mathbf{x} - \mathbf{b}\|^2 + \lambda\|\mathbf{x}\|^2. The solution is xλ=(AA+λI)1Ab\mathbf{x}_\lambda = (A^\top A + \lambda I)^{-1} A^\top \mathbf{b}. The regularized condition number is:

κ(AA+λI)=σ12+λσn2+λ\kappa(A^\top A + \lambda I) = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda}

which decreases as λ\lambda increases. At λ=σn2\lambda = \sigma_n^2: condition number (σ1/σn)2/2=κ2/2\approx (\sigma_1/\sigma_n)^2 / 2 = \kappa^2/2. At λ=σ12\lambda = \sigma_1^2: condition number =2= 2.

Truncated factorizations: Set all singular values (or diagonal entries of RR or UU) below a threshold to zero. This is the basis of truncated SVD and RRQR-based pseudoinverse computation.

For AI: The Adam optimizer's ϵ\epsilon parameter (default 10810^{-8}) is exactly Tikhonov regularization applied to the per-parameter Hessian approximation. K-FAC's damping parameter λ\lambda similarly regularizes the Fisher information matrix to prevent ill-conditioned natural gradient steps.


7. Blocked Algorithms and High-Performance Computing

7.1 Cache Hierarchy and Algorithm Design

Modern computer architecture has a deep memory hierarchy:

LevelSizeLatencyBandwidth
L1 cache32-64 KB4 cycles1 TB/s
L2 cache256-512 KB12 cycles400 GB/s
L3 cache8-32 MB40 cycles200 GB/s
DRAM16-512 GB100+ cycles50 GB/s
GPU HBM16-80 GB80 cycles2 TB/s

The roofline model: An algorithm's performance is limited by the minimum of:

  1. Peak arithmetic throughput (FLOP/s)
  2. Memory bandwidth \times arithmetic intensity (bytes/flop)

A naive matrix-vector multiply y=Ax\mathbf{y} = A\mathbf{x} reads n2+2nn^2 + 2n words and performs 2n22n^2 flops - arithmetic intensity O(1)O(1). It is memory bandwidth bound: DRAM bandwidth, not arithmetic units, limits performance.

A matrix-matrix multiply C=ABC = AB reads 2n22n^2 words and performs 2n32n^3 flops - arithmetic intensity O(n)O(n). For large nn, it is compute bound and can achieve near-peak FLOP/s.

The key insight for blocked algorithms: Reformulate the bottleneck step as a matrix-matrix multiply (BLAS-3), not as a matrix-vector multiply (BLAS-2). This changes arithmetic intensity from O(1)O(1) to O(n)O(n), giving O(n)O(n)-fold speedup on modern hardware.

7.2 Blocked Factorizations

All three factorizations (LU, QR, Cholesky) can be reformulated as blocked algorithms that expose BLAS-3 matrix-matrix multiplies as the dominant computation.

Block LU (Panel + Update):

A = [A_{11} A_{12}]    Factor: [L_{11}   0  ] [U_{11} U_{12}]
    [A_{21} A_{22}]            [L_{21}  L_{22}] [  0   U_{22}]

Step 1: Factor A_11 = L_{11} U_{11}           (small, BLAS-2)
Step 2: L_{21} = A_{21} U_{11}^{-1}           (DTRSM, BLAS-3)
Step 3: U_{12} = L_{11}^{-1} A_{12}           (DTRSM, BLAS-3)  
Step 4: A_{22} -= L_{21} U_{12}               (DGEMM, BLAS-3) <- dominates
Step 5: Recurse on A_{22}

WY representation for blocked Householder QR: A product of bb Householder reflectors H1HbH_1 \cdots H_b can be written as IWYI - WY^\top where W,YRm×bW, Y \in \mathbb{R}^{m \times b}. Applying IWYI - WY^\top to a matrix CC costs O(mbn)O(mbn) using BLAS-3 DGEMM. LAPACK's DGEQRF uses this representation for blocked Householder QR.

Performance: On a modern CPU with AVX-512, blocked DGEMM achieves >80%> 80\% of peak FLOP/s for n>500n > 500. Without blocking (using BLAS-2), performance drops to <10%< 10\% of peak. The 8×8 \times speedup from blocking is why production factorization libraries use block sizes of 6464-256256.

7.3 LAPACK Routine Reference

RoutineOperationNotes
DGETRFLU with partial pivoting (PA=LUPA = LU)Returns pivot indices; in-place
DGETRSSolve AX=BAX = B using DGETRF outputMultiple right-hand sides
DGETRICompute A1A^{-1} from DGETRF outputRarely needed; use DGETRS instead
DGEQRFQR factorization (Householder)Returns H vectors; blocked (WY)
DORMQRApply QQ or QQ^\top from DGEQRFWithout forming Q explicitly
DORGQRForm explicit Q from DGEQRF outputCosts O(mn2)O(mn^2) extra
DGELSYLeast squares via column-pivoted QRRank-deficient safe
DPOTRFCholesky (A=LLA = LL^\top or UUUU^\top)Returns INFO error if not SPD
DPOTRSSolve AX=BAX = B using DPOTRF outputSPD matrix only
DSYTRFLDL^T for symmetric indefiniteBunch-Kaufman pivoting
DSYTRSSolve using DSYTRF output

LAPACK naming convention: D = double precision, S = single, Z = complex double; GE = general, PO = positive definite, SY = symmetric; TRF = triangular factorization, TRS = triangular solve, TRI = triangular inverse.

In Python: scipy.linalg provides direct LAPACK access via scipy.linalg.lapack.dgetrf, etc. High-level wrappers (scipy.linalg.lu, scipy.linalg.qr, scipy.linalg.cholesky) call these automatically.

7.4 Sparse Factorizations

For sparse matrices (e.g., from finite-element discretization, graph Laplacians, or neural network weight matrices with structured sparsity), dense factorizations waste memory and computation on zeros.

Fill-in: When a sparse matrix is factored, the LL and UU factors typically contain more nonzeros than AA itself. The additional nonzeros are called fill-in.

Reordering to minimize fill-in:

  • Minimum degree ordering (AMD): Greedy heuristic - eliminate the variable connected to fewest others first. Approximate minimum degree (AMD) is the production algorithm.
  • Nested dissection: Recursively partition the matrix graph using separator sets; theoretically optimal for regular 2D grids (O(n1.5)O(n^{1.5}) fill vs O(n2)O(n^2) without reordering).

Supernodal methods: CHOLMOD (Davis & Hager, 2009) uses "supernodes" - groups of columns with identical sparsity pattern - and applies dense BLAS-3 operations to each supernode, recovering high arithmetic intensity.

For AI:

  • Graph neural networks: The adjacency matrix of a graph is sparse. Solving GNN systems requires sparse Cholesky or LU.
  • Attention patterns: Sparse attention (Longformer, BigBird) creates structured-sparse weight matrices. Sparse LU factorization is used in some attention-efficient inference methods.
  • Finite-element physics simulation: Differentiable simulation requires backpropagating through sparse LU/Cholesky, implemented in JAX using jax.experimental.sparse.

8. Applications in Machine Learning

8.1 Least Squares via QR

Setup: The standard supervised learning problem with linear model y^=wx\hat{y} = \mathbf{w}^\top \mathbf{x} over nn training examples (XRn×d,yRn)(X \in \mathbb{R}^{n \times d}, \mathbf{y} \in \mathbb{R}^n) reduces to the least-squares problem:

w=argminwXwy22\mathbf{w}^* = \arg\min_\mathbf{w} \|X\mathbf{w} - \mathbf{y}\|_2^2

QR solution procedure:

  1. Factor X=QRX = QR (thin QR, QRn×dQ \in \mathbb{R}^{n \times d}, RRd×dR \in \mathbb{R}^{d \times d})
  2. Compute y^=Qy\hat{\mathbf{y}} = Q^\top \mathbf{y} (project y\mathbf{y} onto column space of XX)
  3. Solve Rw=y^R\mathbf{w}^* = \hat{\mathbf{y}} (backward substitution)

Why QR over normal equations in practice:

  • Normal equations form XXX^\top X, which has condition number κ(X)2\kappa(X)^2. For typical neural network feature matrices (κ104\kappa \sim 10^4-10610^6), this squares to 10810^8-101210^{12}, losing 8-12 digits of precision in double.
  • QR maintains condition number κ(X)\kappa(X) throughout - twice as many correct digits.

Ridge regression via augmented QR: minXwy2+λw2\min \|X\mathbf{w} - \mathbf{y}\|^2 + \lambda\|\mathbf{w}\|^2 is equivalent to:

X~=(XλId),y~=(y0)\tilde{X} = \begin{pmatrix} X \\ \sqrt{\lambda} I_d \end{pmatrix}, \quad \tilde{\mathbf{y}} = \begin{pmatrix} \mathbf{y} \\ \mathbf{0} \end{pmatrix}

Factor X~=QR\tilde{X} = QR and solve Rw=Qy~R\mathbf{w} = Q^\top \tilde{\mathbf{y}}. This is numerically superior to forming (XX+λI)(X^\top X + \lambda I) explicitly.

For AI: Linear probing (evaluating feature quality by fitting a linear classifier) and prompt tuning (fitting a linear head on frozen features) both reduce to this problem. The sklearn.linear_model.Ridge with solver='svd' or 'cholesky' uses exactly these methods.

8.2 Newton's Method and Second-Order Optimization

Newton's method for minimizing L(θ)\mathcal{L}(\boldsymbol{\theta}) applies the update:

θt+1=θtHt1L(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_t^{-1} \nabla\mathcal{L}(\boldsymbol{\theta}_t)

where Ht=2L(θt)H_t = \nabla^2\mathcal{L}(\boldsymbol{\theta}_t) is the Hessian. This requires solving the linear system Htδ=LH_t \boldsymbol{\delta} = \nabla\mathcal{L} at each step - a direct application of LU or Cholesky factorization.

Cost per step: O(p3)O(p^3) where p=θp = |\boldsymbol{\theta}|. For GPT-3 with p=175×109p = 175 \times 10^9, this is astronomically expensive. Full Newton is infeasible for large neural networks.

K-FAC (Kronecker-Factored Approximate Curvature, Martens & Grosse 2015): Approximates the Fisher information matrix (a PSD proxy for the Hessian) as a block-diagonal Kronecker product:

FlAlGlF \approx \bigoplus_l A_l \otimes G_l

where AlRdl×dlA_l \in \mathbb{R}^{d_l \times d_l} and GlRdl+1×dl+1G_l \in \mathbb{R}^{d_{l+1} \times d_{l+1}} are computed from activations and pre-activation gradients respectively. Each block is inverted via Cholesky:

(AlGl)1=Al1Gl1(A_l \otimes G_l)^{-1} = A_l^{-1} \otimes G_l^{-1}

with Al1A_l^{-1} and Gl1G_l^{-1} computed via Cholesky factorization of each factor separately.

Cost per K-FAC step: O(l(dl3+dl+13))O(\sum_l (d_l^3 + d_{l+1}^3)) - sum over layers, each requiring two Cholesky factorizations of small matrices. For a 10-layer network with hidden dim 1024: 10×2×109=2×1010\approx 10 \times 2 \times 10^9 = 2 \times 10^{10} flops, feasible on modern hardware.

Shampoo (Gupta et al., 2018; Anil et al., 2020): Maintains full-matrix preconditioners per layer:

Gt=Gt1+gtgt,Pt=Gt1/4G_t = G_{t-1} + g_t g_t^\top, \quad P_t = G_t^{-1/4}

The matrix square root Gt1/4G_t^{-1/4} requires eigendecomposition or iterative methods. The recent "Distributed Shampoo" (Google Brain, 2022) runs this in FP32 on accelerators, with Cholesky-based Newton iterations for the matrix square root.

8.3 Gaussian Process Inference at Scale

GP regression setup: Given observations y=f(X)+ε\mathbf{y} = f(X) + \varepsilon with fGP(0,k)f \sim \mathcal{GP}(0, k) and εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I), the posterior predictive mean and variance at test points XX_* are:

μ=Kn(Knn+σ2I)1y\boldsymbol{\mu}_* = K_{*n}(K_{nn} + \sigma^2 I)^{-1}\mathbf{y} Σ=KKn(Knn+σ2I)1Kn\Sigma_* = K_{**} - K_{*n}(K_{nn} + \sigma^2 I)^{-1}K_{n*}

where KnnRn×nK_{nn} \in \mathbb{R}^{n \times n} is the training kernel matrix and KnRn×nK_{*n} \in \mathbb{R}^{n_* \times n} is the test-train kernel matrix.

Cholesky for GP: The matrix Knn+σ2IK_{nn} + \sigma^2 I is SPD (positive definite noise term ensures this). The Cholesky factorization:

Knn+σ2I=LLK_{nn} + \sigma^2 I = LL^\top

enables efficient computation of:

  1. α=(Knn+σ2I)1y\boldsymbol{\alpha} = (K_{nn} + \sigma^2 I)^{-1}\mathbf{y}: forward + backward solve with LL, cost O(n3)O(n^3) once, O(n2)O(n^2) for new right-hand sides
  2. logp(yX,θ)=12yαilogLiin2log2π\log p(\mathbf{y} \mid X, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top\boldsymbol{\alpha} - \sum_i \log L_{ii} - \frac{n}{2}\log 2\pi: the log-likelihood used for hyperparameter optimization

Bottleneck: For n=10,000n = 10{,}000 training points, KnnK_{nn} is 104×10410^4 \times 10^4, and Cholesky costs (104)333×1011\frac{(10^4)^3}{3} \approx 3 \times 10^{11} flops - seconds on a GPU, but prohibitive for n=106n = 10^6.

Scalable GP methods:

  • Inducing point methods (Titsias 2009): Introduce mnm \ll n inducing points; compute m×mm \times m Cholesky instead of n×nn \times n. GPyTorch implements this as gpytorch.models.ApproximateGP.
  • SKI/KISS-GP (Wilson & Nickisch 2015): Structure kernel interpolation - exploit grid structure to decompose KnnK_{nn} as a Kronecker product, enabling O(n)O(n) Cholesky-like solves.
  • CG + Lanczos: Conjugate gradient-based methods (Gardner et al., 2018) avoid explicit Cholesky by iteratively solving the system, used in GPyTorch's LinearCG backend.

For AI: GP regression is the inference engine of Bayesian optimization (BoHB, Lindauer et al., 2022), which is used for hyperparameter tuning of large language models. Every BoO iteration requires a Cholesky factorization.

8.4 Backpropagation Through Factorizations

Differentiating through Cholesky: If A=LLA = LL^\top and a scalar loss =f(L)\ell = f(L), then:

A=LΦ(LL)L1\frac{\partial \ell}{\partial A} = L^{-\top}\Phi(L^\top \frac{\partial \ell}{\partial L})L^{-1}

where Φ\Phi extracts the lower triangular part: Φ(M)ij=Mij\Phi(M)_{ij} = M_{ij} for i>ji > j, 12Mii\frac{1}{2}M_{ii} for i=ji = j, 00 for i<ji < j.

This formula (Iain Murray, 2016) is implemented in PyTorch's torch.linalg.cholesky with backward() support.

Differentiating through LU: For A=P1LUA = P^{-1}LU and a loss through the solution x=A1b\mathbf{x} = A^{-1}\mathbf{b}:

A=Axx=λx\frac{\partial \ell}{\partial A} = -A^{-\top}\frac{\partial \ell}{\partial \mathbf{x}}\mathbf{x}^\top = -\boldsymbol{\lambda}\mathbf{x}^\top

where λ=A(/x)\boldsymbol{\lambda} = A^{-\top}(\partial\ell/\partial\mathbf{x}) is the adjoint vector (computed via backward triangular solves).

Implicit differentiation: Rather than differentiating through the factorization algorithm itself (which would unroll all O(n3)O(n^3) operations), implicit differentiation differentiates through the solution condition Ax=bA\mathbf{x}^* = \mathbf{b}:

Axθ=bθAθxA \frac{\partial\mathbf{x}^*}{\partial\boldsymbol{\theta}} = \frac{\partial\mathbf{b}}{\partial\boldsymbol{\theta}} - \frac{\partial A}{\partial\boldsymbol{\theta}}\mathbf{x}^*

This requires solving one additional linear system per output - O(n2)O(n^2) using the already-computed factorization.

For AI: Differentiating through linear solvers is central to:

  • Meta-learning: MAML and its variants differentiate through inner-loop optimization steps
  • Differentiable physics: FEA simulation (e.g., PhiML) requires differentiating through sparse LU solves
  • Constrained optimization: Interior-point methods for differentiable convex optimization
  • Gaussian processes in end-to-end learning: Differentiating the GP marginal likelihood w.r.t. kernel parameters requires differentiating through Cholesky

8.5 Randomized Factorizations

Motivation: For matrices where only a low-rank approximation is needed, exact factorization costs O(mn2)O(mn^2) - wasteful if rank rnr \ll n. Randomized methods achieve O(mnr)O(mnr) or even O(mnlogr)O(mn \log r) using random projections.

Randomized QR (sketch-and-apply):

  1. Sketch: Y=AΩY = A\Omega where ΩRn×(r+p)\Omega \in \mathbb{R}^{n \times (r+p)} is a random Gaussian matrix (p10p \approx 10 oversampling)
  2. Orthogonalize: Factor Y=QRY = QR (thin QR of the sketch)
  3. Project: B=QAB = Q^\top A (project AA onto the sketched subspace)
  4. Factor: B=Q~R~B = \tilde{Q}\tilde{R} (thin QR of the ×n\ell \times n matrix)
  5. Output: A(QQ~)R~A \approx (Q\tilde{Q})\tilde{R} - a rank-(r+p)(r+p) QR approximation

Cost: Steps 1-4 cost O(mn(r+p))O(mn(r+p)), negligible compared to O(mn2)O(mn^2) for exact QR.

Error bound (Halko-Martinsson-Tropp, 2011): With probability 16pp\geq 1 - 6p^{-p}:

AQQA2(1+11r+p)σr+1(A)\|A - Q Q^\top A\|_2 \leq \left(1 + 11\sqrt{r+p}\right)\sigma_{r+1}(A)

The approximation is near-optimal - within a polynomial factor of the Eckart-Young lower bound.

Randomized LU (Yu et al., 2018): Applies random column sampling to produce a structured LU factorization useful for rank-deficient matrices.

Connection to LoRA (Hu et al., 2022): Low-Rank Adaptation of language models uses rank-rr decompositions ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k}. Initializing BB via randomized QR of the pretrained weight matrix gives a structured initialization that preserves the principal components.

DoRA (Liu et al., 2024): Decomposes W=m(W/Wc)W = m \cdot (W/\|W\|_c) where Wc\|W\|_c is the column norm. The directional component W/WcW/\|W\|_c is approximated via LoRA. This uses the QR decomposition of each weight column.

GALORE (Zhao et al., 2024): Gradient Low-Rank Projection - projects gradients onto their principal subspace via randomized SVD before applying Adam updates. The projection matrix is computed via randomized QR every 200 steps.


9. Common Mistakes

#MistakeWhy It's WrongFix
1Using LU without pivoting for numerical computationZero or small pivots cause catastrophic cancellation; growth factor can reach 2n12^{n-1}Always use partial pivoting (scipy.linalg.lu, LAPACK dgetrf)
2Forming A1A^{-1} explicitly to solve Ax=bA\mathbf{x} = \mathbf{b}Computing A1A^{-1} costs O(n3)O(n^3) and introduces additional round-off; A1bA^{-1}\mathbf{b} is less accurate than the triangular solveUse scipy.linalg.solve or lu_solve with the stored factorization
3Applying Cholesky to a non-SPD matrixNegative square roots cause NaN; the algorithm may "succeed" silently with complex values or incorrect resultsCheck np.linalg.eigvalsh(A).min() > 0 or catch LinAlgError from failed Cholesky
4Using normal equations for ill-conditioned least squaresκ(AA)=κ(A)2\kappa(A^\top A) = \kappa(A)^2 - doubles the digits lost; for κ106\kappa \sim 10^6, you lose all precisionUse scipy.linalg.lstsq (QR-based) or explicitly factor via Householder QR
5Confusing thin and full QRFull QR: QRm×mQ \in \mathbb{R}^{m \times m}, RRm×nR \in \mathbb{R}^{m \times n}. Thin QR: QRm×nQ \in \mathbb{R}^{m \times n}, RRn×nR \in \mathbb{R}^{n \times n}. Using the wrong one gives wrong dimensionsUse mode='economic' in scipy.linalg.qr or full_matrices=False in numpy.linalg.qr for thin QR
6Ignoring the growth factor in stability analysisPartial pivoting has theoretical growth factor 2n12^{n-1}; while rare in practice, adversarial inputs existFor certified computation, use complete pivoting or switch to Householder QR
7Re-factoring A for each right-hand sideLU/QR factorization costs O(n3)O(n^3); each additional solve costs only O(n2)O(n^2)Cache the factorization (lu_factor returns LU; qr returns Q,R) and call lu_solve or backsolve repeatedly
8Forgetting the permutation PP in PA=LUPA = LUSolving L(Ux)=bL(U\mathbf{x}) = \mathbf{b} instead of L(Ux)=PbL(U\mathbf{x}) = P\mathbf{b} gives wrong answerAlways apply PbP\mathbf{b} first: b = b[piv] or use lu_solve which handles pivots automatically
9Using classical Gram-Schmidt for QR instead of HouseholderCGS loses orthogonality for ill-conditioned matrices; Q^Q^I=O(κ(A)u)\|\hat{Q}^\top\hat{Q} - I\| = O(\kappa(A) \cdot u)Use scipy.linalg.qr (Householder) or numpy.linalg.qr
10Misinterpreting rank from LU diagonalLU (even with partial pivoting) doesn't give a reliable rank estimate; the pivot threshold is ambiguousUse RRQR (scipy.linalg.qr(pivoting=True)) or SVD for rank determination
11Ignoring mixed precision issues in GPU factorizationsFP16 Cholesky has u=5×104u = 5 \times 10^{-4}; for κ(A)>104\kappa(A) > 10^4, FP16 factorization gives only 0 correct digitsUse FP32 or FP64 for factorizations; apply iterative refinement if FP16 is needed for speed
12Differentiating through factorization naively by unrollingUnrolling O(n3)O(n^3) elimination steps creates O(n3)O(n^3) graph edges - memory explosionUse implicit differentiation through the solution condition Ax=bA\mathbf{x} = \mathbf{b}; PyTorch's linalg.solve handles this automatically

10. Exercises

Exercise 1 (*): Implement forward and backward substitution.

(a) Write a function forward_sub(L, b) that solves Lx=bL\mathbf{x} = \mathbf{b} for unit lower triangular LL (no division needed).

(b) Write backward_sub(U, b) that solves Ux=bU\mathbf{x} = \mathbf{b} for upper triangular UU.

(c) Verify your implementations against scipy.linalg.solve_triangular on a 5×55 \times 5 system.

(d) Measure the accuracy: for a random 100×100100 \times 100 lower triangular system with entries from N(0,1)\mathcal{N}(0,1), compute Lx^b/b\|L\hat{\mathbf{x}} - \mathbf{b}\|_\infty / \|\mathbf{b}\|_\infty (residual) and compare to machine epsilon.

Exercise 2 (*): Implement naive LU and demonstrate failure.

(a) Implement lu_naive(A) that performs Gaussian elimination without pivoting, returning (L,U)(L, U).

(b) For the matrix A=(ε112)A = \begin{pmatrix} \varepsilon & 1 \\ 1 & 2 \end{pmatrix} with ε=1016\varepsilon = 10^{-16}, compute x^=U1(L1b)\hat{\mathbf{x}} = U^{-1}(L^{-1}\mathbf{b}) for b=(1,3)\mathbf{b} = (1, 3)^\top using your naive LU.

(c) Compare to numpy.linalg.solve. Report the relative error x^x/x\|\hat{\mathbf{x}} - \mathbf{x}^*\|_\infty / \|\mathbf{x}^*\|_\infty.

(d) Explain the failure in terms of the growth factor. What is ρ2\rho_2 for this matrix?

Exercise 3 (*): Implement LU with partial pivoting and verify PA=LUPA = LU.

(a) Implement lu_pivot(A) returning (P,L,U)(P, L, U) where PP is stored as a permutation vector.

(b) Verify the factorization: compute PALUF\|PA - LU\|_F and confirm it is O(nuAF)O(n \cdot u \cdot \|A\|_F).

(c) Use your factorization to solve a 50×5050 \times 50 random system and compare accuracy to the naive LU.

(d) Compute det(A)\det(A) from the LU factorization as (1)siUii(-1)^s \prod_i U_{ii} and verify against numpy.linalg.det.

Exercise 4 ():** Implement a Householder reflector and apply it.

(a) Implement householder_vector(a) that computes the Householder vector v\mathbf{v} such that Ha=αe1H\mathbf{a} = \alpha\mathbf{e}_1 with the correct sign convention to avoid cancellation.

(b) Implement householder_apply(v, C) that applies H=I2vv/v2H = I - 2\mathbf{v}\mathbf{v}^\top/\|\mathbf{v}\|^2 to matrix CC using the implicit formula (not the full m×mm \times m matrix).

(c) Verify: for a random vector aR5\mathbf{a} \in \mathbb{R}^5, confirm that Ha=αe1H\mathbf{a} = \alpha\mathbf{e}_1 (entries 2-5 should be zero to machine precision).

(d) Show that the naive computation (I - 2*v*v.T/norm(v)**2) @ a gives the same result but is less efficient. Time both for a large vector.

Exercise 5 ():** Implement Householder QR and compare to scipy.

(a) Implement householder_qr(A) that returns the upper triangular factor RR and stores Householder vectors.

(b) Implement recover_Q(householder_vecs, m) to form the explicit QQ matrix.

(c) Verify AQRFcuAF\|A - QR\|_F \leq c \cdot u \cdot \|A\|_F and QQIFcu\|Q^\top Q - I\|_F \leq c \cdot u for a 50×3050 \times 30 matrix.

(d) Compare to scipy.linalg.qr. Verify they give identical RR (up to sign conventions) and identical orthogonality.

(e) For an ill-conditioned AA with κ(A)=108\kappa(A) = 10^8, compare the orthogonality QQIF\|Q^\top Q - I\|_F of your Householder QR vs classical Gram-Schmidt.

Exercise 6 ():** Rank estimation via column-pivoted QR.

(a) Create a rank-4 matrix: A=USVA = U S V^\top where S=diag(100,10,1,0.1,0.001,0.001,0.001)S = \operatorname{diag}(100, 10, 1, 0.1, 0.001, 0.001, 0.001) and U,VU, V are random orthogonal.

(b) Compute QR with column pivoting: Q, R, P = scipy.linalg.qr(A, pivoting=True).

(c) Plot the diagonal entries R11,,R77|R_{11}|, \ldots, |R_{77}| on a log scale. Identify the rank gap.

(d) Implement rank estimation: declare rank rr when Rrr/R11>τ|R_{rr}|/|R_{11}| > \tau for threshold τ=103\tau = 10^{-3}.

(e) Compare to the SVD-based rank estimation. Which is more reliable? Which is faster?

Exercise 7 (*):** GP regression with Cholesky.

(a) Implement rbf_kernel(X1, X2, ell, sf) computing k(xi,xj)=sf2exp(xixj2/(22))k(x_i, x_j) = s_f^2 \exp(-\|x_i - x_j\|^2 / (2\ell^2)).

(b) Generate 100 training points from f(x)=sin(x)f(x) = \sin(x) with N(0,0.12)\mathcal{N}(0, 0.1^2) noise.

(c) Implement gp_predict(X_train, y_train, X_test, ell, sf, sigma_n) using Cholesky factorization for all linear solves (no explicit matrix inverse).

(d) Compute the log marginal likelihood logp(yX)=12yK1y12logdetKn2log2π\log p(\mathbf{y} \mid X) = -\frac{1}{2}\mathbf{y}^\top K^{-1}\mathbf{y} - \frac{1}{2}\log\det K - \frac{n}{2}\log 2\pi using log_det_chol (2\sumlog diagonal of L).

(e) Optimize (,sf)(\ell, s_f) via grid search maximizing the log marginal likelihood. Plot the GP posterior mean and \pm2\sigma credible interval.

Exercise 8 (*):** Differentiating through Cholesky - gradient of log-determinant.

(a) For an SPD matrix AA, the gradient of logdetA\log\det A w.r.t. AA is A1A^{-1}. Verify this numerically: compute AlogdetA\nabla_A \log\det A via finite differences and via LL1L^{-\top}L^{-1} (using Cholesky), compare.

(b) Implement cholesky_grad(L, dL) computing the gradient of a scalar function f(L)f(L) (with given f/L=dL\partial f/\partial L = \mathtt{dL}) w.r.t. AA, using the Iain Murray (2016) formula.

(c) Use your gradient to implement gradient descent on L(A)=logdetA+tr(A)\mathcal{L}(A) = -\log\det A + \operatorname{tr}(A) (the negative log-likelihood of a Wishart distribution). Starting from A=5IA = 5I, verify convergence to A=IA = I (the analytic minimum).

(d) Compare to PyTorch's automatic differentiation through torch.linalg.cholesky and torch.logdet. Verify that gradients match to 10510^{-5} relative error.


11. Why This Matters for AI (2026 Perspective)

ConceptAI/ML Impact
LU with partial pivotingSolving Hessian systems in Newton/quasi-Newton methods; computing inverses and log-determinants in probabilistic ML
Householder QRFoundation of least-squares linear probing; basis for eigenvalue algorithms (QR iteration); stable weight matrix analysis
Column-pivoted QR (RRQR)Rank estimation of weight matrices in LoRA, DoRA; detecting intrinsic dimensionality of neural representations
Cholesky factorization (computational)GP regression at scale; K-FAC training; VAE/diffusion sampling; Fisher information inversion
Blocked algorithms (BLAS-3)80-95% of peak GPU throughput for all factorizations; the reason FlashAttention uses tile-based blocking
Backward error analysisUnderstanding precision limits in FP16/BF16 training; justifying mixed-precision with iterative refinement
Iterative refinementcuSOLVER IR: FP16 factorization + FP64 refinement for numerically reliable inference at high throughput
Sparse factorizationsGNN training on large graphs; differentiable physics simulation; structured attention patterns
QR for least squaresLinear probing, ridge regression, prompt tuning accuracy; avoiding condition-number squaring in normal equations
Differentiation through factorizationsEnd-to-end differentiable GP/Bayesian models; implicit differentiation in meta-learning; constraint-based optimization
TSQRCommunication-optimal QR for distributed training; tile-based blocking in FlashAttention-3
Randomized LU/QRLoRA initialization; GaLore gradient projection; sketch-and-solve preconditioning for large-scale optimization
K-FAC with CholeskyNatural gradient training of large transformers; 2-5x faster convergence than Adam at equivalent compute
GP marginal likelihood optimizationBayesian hyperparameter tuning (BoHB); automated ML pipeline optimization
Rank-deficient LU / RRQRIntrinsic dimensionality measurement of LLM weight updates; guided rank selection for compression

12. Conceptual Bridge

From 07 (Positive Definite Matrices): Section 07 developed the theory of positive definite matrices and gave the full treatment of the Cholesky factorization: existence proof (the Cholesky product theorem), the LDL^T variant, the connection to log-determinants, and the PSD cone. This section builds directly on that theory, treating Cholesky as one of three production-grade computational tools alongside LU and QR. The student who understands 07's theory will see 08's Cholesky discussion as pure application - taking the mathematical theorem and implementing it efficiently.

The computational turn: Sections 01-07 of Chapter 3 developed mathematical structures: eigenvalues, SVD, PCA, linear maps, orthogonality, norms, and positive definiteness. Section 08 marks a turn toward the computational: algorithms, numerical stability, blocking, LAPACK. This turn is not a descent in abstraction but a different kind of mathematics - the mathematics of floating-point arithmetic, error propagation, and algorithm design under hardware constraints. Both kinds are essential for AI/ML practice.

Into Chapter 4 (Calculus): The matrix factorizations developed here are the computational backbone of calculus-based optimization. The Hessian matrix Hf(x)H_f(\mathbf{x}) (Chapter 4, 05) is positive definite at local minima - and therefore amenable to Cholesky. Newton's method (Chapter 4, 06) requires solving Hessian systems via LU or Cholesky at each step. The Jacobian matrix (Chapter 4, 04) appears in the normal equations for nonlinear least squares (Gauss-Newton, Levenberg-Marquardt), which are solved via QR. The machinery of 08 is the engine that makes calculus-based optimization computable.

Into Chapter 8 (Optimization): The optimization chapter (Chapter 8) develops gradient descent, Newton's method, interior-point methods, and constrained optimization - all of which require the factorizations of 08 in their inner loops. The K-FAC connection to natural gradient optimization, the QR connection to least-squares problems, and the Cholesky connection to GP-based Bayesian optimization are all elaborated with full optimization context in Chapter 8.

MATRIX DECOMPOSITIONS IN THE CURRICULUM
========================================================================

  Chapter 2: Linear Algebra Basics
  +-- 02 Matrix Operations -------- brief LU/QR/Cholesky preview
  +-- 04 Determinants ------------ det(A) = \prod U_ii from LU

  Chapter 3: Advanced Linear Algebra
  +-- 01 Eigenvalues -------------- QR iteration uses Householder QR
  +-- 02 SVD ---------------------- thin QR as step in SVD algorithm
  +-- 05 Orthogonality ------------ QR theory (Gram-Schmidt)
  +-- 07 Positive Definite -------- Cholesky THEORY (full treatment)
  +-- 08 Matrix Decompositions <-- YOU ARE HERE
       +-- LU (canonical home)      PA = LU with partial pivoting
       +-- QR algorithms            Householder, Givens, RRQR, TSQR
       +-- Cholesky (computational) Blocked dpotrf, LDL^T for indef.

  Chapter 4: Calculus Fundamentals
  +-- 04 Jacobian ----------------- Gauss-Newton needs QR
  +-- 05 Hessian ------------------ Cholesky at local minima

  Chapter 8: Optimization
  +-- Newton's method -------------- LU/Cholesky for Hessian systems
  +-- K-FAC ------------------------ Kronecker-Cholesky per layer
  +-- Interior point methods ------- Cholesky at each Newton step

  ML Applications (load-bearing)
  +-- GP regression ---------------- Cholesky (O(n^3) bottleneck)
  +-- K-FAC / Shampoo -------------- blocked Cholesky per layer
  +-- LoRA / GaLore ---------------- randomized QR for rank selection
  +-- Linear probing --------------- Householder QR least squares
  +-- Bayesian optimization -------- GP + Cholesky every step

========================================================================

The three factorizations of this section - LU, QR, and Cholesky - are not abstract mathematical objects. They are the algorithms that run inside every call to numpy.linalg.solve, scipy.linalg.qr, torch.linalg.cholesky, and LAPACK's dgesv, dgeqrf, dpotrf. Understanding them at this level - theorem, algorithm, numerical stability, and AI application - completes the computational linear algebra foundation needed for everything that follows.

<- Back to Advanced Linear Algebra | <- Positive Definite Matrices | Next: Calculus Fundamentals ->


Appendix A: Complete Algorithms and Pseudocode

A.1 LU with Partial Pivoting - Full Pseudocode

The following pseudocode gives the complete in-place LU factorization with partial pivoting. The matrix AA is overwritten: the strict lower triangle stores the multipliers mijm_{ij} (the subdiagonal entries of LL, since Lii=1L_{ii} = 1 is implicit), and the upper triangle plus diagonal stores UU.

Algorithm LU_PARTIAL_PIVOT(A):
    Input:  A \in \mathbb{R}^{n\timesn}
    Output: Modified A (lower: L multipliers, upper+diag: U), pivot vector piv

    piv = [1, 2, ..., n]   # permutation record

    for k = 1 to n-1:
        # Step 1: Find pivot in column k (rows k to n)
        max_val = |A[k,k]|
        max_row = k
        for i = k+1 to n:
            if |A[i,k]| > max_val:
                max_val = |A[i,k]|
                max_row = i

        # Step 2: Swap rows k and max_row
        if max_row \neq k:
            swap A[k, :] <-> A[max_row, :]
            swap piv[k] <-> piv[max_row]

        # Step 3: Check for (near-)singular pivot
        if |A[k,k]| < \epsilon:
            warn("Near-singular pivot at step k")

        # Step 4: Compute multipliers and update
        for i = k+1 to n:
            A[i,k] = A[i,k] / A[k,k]          # multiplier m_{ik}
            for j = k+1 to n:
                A[i,j] -= A[i,k] * A[k,j]     # rank-1 update

    return A, piv

Solve Ax = b using LU_PARTIAL_PIVOT:
    Compute A, piv = LU_PARTIAL_PIVOT(A)
    Apply permutation: b = b[piv]
    Forward solve Lx = b (L has unit diagonal, ignore A[k,k]):
        for i = 1 to n:
            for j = 1 to i-1:
                b[i] -= A[i,j] * b[j]
    Backward solve Ux = b:
        for i = n downto 1:
            for j = i+1 to n:
                b[i] -= A[i,j] * b[j]
            b[i] /= A[i,i]
    return b

Implementation notes:

  • The inner loop for j = k+1 to n: A[i,j] -= A[i,k] * A[k,j] is a SAXPY operation (Scalar A times X Plus Y) - BLAS level 1, vectorizable.
  • The full step 4 over all ii simultaneously is a rank-1 update A[k+1:n,k+1:n]=muA[k+1:n, k+1:n] -= \mathbf{m} \mathbf{u}^\top - BLAS level 2 DGER.
  • Blocked version batches bb steps into a panel and uses DGEMM for the trailing update.

A.2 Householder QR - Full Pseudocode

Algorithm HOUSEHOLDER_QR(A):
    Input:  A \in \mathbb{R}^{m\timesn}, m \geq n
    Output: Modified A (upper triangle: R; lower triangle: Householder vectors),
            beta vector (scaling factors)

    beta = zeros(n)

    for k = 1 to n:
        # Extract column k from row k to m
        x = A[k:m, k]                        # length (m-k+1)

        # Compute Householder vector
        sigma = norm(x)
        if sigma == 0: continue               # column already zero

        # Sign convention: ensure v[0] and x[0] have opposite signs
        alpha = -sign(x[0]) * sigma           # target value
        v = copy(x)
        v[0] -= alpha
        beta_k = 2 / (v^T v)                  # = 2 / ||v||^2

        # Apply H_k to trailing submatrix A[k:m, k:n]
        # H_k A = A - (beta_k * v)(v^T A) = A - beta_k * v * (v^T A)
        w = beta_k * (v^T @ A[k:m, k:n])     # row vector, length (n-k+1)
        A[k:m, k:n] -= outer(v, w)

        # Store Householder vector in lower triangle
        A[k+1:m, k] = v[1:]                  # store v[1:] (v[0] = 1 implicit)
        beta[k] = beta_k

        # A[k,k] now equals alpha = -sign(x[0]) * norm(x)

    return A, beta

Recover Q (thin, first n columns):
    Q = I_{m\timesn}
    for k = n downto 1:
        # Reconstruct v from stored lower triangle
        v = [1; A[k+1:m, k]]                 # prepend implicit 1
        beta_k = beta[k]
        # Apply H_k to Q[k:m, k:n]
        w = beta_k * (v^T @ Q[k:m, k:n])
        Q[k:m, k:n] -= outer(v, w)
    return Q

A.3 Givens QR for Banded Matrices

For a matrix AA with upper and lower bandwidth kuk_u and kk_\ell respectively:

Algorithm GIVENS_QR_BANDED(A, ku, kl):
    for j = 1 to n:              # column
        for i = j+kl downto j+1: # eliminate subdiagonal entries in bandwidth
            # Compute Givens rotation to zero A[i,j] using A[i-1,j]
            c, s, r = compute_givens(A[i-1,j], A[i,j])
            A[i,j] = 0
            A[i-1,j] = r
            # Apply G to remaining columns in bandwidth
            for jj = j+1 to min(j+ku+kl, n):
                temp = c*A[i-1,jj] + s*A[i,jj]
                A[i,jj] = -s*A[i-1,jj] + c*A[i,jj]
                A[i-1,jj] = temp
    return R (upper banded), Q (product of Givens rotations)

Complexity: O(nk(k+ku))O(nk_\ell(k_\ell + k_u)) - linear in nn for fixed bandwidth. This is the algorithm used in tridiagonal eigensolvers.


Appendix B: Deeper Theoretical Results

B.1 The Fundamental Theorem of Existence for LU

Theorem (Existence of LU, precise version). Let ARn×nA \in \mathbb{R}^{n \times n}. The following are equivalent:

  1. AA has an LU factorization with LL unit lower triangular.
  2. All leading principal minors det(Ak)0\det(A_k) \neq 0 for k=1,,n1k = 1, \ldots, n-1.
  3. Gaussian elimination completes without encountering a zero pivot.

Proof of (1) <=> (2): We use induction. Base case n=1n = 1: trivial. Inductive step: write A=(An1bcd)A = \begin{pmatrix} A_{n-1} & \mathbf{b} \\ \mathbf{c}^\top & d \end{pmatrix}. If An1A_{n-1} has an LU factorization An1=Ln1Un1A_{n-1} = L_{n-1}U_{n-1} (by induction, iff all leading minors of An1A_{n-1} are nonzero), then:

A=(Ln10cUn111)(Un1Ln11b0s)A = \begin{pmatrix} L_{n-1} & \mathbf{0} \\ \mathbf{c}^\top U_{n-1}^{-1} & 1 \end{pmatrix} \begin{pmatrix} U_{n-1} & L_{n-1}^{-1}\mathbf{b} \\ \mathbf{0}^\top & s \end{pmatrix}

where s=dcAn11bs = d - \mathbf{c}^\top A_{n-1}^{-1}\mathbf{b} is the Schur complement (see 07). This gives a valid LU factorization with all blocks well-defined. \square

Corollary: The Schur complement s=dcAn11bs = d - \mathbf{c}^\top A_{n-1}^{-1}\mathbf{b} is the last pivot. It equals det(A)/det(An1)\det(A)/\det(A_{n-1}). This shows that the pivot sequence in Gaussian elimination directly encodes the leading minors via their ratios.

B.2 The Backward Error Theorem (Precise)

Theorem (Wilkinson, 1963). Let L^,U^\hat{L}, \hat{U} be the matrices computed by Gaussian elimination with partial pivoting, applied to PAPA (after row interchanges). Then there exists a matrix EE such that:

PA+E=L^U^PA + E = \hat{L}\hat{U} Eijnugnmaxi,jAijfor all i,j|E_{ij}| \leq n \cdot u \cdot g_n \cdot \max_{i,j}|A_{ij}| \quad \text{for all } i,j

where uu is unit roundoff and gn=maxi,j,kAij(k)/maxi,jAijg_n = \max_{i,j,k}|A_{ij}^{(k)}|/\max_{i,j}|A_{ij}| is the growth factor.

Corollary (Solution accuracy). The computed solution x^\hat{\mathbf{x}} to Ax=bA\mathbf{x} = \mathbf{b} satisfies:

(A+δA)x^=b,δAc(n)ugnA(A + \delta A)\hat{\mathbf{x}} = \mathbf{b}, \quad \|\delta A\|_\infty \leq c(n) \cdot u \cdot g_n \cdot \|A\|_\infty

The forward error bound then follows from the perturbation lemma:

x^xxκ(A)c(n)ugnAA+O(u2)\frac{\|\hat{\mathbf{x}} - \mathbf{x}^*\|_\infty}{\|\mathbf{x}^*\|_\infty} \leq \kappa_\infty(A) \cdot \frac{c(n) u g_n \|A\|_\infty}{\|A\|_\infty} + O(u^2)

The fundamental lesson: error = condition number \times backward error. The algorithm controls the backward error; the problem controls the condition number.

B.3 Optimality of Householder QR

Theorem. Among all algorithms for computing QR that use orthogonal similarity transformations, Householder QR minimizes the number of arithmetic operations asymptotically.

Sketch: Each Householder reflector zeros mkm-k entries in column kk using 2(mk)2(m-k) flops (inner product + SAXPY). Any other elementary orthogonal transformation (e.g., a sequence of Givens rotations) requires at least as many flops per zero created. Householder maximizes the work per transformation.

Why Givens is sometimes preferred: Despite higher total flop count, Givens rotations are preferred for:

  • Sparse matrices (avoid creating fill-in in the reflection direction)
  • Parallel computation (rotations on disjoint row pairs can be parallelized)
  • Streaming data (new rows can be incorporated one at a time)

B.4 Perturbation Theory for Triangular Factorizations

Theorem (Stewart, 1977). If A=LUA = LU (LU without pivoting, assuming it exists), then:

L+δLFLF,U+δUFUF=O(κ(L)κ(U)δAFAF)\frac{\|L + \delta L\|_F}{\|L\|_F}, \frac{\|U + \delta U\|_F}{\|U\|_F} = O\left(\kappa(L)\kappa(U) \cdot \frac{\|\delta A\|_F}{\|A\|_F}\right)

The condition numbers κ(L)\kappa(L) and κ(U)\kappa(U) can independently be much larger than κ(A)\kappa(A), explaining why small perturbations to AA can cause large perturbations to LL and UU separately.

For QR: The Householder factors QkQ_k have κ(Qk)=1\kappa(Q_k) = 1 exactly (orthogonal matrices). Therefore:

δRFRF=O(κ(R)δAFAF)=O(κ(A)δAFAF)\frac{\|\delta R\|_F}{\|R\|_F} = O\left(\kappa(R) \cdot \frac{\|\delta A\|_F}{\|A\|_F}\right) = O\left(\kappa(A) \cdot \frac{\|\delta A\|_F}{\|A\|_F}\right)

The QR factors are perturbed proportionally to κ(A)\kappa(A), not κ(A)2\kappa(A)^2 as in the normal equations.


Appendix C: Connections to Other Decompositions

C.1 LU and the Spectral Decomposition

The LU factorization of AA is not directly related to the eigendecomposition A=QΛQ1A = Q\Lambda Q^{-1}, but the two share a common ancestor: block triangularization. The Schur decomposition A=UTUA = UTU^* (with TT upper triangular, UU unitary) is a complex-field generalization where TT is upper triangular with eigenvalues on the diagonal - the "LU" of the spectral world.

The QR algorithm for eigenvalues alternates QR factorizations and similarity transformations to drive AA toward upper triangular (Schur) form. Each QR step is one Householder QR followed by one matrix-matrix product. After convergence, the diagonal of TT gives the eigenvalues. This is the most important use of Householder QR in all of numerical linear algebra.

C.2 Cholesky and the SVD

For a symmetric positive definite matrix, the Cholesky factorization A=LLA = LL^\top and the SVD A=UΣUA = U\Sigma U^\top (symmetric PD, so UU has orthonormal columns and Σ=Λ\Sigma = \Lambda has positive diagonal) are related by:

L=UΛ1/2RL = U\Lambda^{1/2} R^\top

where RR is the upper triangular factor of a QR decomposition of UU. Explicitly: L=QRL = QR where Q=UQ = U and R=Λ1/2VR = \Lambda^{1/2} V^\top for some orthogonal VV - but this is not the standard Cholesky. The key point is that Cholesky is a non-orthogonal factorization while SVD is orthogonal; they capture different aspects of the same structure.

Practical consequence: The Cholesky factor LL is not related to the eigenvectors of AA in a simple way. Computing the matrix square root A1/2A^{1/2} (which requires eigenvectors) is more expensive than Cholesky.

C.3 QR and the Gram-Schmidt-Cholesky Identity

The Gram-Schmidt process applied to the columns of AA produces the thin QR A=QRA = QR. But the normal equations give (AA)=RR(A^\top A) = R^\top R - a Cholesky factorization of the Gram matrix! This identity AA=RRA^\top A = R^\top R explains why:

  • QR of AA and Cholesky of AAA^\top A produce the same RR
  • Computing QR of AA is equivalent to computing Cholesky of AAA^\top A in exact arithmetic
  • In floating-point, QR is more stable (works with κ(A)\kappa(A)) while Cholesky of AAA^\top A works with κ(A)2\kappa(A)^2

C.4 CUR and Interpolative Decompositions

Beyond LU, QR, and Cholesky, recent methods have developed structure-revealing factorizations that select actual columns/rows of AA:

CUR decomposition: ACURA \approx CUR where CC contains a subset of columns of AA, RR contains a subset of rows, and UU is a small "bridge" matrix. Unlike LU/QR, CC and RR are actual data columns - interpretable and memory-efficient.

Interpolative decomposition (ID): AA:,JBA \approx A_{:,J} B where JJ is a subset of column indices and BB is a well-conditioned matrix. The ID is computed via column-pivoted QR: the pivot columns from RRQR form JJ.

For AI: CUR and ID are used in interpretability research (identifying the "canonical" data points or features that the model attends to), in structured pruning (removing weight columns corresponding to small pivots), and in efficient attention (selecting key "skeleton" positions for linear-complexity attention).


Appendix D: Implementation Reference

D.1 Python/NumPy/SciPy API

import numpy as np
import scipy.linalg as la

# === LU Factorization ===
# scipy.linalg.lu (returns P, L, U as separate matrices)
P, L, U = la.lu(A)
# verify: A \approx P @ L @ U

# scipy.linalg.lu_factor (returns compact (LU, piv) for solving)
lu, piv = la.lu_factor(A)
x = la.lu_solve((lu, piv), b)     # solves Ax = b

# === QR Factorization ===
# Full QR (Q is m\timesm)
Q, R = la.qr(A)                   # default: full
# Thin/economy QR (Q is m\timesn for m\timesn matrix A)
Q, R = la.qr(A, mode='economic')

# Column-pivoted QR (rank-revealing)
Q, R, P = la.qr(A, pivoting=True)
# verify: A[:, P] \approx Q @ R  (P is permutation array)
rank_est = np.sum(np.abs(np.diag(R)) > 1e-10 * np.abs(R[0,0]))

# Apply Q^T without forming Q
# (use the raw tau output for LAPACK-level efficiency)

# === Cholesky ===
L = la.cholesky(A, lower=True)    # A = L L^T
x = la.cho_solve((L, True), b)    # solves Ax = b

# === Triangular solves ===
x = la.solve_triangular(L, b, lower=True)   # solves Lx = b
x = la.solve_triangular(U, b, lower=False)  # solves Ux = b

D.2 PyTorch API

import torch

A = torch.randn(n, n, dtype=torch.float64)
A = A @ A.T + torch.eye(n) * 0.1  # make SPD

# LU
LU, pivots, info = torch.linalg.lu_factor(A)
x = torch.linalg.lu_solve(LU, pivots, b)

# QR
Q, R = torch.linalg.qr(A, mode='reduced')  # thin QR

# Cholesky
L = torch.linalg.cholesky(A)
x = torch.linalg.cholesky_solve(b.unsqueeze(-1), L).squeeze(-1)

# All support autograd:
A.requires_grad_(True)
L = torch.linalg.cholesky(A)
loss = torch.logdet(A)
loss.backward()   # computes grad via implicit differentiation

D.3 JAX API

import jax.numpy as jnp
import jax.scipy.linalg as jla

# LU (no grad through lu_factor yet in standard JAX)
lu, piv, permutation = jla.lu(A)

# QR
Q, R = jnp.linalg.qr(A, mode='reduced')

# Cholesky (supports JIT, vmap, grad)
L = jnp.linalg.cholesky(A)
# Solve via triangular
y = jla.solve_triangular(L, b, lower=True)
x = jla.solve_triangular(L.T, y, lower=False)

# Differentiating log-det through Cholesky:
def log_det_spd(A):
    L = jnp.linalg.cholesky(A)
    return 2 * jnp.sum(jnp.log(jnp.diag(L)))

grad_A = jax.grad(log_det_spd)(A)  # should equal A^{-1}

D.4 Performance Benchmarks (n = 1000, FP64)

OperationNumPy (CPU)PyTorch (CPU)PyTorch (A100 GPU)
LU factorization85 ms78 ms3 ms
QR factorization210 ms195 ms8 ms
Cholesky35 ms32 ms1.5 ms
Triangular solve12 ms10 ms0.5 ms
Matrix multiply (DGEMM)40 ms38 ms0.8 ms

Approximate values on typical hardware (2024). GPU speedups increase for larger nn as arithmetic intensity grows.


Appendix E: Error Accumulation in Practice

E.1 Condition Number Estimation

Computing the exact condition number κ(A)=A2A12=σ1/σn\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \sigma_1/\sigma_n requires computing all singular values - O(n3)O(n^3) work. LAPACK provides cheap condition number estimators via inverse iteration:

LAPACK DGECON: Given the LU factorization of AA (already computed for solving), estimates κ1(A)=A1A11\kappa_1(A) = \|A\|_1 \|A^{-1}\|_1 in O(n2)O(n^2) via a sequence of triangular solves. The estimate is typically within a factor of 10 of the true condition number.

Rule of thumb: If DGECON returns κ10k\kappa \approx 10^k, you lose approximately kk digits of precision. For FP64 (u1016u \approx 10^{-16}), you have 16k16-k correct digits. For FP32 (u107u \approx 10^{-7}), you have 7k7-k correct digits.

For AI: PyTorch's torch.linalg.cond computes the exact condition number via SVD. For quick checks without full SVD: torch.linalg.matrix_norm(A) * torch.linalg.matrix_norm(torch.linalg.inv(A)).

E.2 Diagnosis of Numerical Failures

SymptomLikely causeDiagnosisFix
LinAlgError: Singular matrixZero pivot encounteredAA is singular or nearly soCheck np.linalg.matrix_rank(A)
nan in outputZero or tiny pivot in LUNo pivoting, tiny diagonalUse partial pivoting
Solution with large residual Ax^buAx^\|A\hat{x}-b\| \gg u\|A\|\|\hat{x}\|High condition numberCompute κ(A)\kappa(A)Regularize or use iterative refinement
LinAlgError from choleskyMatrix not SPDEigenvalues not all positiveCheck np.linalg.eigvalsh(A).min()
Q not orthogonal: QTQIu\|Q^TQ - I\| \gg uClassical Gram-Schmidt usedCompute QTQIF\|Q^TQ-I\|_FSwitch to Householder QR
Slow factorizationBlock size mismatchProfile with BLAS callsTune block size; use BLAS-3 routine

E.3 Iterative Refinement Implementation

def iterative_refinement(A, b, max_iter=3, tol=1e-14):
    """
    Solve Ax = b with iterative refinement.
    Factor in FP64, residuals in FP64, corrections in FP32.
    """
    import scipy.linalg as la
    
    # Factor in FP32 (simulating low-precision)
    A32 = A.astype(np.float32)
    lu, piv = la.lu_factor(A32.astype(np.float64))  # use FP64 here for demo
    
    # Initial solve
    x = la.lu_solve((lu, piv), b)
    
    for k in range(max_iter):
        # Residual in FP64 (high precision)
        r = b - A @ x          # FP64 residual
        r_norm = np.linalg.norm(r) / np.linalg.norm(b)
        print(f"  Iter {k}: relative residual = {r_norm:.3e}")
        
        if r_norm < tol:
            break
        
        # Correction solve (in FP32 precision)
        d = la.lu_solve((lu, piv), r)
        x = x + d
    
    return x

Appendix F: The Geometry of Factorizations

F.1 LU as a Basis Change

The LU factorization PA=LUPA = LU can be interpreted geometrically. The unit lower triangular matrix LL represents a shearing transformation - it maps the standard basis {ei}\{\mathbf{e}_i\} to the columns of LL, which form a "lower triangular" basis. The upper triangular UU then represents the coordinates of the rows of AA in this new basis.

Concretely: each step of Gaussian elimination adds multiples of row kk to lower rows, which is a shear in the "row space" of AA. The cumulative effect is a change to a triangular basis - the essence of LU.

F.2 QR as a Rotation to Triangular Form

The Householder QR factorization A=QRA = QR is geometrically a rotation of the column space of AA to align with the coordinate axes. Each Householder reflector HkH_k reflects the kk-th "residual" column to align with ek\mathbf{e}_k, zeroing out all entries below the diagonal.

After nn reflections, the matrix Q=H1H2HnQ = H_1 H_2 \cdots H_n is a composition of orthogonal reflections - itself orthogonal - that has rotated AA to upper triangular form RR.

Geometric insight: The columns of QQ form an orthonormal basis for the column space of AA. The entries of RR give the coordinates of the original columns of AA in this orthonormal basis. This is the Gram-Schmidt process, done via reflections rather than projections.

F.3 Cholesky as the Matrix Square Root Factorization

For A0A \succ 0, the Cholesky factor LL satisfies A=LLA = LL^\top - making LL a "matrix square root" of AA in a specific sense. Unlike the symmetric square root A1/2=QΛ1/2QA^{1/2} = Q\Lambda^{1/2}Q^\top (which is symmetric), the Cholesky factor LL is lower triangular.

Geometric interpretation: The linear map T:RnRnT : \mathbb{R}^n \to \mathbb{R}^n defined by Tx=LxT\mathbf{x} = L\mathbf{x} transforms the unit ball {x:x1}\{\mathbf{x} : \|\mathbf{x}\| \leq 1\} into the ellipsoid {y:yA1y1}\{\mathbf{y} : \mathbf{y}^\top A^{-1} \mathbf{y} \leq 1\} - the level set of the quadratic form q(y)=yA1yq(\mathbf{y}) = \mathbf{y}^\top A^{-1}\mathbf{y}. This is why Cholesky factorization is the foundation of sampling from N(μ,A)\mathcal{N}(\boldsymbol{\mu}, A): if zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I), then Lz+μN(μ,LL)=N(μ,A)L\mathbf{z} + \boldsymbol{\mu} \sim \mathcal{N}(\boldsymbol{\mu}, LL^\top) = \mathcal{N}(\boldsymbol{\mu}, A).

Uniqueness: The Cholesky factor LL with positive diagonal entries is unique for every A0A \succ 0. The symmetric square root A1/2A^{1/2} is also unique and PSD, but it is NOT the same as LL (they differ by an orthogonal factor).

F.4 Triangular Factorizations and the Flag Manifold

The space of all n×nn \times n invertible matrices GLn(R)GL_n(\mathbb{R}) acts on the set of factorizations as follows:

  • LU factorizations correspond to the decomposition GLn=BB+GL_n = B_- \cdot B_+ where BB_- (lower triangular) and B+B_+ (upper triangular) are Borel subgroups. The "generic" element of GLnGL_n has a unique LU (the open Bruhat cell).
  • QR factorizations correspond to GLn=O(n)B+GL_n = O(n) \cdot B_+ where O(n)O(n) is the orthogonal group. This decomposition is always unique (no exceptional cases, unlike LU).
  • The flag manifold Fl(n)=GLn/B+Fl(n) = GL_n / B_+ parameterizes complete flags V1V2Vn=RnV_1 \subset V_2 \subset \cdots \subset V_n = \mathbb{R}^n and is the natural geometric setting for both LU and QR.

This connection to Lie group theory underlies the deep relationship between matrix factorizations and the geometry of symmetric spaces - a topic developed in Chapter 12 (Functional Analysis).


Appendix G: Randomized Algorithms - Deeper Theory

G.1 The Randomized SVD via QR

The Halko-Martinsson-Tropp (2011) algorithm computes a rank-rr approximation of ARm×nA \in \mathbb{R}^{m \times n} in O(mnr)O(mnr) operations:

Algorithm RANDOMIZED_SVD(A, r, p):
    # Oversampling: target rank r + p (p \approx 10)
    \Omega = randn(n, r+p)              # Gaussian sketch matrix
    Y = A @ \Omega                       # Sketch: Y \in \mathbb{R}^{m\times(r+p)}
    Q, _ = qr(Y)                   # Orthonormalize: Q \in \mathbb{R}^{m\times(r+p)}
    B = Q.T @ A                    # Project: B \in \mathbb{R}^{(r+p)\timesn}
    U, \Sigma, Vt = svd(B, full_matrices=False)  # SVD of small matrix
    U = Q @ U                      # Lift back to full space
    return U[:, :r], \Sigma[:r], Vt[:r, :]

Error bound: With probability 1δ\geq 1 - \delta:

AUΣV2(1+rp1+er+ppmin(m,n)r)σr+1(A)\|A - U\Sigma V^\top\|_2 \leq \left(1 + \sqrt{\frac{r}{p-1}} + \frac{e\sqrt{r+p}}{p}\sqrt{\min(m,n) - r}\right)\sigma_{r+1}(A)

For p=10p = 10: error is 2σr+1(A)\approx 2\sigma_{r+1}(A) with probability >99%> 99\%.

Power iteration improvement: For matrices with slowly decaying singular values, applying Y=(AA)qAΩY = (AA^\top)^q A\Omega (with q=1q = 1 or 22 power iterations) dramatically improves accuracy:

AA^r2(σr+1(A)σr(A))2q ⁣ ⁣(base error)\|A - \hat{A}_r\|_2 \leq \left(\frac{\sigma_{r+1}(A)}{\sigma_r(A)}\right)^{2q}\!\!\cdot \text{(base error)}

For AI - GaLore (Zhao et al., 2024): GaLore uses randomized SVD to project gradient matrices GtRm×nG_t \in \mathbb{R}^{m \times n} onto their principal rr-dimensional subspace every T=200T = 200 steps. The projection matrix PtRn×rP_t \in \mathbb{R}^{n \times r} is computed via randomized SVD of the gradient, then Adam is applied to the projected gradients GtPtRm×rG_t P_t \in \mathbb{R}^{m \times r}. This reduces optimizer memory by n/rn/r while maintaining training quality.

G.2 Structured Random Matrices

Instead of dense Gaussian Ω\Omega, structured random matrices enable faster sketching:

MatrixSketch costRandomnessError bound
Dense GaussianO(mnr)O(mnr)Full i.i.d.Optimal
SRFT (subsampled random Fourier)O(mnlogr)O(mn\log r)Hadamard + samplingNear-optimal
Sparse embeddingO(nnz(A)r)O(nnz(A) \cdot r)Hash functionsNear-optimal for sparse AA
CountSketchO(nnz(A))O(nnz(A))Hash functionsSuboptimal but fast

SRFT (Johnson-Lindenstrauss): Ω=n/rDFS\Omega = \sqrt{n/r} \cdot DFS where DD is a random diagonal ±1\pm 1 matrix, FF is the DFT, and SS samples rr rows uniformly. Applying Ω\Omega^\top costs O(nlogn)O(n\log n) via FFT.


Appendix H: Software Ecosystem

H.1 BLAS and LAPACK Hierarchy

User Code (Python/Julia/MATLAB)
    down
SciPy / NumPy / PyTorch / JAX (Python wrappers)
    down
LAPACK (high-level: LU, QR, Cholesky, eigensolvers)
    down
BLAS Level 3 (DGEMM, DSYRK, DTRSM - blocked, cache-optimal)
BLAS Level 2 (DGEMV, DGER - matrix-vector)
BLAS Level 1 (DDOT, DAXPY, DNRM2 - vector)
    down
Hardware-optimized BLAS (MKL, OpenBLAS, cuBLAS, BLIS)
    down
CPU/GPU Tensor Cores / AVX-512 / AMX tiles

Key implementations:

  • Intel MKL: Optimized for Intel CPUs; 2-3x faster than reference BLAS for blocked operations
  • OpenBLAS: Open source, near-MKL performance on x86; default in many Linux distributions
  • BLIS (BLAS-like Library Instantiation Software): Framework for portable high-performance BLAS; used by many modern CPU vendors
  • cuBLAS: NVIDIA GPU BLAS; powers all PyTorch/JAX GPU linear algebra
  • cuSOLVER: NVIDIA GPU LAPACK; powers torch.linalg on NVIDIA hardware
  • rocBLAS / rocSOLVER: AMD GPU equivalents

H.2 Choosing the Right Routine

Need to solve Ax = b?
+-- A is SPD? -> DPOTRF (Cholesky) + DPOTRS
+-- A is symmetric indefinite? -> DSYTRF (LDL^T) + DSYTRS
+-- A is general square? -> DGESV (= DGETRF + DGETRS, partial pivot)
+-- A is triangular? -> DTRTRS (triangular solve directly)

Need to solve min ||Ax - b||?
+-- A is full-rank, well-conditioned? -> DGELS (QR-based least squares)
+-- A may be rank-deficient? -> DGELSY (column-pivoted QR)
+-- Need minimum-norm solution? -> DGELSD (divide-and-conquer SVD)

Need eigenvalues?
+-- A is symmetric? -> DSYEV (or DSYEVD for large n)
+-- A is general? -> DGEEV (QR iteration)

Need SVD?
+-- Full SVD? -> DGESVD or DGESDD (divide-and-conquer, faster)
+-- Truncated SVD? -> Use randomized SVD (not in LAPACK)

CHUNK_APPENDIX


Appendix I: Practical Decision Guide for Practitioners

I.1 Which Factorization to Use? - Decision Tree

The choice among LU, QR, and Cholesky depends on matrix structure, problem type, and numerical requirements. The following decision framework covers the vast majority of practical cases.

MATRIX FACTORIZATION SELECTION GUIDE
========================================================================

 Is A symmetric?
 +-- YES: Is A positive definite? (all eigenvalues > 0)
 |   +-- YES: -> CHOLESKY (dpotrf)
 |   |         Cost: n^3/3 flops. Fastest, most stable for SPD.
 |   |         Use for: Gaussians, GP regression, K-FAC, Fisher
 |   +-- NO:  -> LDL^T (dsytrf, Bunch-Kaufman pivoting)
 |              Cost: n^3/3 flops. Handles indefinite (saddle pts).
 |              Use for: indefinite Hessians, Newton at saddle pts
 +-- NO:  Is A square?
     +-- YES: -> LU with partial pivoting (dgetrf)
     |         Cost: 2n^3/3 flops. General purpose.
     |         Use for: Solving Ax=b, computing det(A)
     +-- NO:  Is A overdetermined (m > n)? -> LEAST SQUARES
         +-- Well-conditioned & fast? -> Normal equations + Cholesky
         |   Cost: mn^2 + n^3/3. Squares condition number.
         +-- General / numerically safe? -> Householder QR (dgelsy)
         |   Cost: 2mn^2 - 2n^3/3. Standard choice.
         +-- May be rank-deficient? -> Column-pivoted QR or SVD
             RRQR: O(mn^2), good rank estimate
             SVD:  O(mn^2+n^3), exact singular values

========================================================================

I.2 Numerical Precision Requirements

ScenarioPrecision neededRecommended approach
Single FP32 forward passu107u \approx 10^{-7}Standard FP32 factorization
Training with FP16/BF16u104u \approx 10^{-4} to 10310^{-3}Mixed precision + refinement
Scientific computingu1016u \approx 10^{-16} (FP64)FP64 throughout
Certified computationExact or intervalSymbolic or interval arithmetic
Ill-conditioned system (κ>1012\kappa > 10^{12})All digits lost in FP64Regularize or use extended precision

I.3 Memory and Flop Budgets

For a matrix of size n×nn \times n:

FactorizationMemory (dense)FlopsTriangular solve flops
LUn2n^2 (in-place)23n3\frac{2}{3}n^32n22n^2
QR (Householder)mnmn + n2n^22mn223n32mn^2 - \frac{2}{3}n^32mn2mn for QbQ^\top b, n2n^2 for backsolve
Choleskyn(n+1)2\frac{n(n+1)}{2}13n3\frac{1}{3}n^3n2n^2
LDL^Tn(n+1)2\frac{n(n+1)}{2}13n3\frac{1}{3}n^3n2n^2

Memory for large nn:

  • n=10,000n = 10{,}000: LU requires 10810^8 doubles = 800 MB (fits in GPU HBM)
  • n=100,000n = 100{,}000: LU requires 101010^{10} doubles = 80 GB (exceeds most GPUs)
  • n>105n > 10^5: must use iterative solvers (CG, GMRES) or sparse factorizations

I.4 Parallelism Characteristics

FactorizationCPU parallelismGPU parallelismDistributed
Blocked LUPanel: serial; trailing: DGEMM (parallel)cuSOLVER DGETRF; good for n>2000n > 2000ScaLAPACK PDGETRF
Householder QRWY representation + DGEMMcuSOLVER DGEQRFScaLAPACK PDGEQRF
CholeskyPanel: serial; trailing: DSYRK (parallel)cuSOLVER DPOTRF; excellentScaLAPACK PDPOTRF
Sparse LUAMD reordering; supernodal parallelcuSOLVER batch for many small systemsMUMPS, SuperLU_DIST

Batch factorizations (small systems): For thousands of small systems (common in K-FAC, GP mini-batches), GPU batch factorizations (cublasDgetrfBatched) outperform sequential large-matrix factorizations by 10-100x.

I.5 Common Library Calls and Their Underlying Algorithms

High-level callUnderlying algorithmNotes
np.linalg.solve(A, b)LAPACK DGESV (= DGETRF + DGETRS)Partial pivoting, FP64
np.linalg.lstsq(A, b)LAPACK DGELSD (divide-and-conquer SVD)Rank-deficient safe
np.linalg.qr(A)LAPACK DGEQRF (Householder)Full or thin via mode
scipy.linalg.lu(A)LAPACK DGETRFReturns P, L, U separately
scipy.linalg.cholesky(A)LAPACK DPOTRFRaises if not SPD
scipy.linalg.lu_solve((lu,piv), b)LAPACK DGETRSAmortized over many b
torch.linalg.solve(A, b)cuSOLVER DGESV (GPU) / MKL (CPU)Autograd supported
torch.linalg.cholesky(A)cuSOLVER DPOTRF (GPU)Autograd via implicit diff
jax.numpy.linalg.solve(A, b)XLA:HLO -> cuSOLVER / LAPACKJIT, vmap, grad
sklearn.linear_model.Ridge(solver='cholesky')LAPACK DPOTRSNormal equations + Cholesky

Appendix J: Extended Example - GP Hyperparameter Optimization

This appendix walks through a complete GP regression pipeline showing how LU, QR, and Cholesky all appear in different roles within the same ML workflow.

J.1 Problem Setup

We have n=200n = 200 observations {(x(i),y(i))}i=1n\{(x^{(i)}, y^{(i)})\}_{i=1}^n from f(x)=sin(3x)exp(x/5)+εf(x) = \sin(3x)\exp(-x/5) + \varepsilon, εN(0,0.01)\varepsilon \sim \mathcal{N}(0, 0.01). We fit a GP with RBF kernel k(x,x)=sf2exp(xx2/(22))k(x,x') = s_f^2\exp(-\|x-x'\|^2/(2\ell^2)) and noise σn2\sigma_n^2.

J.2 Where Each Factorization Appears

Cholesky - computing the GP posterior:

The kernel matrix K=Knn+σn2I0K = K_{nn} + \sigma_n^2 I \succ 0 is factored via Cholesky:

L=chol(K),α=L(L1y)L = \operatorname{chol}(K), \quad \boldsymbol{\alpha} = L^{-\top}(L^{-1}\mathbf{y}) logp(y)=12yαilogLiin2log2π\log p(\mathbf{y}) = -\frac{1}{2}\mathbf{y}^\top\boldsymbol{\alpha} - \sum_i\log L_{ii} - \frac{n}{2}\log 2\pi

The log-determinant ilogLii\sum_i \log L_{ii} is computed from the Cholesky diagonal at zero additional cost.

LU (implicit, via matrix-vector products) - CG solver for large nn:

For n=104n = 10^4, direct Cholesky costs 101210^{12} flops. Instead, conjugate gradient (CG) is used to solve (Knn+σn2I)α=y(K_{nn} + \sigma_n^2 I)\boldsymbol{\alpha} = \mathbf{y} iteratively. Each CG iteration requires one matrix-vector product KnnvK_{nn}\mathbf{v}, plus the application of a preconditioner.

The preconditioner is a low-rank + diagonal approximation (UΛU+σn2I)1(U\Lambda U^\top + \sigma_n^2 I)^{-1}, where UΛUU\Lambda U^\top is a rank-rr approximation of KnnK_{nn} computed via randomized SVD (which uses QR internally). The preconditioned CG converges in O(r)O(r) iterations instead of O(κ(K))O(\kappa(K)).

QR - selecting inducing points:

For the sparse GP with mnm \ll n inducing points, the inducing point locations are selected via column-pivoted QR on the feature matrix:

Φ=(ϕ(x(1))ϕ(x(n)))Rn×r\Phi = \begin{pmatrix} \phi(x^{(1)})^\top \\ \vdots \\ \phi(x^{(n)})^\top \end{pmatrix} \in \mathbb{R}^{n \times r}

The pivot columns of scipy.linalg.qr(\Phi.T, pivoting=True) identify the most "important" training points - those that span the feature space. These become the inducing points.

Cholesky - computing the evidence lower bound (ELBO) for inducing point GP:

The sparse GP ELBO requires computing the Cholesky of the m×mm \times m inducing point kernel matrix KmmK_{mm} plus a correction term - again O(m3)O(m^3) with mnm \ll n.

J.3 Hyperparameter Optimization

The hyperparameters θ=(,sf,σn)\theta = (\ell, s_f, \sigma_n) are optimized by maximizing the log marginal likelihood logp(yX,θ)\log p(\mathbf{y} \mid X, \theta) via gradient-based optimization (L-BFGS or Adam). Each gradient evaluation requires:

  1. A new Cholesky factorization of K(θ)K(\theta) - O(n3)O(n^3)
  2. Computing logp/θk=12tr[(ααK1)Kθk]\partial\log p / \partial\theta_k = \frac{1}{2}\operatorname{tr}\left[({\boldsymbol{\alpha}}\boldsymbol{\alpha}^\top - K^{-1})\frac{\partial K}{\partial\theta_k}\right]
  3. Each trace computation uses: tr(AB)=i,jAijBji=vec(A)vec(B)\operatorname{tr}(A B) = \sum_{i,j} A_{ij}B_{ji} = \operatorname{vec}(A)^\top\operatorname{vec}(B) - O(n2)O(n^2) per hyperparameter

Total cost per hyperparameter gradient: O(n3+pn2)O(n^3 + pn^2) for pp hyperparameters. For n=1000n = 1000, p=3p = 3: about 3×1093 \times 10^9 flops - feasible in seconds on modern hardware.

This complete pipeline illustrates that a single ML workflow (GP regression with hyperparameter optimization) relies on Cholesky (3+ times, for different matrices), randomized QR (for inducing points), and CG with preconditioner (which involves implicit LU-like operations) - all in service of a Bayesian inference computation.

CHUNK_APPENDIXI


Appendix K: Matrix Decompositions in the 2026 AI Landscape

K.1 Inference-Time Decompositions

As language models move toward longer context (1M+ tokens), the attention computation's quadratic cost O(n2d)O(n^2 d) has become a primary bottleneck. Several recent methods use matrix decompositions at inference time:

PagedAttention (Kwon et al., 2023): Manages KV cache memory using block-based paging inspired by virtual memory. The KV cache is a large matrix K,VRn×dK, V \in \mathbb{R}^{n \times d}; PagedAttention tiles this into fixed-size "pages" and maintains an LU-like decomposition of the free-list.

MLA (Multi-head Latent Attention, DeepSeek, 2024): Compresses the KV cache via low-rank projection: K=KcUKK = K_c U_{K}^\top where KcK_c is a compressed latent and UKU_K is a fixed projection matrix. The decomposition K=QRK = QR (QR of the projection matrix) is used to initialize and orthogonalize the projection basis.

Speculative decoding with draft models: Uses LU-based factored attention matrices to maintain a draft distribution for speculative sampling.

K.2 Training-Time Decompositions

GaLore (Zhao et al., 2024): Gradient Low-Rank Projection. For a weight matrix WRm×nW \in \mathbb{R}^{m \times n}, the gradient GtG_t is projected onto a rank-rr subspace via:

G~t=GtPt,PtRn×r\tilde{G}_t = G_t P_t, \quad P_t \in \mathbb{R}^{n \times r}

where PtP_t comes from the right singular vectors of GtG_t (randomized SVD). Adam is applied to G~t\tilde{G}_t (memory: mrmr instead of mnmn), then the update is lifted back: ΔW=ΔW~tPt\Delta W = \tilde{\Delta W}_t P_t^\top. The SVD is recomputed every T=200T = 200 steps.

Fira (Chen et al., 2024): Extends GaLore with a closed-form gradient approximation that avoids the SVD step entirely, using a QR-based approximation to the projection.

SOAP (Vyas et al., 2024): Combines Shampoo's Kronecker-structured preconditioner with Adam's update rule. Each layer's preconditioner is maintained as GlRdl×dlG_l \in \mathbb{R}^{d_l \times d_l} with iterative Cholesky-based Newton steps for the matrix square root.

K.3 Model Compression via Decompositions

SliceGPT (Ashkboos et al., 2024): Compresses transformer weights by applying QR-based slice operations: a structured pruning where the weight matrix is decomposed as W=QW~W = Q\tilde{W} and the orthogonal factor QQ is absorbed into the layer normalization.

ASVD (Yuan et al., 2023): Activation-aware SVD for LLM compression. Instead of SVD of WW directly, computes SVD of WXWX (weight matrix multiplied by typical activation matrix), finding a better rank-rr approximation for the actual inference distribution. Uses randomized QR for efficiency.

QuIP (Chee et al., 2023) and QuIP# (Tseng et al., 2024): Incoherence-based quantization. A random orthogonal matrix QQ (Hadamard-based) is applied to WW before quantization: W^=Qround(QW/s)\hat{W} = Q \cdot \text{round}(Q^\top W / s). The orthogonal preprocessing ensures that no single weight has disproportionate magnitude, enabling aggressive quantization. The QR connection: QQ is chosen as a randomized Hadamard transform, equivalent to the sketch matrix in randomized QR.

K.4 Numerical Linear Algebra for AI Accelerators

Tensor Cores and Mixed Precision: NVIDIA A100/H100 Tensor Cores perform D=AB+CD = AB + C where A,BA, B are FP16/BF16 and C,DC, D are FP32, achieving 312312 TFLOPS (FP16) vs 19.519.5 TFLOPS (FP64). All modern factorization libraries exploit this:

  • cuSOLVER RF (Iterative Refinement): FP16 factor + FP32 residual + FP32 correction - full FP32 accuracy at near-FP16 throughput
  • cuBLAS Tensor Core GEMM: All blocked factorizations (LU, QR, Cholesky trailing updates) use TC-GEMM for the dominant computation

Intel AMX (Advanced Matrix Extensions): New Intel architecture extensions (Sapphire Rapids, 2023) add hardware matrix tile registers and tile-based GEMM instructions. BLIS and MKL exploit AMX for 2×2\times speedup over AVX-512 on blocked factorizations.

AMD MI300X: 192 GB HBM3 + 5.3 TB/s bandwidth enables n=65,000n = 65{,}000 Cholesky factorization in GPU memory - sufficient for large GP regression without blocking across devices.


Appendix L: Summary of Key Results

For quick reference, here are the key theorems and facts from this section:

LU Factorization:

  • Exists iff all leading principal minors det(Ak)0\det(A_k) \neq 0 for k=1,,n1k = 1, \ldots, n-1.
  • With partial pivoting: PA=LUPA = LU; Lij1|L_{ij}| \leq 1; cost 23n3\frac{2}{3}n^3 flops.
  • Backward error: EcnuρnA\|E\|_\infty \leq c_n u \rho_n \|A\|_\infty where ρn\rho_n is the growth factor.
  • Growth factor: ρn2n1\rho_n \leq 2^{n-1} (theoretical), n2/3\approx n^{2/3} (typical).
  • Determinant: det(A)=(1)siUii\det(A) = (-1)^s \prod_i U_{ii} where ss = number of row swaps.

QR Factorization:

  • Always exists for any ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n.
  • Householder QR: backward stable with Q^Q^IF=O(u)\|\hat{Q}^\top\hat{Q} - I\|_F = O(u), independent of κ(A)\kappa(A).
  • Cost: 2mn223n32mn^2 - \frac{2}{3}n^3 flops (vs 23n3\frac{2}{3}n^3 for LU when m=nm = n).
  • QR for least squares: κ\kappa loss =κ(A)= \kappa(A) (vs κ(A)2\kappa(A)^2 for normal equations).
  • RRQR: diagonal of RR estimates singular values; Rkk|R_{kk}| reveals rank.

Cholesky Factorization:

  • Exists iff A0A \succ 0 (equivalently: all eigenvalues positive).
  • Cost: 13n3\frac{1}{3}n^3 flops (half of LU); unconditionally backward stable without pivoting.
  • Log-determinant: logdetA=2ilogLii\log\det A = 2\sum_i\log L_{ii} - exact, O(n)O(n) after Cholesky.
  • Gradient: AlogdetA=A1\nabla_A\log\det A = A^{-1} - computed in O(n2)O(n^2) via Cholesky solve.

Backward Stability:

  • Forward error κ(A)×\leq \kappa(A) \times backward error (fundamental inequality).
  • Backward stability hierarchy: Cholesky = Householder QR \succ LU with partial pivot \succ LU without pivot.
  • Iterative refinement: recovers FP64 accuracy from FP16 factorization in O(1)O(1) iterations if κu16<1\kappa \cdot u_{16} < 1.

CHUNK_FINAL


Appendix M: Exercises - Further Problems

M.1 Theoretical Extensions

Problem M.1. Show that for any invertible ARn×nA \in \mathbb{R}^{n \times n}, there exists a permutation matrix PP such that PAPA has an LU factorization. (Hint: show that there always exists a row ordering such that all leading principal minors of PAPA are nonzero.)

Problem M.2. Prove the identity det(A)=det(L)det(U)=i=1nUii\det(A) = \det(L)\det(U) = \prod_{i=1}^n U_{ii} using the LU factorization. Extend to show that det(PA)=(1)sdet(A)\det(PA) = (-1)^s \det(A) where ss is the number of transpositions in PP.

Problem M.3. For ARn×nA \in \mathbb{R}^{n \times n} with Cholesky A=LLA = LL^\top, prove that the Schur complement of A11A_{11} in AA equals (L22L22)(L_{22}L_{22}^\top) where L22L_{22} is the lower-right (n1)×(n1)(n-1) \times (n-1) block of LL. Conclude that the Schur complement of any SPD matrix is SPD.

Problem M.4. Show that the Householder reflector H=I2vv/v2H = I - 2\mathbf{v}\mathbf{v}^\top/\|\mathbf{v}\|^2 satisfies: (a) H=HH = H^\top (symmetric) (b) HH=IH^\top H = I (orthogonal) (c) det(H)=1\det(H) = -1 (orientation-reversing) (d) H2=IH^2 = I (involutory - its own inverse)

Problem M.5. Prove that QR with column pivoting finds the best rank-rr approximation in the following sense: if rank(A)=r\operatorname{rank}(A) = r and PP is chosen so that R11R22|R_{11}| \geq |R_{22}| \geq \cdots, then the leading r×rr \times r block R11R_{11} of RR satisfies σr(R11)σr(A)/1+r(nr)\sigma_r(R_{11}) \geq \sigma_r(A) / \sqrt{1 + r(n-r)}.

M.2 Computational Challenges

Challenge M.6 (Batched Cholesky). Implement batched Cholesky factorization for k=1000k = 1000 matrices of size n=50×50n = 50 \times 50. Compare three implementations:

  • Python loop over scipy.linalg.cholesky
  • NumPy vectorized via np.vectorize
  • torch.linalg.cholesky on CPU with batch dimension

Measure time and throughput (matrices/second). Extrapolate to the K-FAC setting where k=10,000k = 10{,}000 and n=128n = 128.

Challenge M.7 (Iterative Refinement). Construct a matrix AA with condition number κ1012\kappa \approx 10^{12} (near the FP64 limit). Solve Ax=bA\mathbf{x} = \mathbf{b} using: (a) Direct LU (FP64): measure relative error (b) Simulated FP32 LU + FP64 residual refinement (3 steps): measure relative error Compare and explain the improvement using backward error theory.

Challenge M.8 (Rank-revealing QR in practice). Download a real dataset (e.g., the UCI Breast Cancer dataset, n=569n = 569, d=30d = 30 features). Apply RRQR to the feature matrix XX and identify the effective rank (threshold: Rkk/R11<103|R_{kk}|/|R_{11}| < 10^{-3}). Verify by comparing to the number of significant singular values of XX.


References

  1. Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. - The definitive reference for all material in this section.

  2. Trefethen, L. N. & Bau, D. (1997). Numerical Linear Algebra. SIAM. - Exceptionally clear treatment of Householder QR and backward error analysis.

  3. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM. - Comprehensive analysis of numerical stability; source for all error bounds cited here.

  4. Demmel, J. W. (1997). Applied Numerical Linear Algebra. SIAM. - Excellent treatment of LU pivoting strategies and condition estimation.

  5. Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix. Journal of the ACM, 5(4), 339-342. - Original Householder reflector paper.

  6. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion. Journal of the ACM, 8(3), 281-330. - Foundational backward error analysis for LU.

  7. Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288. - Randomized QR/SVD framework.

  8. Martens, J. & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature. ICML 2015. - K-FAC: Cholesky for Fisher information blocks.

  9. Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR 2022. - Low-rank decomposition for fine-tuning.

  10. Zhao, J. et al. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. ICML 2024. - Randomized SVD/QR for gradient compression.

  11. Bunch, J. R. & Kaufman, L. (1977). Some stable methods for calculating inertia and solving symmetric linear systems. Mathematics of Computation, 31(137), 163-179. - LDL^T with Bunch-Kaufman pivoting.

  12. Anderson, E. et al. (1999). LAPACK Users' Guide (3rd ed.). SIAM. - LAPACK routine reference and algorithmic details.

  13. Gardner, J. R. et al. (2018). GPyTorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. NeurIPS 2018. - CG-based GP inference for large-scale regression.

  14. Demmel, J., Grigori, L., Hoemmen, M., & Langou, J. (2008). Communication-optimal parallel and sequential QR and LU factorizations. SIAM Journal on Scientific Computing, 34(1), A206-A239. - TSQR and communication-avoiding algorithms.

  15. Shah, J. et al. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv:2407.08608. - Block-tiled attention exploiting BLAS-3 structure.


Appendix M: Exercises - Further Problems

M.1 Theoretical Extensions

Problem M.1. Show that for any invertible A there exists a permutation P such that PA has an LU factorization. (Hint: show that there always exists a row ordering such that all leading principal minors of PA are nonzero.)

Problem M.2. Prove the identity det(A) = prod_i U_ii using the LU factorization. Extend to show that det(PA) = (-1)^s det(A) where s is the number of transpositions in P.

Problem M.3. For A with Cholesky A = LL^T, prove that the Schur complement of A_11 in A equals L_22 L_22^T. Conclude that the Schur complement of any SPD matrix is SPD.

Problem M.4. Show that the Householder reflector H = I - 2vv^T/||v||^2 is symmetric, orthogonal, has det(H) = -1, and H^2 = I.

M.2 References

  1. Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
  2. Trefethen, L. N. and Bau, D. (1997). Numerical Linear Algebra. SIAM.
  3. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM.
  4. Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011). Finding structure with randomness. SIAM Review, 53(2), 217-288.
  5. Martens, J. and Grosse, R. (2015). Optimizing neural networks with K-FAC. ICML 2015.
  6. Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR 2022.
  7. Zhao, J. et al. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. ICML 2024.
  8. Demmel, J. et al. (2008). Communication-optimal parallel and sequential QR and LU factorizations. SIAM JSC, 34(1).