Lesson overview | Lesson overview | Next part
Matrix Decompositions: Part 1: Intuition to 5. Cholesky Factorization Computational Recap
1. Intuition
1.1 The Factorization Paradigm
Every problem in computational linear algebra is, at its heart, a search for structure. When you factor a matrix into simpler pieces - triangular, orthogonal, diagonal - you transform a hard problem into a sequence of easy ones. The word "easy" is precise: triangular systems can be solved in operations by substitution; orthogonal transformations are numerically perfect (they preserve norms exactly); diagonal systems are trivially solvable. A decomposition is a way of peeling back complexity to reveal the structural skeleton underneath.
This perspective explains why factorizations are not merely computational conveniences. They are mathematical revelations. LU factorization reveals which rows are "heavy" (large pivots dominate) and which are "light" (small pivots signal near-singularity). QR factorization reveals the orthogonal complement of the column space via the zero rows of . Cholesky factorization reveals the "matrix square root" that encodes the geometry of a positive definite quadratic form. Each decomposition is a lens tuned to a different structural feature.
The central insight: nearly every matrix algorithm of practical importance runs in time, dominated by the factorization phase. After factorization, the actual problem-solving (applying , , ) costs only . This is the fundamental return on investment: spend once to factor, then spend as many times as needed for different right-hand sides.
For AI: this principle appears in natural gradient methods (factor the Fisher information matrix once per outer loop, solve multiple gradient updates cheaply), in Gaussian process regression (factor the covariance matrix once, evaluate posterior at many test points cheaply), and in iterative solvers (factor a preconditioner once, apply it repeatedly to accelerate convergence).
1.2 Triangular Systems: The Foundation
Before understanding factorizations, you must understand why triangular systems are considered "solved." A lower triangular system with can be solved row by row from top to bottom:
This is forward substitution: each unknown is determined uniquely by those already computed. The computation requires exactly multiplications and additions - total. Similarly, an upper triangular system is solved by backward substitution from bottom to top.
The key properties of triangular systems:
- Existence and uniqueness: a triangular system has a unique solution if and only if all diagonal entries are nonzero.
- Numerical stability: forward/backward substitution are inherently stable when pivots are large; small pivots amplify errors.
- Parallelism: the sequential dependency (each unknown requires all previous) limits parallelism, but blocked implementations can exploit level-2 BLAS parallelism within panels.
- Structural exploitation: sparse triangular systems (band, arrowhead, bidiagonal) can be solved in for bandwidth .
Non-example: a general dense system cannot be solved by substitution directly - it requires work. The entire purpose of LU, QR, and Cholesky is to reduce the general case to triangular cases.
1.3 Three Canonical Decompositions
The three decompositions in this section partition the space of structured matrix problems:
| Decomposition | Matrix class | Factored form | Primary use |
|---|---|---|---|
| LU | General square, non-singular | Solving , computing | |
| QR | Any rectangular , | Least squares, rank determination, eigenvalue algorithms | |
| Cholesky | Symmetric positive definite | SPD systems, Gaussian sampling, log-det computation |
Why three? Each exploits a different structural property. LU uses no special structure beyond invertibility; it is the most general and therefore requires more care (pivoting for stability). QR uses orthogonality, which is the most numerically pristine structure - has condition number 1 by definition. Cholesky uses both symmetry and positive definiteness, which allows working with only half the matrix and guarantees the factorization exists without pivoting.
The hierarchy of stability: Cholesky QR LU (partially pivoted) LU (no pivoting). As you sacrifice structure, you need more algorithmic sophistication to maintain numerical reliability.
1.4 Why This Matters for AI
Matrix factorizations are not background infrastructure - they are in the critical path of the most computationally expensive operations in modern AI.
Gaussian process regression (used in Bayesian optimization, uncertainty quantification, and hyperparameter tuning) requires computing and for an kernel matrix . Both require Cholesky factorization. A single GP training step on points costs operations - factorization is the bottleneck.
Second-order optimization (Newton's method, natural gradient, K-FAC) requires solving Hessian or Fisher information systems . For K-FAC (Kronecker-Factored Approximate Curvature, Martens & Grosse 2015), the Fisher approximation is block-diagonal with Kronecker structure, and each block is inverted via Cholesky - thousands of small Cholesky factorizations per training step.
Linear probing and fine-tuning - fitting a linear model on top of frozen features - is a least-squares problem. Numerically reliable implementations use QR (e.g., numpy.linalg.lstsq calls LAPACK's dgelsd which uses divide-and-conquer SVD, or dgelsy which uses column-pivoted QR).
Differentiating through factorizations is required whenever a factorization appears inside a differentiable computational graph. PyTorch's torch.linalg.cholesky_solve and JAX's jax.lax.linalg.cholesky support automatic differentiation through the factorization via implicit function theorem.
Randomized methods (LoRA, sketching, randomized preconditioning) use approximate LU and QR factorizations on sketched matrices. Understanding the exact factorizations is prerequisite to understanding their randomized variants.
1.5 Historical Timeline
| Year | Contribution | Author(s) |
|---|---|---|
| 1809 | Gaussian elimination for geodesy (Gauss not published until 1810) | Carl Friedrich Gauss |
| 1938 | LU factorization formalized as matrix decomposition | Tadeusz Banachiewicz |
| 1945 | Systematic analysis of elimination and error growth | Alan Turing |
| 1954 | Givens rotations for QR | Wallace Givens |
| 1958 | Householder reflections; modern QR algorithm | Alston Householder |
| 1961 | QR algorithm for eigenvalues (combined QR iteration) | Francis; Kublanovskaya |
| 1961 | LAPACK precursor: ALGOL programs for matrix problems | Wilkinson, Reinsch |
| 1979 | LINPACK (first standardized library) | Dongarra et al. |
| 1987 | LAPACK + BLAS-3 blocked algorithms | Anderson, Bai, Dongarra et al. |
| 1995 | Randomised SVD (sketch-then-factor) | Frieze, Kannan, Vempala |
| 2011 | Halko-Martinsson-Tropp randomized low-rank framework | Halko, Martinsson, Tropp |
| 2015 | K-FAC: Cholesky for Fisher information blocks | Martens & Grosse |
| 2022 | LoRA: low-rank LU-inspired weight decomposition | Hu et al. |
| 2024 | FlashAttention-3: blocked QR-like tiling for attention | Shah et al. |
2. Background: Triangular Systems
2.1 Forward Substitution
Definition. Given a lower triangular matrix with for all , and , forward substitution computes satisfying via:
Algorithm (row-oriented):
for i = 1 to n:
x[i] = b[i]
for j = 1 to i-1:
x[i] -= L[i,j] * x[j]
x[i] /= L[i,i]
Complexity: The inner loop runs times, giving multiplications and the same number of additions. Total: floating-point operations, or .
Numerical properties: Forward substitution is unconditionally stable in exact arithmetic. In floating-point arithmetic, errors accumulate proportionally to (the condition number of ), not to the intermediate values of . The key source of numerical difficulty is a small diagonal entry : dividing by a near-zero diagonal amplifies errors quadratically.
Unit lower triangular: When for all (as in the LU factorization where is unit lower triangular by convention), the divisions are trivial and the algorithm reduces to:
for i = 1 to n:
x[i] = b[i] - sum(L[i,1:i-1] * x[1:i-1])
This eliminates divisions, slightly improving performance and stability.
For AI: Forward substitution appears in the final step of any matrix factorization-based solver. In neural network training with second-order methods (Newton, K-FAC), solving extracts the natural gradient direction given the Cholesky factor of the Fisher matrix .
2.2 Backward Substitution
Definition. Given an upper triangular with , backward substitution computes satisfying via:
The algorithm proceeds from the last equation (which has a single unknown ) backward to the first. Complexity and numerical properties mirror forward substitution.
Column-oriented backward substitution: An equivalent formulation updates the remaining components of as each is resolved:
for i = n downto 1:
x[i] = b[i] / U[i,i]
b[1:i-1] -= x[i] * U[1:i-1, i]
This accesses by columns rather than rows, which can improve cache efficiency on column-major storage (Fortran, LAPACK convention).
Connection to matrix inverse: The columns of can be computed by solving triangular systems , one per standard basis vector. This is how scipy.linalg.solve_triangular computes when called with a matrix right-hand side.
2.3 Gaussian Elimination Revisited
Gaussian elimination is the process of reducing a matrix to upper triangular form by applying elementary row operations (EROs). Each ERO of type "add times row to row (with )" corresponds to left-multiplication by an elementary lower triangular matrix (an elementary elimination matrix or Frobenius matrix):
The multiplier is the ratio that zeros out entry . Applying all elimination matrices to reduce to gives:
Crucially, the product of elementary lower triangular matrices is lower triangular, and their inverses are trivially:
So the product of inverses, , is unit lower triangular with the multipliers in their natural positions. This gives:
The multipliers go directly into : one of the most elegant facts in numerical linear algebra is that the multipliers used during elimination appear, without any further computation, as the subdiagonal entries of the factor . No separate computation of is required - it is built up in-place during elimination.
3. LU Factorization
3.1 LU Decomposition Theorem
Theorem (LU factorization). Let . Then has an LU factorization (with unit lower triangular and upper triangular) if and only if all leading principal submatrices for are non-singular.
Proof sketch. The condition non-singular for all ensures that no zero pivot is encountered during Gaussian elimination. At step , the pivot is , so all pivots are nonzero iff all leading minors are nonzero. Conversely, if Gaussian elimination completes without zero pivots, the multipliers are finite and are well-defined.
Non-examples:
- fails: (a zero pivot at step 1).
- : but , so the factorization runs but (singular).
- Any singular matrix: , so iff some .
Uniqueness: The factorization with (unit lower triangular) is unique when it exists. Without the normalization, one could scale and by any diagonal matrix and its inverse.
The factorization in-place: LU computation overwrites : the upper triangle stores , the strict lower triangle stores the multipliers (i.e., the strict lower triangle of , since need not be stored). This halves the memory requirement.
3.2 The LU Algorithm: Outer Product Form
The standard view of LU elimination is column-by-column. But the outer product form (also called rank-1 update form) exposes the structure more clearly and maps naturally to BLAS-3 blocked implementations.
At step of the factorization, we have already reduced columns . The remaining submatrix (the Schur complement after steps) has the form:
where (the multipliers column) and (the pivot row). This is a rank-1 update of the trailing submatrix.
Algorithm:
for k = 1 to n-1:
# Compute multipliers
L[k+1:n, k] = A[k+1:n, k] / A[k, k] # size (n-k)
# Rank-1 update of trailing submatrix
A[k+1:n, k+1:n] -= L[k+1:n, k] * A[k, k+1:n] # BLAS-2: ger
U[k, k:n] = A[k, k:n]
Complexity: Step costs divisions and multiply-adds. Total:
So LU factorization costs floating-point operations (flops). For comparison: solving after factorization costs (two triangular solves). The ratio means factorization dominates for large .
For AI: the rank-1 update structure maps directly to GPU tensor cores. Blocked implementations reformulate the trailing submatrix update as a matrix-matrix multiply (BLAS-3: DGEMM), which achieves near-peak GPU throughput.
3.3 Failure Without Pivoting
Naive LU (no pivoting) fails catastrophically when:
-
Zero pivots: If during step , the division by the pivot is undefined. This happens even for non-singular matrices that happen to have zero leading principal minors.
-
Small pivots - numerical catastrophe: Consider:
Naive LU gives multiplier , and (in floating-point, this overwhelms the exact value ). The computed solution can have relative error of order 1 even when the exact solution is well-conditioned.
- Growth factor explosion: The growth factor is defined as:
Without pivoting, can grow as in the worst case (Wilkinson's matrix), making the computed useless.
Example of catastrophic failure:
Exact solution: approximately. Naive LU in IEEE double precision gives multiplier , yielding - a catastrophic cancellation.
For AI: PyTorch's torch.linalg.lu uses partial pivoting by default. numpy.linalg.solve calls LAPACK dgesv which uses partial pivoting. Never use unpivoted LU for numerical computation.
3.4 Partial Pivoting: PA = LU
Idea: Before processing column , swap row with the row (among ) having the largest absolute value in column . This ensures for all multipliers.
Algorithm with partial pivoting:
for k = 1 to n-1:
# Find pivot row
p = argmax_{i >= k} |A[i, k]|
swap rows k and p in A (and record in permutation P)
# Now A[k,k] is the largest in column k below
L[k+1:n, k] = A[k+1:n, k] / A[k, k] # |L[i,k]| <= 1
A[k+1:n, k+1:n] -= L[k+1:n, k] * A[k, k+1:n]
Result: where is a permutation matrix encoding all row swaps. The factor satisfies for all .
Growth factor with partial pivoting: The theoretical bound is (same as no pivoting!), but in practice the growth factor is almost always small ( for random matrices). Foster (1997) proved that adversarial inputs achieving are exponentially rare.
Stability theorem (Wilkinson 1961): Partial pivoting produces a computed factorization such that:
where is the unit roundoff (e.g., for IEEE double). For typical , this bound is well within practical tolerance.
Storage of the permutation: is stored as a pivot vector where is the row index selected at step . Applying to before the triangular solve costs .
Non-example: Partial pivoting does NOT guarantee is small. The example:
has pivot column , so either row could be chosen. After step 1, , . But for , .
3.5 Complete Pivoting: PAQ = LU
Complete pivoting selects both the row and column with the globally largest absolute entry at each step:
At step : find , then swap row and column .
Result: where and are permutation matrices (row and column respectively).
Growth factor bound: Complete pivoting satisfies the tighter bound:
which grows much more slowly than but is still superlinear. Numerically, complete pivoting is extremely stable - no practical example of large growth is known.
Cost: Finding the global maximum at each step requires examining the full trailing submatrix, adding comparisons per step and total - which triples the leading constant relative to partial pivoting. This is the reason complete pivoting is rarely used in practice.
When to use complete pivoting:
- When stability is paramount (e.g., certified computation, interval arithmetic)
- When the matrix is suspected to be rank-deficient (complete pivoting exposes rank)
- When solving very ill-conditioned systems where partial pivoting is known to fail
For AI: Complete pivoting is rarely used in mainstream ML workflows. Partial pivoting suffices for almost all practical problems. The theoretical importance of complete pivoting is in establishing lower bounds and in the analysis of rank-revealing factorizations.
3.6 Rook Pivoting
Rook pivoting is a middle ground between partial and complete pivoting. At step :
- Find the largest element in column (as in partial pivoting) - say row .
- Check if is also the largest in row . If yes, use as pivot.
- If not, find the largest in row , swap columns, repeat until convergence.
The name comes from the chess rook: the pivot search alternates between column and row moves until it finds a position that is simultaneously the largest in its row and column - a "rook position."
Properties:
- Growth factor: but in practice much smaller than partial pivoting.
- Cost: additional work per step in the worst case but typically in practice (convergence in 1-2 alternations).
- Rank-revealing: Rook pivoting reveals rank approximately as well as complete pivoting.
Theorem (Foster 1997): For a matrix of rank , rook pivoting ensures for , making the diagonal jump at position clearly visible.
3.7 LU Stability Analysis and Backward Error
The definitive framework for understanding LU stability is backward error analysis, developed by Wilkinson in the 1960s.
Definition (backward error). The backward error of a computed solution to is the smallest perturbation such that:
A method is backward stable if and are of order (machine epsilon) regardless of the input.
Backward error theorem for LU with partial pivoting:
Theorem (Wilkinson 1961). Let be the solution computed by LU with partial pivoting in IEEE double precision. Then:
where is a modest polynomial constant and .
Forward error bound: The forward error satisfies:
where is the condition number. This is the key identity: numerical error = roundoff growth factor condition number.
Implication for practice: If in double precision, you lose digits of accuracy. With , you have correct digits in .
Backward stability of QR: QR (Householder) is backward stable with growth factor - Householder reflectors are orthogonal and preserve norms exactly. This is why QR is preferred for ill-conditioned systems.
3.8 Blocked LU for Cache Efficiency
Modern hardware has a deep memory hierarchy (L1: 32KB, L2: 256KB, L3: 8MB, DRAM: \infty). The key metric for algorithmic performance is arithmetic intensity: floating-point operations per byte of data moved. BLAS operations have different intensities:
| Operation | BLAS level | Flops | Data | Intensity |
|---|---|---|---|---|
| (DGEMV) | 2 | |||
| (DGEMM) | 3 |
Blocked LU reformulates the algorithm so that the dominant computation is DGEMM (matrix-matrix multiply), achieving arithmetic intensity and near-peak GPU/CPU throughput.
Block algorithm (block size ):
for k = 0 to n/b - 1:
A_kk = A[kb:(k+1)b, kb:(k+1)b]
# Factor the diagonal panel (unblocked, uses BLAS-2)
P_k, L_kk, U_kk = LU_factor(A_kk) # O(b^3)
# Apply permutation to remaining columns
A[kb:(k+1)b, (k+1)b:n] = P_k @ A[kb:(k+1)b, (k+1)b:n]
# Solve for U panel (triangular solve, BLAS-3: DTRSM)
U_k_right = solve_lower_triangular(L_kk, A[kb:(k+1)b, (k+1)b:n])
# Update trailing submatrix (BLAS-3: DGEMM)
A[(k+1)b:n, (k+1)b:n] -= L_right @ U_k_right # This is the bottleneck
The trailing submatrix update (A -= L_right @ U_k_right) is a pure DGEMM and dominates the total computation. On a modern CPU/GPU with peak DGEMM performance, this block formulation achieves 80-95% of theoretical peak.
LAPACK routine: DGETRF implements blocked LU with partial pivoting. The block size is tuned automatically per platform via the ilaenv oracle.
For AI: Blocked LU is the algorithm behind PyTorch's torch.linalg.lu_factor, jax.scipy.linalg.lu, and scipy's lu_factor. For GPU, cuSOLVER implements batched blocked LU for thousands of small systems simultaneously - critical for K-FAC training.
3.9 Solving Ax = b via LU
Given the LU factorization , solving proceeds in four steps:
- Apply permutation: (O(n), just index lookup)
- Forward solve: solve (O(n^2), forward substitution)
- Backward solve: solve (O(n^2), backward substitution)
- (Optional) Iterative refinement: compute , solve for correction , update
The total cost after factorization is flops - linear in , negligible compared to the factorization cost for large .
Multiple right-hand sides: For different vectors , one factorization costing is followed by triangular solves each costing . This is the key economic argument for LU: amortize the factorization cost over many solves.
Computing the determinant: , where is the number of row swaps. Since is unit lower triangular, .
Computing : Solve for (one factorization, triangular solves). But explicitly forming is almost always unnecessary and should be avoided - solve the system directly instead.
3.10 Rank-Deficient LU
When is singular or numerically rank-deficient, the LU factorization (even with pivoting) produces a with some diagonal entries near zero. The factorization still "works" mechanically but produces at position corresponding to a dependent column.
Signature of rank deficiency:
Problem: With partial pivoting, the threshold for "approximately zero" is ambiguous. If , we declare the matrix has numerical rank . But the choice of is problem-dependent.
Rank-revealing LU: Column-pivoted LU (complete or rook pivoting) provides better rank-revelation: the diagonal entries of decay in a manner that reflects the true rank structure.
Better alternative for rank-deficient problems: Use rank-revealing QR (RRQR, 4.6) or SVD. The SVD provides the definitive rank determination via singular value thresholding - but costs just as LU does. RRQR provides a cheaper, nearly-as-reliable alternative.
For AI: Neural network weight matrices often have approximate low rank (Hu et al., 2022 show that fine-tuning adapters are intrinsically low-dimensional). Rank-deficient LU can be used to detect this structure, but RRQR or randomized SVD are preferred in practice.
4. QR Factorization: Computational Algorithms
4.1 From Gram-Schmidt to Algorithms
Recall from 05: The QR factorization of an matrix () decomposes into an orthonormal factor (or for the full QR) and an upper triangular factor . This was constructed via Gram-Schmidt orthogonalization in 05.
The problem with classical Gram-Schmidt: Classical Gram-Schmidt (CGS) is mathematically correct but numerically unstable for ill-conditioned matrices. The computed can lose orthogonality catastrophically: for a matrix with condition number , the computed from CGS may have , causing significant errors in subsequent computations.
Modified Gram-Schmidt (MGS): Reorders the orthogonalization to reduce error accumulation. MGS is numerically equivalent to classical QR of a slightly perturbed matrix, giving . Still insufficient for highly ill-conditioned problems.
Householder and Givens: Both achieve regardless of , making them the preferred algorithms for production use. The key is that they build as a product of exactly orthogonal elementary transformations (reflectors or rotations), never losing orthogonality during construction.
For AI: numpy.linalg.qr and scipy.linalg.qr use LAPACK DGEQRF (Householder QR). torch.linalg.qr similarly calls cuSOLVER which uses Householder on GPU. Understanding the algorithmic basis helps you interpret condition number warnings and numerical precision behavior.
4.2 Householder Reflections
Definition. A Householder reflector (or Householder transformation) is a matrix of the form:
where is a nonzero Householder vector. The matrix is symmetric () and orthogonal (), so - it is its own inverse.
Geometric interpretation: reflects vectors across the hyperplane perpendicular to . Every point in the hyperplane satisfies (invariant). Every point along the direction satisfies (negated).
Key property: Given any vector and any unit vector , there exists a Householder reflector such that . That is, maps to a multiple of the first standard basis vector.
Construction (numerical form): Given to be zeroed below its first entry:
The sign convention (critical for stability): We choose to avoid cancellation in computing . If , choosing gives , avoiding the catastrophic subtraction that would occur with .
Cost of applying : Computing directly costs (matrix-vector product), but using the formula costs only - first compute the scalar , then update . This is the implicit representation of .
4.3 Householder QR Algorithm
Algorithm: Apply Householder reflectors successively to zero out subdiagonal entries of each column.
for k = 1 to n: # n columns
# Construct H_k to zero out A[k+1:m, k]
a = A[k:m, k] # current column
alpha = -sign(a[0]) * norm(a)
v = a.copy(); v[0] -= alpha
v /= norm(v) # normalized Householder vector
# Apply H_k implicitly to A[k:m, k:n]
A[k:m, k:n] -= 2 * v * (v.T @ A[k:m, k:n])
# Store v in lower triangle of A (LAPACK convention)
A[k+1:m, k] = v[1:]
After steps, has been overwritten with in the upper triangle and Householder vectors in the lower triangle (the implicit storage of ).
Implicit Q representation: The full matrix is and has size . Forming explicitly costs additional work and storage. LAPACK's DORGQR computes the explicit when needed (e.g., for the thin QR, only the first columns of are needed).
Complexity:
- Applying to trailing submatrix: per step
- Total:
- For square : flops (vs for LU)
Numerical stability: Householder QR is backward stable. The computed satisfy:
with a modest polynomial constant and machine epsilon. The orthogonality error satisfies , independent of .
LAPACK: DGEQRF computes the Householder QR in blocked form (block Householder, using WY representation). DORMQR applies or to a matrix without forming explicitly.
For AI: The Householder algorithm is used in PyTorch's torch.linalg.qr, in scipy's linalg.qr, and in LAPACK DGEQRF. It is the foundation for numerically reliable QR-based least-squares solvers used in linear probing, weight matrix analysis, and second-order methods.
4.4 Givens Rotations
Definition. A Givens rotation is the identity matrix with a rotation embedded in the rows and columns:
with and in positions , , , .
Key property: is orthogonal. Left-multiplying by rotates the plane by angle .
Zeroing a specific entry: Given with entries and , choose:
Then , - the rotation zeros out while preserving as .
Numerically stable computation (LAPACK DLARTG):
if x_j == 0:
c = 1, s = 0, r = x_i
elif |x_j| > |x_i|:
t = x_i / x_j; s = 1/sqrt(1+t^2); c = s*t; r = x_j/s
else:
t = x_j / x_i; c = 1/sqrt(1+t^2); s = c*t; r = x_i/c
This avoids overflow and underflow in computing .
Cost: Applying to a matrix costs flops (touching only rows and ). Constructing costs .
For AI: Givens rotations appear in updating QR factorizations when a new row is appended (streaming QR), which is used in online learning and in updating Cholesky factors after rank-1 additions (Cholesky rank-1 update, relevant to incremental GP regression).
4.5 Givens QR and Sparse Matrices
Algorithm: Apply Givens rotations sequentially to zero out subdiagonal entries of column by column. To zero (), apply on the left.
For a dense matrix, the total number of Givens rotations needed is , each costing flops - total , which is worse than Householder QR for dense matrices.
Advantage for sparse matrices: Each Givens rotation touches exactly two rows and two columns. If has a known sparsity pattern, Givens rotations can be sequenced to minimize fill-in (new nonzeros created by the transformation). In contrast, a single Householder reflector is a rank-2 update that touches an entire panel, potentially creating dense fill.
Banded matrices: For a banded matrix with bandwidth , Givens QR costs instead of for Householder, and the resulting remains banded. This is critical for signal processing and finite-element computations.
Online/streaming QR: When a new row is appended to , the existing QR factorization can be updated using a single sequence of Givens rotations, costing rather than recomputing from scratch in .
For AI: Streaming QR via Givens is used in online learning settings where data arrives sequentially. It also underlies incremental SVD algorithms used for streaming PCA (e.g., in continual learning, where new tasks arrive without access to past data).
4.6 Column-Pivoted QR (RRQR)
Motivation: Standard QR doesn't reveal rank. The diagonal entries of decay, but not necessarily in a way that makes rank determination reliable. Column-pivoted QR (RRQR) reorders columns of to expose rank structure in .
Algorithm: At step , before computing the -th Householder reflector, swap column with the column having maximum 2-norm among remaining columns. This produces:
where is a permutation and has the property that .
Rank estimation: If has numerical rank , then is large (theoretically , practically much larger). The threshold for declaring rank is:
RRQR theorem (Golub 1965, Chandrasekaran & Ipsen 1994): The strong RRQR algorithm guarantees:
ensuring that (the leading block) captures essentially all the energy of the matrix.
Connection to SVD: RRQR is a cheap () approximation to the SVD. It provides the same rank information and a good basis for the column space, but singular values only approximately. For exact singular values, use SVD at cost.
For AI:
- LoRA (Hu et al., 2022): Low-Rank Adaptation uses rank- decompositions of weight updates . Choosing requires rank estimation - RRQR is one approach.
- DoRA (Liu et al., 2024): Decomposes weight matrices into magnitude and direction; RRQR is used to identify the principal components.
- Intrinsic dimensionality: RRQR-based rank estimation reveals the intrinsic dimensionality of weight matrices, guiding compression decisions.
4.7 Tall-Skinny QR (TSQR)
Motivation: For matrices with (tall and skinny - e.g., , ), standard Householder QR communicates data between levels of the memory hierarchy at each of steps, totaling words of communication. For distributed or GPU computation, this is the bottleneck.
TSQR algorithm (Demmel et al., 2008):
- Local factorization: Partition into panels (one per processor/GPU block).
- Local QR: Factor each independently (no communication).
- Reduction tree: Form , factor its QR to get . Repeat up the tree.
- Result: is the final upper triangular factor. can be recovered by traversing the reduction tree backward.
Communication cost: TSQR communicates words (vs for blocked Householder), achieving communication optimality.
For AI:
- Distributed training: TSQR is used in distributed least-squares and in computing the QR of gradient matrices across multiple GPUs.
- Randomized methods: TSQR underlies the "sketch-then-QR" approach for randomized low-rank factorization.
- Attention-efficient computation: The block-tile structure of FlashAttention-3 (Shah et al., 2024) is inspired by the communication-optimal blocking patterns of TSQR.
4.8 QR for Least Squares
Problem: Given with (overdetermined) and , find minimizing .
Method 1 (Normal equations): . Form , factor via Cholesky, solve. Cost: . Problem: - squaring the condition number doubles the digits lost.
Method 2 (QR): Factor (thin QR), then:
Minimizing over : solve (upper triangular, backward substitution). Cost: for QR, for , for backsolve. The condition number of is , not .
Stability comparison:
| Method | Condition amplification | Cost | When to use |
|---|---|---|---|
| Normal equations + Cholesky | Well-conditioned , | ||
| QR (Householder) | General purpose, numerically safe | ||
| RRQR | , reveals rank | Rank-deficient or near-singular | |
| SVD (truncated) | None (sets small SVs to zero) | Ill-conditioned, rank-deficient |
For AI:
- Linear probing: Fitting a linear classifier on top of frozen features is an overdetermined least-squares problem.
sklearn.linear_model.LinearRegressionuses LAPACKDGELSD(divide-and-conquer SVD) orDGELSY(column-pivoted QR). - Ridge regression: is equivalent to the augmented system , which QR solves stably.
- Attention weight regression: Computing attention pattern regression (e.g., for mechanistic interpretability) uses QR-based least squares.
5. Cholesky Factorization (Computational Recap)
5.1 Brief Overview
Full treatment: The complete theory of Cholesky factorization - existence proofs, LDL^T, modified Cholesky, log-determinant, connection to PSD cone - is in 07: Positive Definite Matrices. This section covers only the computational aspects not covered in 07: the relationship to LU, blocked algorithms, and LAPACK routines.
For a symmetric positive definite matrix , the Cholesky factorization is:
where is unit lower triangular with positive diagonal entries . The factorization exists and is unique for every (Theorem from 07).
Why Cholesky for SPD systems:
- Efficiency: Cholesky costs flops - exactly half of LU's - because symmetry halves the work.
- Storage: Only the lower triangle needs to be stored (n(n+1)/2 entries vs n^2 for LU).
- No pivoting needed: Positive definiteness guarantees at every step without any row interchanges.
- Stability: Cholesky is backward stable for SPD matrices without any pivoting.
5.2 Cholesky as Specialized LU
The connection between Cholesky and LU: for a symmetric positive definite , the LU factorization (without pivoting, which exists since all leading principal minors of are positive) gives . But implies , so where . Thus (the LDL^T form). Since , all diagonal entries , and we can define to get - the Cholesky factorization.
Cholesky algorithm (jth column):
The argument of the square root must be positive - this is guaranteed by positive definiteness.
Cost analysis:
- Column costs: 1 square root + flops for , plus flops for
- Total: , exactly half of LU
5.3 LDL^T for Indefinite Systems
For symmetric indefinite matrices (not necessarily PD), Cholesky cannot be applied directly (negative square roots). The LDL^T factorization computes:
where is unit lower triangular, is block diagonal with and blocks (to handle negative eigenvalues), and is a permutation.
Bunch-Kaufman pivoting: The blocks in capture pairs of eigenvalues of opposite sign, avoiding the square root altogether. The pivoting strategy (Bunch & Kaufman 1977) ensures - a stability bound analogous to for partial pivoting in LU.
Applications in ML:
- Indefinite Hessians: At saddle points (common early in neural network training), the Hessian is indefinite. LDL^T factors it without the positive-definiteness requirement, enabling second-order descent directions even at saddle points.
- Modified Cholesky: Adding a diagonal shift to make before Cholesky factorization is the standard approach in L-BFGS-B and quasi-Newton methods. See 07 for the modified Cholesky algorithm.
LAPACK routine: DSYTRF implements LDL^T with Bunch-Kaufman pivoting.
5.4 Blocked Cholesky and LAPACK dpotrf
Blocked Cholesky follows the same blocked structure as blocked LU. For block size :
for k = 0 to n/b - 1:
# Factor diagonal block
L_kk = cholesky(A[kb:(k+1)b, kb:(k+1)b]) # O(b^3)
# Solve for L panel (DTRSM)
L_k_below = solve_lower_triangular(L_kk.T, A[(k+1)b:n, kb:(k+1)b].T).T
# Update trailing submatrix (DSYRK + DGEMM)
A[(k+1)b:n, (k+1)b:n] -= L_k_below @ L_k_below.T
The trailing submatrix update (A -= L @ L^T) is a symmetric rank- update, computed by LAPACK's DSYRK (symmetric rank-k update) which exploits symmetry to halve the work relative to DGEMM.
LAPACK routine DPOTRF: The production Cholesky routine. On entry: the upper or lower triangle of . On exit: (or for upper Cholesky). Returns an error flag INFO ( for success; if the -th leading minor is not positive definite).
For AI: DPOTRF is called by:
scipy.linalg.cholesky(wraps LAPACK)numpy.linalg.cholesky(wraps LAPACK)torch.linalg.cholesky(calls cuSOLVERcusolverDnDpotrfon GPU)- Every GP regression library (GPyTorch, GPflow, etc.) for computing and