Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Matrix Decompositions: Part 1: Intuition to 5. Cholesky Factorization Computational Recap

1. Intuition

1.1 The Factorization Paradigm

Every problem in computational linear algebra is, at its heart, a search for structure. When you factor a matrix into simpler pieces - triangular, orthogonal, diagonal - you transform a hard problem into a sequence of easy ones. The word "easy" is precise: triangular systems can be solved in O(n2)O(n^2) operations by substitution; orthogonal transformations are numerically perfect (they preserve norms exactly); diagonal systems are trivially solvable. A decomposition is a way of peeling back complexity to reveal the structural skeleton underneath.

This perspective explains why factorizations are not merely computational conveniences. They are mathematical revelations. LU factorization reveals which rows are "heavy" (large pivots dominate) and which are "light" (small pivots signal near-singularity). QR factorization reveals the orthogonal complement of the column space via the zero rows of RR. Cholesky factorization reveals the "matrix square root" LL that encodes the geometry of a positive definite quadratic form. Each decomposition is a lens tuned to a different structural feature.

The central insight: nearly every matrix algorithm of practical importance runs in O(n3)O(n^3) time, dominated by the factorization phase. After factorization, the actual problem-solving (applying L1L^{-1}, U1U^{-1}, QQ^\top) costs only O(n2)O(n^2). This is the fundamental return on investment: spend O(n3)O(n^3) once to factor, then spend O(n2)O(n^2) as many times as needed for different right-hand sides.

For AI: this principle appears in natural gradient methods (factor the Fisher information matrix once per outer loop, solve multiple gradient updates cheaply), in Gaussian process regression (factor the covariance matrix once, evaluate posterior at many test points cheaply), and in iterative solvers (factor a preconditioner once, apply it repeatedly to accelerate convergence).

1.2 Triangular Systems: The Foundation

Before understanding factorizations, you must understand why triangular systems are considered "solved." A lower triangular system Lx=bL\mathbf{x} = \mathbf{b} with Lii0L_{ii} \neq 0 can be solved row by row from top to bottom:

x1=b1/L11x_1 = b_1 / L_{11} x2=(b2L21x1)/L22x_2 = (b_2 - L_{21}x_1) / L_{22} xi=(bij=1i1Lijxj)/Liix_i = \left(b_i - \sum_{j=1}^{i-1} L_{ij} x_j\right) / L_{ii}

This is forward substitution: each unknown is determined uniquely by those already computed. The computation requires exactly n2/2n^2/2 multiplications and n2/2n^2/2 additions - O(n2)O(n^2) total. Similarly, an upper triangular system Ux=bU\mathbf{x} = \mathbf{b} is solved by backward substitution from bottom to top.

The key properties of triangular systems:

  1. Existence and uniqueness: a triangular system has a unique solution if and only if all diagonal entries are nonzero.
  2. Numerical stability: forward/backward substitution are inherently stable when pivots are large; small pivots amplify errors.
  3. Parallelism: the sequential dependency (each unknown requires all previous) limits parallelism, but blocked implementations can exploit level-2 BLAS parallelism within panels.
  4. Structural exploitation: sparse triangular systems (band, arrowhead, bidiagonal) can be solved in O(nk)O(nk) for bandwidth kk.

Non-example: a general dense n×nn \times n system Ax=bA\mathbf{x} = \mathbf{b} cannot be solved by substitution directly - it requires O(n3)O(n^3) work. The entire purpose of LU, QR, and Cholesky is to reduce the general case to triangular cases.

1.3 Three Canonical Decompositions

The three decompositions in this section partition the space of structured matrix problems:

DecompositionMatrix classFactored formPrimary use
LUGeneral square, non-singularPA=LUPA = LUSolving Ax=bA\mathbf{x} = \mathbf{b}, computing det(A)\det(A)
QRAny rectangular m×nm \times n, mnm \geq nA=QRA = QRLeast squares, rank determination, eigenvalue algorithms
CholeskySymmetric positive definiteA=LLA = LL^\topSPD systems, Gaussian sampling, log-det computation

Why three? Each exploits a different structural property. LU uses no special structure beyond invertibility; it is the most general and therefore requires more care (pivoting for stability). QR uses orthogonality, which is the most numerically pristine structure - QQ has condition number 1 by definition. Cholesky uses both symmetry and positive definiteness, which allows working with only half the matrix and guarantees the factorization exists without pivoting.

The hierarchy of stability: Cholesky >> QR >> LU (partially pivoted) >> LU (no pivoting). As you sacrifice structure, you need more algorithmic sophistication to maintain numerical reliability.

1.4 Why This Matters for AI

Matrix factorizations are not background infrastructure - they are in the critical path of the most computationally expensive operations in modern AI.

Gaussian process regression (used in Bayesian optimization, uncertainty quantification, and hyperparameter tuning) requires computing K1yK^{-1}\mathbf{y} and logdetK\log\det K for an n×nn \times n kernel matrix KK. Both require Cholesky factorization. A single GP training step on n=10,000n = 10{,}000 points costs O(n3)=1012O(n^3) = 10^{12} operations - factorization is the bottleneck.

Second-order optimization (Newton's method, natural gradient, K-FAC) requires solving Hessian or Fisher information systems Hδ=gH\boldsymbol{\delta} = \mathbf{g}. For K-FAC (Kronecker-Factored Approximate Curvature, Martens & Grosse 2015), the Fisher approximation is block-diagonal with Kronecker structure, and each block is inverted via Cholesky - thousands of small Cholesky factorizations per training step.

Linear probing and fine-tuning - fitting a linear model on top of frozen features - is a least-squares problem. Numerically reliable implementations use QR (e.g., numpy.linalg.lstsq calls LAPACK's dgelsd which uses divide-and-conquer SVD, or dgelsy which uses column-pivoted QR).

Differentiating through factorizations is required whenever a factorization appears inside a differentiable computational graph. PyTorch's torch.linalg.cholesky_solve and JAX's jax.lax.linalg.cholesky support automatic differentiation through the factorization via implicit function theorem.

Randomized methods (LoRA, sketching, randomized preconditioning) use approximate LU and QR factorizations on sketched matrices. Understanding the exact factorizations is prerequisite to understanding their randomized variants.

1.5 Historical Timeline

YearContributionAuthor(s)
1809Gaussian elimination for geodesy (Gauss not published until 1810)Carl Friedrich Gauss
1938LU factorization formalized as matrix decompositionTadeusz Banachiewicz
1945Systematic analysis of elimination and error growthAlan Turing
1954Givens rotations for QRWallace Givens
1958Householder reflections; modern QR algorithmAlston Householder
1961QR algorithm for eigenvalues (combined QR iteration)Francis; Kublanovskaya
1961LAPACK precursor: ALGOL programs for matrix problemsWilkinson, Reinsch
1979LINPACK (first standardized library)Dongarra et al.
1987LAPACK + BLAS-3 blocked algorithmsAnderson, Bai, Dongarra et al.
1995Randomised SVD (sketch-then-factor)Frieze, Kannan, Vempala
2011Halko-Martinsson-Tropp randomized low-rank frameworkHalko, Martinsson, Tropp
2015K-FAC: Cholesky for Fisher information blocksMartens & Grosse
2022LoRA: low-rank LU-inspired weight decompositionHu et al.
2024FlashAttention-3: blocked QR-like tiling for attentionShah et al.

2. Background: Triangular Systems

2.1 Forward Substitution

Definition. Given a lower triangular matrix LRn×nL \in \mathbb{R}^{n \times n} with Lii0L_{ii} \neq 0 for all ii, and bRn\mathbf{b} \in \mathbb{R}^n, forward substitution computes x\mathbf{x} satisfying Lx=bL\mathbf{x} = \mathbf{b} via:

xi=1Lii(bij=1i1Lijxj),i=1,2,,nx_i = \frac{1}{L_{ii}}\left(b_i - \sum_{j=1}^{i-1} L_{ij} x_j\right), \quad i = 1, 2, \ldots, n

Algorithm (row-oriented):

for i = 1 to n:
    x[i] = b[i]
    for j = 1 to i-1:
        x[i] -= L[i,j] * x[j]
    x[i] /= L[i,i]

Complexity: The inner loop runs 0,1,,n10, 1, \ldots, n-1 times, giving i=1n(i1)=n(n1)/2\sum_{i=1}^{n}(i-1) = n(n-1)/2 multiplications and the same number of additions. Total: n2n^2 floating-point operations, or O(n2)O(n^2).

Numerical properties: Forward substitution is unconditionally stable in exact arithmetic. In floating-point arithmetic, errors accumulate proportionally to κ(L)\kappa(L) (the condition number of LL), not to the intermediate values of xjx_j. The key source of numerical difficulty is a small diagonal entry LiiL_{ii}: dividing by a near-zero diagonal amplifies errors quadratically.

Unit lower triangular: When Lii=1L_{ii} = 1 for all ii (as in the LU factorization where LL is unit lower triangular by convention), the divisions are trivial and the algorithm reduces to:

for i = 1 to n:
    x[i] = b[i] - sum(L[i,1:i-1] * x[1:i-1])

This eliminates nn divisions, slightly improving performance and stability.

For AI: Forward substitution appears in the final step of any matrix factorization-based solver. In neural network training with second-order methods (Newton, K-FAC), solving Lx=gL\mathbf{x} = \mathbf{g} extracts the natural gradient direction g=F1L\mathbf{g} = F^{-1}\nabla\mathcal{L} given the Cholesky factor LL of the Fisher matrix FF.

2.2 Backward Substitution

Definition. Given an upper triangular URn×nU \in \mathbb{R}^{n \times n} with Uii0U_{ii} \neq 0, backward substitution computes x\mathbf{x} satisfying Ux=bU\mathbf{x} = \mathbf{b} via:

xi=1Uii(bij=i+1nUijxj),i=n,n1,,1x_i = \frac{1}{U_{ii}}\left(b_i - \sum_{j=i+1}^{n} U_{ij} x_j\right), \quad i = n, n-1, \ldots, 1

The algorithm proceeds from the last equation (which has a single unknown xn=bn/Unnx_n = b_n / U_{nn}) backward to the first. Complexity and numerical properties mirror forward substitution.

Column-oriented backward substitution: An equivalent formulation updates the remaining components of b\mathbf{b} as each xix_i is resolved:

for i = n downto 1:
    x[i] = b[i] / U[i,i]
    b[1:i-1] -= x[i] * U[1:i-1, i]

This accesses UU by columns rather than rows, which can improve cache efficiency on column-major storage (Fortran, LAPACK convention).

Connection to matrix inverse: The columns of U1U^{-1} can be computed by solving nn triangular systems Ux(k)=ekU \mathbf{x}^{(k)} = \mathbf{e}_k, one per standard basis vector. This is how scipy.linalg.solve_triangular computes U1U^{-1} when called with a matrix right-hand side.

2.3 Gaussian Elimination Revisited

Gaussian elimination is the process of reducing a matrix AA to upper triangular form by applying elementary row operations (EROs). Each ERO of type "add cc times row jj to row ii (with i>ji > j)" corresponds to left-multiplication by an elementary lower triangular matrix (an elementary elimination matrix or Frobenius matrix):

Eij=Imijeiej,mij=Aij/AjjE_{ij} = I - m_{ij}\mathbf{e}_i\mathbf{e}_j^\top, \quad m_{ij} = A_{ij}/A_{jj}

The multiplier mijm_{ij} is the ratio that zeros out entry (i,j)(i,j). Applying all elimination matrices to reduce AA to UU gives:

En,n1E31E21A=UE_{n,n-1} \cdots E_{31} E_{21} A = U

Crucially, the product of elementary lower triangular matrices is lower triangular, and their inverses are trivially:

Eij1=I+mijeiejE_{ij}^{-1} = I + m_{ij}\mathbf{e}_i\mathbf{e}_j^\top

So the product of inverses, L=E211E311En,n11L = E_{21}^{-1} E_{31}^{-1} \cdots E_{n,n-1}^{-1}, is unit lower triangular with the multipliers mijm_{ij} in their natural positions. This gives:

A=LUA = LU

The multipliers go directly into LL: one of the most elegant facts in numerical linear algebra is that the multipliers mijm_{ij} used during elimination appear, without any further computation, as the subdiagonal entries LijL_{ij} of the factor LL. No separate computation of LL is required - it is built up in-place during elimination.


3. LU Factorization

3.1 LU Decomposition Theorem

Theorem (LU factorization). Let ARn×nA \in \mathbb{R}^{n \times n}. Then AA has an LU factorization A=LUA = LU (with LL unit lower triangular and UU upper triangular) if and only if all leading principal submatrices Ak=A[1:k,1:k]A_k = A[1:k, 1:k] for k=1,,n1k = 1, \ldots, n-1 are non-singular.

Proof sketch. The condition AkA_k non-singular for all kk ensures that no zero pivot is encountered during Gaussian elimination. At step kk, the pivot is Ukk=det(Ak)/det(Ak1)U_{kk} = \det(A_k)/\det(A_{k-1}), so all pivots are nonzero iff all leading minors are nonzero. Conversely, if Gaussian elimination completes without zero pivots, the multipliers are finite and L,UL, U are well-defined. \square

Non-examples:

  • A=(0110)A = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} fails: A11=0A_{11} = 0 (a zero pivot at step 1).
  • A=(1111)A = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}: A11=10A_{11} = 1 \neq 0 but det(A)=0\det(A) = 0, so the factorization runs but U22=0U_{22} = 0 (singular).
  • Any singular matrix: A=LUdet(A)=det(L)det(U)=iUiiA = LU \Rightarrow \det(A) = \det(L)\det(U) = \prod_i U_{ii}, so det(A)=0\det(A) = 0 iff some Uii=0U_{ii} = 0.

Uniqueness: The factorization with Lii=1L_{ii} = 1 (unit lower triangular) is unique when it exists. Without the normalization, one could scale LL and UU by any diagonal matrix and its inverse.

The factorization in-place: LU computation overwrites AA: the upper triangle stores UU, the strict lower triangle stores the multipliers (i.e., the strict lower triangle of LL, since Lii=1L_{ii} = 1 need not be stored). This halves the memory requirement.

3.2 The LU Algorithm: Outer Product Form

The standard view of LU elimination is column-by-column. But the outer product form (also called rank-1 update form) exposes the structure more clearly and maps naturally to BLAS-3 blocked implementations.

At step kk of the factorization, we have already reduced columns 1,,k11, \ldots, k-1. The remaining (nk+1)×(nk+1)(n-k+1) \times (n-k+1) submatrix A(k)A^{(k)} (the Schur complement after k1k-1 steps) has the form:

A(k+1)=A(k)[k+1:n,k+1:n]l(k)u(k)A^{(k+1)} = A^{(k)}[k+1:n, k+1:n] - \mathbf{l}^{(k)} \mathbf{u}^{(k)\top}

where l(k)=A(k)[k+1:n,k]/Akk(k)\mathbf{l}^{(k)} = A^{(k)}[k+1:n, k] / A^{(k)}_{kk} (the multipliers column) and u(k)=A(k)[k,k+1:n]\mathbf{u}^{(k)\top} = A^{(k)}[k, k+1:n] (the pivot row). This is a rank-1 update of the trailing submatrix.

Algorithm:

for k = 1 to n-1:
    # Compute multipliers
    L[k+1:n, k] = A[k+1:n, k] / A[k, k]    # size (n-k)
    # Rank-1 update of trailing submatrix
    A[k+1:n, k+1:n] -= L[k+1:n, k] * A[k, k+1:n]   # BLAS-2: ger
    U[k, k:n] = A[k, k:n]

Complexity: Step kk costs (nk)(n-k) divisions and (nk)2(n-k)^2 multiply-adds. Total:

k=1n1[(nk)+(nk)2]=n(n1)2+(n1)n(2n1)62n33\sum_{k=1}^{n-1} [(n-k) + (n-k)^2] = \frac{n(n-1)}{2} + \frac{(n-1)n(2n-1)}{6} \approx \frac{2n^3}{3}

So LU factorization costs 2n33\frac{2n^3}{3} floating-point operations (flops). For comparison: solving Ax=bA\mathbf{x} = \mathbf{b} after factorization costs 2n22n^2 (two triangular solves). The ratio n/3n/3 means factorization dominates for large nn.

For AI: the rank-1 update structure maps directly to GPU tensor cores. Blocked implementations reformulate the trailing submatrix update as a matrix-matrix multiply (BLAS-3: DGEMM), which achieves near-peak GPU throughput.

3.3 Failure Without Pivoting

Naive LU (no pivoting) fails catastrophically when:

  1. Zero pivots: If Akk(k)=0A_{kk}^{(k)} = 0 during step kk, the division by the pivot is undefined. This happens even for non-singular matrices that happen to have zero leading principal minors.

  2. Small pivots - numerical catastrophe: Consider:

A=(ε111),ε=1016A = \begin{pmatrix} \varepsilon & 1 \\ 1 & 1 \end{pmatrix}, \quad \varepsilon = 10^{-16}

Naive LU gives multiplier m21=1/ε=1016m_{21} = 1/\varepsilon = 10^{16}, and U22=1101611016U_{22} = 1 - 10^{16} \cdot 1 \approx -10^{16} (in floating-point, this overwhelms the exact value 11/ε=(ε1)/ε1 - 1/\varepsilon = ({\varepsilon - 1})/\varepsilon). The computed solution x^\hat{\mathbf{x}} can have relative error of order 1 even when the exact solution is well-conditioned.

  1. Growth factor explosion: The growth factor ρn\rho_n is defined as:
ρn=maxi,j,kAij(k)maxi,jAij\rho_n = \frac{\max_{i,j,k} |A_{ij}^{(k)}|}{\max_{i,j} |A_{ij}|}

Without pivoting, ρn\rho_n can grow as 2n12^{n-1} in the worst case (Wilkinson's matrix), making the computed UU useless.

Example of catastrophic failure:

A=(1020112),b=(13)A = \begin{pmatrix} 10^{-20} & 1 \\ 1 & 2 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} 1 \\ 3 \end{pmatrix}

Exact solution: x=(1,1)\mathbf{x} = (1, 1)^\top approximately. Naive LU in IEEE double precision gives multiplier m21=1020m_{21} = 10^{20}, yielding U22=210201020U_{22} = 2 - 10^{20} \approx -10^{20} - a catastrophic cancellation.

For AI: PyTorch's torch.linalg.lu uses partial pivoting by default. numpy.linalg.solve calls LAPACK dgesv which uses partial pivoting. Never use unpivoted LU for numerical computation.

3.4 Partial Pivoting: PA = LU

Idea: Before processing column kk, swap row kk with the row (among k,k+1,,nk, k+1, \ldots, n) having the largest absolute value in column kk. This ensures mij1|m_{ij}| \leq 1 for all multipliers.

Algorithm with partial pivoting:

for k = 1 to n-1:
    # Find pivot row
    p = argmax_{i >= k} |A[i, k]|
    swap rows k and p in A (and record in permutation P)
    # Now A[k,k] is the largest in column k below
    L[k+1:n, k] = A[k+1:n, k] / A[k, k]       # |L[i,k]| <= 1
    A[k+1:n, k+1:n] -= L[k+1:n, k] * A[k, k+1:n]

Result: PA=LUPA = LU where PP is a permutation matrix encoding all row swaps. The factor LL satisfies Lij1|L_{ij}| \leq 1 for all i>ji > j.

Growth factor with partial pivoting: The theoretical bound is ρn2n1\rho_n \leq 2^{n-1} (same as no pivoting!), but in practice the growth factor is almost always small (ρnn2/3\rho_n \approx n^{2/3} for random matrices). Foster (1997) proved that adversarial inputs achieving ρn=2n1\rho_n = 2^{n-1} are exponentially rare.

Stability theorem (Wilkinson 1961): Partial pivoting produces a computed factorization L^U^\hat{L}\hat{U} such that:

PA=L^U^+E,E8n3uρnAPA = \hat{L}\hat{U} + E, \quad \|E\|_\infty \leq 8n^3 u \cdot \rho_n \cdot \|A\|_\infty

where uu is the unit roundoff (e.g., u=253u = 2^{-53} for IEEE double). For typical ρn\rho_n, this bound is well within practical tolerance.

Storage of the permutation: PP is stored as a pivot vector pZn1\mathbf{p} \in \mathbb{Z}^{n-1} where pkp_k is the row index selected at step kk. Applying PP to b\mathbf{b} before the triangular solve costs O(n)O(n).

Non-example: Partial pivoting does NOT guarantee ρn\rho_n is small. The 2×22 \times 2 example:

A=(1111)A = \begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}

has pivot column (11)\begin{pmatrix} 1 \\ 1 \end{pmatrix}, so either row could be chosen. After step 1, U22=2U_{22} = 2, ρ2=1\rho_2 = 1. But for A=(1111+ε)A = \begin{pmatrix} 1 & 1 \\ 1 & 1+\varepsilon \end{pmatrix}, ρ2=(2+ε)/12\rho_2 = (2+\varepsilon)/1 \approx 2.

3.5 Complete Pivoting: PAQ = LU

Complete pivoting selects both the row and column with the globally largest absolute entry at each step:

At step kk: find (p,q)=argmaxik,jkAij(p, q) = \arg\max_{i \geq k, j \geq k} |A_{ij}|, then swap row kpk \leftrightarrow p and column kqk \leftrightarrow q.

Result: PAQ=LUPAQ = LU where PP and QQ are permutation matrices (row and column respectively).

Growth factor bound: Complete pivoting satisfies the tighter bound:

ρn(n2131/241/3n1/(n1))1/2\rho_n \leq (n \cdot 2^1 \cdot 3^{1/2} \cdot 4^{1/3} \cdots n^{1/(n-1)})^{1/2}

which grows much more slowly than 2n12^{n-1} but is still superlinear. Numerically, complete pivoting is extremely stable - no practical example of large growth is known.

Cost: Finding the global maximum at each step requires examining the full trailing submatrix, adding O(n2)O(n^2) comparisons per step and O(n3)O(n^3) total - which triples the leading constant relative to partial pivoting. This is the reason complete pivoting is rarely used in practice.

When to use complete pivoting:

  • When stability is paramount (e.g., certified computation, interval arithmetic)
  • When the matrix is suspected to be rank-deficient (complete pivoting exposes rank)
  • When solving very ill-conditioned systems where partial pivoting is known to fail

For AI: Complete pivoting is rarely used in mainstream ML workflows. Partial pivoting suffices for almost all practical problems. The theoretical importance of complete pivoting is in establishing lower bounds and in the analysis of rank-revealing factorizations.

3.6 Rook Pivoting

Rook pivoting is a middle ground between partial and complete pivoting. At step kk:

  1. Find the largest element in column kk (as in partial pivoting) - say row pp.
  2. Check if Apk|A_{pk}| is also the largest in row pp. If yes, use (p,k)(p, k) as pivot.
  3. If not, find the largest in row pp, swap columns, repeat until convergence.

The name comes from the chess rook: the pivot search alternates between column and row moves until it finds a position that is simultaneously the largest in its row and column - a "rook position."

Properties:

  • Growth factor: ρn2n1\rho_n \leq 2^{n-1} but in practice much smaller than partial pivoting.
  • Cost: O(n2)O(n^2) additional work per step in the worst case but typically O(n)O(n) in practice (convergence in 1-2 alternations).
  • Rank-revealing: Rook pivoting reveals rank approximately as well as complete pivoting.

Theorem (Foster 1997): For a matrix of rank rr, rook pivoting ensures R11/Rkkn|R_{11}| / |R_{kk}| \leq \sqrt{n} for k>rk > r, making the diagonal jump at position r+1r+1 clearly visible.

3.7 LU Stability Analysis and Backward Error

The definitive framework for understanding LU stability is backward error analysis, developed by Wilkinson in the 1960s.

Definition (backward error). The backward error of a computed solution x^\hat{\mathbf{x}} to Ax=bA\mathbf{x} = \mathbf{b} is the smallest perturbation δA,δb\delta A, \delta\mathbf{b} such that:

(A+δA)x^=b+δb(A + \delta A)\hat{\mathbf{x}} = \mathbf{b} + \delta\mathbf{b}

A method is backward stable if δA/A\|\delta A\| / \|A\| and δb/b\|\delta\mathbf{b}\| / \|\mathbf{b}\| are of order uu (machine epsilon) regardless of the input.

Backward error theorem for LU with partial pivoting:

Theorem (Wilkinson 1961). Let x^\hat{\mathbf{x}} be the solution computed by LU with partial pivoting in IEEE double precision. Then:

(A+δA)x^=b,δAcnuρnA(A + \delta A)\hat{\mathbf{x}} = \mathbf{b}, \quad \|\delta A\|_\infty \leq c_n u \cdot \rho_n \cdot \|A\|_\infty

where cn=O(n2)c_n = O(n^2) is a modest polynomial constant and u=2531.1×1016u = 2^{-53} \approx 1.1 \times 10^{-16}.

Forward error bound: The forward error satisfies:

xx^xcnuρnκ(A)+O(u2)\frac{\|\mathbf{x} - \hat{\mathbf{x}}\|_\infty}{\|\mathbf{x}\|_\infty} \leq c_n u \cdot \rho_n \cdot \kappa_\infty(A) + O(u^2)

where κ(A)=AA1\kappa_\infty(A) = \|A\|_\infty \|A^{-1}\|_\infty is the condition number. This is the key identity: numerical error = roundoff ×\times growth factor ×\times condition number.

Implication for practice: If κ(A)10k\kappa(A) \approx 10^k in double precision, you lose kk digits of accuracy. With u1016u \approx 10^{-16}, you have 16k16 - k correct digits in x^\hat{\mathbf{x}}.

Backward stability of QR: QR (Householder) is backward stable with growth factor ρn=1\rho_n = 1 - Householder reflectors are orthogonal and preserve norms exactly. This is why QR is preferred for ill-conditioned systems.

3.8 Blocked LU for Cache Efficiency

Modern hardware has a deep memory hierarchy (L1: 32KB, L2: 256KB, L3: 8MB, DRAM: \infty). The key metric for algorithmic performance is arithmetic intensity: floating-point operations per byte of data moved. BLAS operations have different intensities:

OperationBLAS levelFlopsDataIntensity
yAx\mathbf{y} \leftarrow A\mathbf{x} (DGEMV)22n22n^2n2n^2O(1)O(1)
CAB+CC \leftarrow AB + C (DGEMM)32n32n^3n2n^2O(n)O(n)

Blocked LU reformulates the algorithm so that the dominant computation is DGEMM (matrix-matrix multiply), achieving O(n)O(n) arithmetic intensity and near-peak GPU/CPU throughput.

Block algorithm (block size bb):

for k = 0 to n/b - 1:
    A_kk = A[kb:(k+1)b, kb:(k+1)b]
    # Factor the diagonal panel (unblocked, uses BLAS-2)
    P_k, L_kk, U_kk = LU_factor(A_kk)    # O(b^3)
    # Apply permutation to remaining columns
    A[kb:(k+1)b, (k+1)b:n] = P_k @ A[kb:(k+1)b, (k+1)b:n]
    # Solve for U panel (triangular solve, BLAS-3: DTRSM)
    U_k_right = solve_lower_triangular(L_kk, A[kb:(k+1)b, (k+1)b:n])
    # Update trailing submatrix (BLAS-3: DGEMM)
    A[(k+1)b:n, (k+1)b:n] -= L_right @ U_k_right     # This is the bottleneck

The trailing submatrix update (A -= L_right @ U_k_right) is a pure DGEMM and dominates the total computation. On a modern CPU/GPU with peak DGEMM performance, this block formulation achieves 80-95% of theoretical peak.

LAPACK routine: DGETRF implements blocked LU with partial pivoting. The block size bb is tuned automatically per platform via the ilaenv oracle.

For AI: Blocked LU is the algorithm behind PyTorch's torch.linalg.lu_factor, jax.scipy.linalg.lu, and scipy's lu_factor. For GPU, cuSOLVER implements batched blocked LU for thousands of small systems simultaneously - critical for K-FAC training.

3.9 Solving Ax = b via LU

Given the LU factorization PA=LUPA = LU, solving Ax=bA\mathbf{x} = \mathbf{b} proceeds in four steps:

  1. Apply permutation: c=Pb\mathbf{c} = P\mathbf{b} (O(n), just index lookup)
  2. Forward solve: solve Ly=cL\mathbf{y} = \mathbf{c} (O(n^2), forward substitution)
  3. Backward solve: solve Ux=yU\mathbf{x} = \mathbf{y} (O(n^2), backward substitution)
  4. (Optional) Iterative refinement: compute r=bAx^\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}, solve for correction δx\delta\mathbf{x}, update

The total cost after factorization is 2n22n^2 flops - linear in n2n^2, negligible compared to the 2n33\frac{2n^3}{3} factorization cost for large nn.

Multiple right-hand sides: For kk different vectors b1,,bk\mathbf{b}_1, \ldots, \mathbf{b}_k, one factorization costing 2n33\frac{2n^3}{3} is followed by kk triangular solves each costing 2n22n^2. This is the key economic argument for LU: amortize the factorization cost over many solves.

Computing the determinant: det(A)=det(P1)det(L)det(U)=(1)si=1nUii\det(A) = \det(P^{-1})\det(L)\det(U) = (-1)^s \prod_{i=1}^n U_{ii}, where ss is the number of row swaps. Since LL is unit lower triangular, det(L)=1\det(L) = 1.

Computing A1A^{-1}: Solve Ax(k)=ekA\mathbf{x}^{(k)} = \mathbf{e}_k for k=1,,nk = 1, \ldots, n (one factorization, nn triangular solves). But explicitly forming A1A^{-1} is almost always unnecessary and should be avoided - solve the system directly instead.

3.10 Rank-Deficient LU

When AA is singular or numerically rank-deficient, the LU factorization (even with pivoting) produces a UU with some diagonal entries near zero. The factorization still "works" mechanically but produces Ukk0U_{kk} \approx 0 at position kk corresponding to a dependent column.

Signature of rank deficiency:

rank(A)=r    U11,,Urr are "large" and Ur+1,r+1,,Unn0\operatorname{rank}(A) = r \iff U_{11}, \ldots, U_{rr} \text{ are "large" and } U_{r+1,r+1}, \ldots, U_{nn} \approx 0

Problem: With partial pivoting, the threshold for "approximately zero" is ambiguous. If Ukk/U11<εtol|U_{kk}| / |U_{11}| < \varepsilon_{\text{tol}}, we declare the matrix has numerical rank <k< k. But the choice of εtol\varepsilon_{\text{tol}} is problem-dependent.

Rank-revealing LU: Column-pivoted LU (complete or rook pivoting) provides better rank-revelation: the diagonal entries of UU decay in a manner that reflects the true rank structure.

Better alternative for rank-deficient problems: Use rank-revealing QR (RRQR, 4.6) or SVD. The SVD provides the definitive rank determination via singular value thresholding - but costs O(n3)O(n^3) just as LU does. RRQR provides a cheaper, nearly-as-reliable alternative.

For AI: Neural network weight matrices often have approximate low rank (Hu et al., 2022 show that fine-tuning adapters are intrinsically low-dimensional). Rank-deficient LU can be used to detect this structure, but RRQR or randomized SVD are preferred in practice.


4. QR Factorization: Computational Algorithms

4.1 From Gram-Schmidt to Algorithms

Recall from 05: The QR factorization A=QRA = QR of an m×nm \times n matrix (mnm \geq n) decomposes AA into an orthonormal factor QRm×nQ \in \mathbb{R}^{m \times n} (or m×mm \times m for the full QR) and an upper triangular factor RRn×nR \in \mathbb{R}^{n \times n}. This was constructed via Gram-Schmidt orthogonalization in 05.

The problem with classical Gram-Schmidt: Classical Gram-Schmidt (CGS) is mathematically correct but numerically unstable for ill-conditioned matrices. The computed QQ can lose orthogonality catastrophically: for a matrix with condition number 10810^8, the computed QQ from CGS may have QQIF108u108\|Q^\top Q - I\|_F \approx 10^8 \cdot u \approx 10^{-8}, causing significant errors in subsequent computations.

Modified Gram-Schmidt (MGS): Reorders the orthogonalization to reduce error accumulation. MGS is numerically equivalent to classical QR of a slightly perturbed matrix, giving QQIF=O(κ(A)u)\|Q^\top Q - I\|_F = O(\kappa(A) \cdot u). Still insufficient for highly ill-conditioned problems.

Householder and Givens: Both achieve QQIF=O(u)\|Q^\top Q - I\|_F = O(u) regardless of κ(A)\kappa(A), making them the preferred algorithms for production use. The key is that they build QQ as a product of exactly orthogonal elementary transformations (reflectors or rotations), never losing orthogonality during construction.

For AI: numpy.linalg.qr and scipy.linalg.qr use LAPACK DGEQRF (Householder QR). torch.linalg.qr similarly calls cuSOLVER which uses Householder on GPU. Understanding the algorithmic basis helps you interpret condition number warnings and numerical precision behavior.

4.2 Householder Reflections

Definition. A Householder reflector (or Householder transformation) is a matrix of the form:

H=I2vvvv=I2v22vvH = I - 2\frac{\mathbf{v}\mathbf{v}^\top}{\mathbf{v}^\top\mathbf{v}} = I - \frac{2}{\|\mathbf{v}\|_2^2}\mathbf{v}\mathbf{v}^\top

where vRm\mathbf{v} \in \mathbb{R}^m is a nonzero Householder vector. The matrix HH is symmetric (H=HH = H^\top) and orthogonal (HH=H2=IH^\top H = H^2 = I), so H1=HH^{-1} = H - it is its own inverse.

Geometric interpretation: HH reflects vectors across the hyperplane perpendicular to v\mathbf{v}. Every point x\mathbf{x} in the hyperplane satisfies Hx=xH\mathbf{x} = \mathbf{x} (invariant). Every point cvc\mathbf{v} along the v\mathbf{v} direction satisfies Hcv=cvHc\mathbf{v} = -c\mathbf{v} (negated).

Key property: Given any vector aRm\mathbf{a} \in \mathbb{R}^m and any unit vector e^Rm\hat{\mathbf{e}} \in \mathbb{R}^m, there exists a Householder reflector HH such that Ha=sign(a1)a2e^1H\mathbf{a} = -\operatorname{sign}(a_1)\|\mathbf{a}\|_2 \hat{\mathbf{e}}_1. That is, HH maps a\mathbf{a} to a multiple of the first standard basis vector.

Construction (numerical form): Given aRm\mathbf{a} \in \mathbb{R}^m to be zeroed below its first entry:

α=sign(a1)a2\alpha = -\operatorname{sign}(a_1)\|\mathbf{a}\|_2 v=aαe1=(a1αa2am)\mathbf{v} = \mathbf{a} - \alpha\mathbf{e}_1 = \begin{pmatrix} a_1 - \alpha \\ a_2 \\ \vdots \\ a_m \end{pmatrix} H=I2v22vv,Ha=αe1H = I - \frac{2}{\|\mathbf{v}\|_2^2}\mathbf{v}\mathbf{v}^\top, \quad H\mathbf{a} = \alpha\mathbf{e}_1

The sign convention (critical for stability): We choose α=sign(a1)a2\alpha = -\operatorname{sign}(a_1)\|\mathbf{a}\|_2 to avoid cancellation in computing v1=a1αv_1 = a_1 - \alpha. If a1>0a_1 > 0, choosing α=a2\alpha = -\|\mathbf{a}\|_2 gives v1=a1+a2>a1v_1 = a_1 + \|\mathbf{a}\|_2 > a_1, avoiding the catastrophic subtraction that would occur with α=+a2\alpha = +\|\mathbf{a}\|_2.

Cost of applying HH: Computing HxH\mathbf{x} directly costs O(m2)O(m^2) (matrix-vector product), but using the formula Hx=x2v(vx)/(vv)H\mathbf{x} = \mathbf{x} - 2\mathbf{v}(\mathbf{v}^\top\mathbf{x})/(\mathbf{v}^\top\mathbf{v}) costs only O(m)O(m) - first compute the scalar s=vxs = \mathbf{v}^\top\mathbf{x}, then update xx(2s/v2)v\mathbf{x} \leftarrow \mathbf{x} - (2s/\|\mathbf{v}\|^2)\mathbf{v}. This is the implicit representation of HH.

4.3 Householder QR Algorithm

Algorithm: Apply Householder reflectors successively to zero out subdiagonal entries of each column.

for k = 1 to n:                                # n columns
    # Construct H_k to zero out A[k+1:m, k]
    a = A[k:m, k]                              # current column
    alpha = -sign(a[0]) * norm(a)
    v = a.copy(); v[0] -= alpha
    v /= norm(v)                               # normalized Householder vector
    # Apply H_k implicitly to A[k:m, k:n]
    A[k:m, k:n] -= 2 * v * (v.T @ A[k:m, k:n])
    # Store v in lower triangle of A (LAPACK convention)
    A[k+1:m, k] = v[1:]

After nn steps, AA has been overwritten with RR in the upper triangle and Householder vectors in the lower triangle (the implicit storage of QQ).

Implicit Q representation: The full QQ matrix is Q=H1H2HnQ = H_1 H_2 \cdots H_n and has size m×mm \times m. Forming QQ explicitly costs O(mn2)O(mn^2) additional work and O(mn)O(mn) storage. LAPACK's DORGQR computes the explicit QQ when needed (e.g., for the thin QR, only the first nn columns of QQ are needed).

Complexity:

  • Applying HkH_k to trailing submatrix: O((mk)(nk))O((m-k)(n-k)) per step
  • Total: k=1nO((mk)(nk))=O(mn2n3/3)2mn22n3/3\sum_{k=1}^{n} O((m-k)(n-k)) = O(mn^2 - n^3/3) \approx 2mn^2 - 2n^3/3
  • For square m=nm = n: 4n3/34n^3/3 flops (vs 2n3/32n^3/3 for LU)

Numerical stability: Householder QR is backward stable. The computed Q^,R^\hat{Q}, \hat{R} satisfy:

A+E=Q^R^,EFcmnuAFA + E = \hat{Q}\hat{R}, \quad \|E\|_F \leq c_{mn} u \|A\|_F

with cmnc_{mn} a modest polynomial constant and uu machine epsilon. The orthogonality error satisfies Q^Q^IF=O(u)\|\hat{Q}^\top\hat{Q} - I\|_F = O(u), independent of κ(A)\kappa(A).

LAPACK: DGEQRF computes the Householder QR in blocked form (block Householder, using WY representation). DORMQR applies QQ or QQ^\top to a matrix without forming QQ explicitly.

For AI: The Householder algorithm is used in PyTorch's torch.linalg.qr, in scipy's linalg.qr, and in LAPACK DGEQRF. It is the foundation for numerically reliable QR-based least-squares solvers used in linear probing, weight matrix analysis, and second-order methods.

4.4 Givens Rotations

Definition. A Givens rotation G(i,j,θ)Rn×nG(i,j,\theta) \in \mathbb{R}^{n \times n} is the identity matrix with a 2×22 \times 2 rotation embedded in the (i,j)(i,j) rows and columns:

G(i,j,θ)=(10000cs00sc00001)G(i,j,\theta) = \begin{pmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & & \vdots & & \vdots \\ 0 & \cdots & c & \cdots & s & \cdots & 0 \\ \vdots & & \vdots & \ddots & \vdots & & \vdots \\ 0 & \cdots & -s & \cdots & c & \cdots & 0 \\ \vdots & & \vdots & & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 & \cdots & 1 \end{pmatrix}

with c=cosθc = \cos\theta and s=sinθs = \sin\theta in positions (i,i)(i,i), (i,j)(i,j), (j,i)(j,i), (j,j)(j,j).

Key property: G(i,j,θ)G(i,j,\theta) is orthogonal. Left-multiplying x\mathbf{x} by GG rotates the (xi,xj)(x_i, x_j) plane by angle θ\theta.

Zeroing a specific entry: Given x\mathbf{x} with entries xix_i and xjx_j, choose:

r=xi2+xj2,c=xi/r,s=xj/rr = \sqrt{x_i^2 + x_j^2}, \quad c = x_i/r, \quad s = -x_j/r

Then (Gx)i=r(G\mathbf{x})_i = r, (Gx)j=0(G\mathbf{x})_j = 0 - the rotation zeros out xjx_j while preserving xix_i as rr.

Numerically stable computation (LAPACK DLARTG):

if x_j == 0:
    c = 1, s = 0, r = x_i
elif |x_j| > |x_i|:
    t = x_i / x_j; s = 1/sqrt(1+t^2); c = s*t; r = x_j/s
else:
    t = x_j / x_i; c = 1/sqrt(1+t^2); s = c*t; r = x_i/c

This avoids overflow and underflow in computing xi2+xj2\sqrt{x_i^2 + x_j^2}.

Cost: Applying G(i,j,θ)G(i,j,\theta) to a matrix costs 6m6m flops (touching only rows ii and jj). Constructing GG costs O(1)O(1).

For AI: Givens rotations appear in updating QR factorizations when a new row is appended (streaming QR), which is used in online learning and in updating Cholesky factors after rank-1 additions (Cholesky rank-1 update, relevant to incremental GP regression).

4.5 Givens QR and Sparse Matrices

Algorithm: Apply Givens rotations sequentially to zero out subdiagonal entries of AA column by column. To zero AijA_{ij} (i>ji > j), apply G(j,i,θ)G(j, i, \theta) on the left.

For a dense m×nm \times n matrix, the total number of Givens rotations needed is mnn(n+1)/2mnmn - n(n+1)/2 \approx mn, each costing O(m)O(m) flops - total O(m2n)O(m^2 n), which is worse than Householder QR for dense matrices.

Advantage for sparse matrices: Each Givens rotation touches exactly two rows and two columns. If AA has a known sparsity pattern, Givens rotations can be sequenced to minimize fill-in (new nonzeros created by the transformation). In contrast, a single Householder reflector HkH_k is a rank-2 update that touches an entire panel, potentially creating dense fill.

Banded matrices: For a banded matrix with bandwidth bb, Givens QR costs O(nb2)O(nb^2) instead of O(nb2+n2b)O(nb^2 + n^2 b) for Householder, and the resulting RR remains banded. This is critical for signal processing and finite-element computations.

Online/streaming QR: When a new row a\mathbf{a}^\top is appended to AA, the existing QR factorization can be updated using a single sequence of nn Givens rotations, costing O(n2)O(n^2) rather than recomputing from scratch in O(mn2)O(mn^2).

For AI: Streaming QR via Givens is used in online learning settings where data arrives sequentially. It also underlies incremental SVD algorithms used for streaming PCA (e.g., in continual learning, where new tasks arrive without access to past data).

4.6 Column-Pivoted QR (RRQR)

Motivation: Standard QR doesn't reveal rank. The diagonal entries of RR decay, but not necessarily in a way that makes rank determination reliable. Column-pivoted QR (RRQR) reorders columns of AA to expose rank structure in RR.

Algorithm: At step kk, before computing the kk-th Householder reflector, swap column kk with the column jkj \geq k having maximum 2-norm among remaining columns. This produces:

AP=QRA P = QR

where PP is a permutation and RR has the property that RkkRk+1,k+1Rnn|R_{kk}| \geq |R_{k+1,k+1}| \geq \cdots \geq |R_{nn}|.

Rank estimation: If AA has numerical rank rr, then Rrr/Rr+1,r+1|R_{rr}| / |R_{r+1,r+1}| is large (theoretically σr/σr+1\geq \sigma_r / \sigma_{r+1}, practically much larger). The threshold τ\tau for declaring rank is:

r=max{k:Rkk>τR11}r = \max\{k : |R_{kk}| > \tau \cdot |R_{11}|\}

RRQR theorem (Golub 1965, Chandrasekaran & Ipsen 1994): The strong RRQR algorithm guarantees:

σk(A)2σk(R11)2(1+f(k,nk,r)),f=O(n)\sigma_k(A)^2 \leq \sigma_k(R_{11})^2 \cdot (1 + f(k, n-k, r)), \quad f = O(n)

ensuring that R11R_{11} (the leading r×rr \times r block) captures essentially all the energy of the matrix.

Connection to SVD: RRQR is a cheap (O(mn2)O(mn^2)) approximation to the SVD. It provides the same rank information and a good basis for the column space, but singular values only approximately. For exact singular values, use SVD at O(mn2+n3)O(mn^2 + n^3) cost.

For AI:

  • LoRA (Hu et al., 2022): Low-Rank Adaptation uses rank-rr decompositions of weight updates ΔW=BA\Delta W = BA. Choosing rr requires rank estimation - RRQR is one approach.
  • DoRA (Liu et al., 2024): Decomposes weight matrices into magnitude and direction; RRQR is used to identify the principal components.
  • Intrinsic dimensionality: RRQR-based rank estimation reveals the intrinsic dimensionality of weight matrices, guiding compression decisions.

4.7 Tall-Skinny QR (TSQR)

Motivation: For matrices ARm×nA \in \mathbb{R}^{m \times n} with mnm \gg n (tall and skinny - e.g., m=109m = 10^9, n=100n = 100), standard Householder QR communicates O(n2)O(n^2) data between levels of the memory hierarchy at each of nn steps, totaling O(n3)O(n^3) words of communication. For distributed or GPU computation, this is the bottleneck.

TSQR algorithm (Demmel et al., 2008):

  1. Local factorization: Partition AA into PP panels A1,,APA_1, \ldots, A_P (one per processor/GPU block).
  2. Local QR: Factor each Ai=QiRiA_i = Q_i R_i independently (no communication).
  3. Reduction tree: Form (R1R2)\begin{pmatrix} R_1 \\ R_2 \end{pmatrix}, factor its QR to get R12R_{12}. Repeat up the tree.
  4. Result: RR is the final upper triangular factor. QQ can be recovered by traversing the reduction tree backward.

Communication cost: TSQR communicates O(n2logP)O(n^2 \log P) words (vs O(mnn/b)O(mn \cdot n/b) for blocked Householder), achieving communication optimality.

For AI:

  • Distributed training: TSQR is used in distributed least-squares and in computing the QR of gradient matrices across multiple GPUs.
  • Randomized methods: TSQR underlies the "sketch-then-QR" approach for randomized low-rank factorization.
  • Attention-efficient computation: The block-tile structure of FlashAttention-3 (Shah et al., 2024) is inspired by the communication-optimal blocking patterns of TSQR.

4.8 QR for Least Squares

Problem: Given ARm×nA \in \mathbb{R}^{m \times n} with m>nm > n (overdetermined) and bRm\mathbf{b} \in \mathbb{R}^m, find x\mathbf{x}^* minimizing Axb22\|A\mathbf{x} - \mathbf{b}\|_2^2.

Method 1 (Normal equations): x=(AA)1Ab\mathbf{x}^* = (A^\top A)^{-1} A^\top \mathbf{b}. Form AAA^\top A, factor via Cholesky, solve. Cost: O(mn2+n3)O(mn^2 + n^3). Problem: κ(AA)=κ(A)2\kappa(A^\top A) = \kappa(A)^2 - squaring the condition number doubles the digits lost.

Method 2 (QR): Factor A=QRA = QR (thin QR), then:

Axb22=QRxb22=RxQb22+(IQQ)b22\|A\mathbf{x} - \mathbf{b}\|_2^2 = \|QR\mathbf{x} - \mathbf{b}\|_2^2 = \|R\mathbf{x} - Q^\top\mathbf{b}\|_2^2 + \|(I - QQ^\top)\mathbf{b}\|_2^2

Minimizing over x\mathbf{x}: solve Rx=QbR\mathbf{x} = Q^\top\mathbf{b} (upper triangular, backward substitution). Cost: O(mn2)O(mn^2) for QR, O(mn)O(mn) for QbQ^\top\mathbf{b}, O(n2)O(n^2) for backsolve. The condition number of RR is κ(A)\kappa(A), not κ(A)2\kappa(A)^2.

Stability comparison:

MethodCondition amplificationCostWhen to use
Normal equations + Choleskyκ(A)2\kappa(A)^2O(mn2+n3)O(mn^2 + n^3)Well-conditioned AA, mnm \gg n
QR (Householder)κ(A)\kappa(A)O(mn2)O(mn^2)General purpose, numerically safe
RRQRκ(A)\kappa(A), reveals rankO(mn2)O(mn^2)Rank-deficient or near-singular AA
SVD (truncated)None (sets small SVs to zero)O(mn2+n3)O(mn^2 + n^3)Ill-conditioned, rank-deficient

For AI:

  • Linear probing: Fitting a linear classifier on top of frozen features is an overdetermined least-squares problem. sklearn.linear_model.LinearRegression uses LAPACK DGELSD (divide-and-conquer SVD) or DGELSY (column-pivoted QR).
  • Ridge regression: minxAxb22+λx22\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 + \lambda\|\mathbf{x}\|_2^2 is equivalent to the augmented system (AλI)x=(b0)\begin{pmatrix} A \\ \sqrt{\lambda}I \end{pmatrix} \mathbf{x} = \begin{pmatrix} \mathbf{b} \\ \mathbf{0} \end{pmatrix}, which QR solves stably.
  • Attention weight regression: Computing attention pattern regression (e.g., for mechanistic interpretability) uses QR-based least squares.

5. Cholesky Factorization (Computational Recap)

5.1 Brief Overview

Full treatment: The complete theory of Cholesky factorization - existence proofs, LDL^T, modified Cholesky, log-determinant, connection to PSD cone - is in 07: Positive Definite Matrices. This section covers only the computational aspects not covered in 07: the relationship to LU, blocked algorithms, and LAPACK routines.

For a symmetric positive definite matrix A0A \succ 0, the Cholesky factorization is:

A=LLA = LL^\top

where LL is unit lower triangular with positive diagonal entries Lii>0L_{ii} > 0. The factorization exists and is unique for every A0A \succ 0 (Theorem from 07).

Why Cholesky for SPD systems:

  1. Efficiency: Cholesky costs n33\frac{n^3}{3} flops - exactly half of LU's 2n33\frac{2n^3}{3} - because symmetry halves the work.
  2. Storage: Only the lower triangle needs to be stored (n(n+1)/2 entries vs n^2 for LU).
  3. No pivoting needed: Positive definiteness guarantees Lii>0L_{ii} > 0 at every step without any row interchanges.
  4. Stability: Cholesky is backward stable for SPD matrices without any pivoting.

5.2 Cholesky as Specialized LU

The connection between Cholesky and LU: for a symmetric positive definite AA, the LU factorization (without pivoting, which exists since all leading principal minors of AA are positive) gives A=LUA = LU. But A=AA = A^\top implies LU=ULLU = U^\top L^\top, so U=DLU = DL^\top where D=diag(U11,,Unn)D = \operatorname{diag}(U_{11}, \ldots, U_{nn}). Thus A=LDLA = L D L^\top (the LDL^T form). Since A0A \succ 0, all diagonal entries Dii>0D_{ii} > 0, and we can define L~=LD1/2\tilde{L} = L D^{1/2} to get A=L~L~A = \tilde{L}\tilde{L}^\top - the Cholesky factorization.

Cholesky algorithm (jth column):

Ljj=Ajjk=1j1Ljk2L_{jj} = \sqrt{A_{jj} - \sum_{k=1}^{j-1} L_{jk}^2} Lij=1Ljj(Aijk=1j1LikLjk),i=j+1,,nL_{ij} = \frac{1}{L_{jj}}\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk}\right), \quad i = j+1, \ldots, n

The argument of the square root must be positive - this is guaranteed by positive definiteness.

Cost analysis:

  • Column jj costs: 1 square root + O(j)O(j) flops for LjjL_{jj}, plus (nj)O(j)(n-j) \cdot O(j) flops for LijL_{ij}
  • Total: j=1nO(nj)=O(n3/3)\sum_{j=1}^{n} O(nj) = O(n^3/3), exactly half of LU

5.3 LDL^T for Indefinite Systems

For symmetric indefinite matrices (not necessarily PD), Cholesky cannot be applied directly (negative square roots). The LDL^T factorization computes:

PAP=LDLP A P^\top = L D L^\top

where LL is unit lower triangular, DD is block diagonal with 1×11 \times 1 and 2×22 \times 2 blocks (to handle negative eigenvalues), and PP is a permutation.

Bunch-Kaufman pivoting: The 2×22 \times 2 blocks in DD capture pairs of eigenvalues of opposite sign, avoiding the square root altogether. The pivoting strategy (Bunch & Kaufman 1977) ensures Lij(1+17)/80.64|L_{ij}| \leq (1 + \sqrt{17})/8 \approx 0.64 - a stability bound analogous to Lij1|L_{ij}| \leq 1 for partial pivoting in LU.

Applications in ML:

  • Indefinite Hessians: At saddle points (common early in neural network training), the Hessian is indefinite. LDL^T factors it without the positive-definiteness requirement, enabling second-order descent directions even at saddle points.
  • Modified Cholesky: Adding a diagonal shift δI\delta I to make A+δI0A + \delta I \succ 0 before Cholesky factorization is the standard approach in L-BFGS-B and quasi-Newton methods. See 07 for the modified Cholesky algorithm.

LAPACK routine: DSYTRF implements LDL^T with Bunch-Kaufman pivoting.

5.4 Blocked Cholesky and LAPACK dpotrf

Blocked Cholesky follows the same blocked structure as blocked LU. For block size bb:

for k = 0 to n/b - 1:
    # Factor diagonal block
    L_kk = cholesky(A[kb:(k+1)b, kb:(k+1)b])   # O(b^3)
    # Solve for L panel (DTRSM)
    L_k_below = solve_lower_triangular(L_kk.T, A[(k+1)b:n, kb:(k+1)b].T).T
    # Update trailing submatrix (DSYRK + DGEMM)
    A[(k+1)b:n, (k+1)b:n] -= L_k_below @ L_k_below.T

The trailing submatrix update (A -= L @ L^T) is a symmetric rank-bb update, computed by LAPACK's DSYRK (symmetric rank-k update) which exploits symmetry to halve the work relative to DGEMM.

LAPACK routine DPOTRF: The production Cholesky routine. On entry: the upper or lower triangle of AA. On exit: LL (or UU for upper Cholesky). Returns an error flag INFO (=0= 0 for success; =k>0= k > 0 if the kk-th leading minor is not positive definite).

For AI: DPOTRF is called by:

  • scipy.linalg.cholesky (wraps LAPACK)
  • numpy.linalg.cholesky (wraps LAPACK)
  • torch.linalg.cholesky (calls cuSOLVER cusolverDnDpotrf on GPU)
  • Every GP regression library (GPyTorch, GPflow, etc.) for computing K1yK^{-1}\mathbf{y} and logdetK\log\det K

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue