Lesson overview | Previous part | Lesson overview
Matrix Decompositions: Appendix K: Matrix Decompositions in the 2026 AI Landscape to Appendix M: Exercises - Further Problems
Appendix K: Matrix Decompositions in the 2026 AI Landscape
K.1 Inference-Time Decompositions
As language models move toward longer context (1M+ tokens), the attention computation's quadratic cost has become a primary bottleneck. Several recent methods use matrix decompositions at inference time:
PagedAttention (Kwon et al., 2023): Manages KV cache memory using block-based paging inspired by virtual memory. The KV cache is a large matrix ; PagedAttention tiles this into fixed-size "pages" and maintains an LU-like decomposition of the free-list.
MLA (Multi-head Latent Attention, DeepSeek, 2024): Compresses the KV cache via low-rank projection: where is a compressed latent and is a fixed projection matrix. The decomposition (QR of the projection matrix) is used to initialize and orthogonalize the projection basis.
Speculative decoding with draft models: Uses LU-based factored attention matrices to maintain a draft distribution for speculative sampling.
K.2 Training-Time Decompositions
GaLore (Zhao et al., 2024): Gradient Low-Rank Projection. For a weight matrix , the gradient is projected onto a rank- subspace via:
where comes from the right singular vectors of (randomized SVD). Adam is applied to (memory: instead of ), then the update is lifted back: . The SVD is recomputed every steps.
Fira (Chen et al., 2024): Extends GaLore with a closed-form gradient approximation that avoids the SVD step entirely, using a QR-based approximation to the projection.
SOAP (Vyas et al., 2024): Combines Shampoo's Kronecker-structured preconditioner with Adam's update rule. Each layer's preconditioner is maintained as with iterative Cholesky-based Newton steps for the matrix square root.
K.3 Model Compression via Decompositions
SliceGPT (Ashkboos et al., 2024): Compresses transformer weights by applying QR-based slice operations: a structured pruning where the weight matrix is decomposed as and the orthogonal factor is absorbed into the layer normalization.
ASVD (Yuan et al., 2023): Activation-aware SVD for LLM compression. Instead of SVD of directly, computes SVD of (weight matrix multiplied by typical activation matrix), finding a better rank- approximation for the actual inference distribution. Uses randomized QR for efficiency.
QuIP (Chee et al., 2023) and QuIP# (Tseng et al., 2024): Incoherence-based quantization. A random orthogonal matrix (Hadamard-based) is applied to before quantization: . The orthogonal preprocessing ensures that no single weight has disproportionate magnitude, enabling aggressive quantization. The QR connection: is chosen as a randomized Hadamard transform, equivalent to the sketch matrix in randomized QR.
K.4 Numerical Linear Algebra for AI Accelerators
Tensor Cores and Mixed Precision: NVIDIA A100/H100 Tensor Cores perform where are FP16/BF16 and are FP32, achieving TFLOPS (FP16) vs TFLOPS (FP64). All modern factorization libraries exploit this:
- cuSOLVER RF (Iterative Refinement): FP16 factor + FP32 residual + FP32 correction - full FP32 accuracy at near-FP16 throughput
- cuBLAS Tensor Core GEMM: All blocked factorizations (LU, QR, Cholesky trailing updates) use TC-GEMM for the dominant computation
Intel AMX (Advanced Matrix Extensions): New Intel architecture extensions (Sapphire Rapids, 2023) add hardware matrix tile registers and tile-based GEMM instructions. BLIS and MKL exploit AMX for speedup over AVX-512 on blocked factorizations.
AMD MI300X: 192 GB HBM3 + 5.3 TB/s bandwidth enables Cholesky factorization in GPU memory - sufficient for large GP regression without blocking across devices.
Appendix L: Summary of Key Results
For quick reference, here are the key theorems and facts from this section:
LU Factorization:
- Exists iff all leading principal minors for .
- With partial pivoting: ; ; cost flops.
- Backward error: where is the growth factor.
- Growth factor: (theoretical), (typical).
- Determinant: where = number of row swaps.
QR Factorization:
- Always exists for any with .
- Householder QR: backward stable with , independent of .
- Cost: flops (vs for LU when ).
- QR for least squares: loss (vs for normal equations).
- RRQR: diagonal of estimates singular values; reveals rank.
Cholesky Factorization:
- Exists iff (equivalently: all eigenvalues positive).
- Cost: flops (half of LU); unconditionally backward stable without pivoting.
- Log-determinant: - exact, after Cholesky.
- Gradient: - computed in via Cholesky solve.
Backward Stability:
- Forward error backward error (fundamental inequality).
- Backward stability hierarchy: Cholesky = Householder QR LU with partial pivot LU without pivot.
- Iterative refinement: recovers FP64 accuracy from FP16 factorization in iterations if .
CHUNK_FINAL
Appendix M: Exercises - Further Problems
M.1 Theoretical Extensions
Problem M.1. Show that for any invertible , there exists a permutation matrix such that has an LU factorization. (Hint: show that there always exists a row ordering such that all leading principal minors of are nonzero.)
Problem M.2. Prove the identity using the LU factorization. Extend to show that where is the number of transpositions in .
Problem M.3. For with Cholesky , prove that the Schur complement of in equals where is the lower-right block of . Conclude that the Schur complement of any SPD matrix is SPD.
Problem M.4. Show that the Householder reflector satisfies: (a) (symmetric) (b) (orthogonal) (c) (orientation-reversing) (d) (involutory - its own inverse)
Problem M.5. Prove that QR with column pivoting finds the best rank- approximation in the following sense: if and is chosen so that , then the leading block of satisfies .
M.2 Computational Challenges
Challenge M.6 (Batched Cholesky). Implement batched Cholesky factorization for matrices of size . Compare three implementations:
- Python loop over
scipy.linalg.cholesky - NumPy vectorized via
np.vectorize torch.linalg.choleskyon CPU with batch dimension
Measure time and throughput (matrices/second). Extrapolate to the K-FAC setting where and .
Challenge M.7 (Iterative Refinement). Construct a matrix with condition number (near the FP64 limit). Solve using: (a) Direct LU (FP64): measure relative error (b) Simulated FP32 LU + FP64 residual refinement (3 steps): measure relative error Compare and explain the improvement using backward error theory.
Challenge M.8 (Rank-revealing QR in practice). Download a real dataset (e.g., the UCI Breast Cancer dataset, , features). Apply RRQR to the feature matrix and identify the effective rank (threshold: ). Verify by comparing to the number of significant singular values of .
References
-
Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. - The definitive reference for all material in this section.
-
Trefethen, L. N. & Bau, D. (1997). Numerical Linear Algebra. SIAM. - Exceptionally clear treatment of Householder QR and backward error analysis.
-
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM. - Comprehensive analysis of numerical stability; source for all error bounds cited here.
-
Demmel, J. W. (1997). Applied Numerical Linear Algebra. SIAM. - Excellent treatment of LU pivoting strategies and condition estimation.
-
Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix. Journal of the ACM, 5(4), 339-342. - Original Householder reflector paper.
-
Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion. Journal of the ACM, 8(3), 281-330. - Foundational backward error analysis for LU.
-
Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288. - Randomized QR/SVD framework.
-
Martens, J. & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature. ICML 2015. - K-FAC: Cholesky for Fisher information blocks.
-
Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR 2022. - Low-rank decomposition for fine-tuning.
-
Zhao, J. et al. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. ICML 2024. - Randomized SVD/QR for gradient compression.
-
Bunch, J. R. & Kaufman, L. (1977). Some stable methods for calculating inertia and solving symmetric linear systems. Mathematics of Computation, 31(137), 163-179. - LDL^T with Bunch-Kaufman pivoting.
-
Anderson, E. et al. (1999). LAPACK Users' Guide (3rd ed.). SIAM. - LAPACK routine reference and algorithmic details.
-
Gardner, J. R. et al. (2018). GPyTorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. NeurIPS 2018. - CG-based GP inference for large-scale regression.
-
Demmel, J., Grigori, L., Hoemmen, M., & Langou, J. (2008). Communication-optimal parallel and sequential QR and LU factorizations. SIAM Journal on Scientific Computing, 34(1), A206-A239. - TSQR and communication-avoiding algorithms.
-
Shah, J. et al. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv:2407.08608. - Block-tiled attention exploiting BLAS-3 structure.
Appendix M: Exercises - Further Problems
M.1 Theoretical Extensions
Problem M.1. Show that for any invertible A there exists a permutation P such that PA has an LU factorization. (Hint: show that there always exists a row ordering such that all leading principal minors of PA are nonzero.)
Problem M.2. Prove the identity det(A) = prod_i U_ii using the LU factorization. Extend to show that det(PA) = (-1)^s det(A) where s is the number of transpositions in P.
Problem M.3. For A with Cholesky A = LL^T, prove that the Schur complement of A_11 in A equals L_22 L_22^T. Conclude that the Schur complement of any SPD matrix is SPD.
Problem M.4. Show that the Householder reflector H = I - 2vv^T/||v||^2 is symmetric, orthogonal, has det(H) = -1, and H^2 = I.
M.2 References
- Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
- Trefethen, L. N. and Bau, D. (1997). Numerical Linear Algebra. SIAM.
- Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM.
- Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011). Finding structure with randomness. SIAM Review, 53(2), 217-288.
- Martens, J. and Grosse, R. (2015). Optimizing neural networks with K-FAC. ICML 2015.
- Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR 2022.
- Zhao, J. et al. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. ICML 2024.
- Demmel, J. et al. (2008). Communication-optimal parallel and sequential QR and LU factorizations. SIAM JSC, 34(1).