Part 3

10 min read13 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Matrix Decompositions: Appendix K: Matrix Decompositions in the 2026 AI Landscape to Appendix M: Exercises - Further Problems

Appendix K: Matrix Decompositions in the 2026 AI Landscape

K.1 Inference-Time Decompositions

As language models move toward longer context (1M+ tokens), the attention computation's quadratic cost $O(n^2 d)$ has become a primary bottleneck. Several recent methods use matrix decompositions at inference time:

PagedAttention (Kwon et al., 2023): Manages KV cache memory using block-based paging inspired by virtual memory. The KV cache is a large matrix $K, V \in \mathbb{R}^{n \times d}$ ; PagedAttention tiles this into fixed-size "pages" and maintains an LU-like decomposition of the free-list.

MLA (Multi-head Latent Attention, DeepSeek, 2024): Compresses the KV cache via low-rank projection: $K = K_c U_{K}^\top$ where $K_c$ is a compressed latent and $U_K$ is a fixed projection matrix. The decomposition $K = QR$ (QR of the projection matrix) is used to initialize and orthogonalize the projection basis.

Speculative decoding with draft models: Uses LU-based factored attention matrices to maintain a draft distribution for speculative sampling.

K.2 Training-Time Decompositions

GaLore (Zhao et al., 2024): Gradient Low-Rank Projection. For a weight matrix $W \in \mathbb{R}^{m \times n}$ , the gradient $G_t$ is projected onto a rank- $r$ subspace via:

\tilde{G}_t = G_t P_t, \quad P_t \in \mathbb{R}^{n \times r}

where $P_t$ comes from the right singular vectors of $G_t$ (randomized SVD). Adam is applied to $\tilde{G}_t$ (memory: $mr$ instead of $mn$ ), then the update is lifted back: $\Delta W = \tilde{\Delta W}_t P_t^\top$ . The SVD is recomputed every $T = 200$ steps.

Fira (Chen et al., 2024): Extends GaLore with a closed-form gradient approximation that avoids the SVD step entirely, using a QR-based approximation to the projection.

SOAP (Vyas et al., 2024): Combines Shampoo's Kronecker-structured preconditioner with Adam's update rule. Each layer's preconditioner is maintained as $G_l \in \mathbb{R}^{d_l \times d_l}$ with iterative Cholesky-based Newton steps for the matrix square root.

K.3 Model Compression via Decompositions

SliceGPT (Ashkboos et al., 2024): Compresses transformer weights by applying QR-based slice operations: a structured pruning where the weight matrix is decomposed as $W = Q\tilde{W}$ and the orthogonal factor $Q$ is absorbed into the layer normalization.

ASVD (Yuan et al., 2023): Activation-aware SVD for LLM compression. Instead of SVD of $W$ directly, computes SVD of $WX$ (weight matrix multiplied by typical activation matrix), finding a better rank- $r$ approximation for the actual inference distribution. Uses randomized QR for efficiency.

QuIP (Chee et al., 2023) and QuIP# (Tseng et al., 2024): Incoherence-based quantization. A random orthogonal matrix $Q$ (Hadamard-based) is applied to $W$ before quantization: $\hat{W} = Q \cdot \text{round}(Q^\top W / s)$ . The orthogonal preprocessing ensures that no single weight has disproportionate magnitude, enabling aggressive quantization. The QR connection: $Q$ is chosen as a randomized Hadamard transform, equivalent to the sketch matrix in randomized QR.

K.4 Numerical Linear Algebra for AI Accelerators

Tensor Cores and Mixed Precision: NVIDIA A100/H100 Tensor Cores perform $D = AB + C$ where $A, B$ are FP16/BF16 and $C, D$ are FP32, achieving $312$ TFLOPS (FP16) vs $19.5$ TFLOPS (FP64). All modern factorization libraries exploit this:

cuSOLVER RF (Iterative Refinement): FP16 factor + FP32 residual + FP32 correction - full FP32 accuracy at near-FP16 throughput
cuBLAS Tensor Core GEMM: All blocked factorizations (LU, QR, Cholesky trailing updates) use TC-GEMM for the dominant computation

Intel AMX (Advanced Matrix Extensions): New Intel architecture extensions (Sapphire Rapids, 2023) add hardware matrix tile registers and tile-based GEMM instructions. BLIS and MKL exploit AMX for $2\times$ speedup over AVX-512 on blocked factorizations.

AMD MI300X: 192 GB HBM3 + 5.3 TB/s bandwidth enables $n = 65{,}000$ Cholesky factorization in GPU memory - sufficient for large GP regression without blocking across devices.

Appendix L: Summary of Key Results

For quick reference, here are the key theorems and facts from this section:

LU Factorization:

Exists iff all leading principal minors $\det(A_k) \neq 0$ for $k = 1, \ldots, n-1$ .
With partial pivoting: $PA = LU$ ; $|L_{ij}| \leq 1$ ; cost $\frac{2}{3}n^3$ flops.
Backward error: $\|E\|_\infty \leq c_n u \rho_n \|A\|_\infty$ where $\rho_n$ is the growth factor.
Growth factor: $\rho_n \leq 2^{n-1}$ (theoretical), $\approx n^{2/3}$ (typical).
Determinant: $\det(A) = (-1)^s \prod_i U_{ii}$ where $s$ = number of row swaps.

QR Factorization:

Always exists for any $A \in \mathbb{R}^{m \times n}$ with $m \geq n$ .
Householder QR: backward stable with $\|\hat{Q}^\top\hat{Q} - I\|_F = O(u)$ , independent of $\kappa(A)$ .
Cost: $2mn^2 - \frac{2}{3}n^3$ flops (vs $\frac{2}{3}n^3$ for LU when $m = n$ ).
QR for least squares: $\kappa$ loss $= \kappa(A)$ (vs $\kappa(A)^2$ for normal equations).
RRQR: diagonal of $R$ estimates singular values; $|R_{kk}|$ reveals rank.

Cholesky Factorization:

Exists iff $A \succ 0$ (equivalently: all eigenvalues positive).
Cost: $\frac{1}{3}n^3$ flops (half of LU); unconditionally backward stable without pivoting.
Log-determinant: $\log\det A = 2\sum_i\log L_{ii}$ - exact, $O(n)$ after Cholesky.
Gradient: $\nabla_A\log\det A = A^{-1}$ - computed in $O(n^2)$ via Cholesky solve.

Backward Stability:

Forward error $\leq \kappa(A) \times$ backward error (fundamental inequality).
Backward stability hierarchy: Cholesky = Householder QR $\succ$ LU with partial pivot $\succ$ LU without pivot.
Iterative refinement: recovers FP64 accuracy from FP16 factorization in $O(1)$ iterations if $\kappa \cdot u_{16} < 1$ .

CHUNK_FINAL

Appendix M: Exercises - Further Problems

M.1 Theoretical Extensions

Problem M.1. Show that for any invertible $A \in \mathbb{R}^{n \times n}$ , there exists a permutation matrix $P$ such that $PA$ has an LU factorization. (Hint: show that there always exists a row ordering such that all leading principal minors of $PA$ are nonzero.)

Problem M.2. Prove the identity $\det(A) = \det(L)\det(U) = \prod_{i=1}^n U_{ii}$ using the LU factorization. Extend to show that $\det(PA) = (-1)^s \det(A)$ where $s$ is the number of transpositions in $P$ .

Problem M.3. For $A \in \mathbb{R}^{n \times n}$ with Cholesky $A = LL^\top$ , prove that the Schur complement of $A_{11}$ in $A$ equals $(L_{22}L_{22}^\top)$ where $L_{22}$ is the lower-right $(n-1) \times (n-1)$ block of $L$ . Conclude that the Schur complement of any SPD matrix is SPD.

Problem M.4. Show that the Householder reflector $H = I - 2\mathbf{v}\mathbf{v}^\top/\|\mathbf{v}\|^2$ satisfies: (a) $H = H^\top$ (symmetric) (b) $H^\top H = I$ (orthogonal) (c) $\det(H) = -1$ (orientation-reversing) (d) $H^2 = I$ (involutory - its own inverse)

Problem M.5. Prove that QR with column pivoting finds the best rank- $r$ approximation in the following sense: if $\operatorname{rank}(A) = r$ and $P$ is chosen so that $|R_{11}| \geq |R_{22}| \geq \cdots$ , then the leading $r \times r$ block $R_{11}$ of $R$ satisfies $\sigma_r(R_{11}) \geq \sigma_r(A) / \sqrt{1 + r(n-r)}$ .

M.2 Computational Challenges

Challenge M.6 (Batched Cholesky). Implement batched Cholesky factorization for $k = 1000$ matrices of size $n = 50 \times 50$ . Compare three implementations:

Python loop over scipy.linalg.cholesky
NumPy vectorized via np.vectorize
torch.linalg.cholesky on CPU with batch dimension

Measure time and throughput (matrices/second). Extrapolate to the K-FAC setting where $k = 10{,}000$ and $n = 128$ .

Challenge M.7 (Iterative Refinement). Construct a matrix $A$ with condition number $\kappa \approx 10^{12}$ (near the FP64 limit). Solve $A\mathbf{x} = \mathbf{b}$ using: (a) Direct LU (FP64): measure relative error (b) Simulated FP32 LU + FP64 residual refinement (3 steps): measure relative error Compare and explain the improvement using backward error theory.

Challenge M.8 (Rank-revealing QR in practice). Download a real dataset (e.g., the UCI Breast Cancer dataset, $n = 569$ , $d = 30$ features). Apply RRQR to the feature matrix $X$ and identify the effective rank (threshold: $|R_{kk}|/|R_{11}| < 10^{-3}$ ). Verify by comparing to the number of significant singular values of $X$ .

References

Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. - The definitive reference for all material in this section.
Trefethen, L. N. & Bau, D. (1997). Numerical Linear Algebra. SIAM. - Exceptionally clear treatment of Householder QR and backward error analysis.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM. - Comprehensive analysis of numerical stability; source for all error bounds cited here.
Demmel, J. W. (1997). Applied Numerical Linear Algebra. SIAM. - Excellent treatment of LU pivoting strategies and condition estimation.
Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix. Journal of the ACM, 5(4), 339-342. - Original Householder reflector paper.
Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion. Journal of the ACM, 8(3), 281-330. - Foundational backward error analysis for LU.
Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288. - Randomized QR/SVD framework.
Martens, J. & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature. ICML 2015. - K-FAC: Cholesky for Fisher information blocks.
Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR 2022. - Low-rank decomposition for fine-tuning.
Zhao, J. et al. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. ICML 2024. - Randomized SVD/QR for gradient compression.
Bunch, J. R. & Kaufman, L. (1977). Some stable methods for calculating inertia and solving symmetric linear systems. Mathematics of Computation, 31(137), 163-179. - LDL^T with Bunch-Kaufman pivoting.
Anderson, E. et al. (1999). LAPACK Users' Guide (3rd ed.). SIAM. - LAPACK routine reference and algorithmic details.
Gardner, J. R. et al. (2018). GPyTorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. NeurIPS 2018. - CG-based GP inference for large-scale regression.
Demmel, J., Grigori, L., Hoemmen, M., & Langou, J. (2008). Communication-optimal parallel and sequential QR and LU factorizations. SIAM Journal on Scientific Computing, 34(1), A206-A239. - TSQR and communication-avoiding algorithms.
Shah, J. et al. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv:2407.08608. - Block-tiled attention exploiting BLAS-3 structure.

Appendix M: Exercises - Further Problems

M.1 Theoretical Extensions

Problem M.1. Show that for any invertible A there exists a permutation P such that PA has an LU factorization. (Hint: show that there always exists a row ordering such that all leading principal minors of PA are nonzero.)

Problem M.2. Prove the identity det(A) = prod_i U_ii using the LU factorization. Extend to show that det(PA) = (-1)^s det(A) where s is the number of transpositions in P.

Problem M.3. For A with Cholesky A = LL^T, prove that the Schur complement of A_11 in A equals L_22 L_22^T. Conclude that the Schur complement of any SPD matrix is SPD.

Problem M.4. Show that the Householder reflector H = I - 2vv^T/||v||^2 is symmetric, orthogonal, has det(H) = -1, and H^2 = I.

M.2 References

Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
Trefethen, L. N. and Bau, D. (1997). Numerical Linear Algebra. SIAM.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM.
Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011). Finding structure with randomness. SIAM Review, 53(2), 217-288.
Martens, J. and Grosse, R. (2015). Optimizing neural networks with K-FAC. ICML 2015.
Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR 2022.
Zhao, J. et al. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. ICML 2024.
Demmel, J. et al. (2008). Communication-optimal parallel and sequential QR and LU factorizations. SIAM JSC, 34(1).

Matrix Decompositions: Part 3 - Appendix K Matrix Decompositions In The 2026 Ai Landscape To Appendix

Matrix Decompositions: Appendix K: Matrix Decompositions in the 2026 AI Landscape to Appendix M: Exercises - Further Problems

Appendix K: Matrix Decompositions in the 2026 AI Landscape

K.1 Inference-Time Decompositions

K.2 Training-Time Decompositions

K.3 Model Compression via Decompositions

K.4 Numerical Linear Algebra for AI Accelerators

Appendix L: Summary of Key Results

Appendix M: Exercises - Further Problems

M.1 Theoretical Extensions

M.2 Computational Challenges

References

Appendix M: Exercises - Further Problems

M.1 Theoretical Extensions

M.2 References

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?