Part 3Math for LLMs

Orthogonality and Orthonormality: Part 3 - Appendix A Key Inequalities In Inner Product Spaces To Appendix K Comm

Advanced Linear Algebra / Orthogonality and Orthonormality

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
18 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Orthogonality and Orthonormality: Appendix A: Key Inequalities in Inner Product Spaces to Appendix K: Common Numerical Experiments

Appendix A: Key Inequalities in Inner Product Spaces

A.1 Cauchy-Schwarz in Different Forms

The Cauchy-Schwarz inequality u,vuv|\langle\mathbf{u},\mathbf{v}\rangle| \leq \|\mathbf{u}\|\|\mathbf{v}\| has many equivalent forms:

Discrete form:

(k=1nakbk)2(k=1nak2)(k=1nbk2)\left(\sum_{k=1}^n a_k b_k\right)^2 \leq \left(\sum_{k=1}^n a_k^2\right)\left(\sum_{k=1}^n b_k^2\right)

Integral form:

(abf(x)g(x)dx)2(abf(x)2dx)(abg(x)2dx)\left(\int_a^b f(x)g(x)\,dx\right)^2 \leq \left(\int_a^b f(x)^2\,dx\right)\left(\int_a^b g(x)^2\,dx\right)

Probability form: For random variables X,YX, Y with E[X2],E[Y2]<\mathbb{E}[X^2], \mathbb{E}[Y^2] < \infty:

E[XY]2E[X2]E[Y2]|\mathbb{E}[XY]|^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2]

The probability form directly implies: the correlation coefficient satisfies Corr(X,Y)1|\text{Corr}(X,Y)| \leq 1.

Second proof (via AM-GM). For any t>0t > 0:

0tu1tv2=t2u22u,v+1t2v20 \leq \|t\mathbf{u} - \frac{1}{t}\mathbf{v}\|^2 = t^2\|\mathbf{u}\|^2 - 2\langle\mathbf{u},\mathbf{v}\rangle + \frac{1}{t^2}\|\mathbf{v}\|^2

Minimizing over t>0t > 0 gives t2=v/ut^2 = \|\mathbf{v}\|/\|\mathbf{u}\| and:

02uv2u,vu,vuv0 \leq 2\|\mathbf{u}\|\|\mathbf{v}\| - 2\langle\mathbf{u},\mathbf{v}\rangle \quad \Rightarrow \quad \langle\mathbf{u},\mathbf{v}\rangle \leq \|\mathbf{u}\|\|\mathbf{v}\|

Applying to v-\mathbf{v} gives the lower bound. \square

A.2 Triangle Inequality and Its Uses

The triangle inequality for norms follows from Cauchy-Schwarz:

u+v2=u2+2u,v+v2u2+2uv+v2=(u+v)2\|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + 2\langle\mathbf{u},\mathbf{v}\rangle + \|\mathbf{v}\|^2 \leq \|\mathbf{u}\|^2 + 2\|\mathbf{u}\|\|\mathbf{v}\| + \|\mathbf{v}\|^2 = (\|\mathbf{u}\|+\|\mathbf{v}\|)^2

Parallelogram law: For any inner product space:

u+v2+uv2=2(u2+v2)\|\mathbf{u}+\mathbf{v}\|^2 + \|\mathbf{u}-\mathbf{v}\|^2 = 2(\|\mathbf{u}\|^2 + \|\mathbf{v}\|^2)

Proof: Expand both norms and add. The cross terms cancel. \square

This law characterizes inner product spaces: a norm satisfying the parallelogram law comes from an inner product via the polarization identity:

u,v=14(u+v2uv2)\langle\mathbf{u},\mathbf{v}\rangle = \frac{1}{4}\left(\|\mathbf{u}+\mathbf{v}\|^2 - \|\mathbf{u}-\mathbf{v}\|^2\right)

Practical importance: The 1\ell^1 and \ell^\infty norms do NOT satisfy the parallelogram law, confirming they do not arise from inner products. This is why optimization in 1\ell^1 (LASSO) has different geometry from 2\ell^2 (ridge regression) - there is no meaningful notion of "orthogonality" in 1\ell^1.

A.3 Weyl's Inequality for Eigenvalue Perturbation

When a symmetric matrix AA is perturbed to A+EA + E (with EE symmetric), Weyl's inequality bounds the eigenvalue shifts:

λk(A+E)λk(A)E2=σ1(E)|\lambda_k(A+E) - \lambda_k(A)| \leq \|E\|_2 = \sigma_1(E)

This means: the eigenvalues of a symmetric matrix are Lipschitz functions of the matrix entries, with Lipschitz constant 1 in the spectral norm.

Consequence for ML: If the Hessian 2L(θ)\nabla^2\mathcal{L}(\theta) changes slowly along a training trajectory (small H(t+1)H(t)2\|H(t+1) - H(t)\|_2), its eigenvalues also change slowly. This justifies quasi-Newton methods that assume the curvature is approximately constant over several steps.


Appendix B: Numerical Orthogonalization Recipes

B.1 When to Use Each Method

MethodComplexityStabilityWhen to Use
Classical Gram-Schmidt (CGS)O(mnk)O(mnk)Poor: O(ϵκ2)O(\epsilon\kappa^2)Never in production; educational use only
Modified Gram-Schmidt (MGS)O(mnk)O(mnk)Good: O(ϵκ)O(\epsilon\kappa)When explicit QQ needed and κ(A)\kappa(A) moderate
Householder QRO(mn2n3/3)O(mn^2 - n^3/3)Excellent: O(ϵ)O(\epsilon)General QR factorization; recommended default
Givens QRO(mn2)O(mn^2)ExcellentSparse matrices; parallel processing
Randomized QRO(mnlogk+nk2)O(mn\log k + nk^2)GoodLarge matrices; approximate low-rank QR

B.2 Loss of Orthogonality: Quantifying the Damage

In IEEE double precision (ϵmach2.2×1016\epsilon_\text{mach} \approx 2.2 \times 10^{-16}), the computed Q^\hat{Q} from CGS on an n×nn \times n matrix satisfies:

Q^Q^IFnϵmachκ(A)2\|\hat{Q}^\top\hat{Q} - I\|_F \lesssim n \cdot \epsilon_\text{mach} \cdot \kappa(A)^2

For a well-conditioned matrix (κ=10\kappa = 10): error n1014\approx n \cdot 10^{-14} - acceptable.

For an ill-conditioned matrix (κ=108\kappa = 10^8): error n1\approx n \cdot 1 - completely non-orthogonal!

MGS improves this to nϵmachκ(A)\lesssim n \cdot \epsilon_\text{mach} \cdot \kappa(A), and Householder to nϵmach\lesssim n \cdot \epsilon_\text{mach} regardless of κ\kappa.

B.3 Reorthogonalization

When MGS produces inadequate orthogonality (common for nearly-dependent input vectors), apply reorthogonalization: after completing Gram-Schmidt once, apply it again to the already-computed QQ. This reduces the error from O(ϵκ)O(\epsilon\kappa) to O(ϵ2κ2)O(\epsilon^2\kappa^2), which is typically negligible.

def mgs_with_reorth(A, num_reorth=1):
    """Modified Gram-Schmidt with optional reorthogonalization."""
    m, n = A.shape
    Q = np.zeros((m, n))
    R = np.zeros((n, n))
    for j in range(n):
        v = A[:, j].copy()
        for reorth in range(1 + num_reorth):
            for i in range(j):
                R[i, j] = Q[:, i] @ v  if reorth == 0 else Q[:, i] @ v
                v -= (Q[:, i] @ v) * Q[:, i]
        R[j, j] = np.linalg.norm(v)
        Q[:, j] = v / R[j, j]
    return Q, R

B.4 Block Gram-Schmidt for Parallel Computation

For very large matrices on parallel hardware, block Gram-Schmidt processes columns in blocks of size bb:

  1. Orthogonalize within each block using MGS or Householder
  2. Orthogonalize each new block against all previous blocks using a BLAS-3 matrix operation

The key advantage: step 2 uses matrix-matrix products (O(nb2)O(nb^2) flops, BLAS Level 3) rather than matrix-vector products (O(nb)O(nb) flops, BLAS Level 1/2). On modern hardware with SIMD and caching, this gives 10-100\times speedup for large nn.


Appendix C: Orthogonality in Machine Learning Libraries

C.1 PyTorch Orthogonal Utilities

import torch
import torch.nn as nn

# Orthogonal initialization
linear = nn.Linear(64, 64)
nn.init.orthogonal_(linear.weight)  # U from QR decomposition

# Spectral normalization
conv = nn.utils.spectral_norm(nn.Conv2d(64, 64, 3))
# Divides weight by its spectral norm at each forward pass

# Orthogonal regularization
def orthogonal_reg(model, lambda_=1e-4):
    loss = 0
    for name, param in model.named_parameters():
        if 'weight' in name and param.dim() == 2:
            W = param
            loss += lambda_ * torch.norm(W.T @ W - torch.eye(W.shape[1]))
    return loss

C.2 NumPy/SciPy QR and Orthogonal Matrix Tools

import numpy as np
from scipy.linalg import qr, qr_multiply, polar

# Full QR decomposition
Q, R = np.linalg.qr(A)              # 'reduced' (default) or 'complete'
Q, R, P = qr(A, pivoting=True)      # QR with column pivoting

# Random orthogonal matrix (Haar measure)
M = np.random.randn(n, n)
Q, R = np.linalg.qr(M)
Q *= np.sign(np.diag(R))            # Fix signs for uniform distribution

# Polar decomposition: A = Q @ S
Q, S = polar(A)                     # scipy.linalg.polar

# Check orthogonality
I_approx = Q.T @ Q
print(f"Max off-diagonal: {np.max(np.abs(I_approx - np.eye(n))):.2e}")

C.3 Efficient Householder Representation

For large nn, storing the full QQ matrix explicitly requires O(n2)O(n^2) space. The compact WY representation stores Q=IWYQ = I - WY^\top where W,YRn×kW, Y \in \mathbb{R}^{n \times k}, requiring only O(nk)O(nk) space and allowing matrix-vector products in O(nk)O(nk) time:

def apply_householder_sequence(v_list, tau_list, x):
    """Apply Q = H_1 H_2 ... H_k to x using the compact representation."""
    for v, tau in zip(v_list, tau_list):
        x = x - tau * v * (v @ x)  # Apply H = I - tau * v * v^T
    return x

This is how LAPACK and NumPy store QQ internally - only the Householder vectors are saved, not the explicit matrix.


Appendix D: The Geometry of Projection - Visual Summary

ORTHOGONAL DECOMPOSITION THEOREM
========================================================================

  The Euclidean space \mathbb{R}^n splits into complementary orthogonal subspaces:

  For any subspace S \subseteq \mathbb{R}^n:

           \mathbb{R}^n = S \oplus S\perp

           +------------------------------------------+
           |                  \mathbb{R}^n                      |
           |         +----------------+               |
           |         |       S        |               |
           |         |    v_S = Pv   |               |
           |         |      -        |               |
           |         |    /         |               |
           |         |  /           |               |
           |  v-     |/  <- v_S      |               |
           |   \     +----------------+               |
           |    \          |  orthogonal               |
           |     \         |  complement               |
           |      \        |  S\perp                      |
           |       ------- + ------------------------- |
           |    v - v_S    |  (v - v_S \perp every s \in S) |
           +------------------------------------------+

  Properties of projection matrix P = QQ^T (Q has ONB for S):

    Symmetric:    P^T = P         (orthogonal, not oblique)
    Idempotent:   P^2 = P         (projecting twice = projecting once)
    Range:        col(P) = S     (maps into S)
    Null space:   null(P) = S\perp  (annihilates complement)

========================================================================
GRAM-SCHMIDT PROCESS - STEP BY STEP
========================================================================

  Input: linearly independent vectors a_1, a_2, a_3 (not orthogonal)
  Output: orthonormal basis q_1, q_2, q_3 for same subspace

  STEP 1:  q_1 = a_1 / ||a_1||
           ------------------------------------
           a_1  --------------------------> q_1 (unit vector)

  STEP 2:  u_2 = a_2 - <a_2,q_1>q_1   (subtract projection onto q_1)
           q_2 = u_2 / ||u_2||
           ------------------------------------
           a_2  --->  remove q_1 component  ---> u_2 ---> q_2

                     a_2
                    /
                   / up <a_2,q_1>q_1 (this part removed)
                  /
           q_1 ------------ q_1 direction
                  u_2 (perpendicular remainder) --> q_2

  STEP 3:  u_3 = a_3 - <a_3,q_1>q_1 - <a_3,q_2>q_2
           q_3 = u_3 / ||u_3||
           ------------------------------------
           Remove projections onto BOTH q_1 and q_2

  RESULT: {q_1, q_2, q_3} is an ONB with same span as {a_1, a_2, a_3}

  QR FACTORIZATION: A = [a_1|a_2|a_3] = [q_1|q_2|q_3] * R
  where R is upper triangular with positive diagonal.

========================================================================
SPECTRAL THEOREM: SYMMETRIC <-> ORTHOGONAL EIGENBASIS
========================================================================

  A = A^T  (symmetric)
       <->
  A has orthonormal eigenbasis {q_1, ..., q_n}
       <->
  A = Q\LambdaQ^T  where  Q = [q_1|...|q_n] \in O(n),  \Lambda = diag(\lambda_1,...,\lambda_n)
       <->
  Quadratic form:  x^TAx = \Sigma^i \lambda^i(x^Tq^i)^2 = weighted sum of squared projections

  INTERPRETATION:
  +-----------------------------------------------------------------+
  |  Q^T: Express x in eigenbasis  ->  \Lambda: Scale each component  ->   |
  |  Q: Return to standard basis                                    |
  |                                                                  |
  |  \lambda^i > 0: stretch in q^i direction  (positive definite: all \lambda^i>0)|
  |  \lambda^i = 0: collapse in q^i direction  (singular matrix)           |
  |  \lambda^i < 0: reflection in q^i direction  (indefinite: mixed signs) |
  +-----------------------------------------------------------------+

  AI CONNECTION: Hessian \nabla^2\mathcal{L} is symmetric. Its eigenvectors are the
  "principal curvature directions" of the loss surface. Eigenvalues
  tell us how curved the loss is in each direction.

========================================================================
ORTHOGONAL MATRICES: STRUCTURE AND CLASSIFICATION
========================================================================

  Q^TQ = I  (definition)   =>   det(Q) = \pm1

  det(Q) = +1:  ROTATIONS (SO(n))      det(Q) = -1:  REFLECTIONS
  ----------------------------         --------------------------
  Preserve orientation                 Reverse orientation
  Examples in 2D: R_\theta for all \theta        Example: diag(1,...,1,-1)
  Compose to give rotations            Compose to give rotations

  O(n) = SO(n) \cup {reflections}   (two connected components)

  EIGENVALUES: always on unit circle \in \mathbb{C}
  +-----------------------------------------------------------------+
  |   Real eigenvalues: \pm1 only                                     |
  |   Complex eigenvalues: e^{i\theta}, e^{-i\theta} pairs -> 2D rotations    |
  |                                                                  |
  |   So(2n) block structure:                                        |
  |   +----------+-------+                                          |
  |   | R_{\theta_1}  |       |  each block is a 2\times2 rotation           |
  |   +----------+ R_{\theta_2}|  with its own angle \theta^i                 |
  |   |          +-------+                                          |
  |   |          |  ...  |                                          |
  |   +----------+-------+                                          |
  +-----------------------------------------------------------------+

========================================================================

Appendix E: Summary of Key Formulas

Inner Products and Orthogonality

FormulaExpressionNotes
Inner productu,v=uv=iuivi\langle\mathbf{u},\mathbf{v}\rangle = \mathbf{u}^\top\mathbf{v} = \sum_i u_i v_iEuclidean; generalizes to u,vM=uMv\langle\mathbf{u},\mathbf{v}\rangle_M = \mathbf{u}^\top M\mathbf{v}
Normv=v,v\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}Induced by inner product
Anglecosθ=u,v/(uv)\cos\theta = \langle\mathbf{u},\mathbf{v}\rangle / (\|\mathbf{u}\|\|\mathbf{v}\|)Requires both nonzero
Cauchy-Schwarz$\langle\mathbf{u},\mathbf{v}\rangle
Pythagorean theoremuvu+v2=u2+v2\mathbf{u} \perp \mathbf{v} \Rightarrow \|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2Converse also holds
Parseval's identity$|\mathbf{v}|^2 = \sum_{i=1}^n\langle\mathbf{v},\mathbf{q}_i\rangle
Bessel's inequality$\sum_{i=1}^k\langle\mathbf{v},\mathbf{q}_i\rangle

Projections

FormulaExpressionNotes
Proj onto unit vectorproju^(v)=v,u^u^\operatorname{proj}_{\hat{\mathbf{u}}}(\mathbf{v}) = \langle\mathbf{v},\hat{\mathbf{u}}\rangle\hat{\mathbf{u}}1D case
Proj onto subspace (ONB QQ)PSv=QQvP_S\mathbf{v} = QQ^\top\mathbf{v}Requires QQ=IQ^\top Q = I
Proj onto col(A)(A) (general)PSv=A(AA)1AvP_S\mathbf{v} = A(A^\top A)^{-1}A^\top\mathbf{v}General basis
Projection matrix propertiesP2=PP^2 = P, P=PP^\top = PIdempotent + symmetric
Complementary projectorPS=IPSP_{S^\perp} = I - P_SAlso a projector
Best approx propertyvPSvvs\|\mathbf{v} - P_S\mathbf{v}\| \leq \|\mathbf{v} - \mathbf{s}\| for all sS\mathbf{s} \in SFundamental theorem

Gram-Schmidt and QR

FormulaExpressionNotes
GS stepuj=aji<jaj,qiqi\mathbf{u}_j = \mathbf{a}_j - \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\mathbf{q}_iRemove prior directions
QR factorizationA=QRA = QRQQ orthonormal cols, RR upper triangular
Least squares via QRRx=QbR\mathbf{x} = Q^\top\mathbf{b}Solve upper triangular system
HouseholderH=I2nnH = I - 2\mathbf{n}\mathbf{n}^\topReflection across n\mathbf{n}^\perp
Cayley transformQ=(IS)(I+S)1Q = (I-S)(I+S)^{-1}From skew-symmetric SS

Spectral Theorem

FormulaExpressionNotes
Spectral decompositionA=QΛQ=iλiqiqiA = Q\Lambda Q^\top = \sum_i \lambda_i\mathbf{q}_i\mathbf{q}_i^\topSymmetric AA only
Rayleigh quotientRA(x)=xAx/x2R_A(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x}/\|\mathbf{x}\|^2Bounded by [\lambda_\min,\lambda_\max]
Positive definitenessA0A \succ 0 \Leftrightarrow all λi>0\lambda_i > 0Via spectral theorem
Polar decompositionA=QSA = QS, QQ=IQ^\top Q = I, S0S \succeq 0Extends to non-square

Appendix F: Connections to Other Chapters

F.1 From Eigenvalues (01) to Orthogonality

The spectral theorem shows that symmetric matrices are the matrices that are diagonalizable by orthogonal transformations. Non-symmetric matrices may still be diagonalizable, but with a non-orthogonal change of basis - making computations numerically less stable.

The QR algorithm for computing eigenvalues (01) is built entirely on orthogonal matrix iterations:

A_0 = A
for k = 0, 1, 2, ...:
    A_k = Q_k R_k    (QR factorization of current iterate)
    A_{k+1} = R_k Q_k  (swap factors)

Each iterate is orthogonally similar to the previous: Ak+1=QkAkQkA_{k+1} = Q_k^\top A_k Q_k (since Rk=QkAkR_k = Q_k^\top A_k and Ak+1=RkQk=QkAkQkA_{k+1} = R_k Q_k = Q_k^\top A_k Q_k). The iterates converge to the Schur form - upper triangular, with eigenvalues on the diagonal. The entire algorithm is numerically stable precisely because it uses only orthogonal transformations.

F.2 From SVD (02) to Orthogonality

The SVD A=UΣVA = U\Sigma V^\top can be viewed as the "orthogonality structure" of AA:

  • V=[v1vn]V = [\mathbf{v}_1|\cdots|\mathbf{v}_n]: ONB for the row space, mapping inputs to "pure directions"
  • U=[u1um]U = [\mathbf{u}_1|\cdots|\mathbf{u}_m]: ONB for the column space, the natural output directions
  • Σ\Sigma: the non-orthogonal part (pure scaling)

The polar decomposition A=(UV)(VΣV)A = (U V^\top)(V \Sigma V^\top) isolates the orthogonal factor Q=UVQ = UV^\top from the symmetric factor S=VΣVS = V\Sigma V^\top. This factorization requires all the tools developed in this section.

F.3 Orthogonality Enables Low-Rank Approximation (02, 03)

Eckart-Young-Mirsky theorem: The best rank-kk approximation to AA in both spectral and Frobenius norms is:

Ak=i=1kσiuiviA_k = \sum_{i=1}^k \sigma_i\mathbf{u}_i\mathbf{v}_i^\top

This result requires:

  1. The SVD (which produces orthogonal U,VU, V)
  2. The best approximation theorem (projection gives the nearest element in a subspace)
  3. Parseval's identity (to express AAkF2=i>kσi2\|A - A_k\|_F^2 = \sum_{i>k}\sigma_i^2)

All three are developed in this section. The Eckart-Young theorem is the mathematical foundation for PCA, LoRA, and any other low-rank approximation method.

F.4 Orthogonality and Matrix Norms (06)

Unitarily invariant norms: A matrix norm A\|A\| is unitarily invariant if UAV=A\|UAV\| = \|A\| for all unitary U,VU, V. The three most important matrix norms are all unitarily invariant:

  • Spectral norm: A2=σ1(A)\|A\|_2 = \sigma_1(A) (largest singular value)
  • Frobenius norm: AF=i,jAij2=iσi2\|A\|_F = \sqrt{\sum_{i,j}A_{ij}^2} = \sqrt{\sum_i\sigma_i^2} (sum of squared singular values)
  • Nuclear norm: A=iσi\|A\|_* = \sum_i\sigma_i (sum of singular values)

These norms are "orthogonality-aware" - they don't change when the matrix is multiplied by orthogonal matrices. This property is what allows the Eckart-Young theorem to work: the error in low-rank approximation depends only on the discarded singular values, not on the choice of orthogonal bases.

F.5 Orthogonality in Optimization (Chapter 08)

Modern optimization methods for training large neural networks increasingly exploit orthogonal structure:

Sharpness-Aware Minimization (SAM): The sharpness of a minimum is λmax(2L)\lambda_\text{max}(\nabla^2\mathcal{L}). SAM perturbs parameters in the direction of the eigenvector of the Hessian corresponding to λmax\lambda_\text{max} - computed approximately using power iteration. Power iteration produces a sequence of normalized vectors converging to the top eigenvector, each step involving a matrix-vector product with the Hessian and a re-normalization (orthogonalization against zero-dimensional subspace = normalization).

SOAP optimizer (2024): Maintains a factored approximation of the Adam preconditioner in which the "rotation" and "scaling" factors are tracked separately. The rotation factor is updated via QR decomposition to maintain orthogonality, while the scaling factor tracks the second moments in the rotating basis.

K-FAC and its variants: The Kronecker-Factored Approximate Curvature approximates the Fisher information matrix as a Kronecker product FAGF \approx A \otimes G. Computing the update requires inverting these factors, which can be done via their eigendecompositions - the eigenvectors are orthonormal matrices.


Appendix G: Historical Notes and References

Timeline: Development of Orthogonality Theory

YearMathematicianContribution
1801GaussLeast squares method; orthogonal projection as normal equations
1844GrassmannAbstract vector spaces and subspaces; inner product generalization
1850sCauchy, SchwarzCauchy-Schwarz inequality (Cauchy for sums 1821, Schwarz integral form 1888)
1897GramGram-Schmidt process (Gram's formulation for L2L^2 functions)
1902SchmidtFinite-dimensional Gram-Schmidt for 2\ell^2
1907ParsevalParseval's theorem for Fourier series
1908HilbertAbstract Hilbert spaces; orthogonal projection theorem
1915-1927Weyl, von NeumannSpectral theorem in infinite dimensions
1954HouseholderHouseholder reflectors for numerically stable QR
1965FrancisQR algorithm for eigenvalues (with implicit shifts)
1967BjorckNumerical analysis of Gram-Schmidt stability
1983Saxe, McClelland, GanguliOrthogonal initialization for deep linear networks
2014Mishkin, MatasLSUV initialization using SVD/QR
2018Miyato et al.Spectral normalization for GANs
2021Su et al.RoPE: Rotary Position Embedding using orthogonal rotations
2024Jordan et al.Muon optimizer: gradient orthogonalization via Newton-Schulz

Key References

  1. Golub & Van Loan, Matrix Computations (4th ed., 2013) - Definitive reference on numerical linear algebra including Gram-Schmidt, Householder, and QR algorithms
  2. Axler, Linear Algebra Done Right (3rd ed., 2015) - Elegant proof-based treatment of inner product spaces and spectral theorem
  3. Trefethen & Bau, Numerical Linear Algebra (1997) - Excellent coverage of QR, projections, and stability from computational perspective
  4. Saxe et al. (2013), "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks" - Theoretical foundation for orthogonal initialization
  5. Miyato et al. (2018), "Spectral Normalization for GANs" - Seminal paper on spectral normalization in deep learning
  6. Su et al. (2021), "RoFormer: Enhanced Transformer with Rotary Position Embedding" - RoPE using orthogonal rotation matrices
  7. Jordan et al. (2024), "Muon: An optimizer for hidden layers in neural networks" - Newton-Schulz orthogonalization for gradients

Appendix H: Proofs of Selected Theorems

H.1 Complete Proof: Cauchy-Schwarz Inequality

Theorem. For any inner product space and vectors u,v\mathbf{u}, \mathbf{v}:

u,v2u,uv,v|\langle\mathbf{u},\mathbf{v}\rangle|^2 \leq \langle\mathbf{u},\mathbf{u}\rangle\langle\mathbf{v},\mathbf{v}\rangle

Proof. If v=0\mathbf{v} = \mathbf{0}, both sides are 0 and the inequality holds trivially. Assume v0\mathbf{v} \neq \mathbf{0}.

For any scalar tRt \in \mathbb{R}:

0utv,utv=u,u2tu,v+t2v,v0 \leq \langle\mathbf{u} - t\mathbf{v},\mathbf{u} - t\mathbf{v}\rangle = \langle\mathbf{u},\mathbf{u}\rangle - 2t\langle\mathbf{u},\mathbf{v}\rangle + t^2\langle\mathbf{v},\mathbf{v}\rangle

This is a quadratic in tt that is always 0\geq 0. Its discriminant must therefore be 0\leq 0:

Δ=4u,v24u,uv,v0\Delta = 4\langle\mathbf{u},\mathbf{v}\rangle^2 - 4\langle\mathbf{u},\mathbf{u}\rangle\langle\mathbf{v},\mathbf{v}\rangle \leq 0 u,v2u,uv,v\langle\mathbf{u},\mathbf{v}\rangle^2 \leq \langle\mathbf{u},\mathbf{u}\rangle\langle\mathbf{v},\mathbf{v}\rangle \quad \square

Equality condition. Equality holds iff Δ=0\Delta = 0 iff the quadratic in tt has a zero iff there exists t0t_0 with ut0v=0\mathbf{u} - t_0\mathbf{v} = \mathbf{0} iff u\mathbf{u} and v\mathbf{v} are linearly dependent (one is a scalar multiple of the other).

Remark. This proof works for any abstract inner product space, finite or infinite dimensional. The key was that ,\langle\cdot,\cdot\rangle is positive definite.

H.2 Complete Proof: Best Approximation Uniqueness

Theorem. Let SS be a closed subspace of a Hilbert space H\mathcal{H}. For any vH\mathbf{v} \in \mathcal{H}, the projection vS\mathbf{v}_S satisfying vvSS\mathbf{v} - \mathbf{v}_S \perp S is unique.

Proof of uniqueness. Suppose both w1,w2S\mathbf{w}_1, \mathbf{w}_2 \in S satisfy vwiS\mathbf{v} - \mathbf{w}_i \perp S. Then (vw1)(vw2)=w2w1S(\mathbf{v} - \mathbf{w}_1) - (\mathbf{v} - \mathbf{w}_2) = \mathbf{w}_2 - \mathbf{w}_1 \in S (since SS is a subspace). But also w2w1S\mathbf{w}_2 - \mathbf{w}_1 \perp S (as a difference of vectors each perpendicular to SS). So w2w1SS={0}\mathbf{w}_2 - \mathbf{w}_1 \in S \cap S^\perp = \{\mathbf{0}\}, giving w1=w2\mathbf{w}_1 = \mathbf{w}_2. \square

H.3 Complete Proof: Gram-Schmidt Produces ONB

Theorem. If a1,,ak\mathbf{a}_1, \ldots, \mathbf{a}_k are linearly independent vectors in an inner product space VV, the Gram-Schmidt process produces an orthonormal set q1,,qk\mathbf{q}_1, \ldots, \mathbf{q}_k with span(q1,,qj)=span(a1,,aj)\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_j) = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_j) for each j=1,,kj = 1,\ldots,k.

Proof by induction on jj.

Base case (j=1j=1): q1=a1/a1\mathbf{q}_1 = \mathbf{a}_1/\|\mathbf{a}_1\| is a unit vector (since a10\mathbf{a}_1 \neq \mathbf{0} by independence). span(q1)=span(a1)\operatorname{span}(\mathbf{q}_1) = \operatorname{span}(\mathbf{a}_1). OK

Inductive step: Assume the claim holds for j1j-1. Define Sj1=span(a1,,aj1)=span(q1,,qj1)S_{j-1} = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_{j-1}) = \operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_{j-1}).

(i) uj0\mathbf{u}_j \neq \mathbf{0}: Suppose for contradiction uj=0\mathbf{u}_j = \mathbf{0}. Then aj=i<jaj,qiqiSj1=span(a1,,aj1)\mathbf{a}_j = \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\mathbf{q}_i \in S_{j-1} = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_{j-1}), contradicting linear independence.

(ii) qjq\mathbf{q}_j \perp \mathbf{q}_\ell for <j\ell < j: For any <j\ell < j:

uj,q=aj,qi<jaj,qiqi,q=aj,qaj,q1=0\langle\mathbf{u}_j, \mathbf{q}_\ell\rangle = \langle\mathbf{a}_j, \mathbf{q}_\ell\rangle - \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\langle\mathbf{q}_i,\mathbf{q}_\ell\rangle = \langle\mathbf{a}_j,\mathbf{q}_\ell\rangle - \langle\mathbf{a}_j,\mathbf{q}_\ell\rangle\cdot 1 = 0

(using the inductive hypothesis qi,q=δi\langle\mathbf{q}_i,\mathbf{q}_\ell\rangle = \delta_{i\ell}). Since qj=uj/uj\mathbf{q}_j = \mathbf{u}_j/\|\mathbf{u}_j\|, we also have qj,q=0\langle\mathbf{q}_j,\mathbf{q}_\ell\rangle = 0. OK

(iii) Same span: qj=uj/uj\mathbf{q}_j = \mathbf{u}_j/\|\mathbf{u}_j\| is in span(aj,q1,,qj1)=span(a1,,aj)\operatorname{span}(\mathbf{a}_j, \mathbf{q}_1,\ldots,\mathbf{q}_{j-1}) = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_j). Conversely, aj=uj+i<jaj,qiqi=ujqj+i<jaj,qiqispan(q1,,qj)\mathbf{a}_j = \mathbf{u}_j + \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\mathbf{q}_i = \|\mathbf{u}_j\|\mathbf{q}_j + \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\mathbf{q}_i \in \operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_j). OK

By induction, the claim holds for all jkj \leq k. \square


This section is part of the 03-Advanced-Linear-Algebra chapter.


Appendix I: The Stiefel Manifold and Optimization on Orthogonal Groups

I.1 The Stiefel Manifold St(n,k)\text{St}(n, k)

The set of n×kn \times k matrices with orthonormal columns is the Stiefel manifold:

St(n,k)={QRn×k:QQ=Ik}\text{St}(n, k) = \{Q \in \mathbb{R}^{n \times k} : Q^\top Q = I_k\}

Special cases:

  • St(n,n)=O(n)\text{St}(n, n) = O(n) - the full orthogonal group
  • St(n,1)=Sn1\text{St}(n, 1) = S^{n-1} - the unit sphere in Rn\mathbb{R}^n
  • St(n,k)\text{St}(n, k) for 1<k<n1 < k < n - Stiefel manifolds proper

Dimension: dim(St(n,k))=nkk(k+1)/2\dim(\text{St}(n,k)) = nk - k(k+1)/2

Intuition: We have nknk free parameters (entries of a n×kn \times k matrix), subject to k(k+1)/2k(k+1)/2 constraints (kk normalization constraints + k(k1)/2k(k-1)/2 orthogonality constraints between distinct columns).

Tangent space at QSt(n,k)Q \in \text{St}(n,k): The tangent space consists of matrices Q˙\dot{Q} satisfying QQ˙+Q˙Q=0Q^\top\dot{Q} + \dot{Q}^\top Q = 0, i.e., QQ˙Q^\top\dot{Q} is skew-symmetric:

TQSt(n,k)={Q˙Rn×k:QQ˙+Q˙Q=0}T_Q\text{St}(n,k) = \{\dot{Q} \in \mathbb{R}^{n \times k} : Q^\top\dot{Q} + \dot{Q}^\top Q = 0\}

I.2 Riemannian Gradient Descent on the Stiefel Manifold

To minimize f(Q)f(Q) subject to QSt(n,k)Q \in \text{St}(n,k):

Step 1: Euclidean gradient. Compute Qf\nabla_Q f in Rn×k\mathbb{R}^{n \times k}.

Step 2: Project to tangent space. The Riemannian gradient is the projection of Qf\nabla_Q f onto TQSt(n,k)T_Q\text{St}(n,k):

gradf(Q)=QfQsym(QQf)\text{grad}_f(Q) = \nabla_Q f - Q\,\text{sym}(Q^\top\nabla_Q f)

where sym(M)=(M+M)/2\text{sym}(M) = (M + M^\top)/2 is the symmetric part.

Step 3: Update and retract. Move in the direction of the (negative) Riemannian gradient, then retract back to the manifold:

Retraction via QR: Qnew=qr(Qηgradf(Q))Q_{\text{new}} = \text{qr}(Q - \eta\,\text{grad}_f(Q)) (take the QQ factor from QR decomposition)

Retraction via Cayley: Qnew=(I+ηW/2)1(IηW/2)QQ_{\text{new}} = (I + \eta W/2)^{-1}(I - \eta W/2)Q for a suitable skew-symmetric WW

I.3 Applications in Modern AI

Orthogonal RNNs (Henaff et al. 2016, Arjovsky et al. 2016). Vanilla RNNs suffer from vanishing/exploding gradients because the recurrence matrix WW can have singular values 1\neq 1. Unitary/Orthogonal RNNs constrain WO(n)W \in O(n), ensuring stable gradient flow. Training optimizes over O(n)O(n) using Stiefel manifold optimization or parameterization via Cayley transform.

Rotation-Equivariant Networks. Networks for 3D point cloud processing (SE(3)-equivariant networks, e3nn) explicitly parameterize their weight tensors using representations of SO(3)SO(3). The convolution filters transform predictably under 3D rotations of the input, making the network geometry-aware.

LoRA with Orthogonal Factors. Standard LoRA approximates a weight update as ΔW=BA\Delta W = BA for low-rank matrices BRd×rB \in \mathbb{R}^{d \times r}, ARr×dA \in \mathbb{R}^{r \times d}. GLoRA, DoRA, and OLoRA variants constrain BB or AA to have orthonormal columns (lying on the Stiefel manifold), improving gradient flow during fine-tuning and reducing the rank collapse problem.

I.4 The Polar Decomposition as Nearest Orthogonal Matrix

Problem: Given ARn×nA \in \mathbb{R}^{n \times n} (not necessarily orthogonal), find the nearest orthogonal matrix:

Q=argminQO(n)AQFQ^* = \arg\min_{Q \in O(n)} \|A - Q\|_F

Solution: The polar decomposition A=QSA = QS gives Q=UVQ^* = UV^\top where A=UΣVA = U\Sigma V^\top (SVD).

Proof: For any PO(n)P \in O(n):

APF2=UΣVPF2=ΣUPVF2\|A - P\|_F^2 = \|U\Sigma V^\top - P\|_F^2 = \|\Sigma - U^\top P V\|_F^2

(using unitary invariance of Frobenius norm). Let R=UPVO(n)R = U^\top P V \in O(n). Then:

ΣRF2=ΣF22Σ,RF+RF2=iσi22tr(ΣR)+n\|\Sigma - R\|_F^2 = \|\Sigma\|_F^2 - 2\langle\Sigma, R\rangle_F + \|R\|_F^2 = \sum_i\sigma_i^2 - 2\text{tr}(\Sigma R) + n

This is minimized when tr(ΣR)\text{tr}(\Sigma R) is maximized. Since Σ\Sigma is diagonal with nonneg entries and RR is orthogonal, the Hadamard-type inequality gives tr(ΣR)iσi\text{tr}(\Sigma R) \leq \sum_i\sigma_i, with equality at R=IR = I (i.e., P=UVP = UV^\top). \square

Muon optimizer interpretation: The Muon optimizer applies the update WWηQGW \leftarrow W - \eta Q_G where QG=UVQ_G = UV^\top is the nearest orthogonal matrix to the gradient GG. This can be computed efficiently via Newton-Schulz iterations:

X0=G/G,Xk+1=32Xk12XkXkXkX_0 = G/\|G\|, \quad X_{k+1} = \frac{3}{2}X_k - \frac{1}{2}X_k X_k^\top X_k

which converges quadratically to the orthogonal factor of the polar decomposition of GG.


Appendix J: Quick Reference - Orthogonality Vocabulary

TermDefinitionKey Formula
Orthogonal vectorsu,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0uv=0\mathbf{u}^\top\mathbf{v} = 0
Orthonormal vectorsOrthogonal + unit lengthqiqj=δij\mathbf{q}_i^\top\mathbf{q}_j = \delta_{ij}
Orthonormal basis (ONB)ONS that spans the spacev=iv,qiqi\mathbf{v} = \sum_i\langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i
Orthogonal matrixSquare, orthonormal columnsQQ=QQ=IQ^\top Q = QQ^\top = I
Orthogonal groupAll n×nn \times n orthogonal matricesO(n)O(n), closed under \cdot, \top, 1\cdot^{-1}
Special orthogonal groupRotations (det = +1)SO(n)O(n)SO(n) \subset O(n)
Orthogonal projectionvS=\mathbf{v}_S = closest point in SSPS=QQP_S = QQ^\top, P2=PP^2=P, P=PP^\top=P
Orthogonal complementVectors \perp all of SSS={v:v,s=0s}S^\perp = \{\mathbf{v}: \langle\mathbf{v},\mathbf{s}\rangle=0\,\forall\mathbf{s}\}
Gram-SchmidtOrthogonalization algorithmBuild ONB from any basis
Householder reflectorReflection across a hyperplaneH=I2nnH = I - 2\mathbf{n}\mathbf{n}^\top
QR decompositionA=QRA = QR factorizationQQ: orthonormal cols, RR: upper triangular
Rayleigh quotientQuotient for eigenvalue boundsRA(x)=xAx/x2R_A(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x}/\|\mathbf{x}\|^2
Spectral theoremSymmetric AA has ONB of eigenvectorsA=QΛQA = Q\Lambda Q^\top
IsometryNorm-preserving linear mapQx=x\|Q\mathbf{x}\| = \|\mathbf{x}\|
Stiefel manifoldMatrices with orthonormal columnsSt(n,k)={Q:QQ=Ik}\text{St}(n,k) = \{Q: Q^\top Q = I_k\}
Polar decompositionOrthogonal \times symmetric factorizationA=QSA = QS
Unitary matrixComplex analog of orthogonalUU=IU^*U = I, entries C\in \mathbb{C}

Appendix K: Common Numerical Experiments

K.1 Verifying Gram-Schmidt Stability

The following experiment quantifies the loss of orthogonality in CGS vs MGS:

import numpy as np

def cgs(A):
    m, n = A.shape
    Q = np.zeros_like(A)
    R = np.zeros((n, n))
    for j in range(n):
        v = A[:, j].copy()
        R[:j, j] = Q[:, :j].T @ v
        v -= Q[:, :j] @ R[:j, j]
        R[j, j] = np.linalg.norm(v)
        Q[:, j] = v / R[j, j]
    return Q, R

def mgs(A):
    m, n = A.shape
    Q = A.astype(float).copy()
    R = np.zeros((n, n))
    for j in range(n):
        R[j, j] = np.linalg.norm(Q[:, j])
        Q[:, j] /= R[j, j]
        R[j, j+1:] = Q[:, j] @ Q[:, j+1:]
        Q[:, j+1:] -= np.outer(Q[:, j], R[j, j+1:])
    return Q, R

# Test on Hilbert matrix (ill-conditioned)
n = 12
H = 1.0 / (np.arange(1, n+1)[:, None] + np.arange(1, n+1)[None, :] - 1)

Q_cgs, _ = cgs(H)
Q_mgs, _ = mgs(H)

err_cgs = np.max(np.abs(Q_cgs.T @ Q_cgs - np.eye(n)))
err_mgs = np.max(np.abs(Q_mgs.T @ Q_mgs - np.eye(n)))

print(f"CGS orthogonality error: {err_cgs:.2e}")   # ~1e-4 (catastrophic)
print(f"MGS orthogonality error: {err_mgs:.2e}")   # ~1e-8 (acceptable)
print(f"kappa(H): {np.linalg.cond(H):.2e}")        # ~10^16 for n=12

K.2 Orthogonal Initialization Impact on Gradient Norms

import numpy as np

def grad_norm_experiment(n=64, L=20, trials=100):
    """Compare gradient norms: Gaussian init vs Orthogonal init."""
    results = {'gaussian': [], 'orthogonal': []}
    
    for _ in range(trials):
        # Gaussian initialization
        Ws_gauss = [np.random.randn(n, n) / np.sqrt(n) for _ in range(L)]
        prod_gauss = Ws_gauss[0]
        for W in Ws_gauss[1:]:
            prod_gauss = prod_gauss @ W
        results['gaussian'].append(np.linalg.norm(prod_gauss, 2))
        
        # Orthogonal initialization
        Ws_orth = []
        for _ in range(L):
            M = np.random.randn(n, n)
            Q, _ = np.linalg.qr(M)
            Ws_orth.append(Q)
        prod_orth = Ws_orth[0]
        for W in Ws_orth[1:]:
            prod_orth = prod_orth @ W
        results['orthogonal'].append(np.linalg.norm(prod_orth, 2))
    
    print(f"Gaussian: mean={np.mean(results['gaussian']):.3e}, "
          f"std={np.std(results['gaussian']):.3e}")
    print(f"Orthogonal: mean={np.mean(results['orthogonal']):.3e}, "
          f"std={np.std(results['orthogonal']):.3e}")
    # Expected: Gaussian shows extreme variance (very large or very small)
    #           Orthogonal shows norm \approx 1.0 with tiny variance

grad_norm_experiment()

K.3 Rayleigh Quotient Iteration Convergence

import numpy as np

def rayleigh_quotient_iteration(A, x0, n_iter=20):
    """Track convergence of Rayleigh Quotient Iteration."""
    x = x0 / np.linalg.norm(x0)
    lam_true = np.sort(np.linalg.eigvalsh(A))[-1]  # largest eigenvalue
    
    history = []
    for _ in range(n_iter):
        rho = x @ A @ x        # Rayleigh quotient
        history.append(abs(rho - lam_true))
        try:
            y = np.linalg.solve(A - rho * np.eye(len(A)), x)
            x = y / np.linalg.norm(y)
        except np.linalg.LinAlgError:
            break  # Converged (A - rho*I is singular)
    return history

A = np.array([[3., 1., 0.], [1., 2., 1.], [0., 1., 1.]])
x0 = np.random.randn(3)
errors = rayleigh_quotient_iteration(A, x0)
print("Errors per iteration:")
for i, e in enumerate(errors[:10]):
    print(f"  iter {i}: {e:.2e}")
# Expected: errors should decrease cubically (each step ~= previous^3)

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue