NotesMath for LLMs

Orthogonality and Orthonormality

Advanced Linear Algebra / Orthogonality and Orthonormality

Notes

"Orthogonality is the single most important idea in applied mathematics. When things are perpendicular, they don't interfere with each other - and that's the key to making computation tractable."

Overview

Orthogonality is the mathematical formalization of "not interfering." When two vectors are orthogonal, they carry completely independent information. When a set of basis vectors is orthonormal, coordinates can be computed by simple dot products. When a matrix is orthogonal, it preserves lengths and angles - it cannot distort or amplify errors.

This section develops the full theory of orthogonality from first principles. We begin with the geometric intuition - right angles, projections, and the remarkable simplification that occurs when bases are orthonormal. We then build the formal machinery: Gram-Schmidt orthogonalization (classical and numerically stable modified form), Householder and Givens methods for QR decomposition, orthogonal projections and least squares, and the spectral theorem for symmetric matrices.

Throughout, the AI connections are direct and structural. Orthogonal weight initialization preserves gradient norms through linear networks. RoPE encodes position via rotation matrices - orthogonal transformations that preserve the inner products needed for attention. Spectral normalization constrains weight matrices to have unit spectral norm. The QR algorithm for eigenvalues underlies every numerical eigensolver. Understanding orthogonality at depth is understanding the numerical backbone of machine learning.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive exploration: Gram-Schmidt visualization, projection geometry, QR factorization, spectral theorem, orthogonal init
exercises.ipynb8 graded exercises from projection computation through RoPE and orthogonal initialization

Learning Objectives

After completing this section, you will be able to:

  • State and prove the Cauchy-Schwarz inequality and derive the angle formula
  • Prove that any orthogonal set of nonzero vectors is linearly independent
  • Compute coordinates in an orthonormal basis using Fourier coefficients
  • Construct orthogonal complements and verify the orthogonal decomposition theorem
  • Characterize orthogonal matrices via QQ=IQ^\top Q = I and prove they are isometries
  • Implement Gram-Schmidt orthogonalization (classical and modified forms)
  • Explain why modified Gram-Schmidt is numerically superior to classical
  • Construct QR decompositions via Gram-Schmidt, Householder reflectors, and Givens rotations
  • Solve least squares problems via QR and explain why this is numerically superior to normal equations
  • State and prove the spectral theorem for real symmetric matrices
  • Apply orthogonal initialization, RoPE, and spectral normalization in AI contexts

Table of Contents


1. Intuition

1.1 What Orthogonality "Buys" You

Orthogonality is the key that makes linear algebra computationally tractable. Without it, working in a basis requires solving systems of equations. With it, everything simplifies to dot products.

Five computational miracles of orthogonality:

Miracle 1: Coordinates become dot products. In a general basis {b1,,bn}\{\mathbf{b}_1, \ldots, \mathbf{b}_n\}, finding the coordinates of a vector v=cibi\mathbf{v} = \sum c_i \mathbf{b}_i requires solving an n×nn \times n linear system. In an orthonormal basis {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}, the coordinate ci=v,qic_i = \langle \mathbf{v}, \mathbf{q}_i \rangle - just a dot product. No system to solve.

Miracle 2: Length is preserved component-wise. In an ONB: v2=ici2\|\mathbf{v}\|^2 = \sum_i c_i^2 (Parseval's identity). The squared norm is simply the sum of squared coordinates - no cross terms.

Miracle 3: Orthogonal matrices have condition number 1. The condition number of QQ is QQ1=11=1\|Q\| \cdot \|Q^{-1}\| = 1 \cdot 1 = 1. This is the best possible - orthogonal systems are maximally well-conditioned. Solving Qx=bQ\mathbf{x} = \mathbf{b} is trivially x=Qb\mathbf{x} = Q^\top \mathbf{b}.

Miracle 4: Projection is the best approximation. The orthogonal projection of b\mathbf{b} onto a subspace SS gives the closest point in SS to b\mathbf{b}. This is the geometric foundation of least squares, PCA, and low-rank approximation.

Miracle 5: Composition of orthogonal maps is free. Two isometries composed give another isometry. Chaining rotations, reflections, and permutations never accumulates numerical error.

THE ORTHOGONALITY ADVANTAGE
========================================================================

  General basis {b_1, ..., b_n}         Orthonormal basis {q_1, ..., q_n}
  -----------------------------        ----------------------------------

  Coordinates:  solve Bc = v           Coordinates:  c^i = <v, q^i> (dot!)
  Norm:         v^T B^T B v            Norm:         \Sigma c^i^2 (Parseval)
  Projection:   A(A^T A)^-^1 A^T b      Projection:   QQ^T b (just Q^T then Q)
  Solve Ax=b:   Gaussian elim.         Solve Qx=b:   x = Q^T b (trivial)
  Condition#:   can be >> 1            Condition#:   exactly 1

========================================================================

1.2 Geometric Picture

In Euclidean space, two vectors are orthogonal if they meet at a right angle. The dot product measures "how much they point in the same direction":

u,v=uvcosθ\langle \mathbf{u}, \mathbf{v} \rangle = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta
  • θ=0degrees\theta = 0 degrees : u,v=uv\langle\mathbf{u},\mathbf{v}\rangle = \|\mathbf{u}\|\|\mathbf{v}\| (parallel, maximum alignment)
  • θ=90degrees\theta = 90 degrees : u,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0 (orthogonal, zero alignment)
  • θ=180degrees\theta = 180 degrees : u,v=uv\langle\mathbf{u},\mathbf{v}\rangle = -\|\mathbf{u}\|\|\mathbf{v}\| (antiparallel, maximum anti-alignment)

The projection picture. The orthogonal projection of v\mathbf{v} onto u\mathbf{u} is the "shadow" of v\mathbf{v} cast perpendicularly onto the line through u\mathbf{u}:

projuv=v,uu2u\text{proj}_{\mathbf{u}}\mathbf{v} = \frac{\langle\mathbf{v},\mathbf{u}\rangle}{\|\mathbf{u}\|^2}\mathbf{u}

The error vprojuv\mathbf{v} - \text{proj}_{\mathbf{u}}\mathbf{v} is orthogonal to u\mathbf{u}. This Pythagorean decomposition is the geometric core of all least-squares theory.

Orthonormal bases as "rulers." A set of orthonormal vectors {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\} forms a perfect coordinate system: each qi\mathbf{q}_i is a unit "ruler" pointing in an independent direction. The coordinate of v\mathbf{v} along qi\mathbf{q}_i is exactly v,qi\langle\mathbf{v}, \mathbf{q}_i\rangle - how far v\mathbf{v} extends in the ii-th direction.

1.3 Why Orthogonality Matters for AI

PCA finds orthogonal directions of variance. The principal components are the eigenvectors of the sample covariance matrix C=XX/(n1)C = X^\top X/(n-1). By the spectral theorem (7), these eigenvectors are orthogonal. PCA is possible - and gives uncorrelated components - precisely because of orthogonality.

Attention subspaces. In multi-head attention, each head projects queries and keys into a dkd_k-dimensional subspace. The heads are designed to attend to different aspects of the input. The more orthogonal the subspaces, the more independent information each head extracts.

Orthogonal weight initialization (Saxe et al., 2013) initializes weight matrices as random orthogonal matrices (via QR decomposition of a random Gaussian matrix). This preserves the norm of activations and gradients through linear layers, enabling training of very deep networks without vanishing/exploding gradients.

RoPE (Su et al., 2023) encodes position mm by rotating the query/key vectors by angle mθim\theta_i in each 2D subspace. The key property: the dot product between a rotated query at position mm and a rotated key at position nn depends only on (mn)(m-n) - position appears as a relative rotation. This works because rotation is orthogonal: it preserves the magnitude of inner products.

Numerical stability. QR decomposition is the backbone of numerical linear algebra. Householder QR is used to solve least-squares problems, compute eigenvalues (QR iteration), and factor matrices. Its superiority over LU (for rectangular systems) stems from orthogonality: the Householder reflectors are orthogonal transformations that never amplify errors.

1.4 Historical Timeline

YearPersonContribution
1801GaussLeast squares by orthogonal projection (unpublished until 1809)
1805LegendrePublished least squares - minimizing sum of squared residuals
1883Jrgen GramGram-Schmidt process published in Danish
1907Erhard SchmidtGram-Schmidt republished and popularized
1958Alston HouseholderHouseholder reflections for stable QR factorization
1961J.G.F. FrancisPractical QR algorithm for eigenvalue computation
1961V. KublanovskayaIndependent discovery of QR algorithm (USSR)
1965James WilkinsonError analysis of Householder QR; numerical stability theory
2013Saxe, McClelland, GanguliOrthogonal initialization for deep linear networks
2021Su et al.RoPE: rotary positional encoding via orthogonal rotation matrices
2021Hua et al.Transformer Quality with orthogonal attention mechanisms

2. Formal Definitions

2.1 Inner Products and the Angle Formula

Definition (Inner Product). A real inner product on a vector space VV is a function ,:V×VR\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R} satisfying:

  1. Symmetry: u,v=v,u\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle
  2. Linearity in first argument: au+bv,w=au,w+bv,w\langle a\mathbf{u} + b\mathbf{v}, \mathbf{w} \rangle = a\langle\mathbf{u},\mathbf{w}\rangle + b\langle\mathbf{v},\mathbf{w}\rangle
  3. Positive definiteness: v,v0\langle \mathbf{v}, \mathbf{v} \rangle \geq 0 with equality iff v=0\mathbf{v} = \mathbf{0}

The standard inner product (dot product) on Rn\mathbb{R}^n is u,v=uv=i=1nuivi\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v} = \sum_{i=1}^n u_i v_i.

The induced norm: v=v,v\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle} is the Euclidean length of v\mathbf{v}.

Theorem 2.1.1 (Cauchy-Schwarz Inequality). For any u,v\mathbf{u}, \mathbf{v} in an inner product space:

u,vuv|\langle\mathbf{u},\mathbf{v}\rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\|

with equality iff u\mathbf{u} and v\mathbf{v} are linearly dependent (one is a scalar multiple of the other).

Proof. For any tRt \in \mathbb{R}, u+tv20\|\mathbf{u} + t\mathbf{v}\|^2 \geq 0. Expanding:

u2+2tu,v+t2v20\|\mathbf{u}\|^2 + 2t\langle\mathbf{u},\mathbf{v}\rangle + t^2\|\mathbf{v}\|^2 \geq 0

This is a quadratic in tt that is always non-negative, so its discriminant must be 0\leq 0:

4u,v24u2v204\langle\mathbf{u},\mathbf{v}\rangle^2 - 4\|\mathbf{u}\|^2\|\mathbf{v}\|^2 \leq 0

which gives u,vuv|\langle\mathbf{u},\mathbf{v}\rangle| \leq \|\mathbf{u}\|\|\mathbf{v}\|. Equality holds iff t=u,v/v2t^* = -\langle\mathbf{u},\mathbf{v}\rangle/\|\mathbf{v}\|^2 gives u+tv=0\|\mathbf{u} + t^*\mathbf{v}\| = 0, i.e., u=tv\mathbf{u} = -t^*\mathbf{v}. \square

The angle formula. By Cauchy-Schwarz, 1u,v/(uv)1-1 \leq \langle\mathbf{u},\mathbf{v}\rangle/(\|\mathbf{u}\|\|\mathbf{v}\|) \leq 1, so it is valid to define the angle θ\theta between u\mathbf{u} and v\mathbf{v} by:

cosθ=u,vuv,θ[0,π]\cos\theta = \frac{\langle\mathbf{u},\mathbf{v}\rangle}{\|\mathbf{u}\|\|\mathbf{v}\|}, \quad \theta \in [0, \pi]

Corollary (Triangle Inequality). u+vu+v\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|.

Proof: u+v2=u2+2u,v+v2u2+2uv+v2=(u+v)2\|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + 2\langle\mathbf{u},\mathbf{v}\rangle + \|\mathbf{v}\|^2 \leq \|\mathbf{u}\|^2 + 2\|\mathbf{u}\|\|\mathbf{v}\| + \|\mathbf{v}\|^2 = (\|\mathbf{u}\|+\|\mathbf{v}\|)^2. \square

2.2 Orthogonal and Orthonormal Vectors

Definition. Two vectors u,v\mathbf{u}, \mathbf{v} are orthogonal (written uv\mathbf{u} \perp \mathbf{v}) if u,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0.

A vector v\mathbf{v} is a unit vector if v=1\|\mathbf{v}\| = 1. The normalization of nonzero v\mathbf{v} is v^=v/v\hat{\mathbf{v}} = \mathbf{v}/\|\mathbf{v}\|.

Two vectors are orthonormal if they are both orthogonal AND both unit vectors:

qi,qj=δij={1i=j0ij\langle\mathbf{q}_i, \mathbf{q}_j\rangle = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}

Key consequences of orthogonality:

1. Zero maps to everything: 0v\mathbf{0} \perp \mathbf{v} for all v\mathbf{v} (since 0,v=0\langle\mathbf{0},\mathbf{v}\rangle = 0).

2. Pythagorean theorem: If uv\mathbf{u} \perp \mathbf{v}, then u+v2=u2+v2\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2. Proof: u+v2=u+v,u+v=u2+2u,v=0+v2\|\mathbf{u}+\mathbf{v}\|^2 = \langle\mathbf{u}+\mathbf{v},\mathbf{u}+\mathbf{v}\rangle = \|\mathbf{u}\|^2 + 2\underbrace{\langle\mathbf{u},\mathbf{v}\rangle}_{=0} + \|\mathbf{v}\|^2. \square

3. Generalized Pythagorean theorem: For mutually orthogonal v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k:

i=1kvi2=i=1kvi2\left\|\sum_{i=1}^k \mathbf{v}_i\right\|^2 = \sum_{i=1}^k \|\mathbf{v}_i\|^2

4. Linearity of orthogonality: If uv\mathbf{u} \perp \mathbf{v} and uw\mathbf{u} \perp \mathbf{w}, then u(av+bw)\mathbf{u} \perp (a\mathbf{v} + b\mathbf{w}) for all scalars a,ba, b.

2.3 Orthogonal Sets and Linear Independence

Definition. A set {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} is an orthogonal set if vi,vj=0\langle\mathbf{v}_i,\mathbf{v}_j\rangle = 0 for all iji \neq j. It is an orthonormal set if additionally each vi=1\|\mathbf{v}_i\| = 1.

Theorem 2.3.1. Any orthogonal set of nonzero vectors is linearly independent.

Proof. Suppose i=1kcivi=0\sum_{i=1}^k c_i \mathbf{v}_i = \mathbf{0}. Take the inner product of both sides with vj\mathbf{v}_j:

icivi,vj=cjvj,vj=cjvj2=0\left\langle \sum_i c_i \mathbf{v}_i, \mathbf{v}_j \right\rangle = c_j \langle\mathbf{v}_j,\mathbf{v}_j\rangle = c_j \|\mathbf{v}_j\|^2 = 0

Since vj0\mathbf{v}_j \neq \mathbf{0}, we have vj2>0\|\mathbf{v}_j\|^2 > 0, so cj=0c_j = 0. This holds for all jj. \square

Remark. This theorem gives an effortless proof of linear independence for orthogonal sets - no row reduction needed. The inner product "isolates" each coefficient.

Non-examples:

  • {(1,0),(1,1)}\{(1,0), (1,1)\} is not orthogonal: (1,0),(1,1)=10\langle(1,0),(1,1)\rangle = 1 \neq 0.
  • {(1,1,0)/2,(0,0,1),(1,1,0)/2,(1,0,0)}\{(1,1,0)/\sqrt{2}, (0,0,1), (1,-1,0)/\sqrt{2}, (1,0,0)\} - the last vector breaks orthogonality.

2.4 Orthonormal Bases

Definition. An orthonormal basis (ONB) for VV is a set {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\} that is both orthonormal and a basis for VV.

Theorem 2.4.1 (Fourier Coefficients). If {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\} is an ONB for VV, then for any vV\mathbf{v} \in V:

v=i=1nv,qiqi\mathbf{v} = \sum_{i=1}^n \langle\mathbf{v}, \mathbf{q}_i\rangle \mathbf{q}_i

The coefficients v^i=v,qi\hat{v}_i = \langle\mathbf{v}, \mathbf{q}_i\rangle are called Fourier coefficients (or coordinates of v\mathbf{v} in the ONB).

Proof. Since {qi}\{\mathbf{q}_i\} is a basis, v=jcjqj\mathbf{v} = \sum_j c_j \mathbf{q}_j for some unique cjc_j. Taking the inner product with qi\mathbf{q}_i:

v,qi=jcjqj,qi=jcjδji=ci\langle\mathbf{v}, \mathbf{q}_i\rangle = \sum_j c_j \langle\mathbf{q}_j, \mathbf{q}_i\rangle = \sum_j c_j \delta_{ji} = c_i \quad \square

Theorem 2.4.2 (Parseval's Identity). Under the same conditions:

v2=i=1nv,qi2=i=1nv^i2\|\mathbf{v}\|^2 = \sum_{i=1}^n |\langle\mathbf{v}, \mathbf{q}_i\rangle|^2 = \sum_{i=1}^n |\hat{v}_i|^2

Proof. v2=iv^iqi,jv^jqj=i,jv^iv^jqi,qj=iv^i2\|\mathbf{v}\|^2 = \langle\sum_i \hat{v}_i\mathbf{q}_i, \sum_j \hat{v}_j\mathbf{q}_j\rangle = \sum_{i,j} \hat{v}_i\hat{v}_j \langle\mathbf{q}_i,\mathbf{q}_j\rangle = \sum_i \hat{v}_i^2. \square

Remark. Parseval's identity says: in an ONB, the squared norm equals the sum of squared coordinates. There are no cross terms - the orthonormality kills them all. This is the high-dimensional Pythagorean theorem.

Matrix form. Assembling the ONB into a matrix Q=[q1qn]Q = [\mathbf{q}_1 | \cdots | \mathbf{q}_n], the Fourier decomposition is simply v=Q(Qv)\mathbf{v} = Q(Q^\top\mathbf{v}): first compute coordinates v^=Qv\hat{\mathbf{v}} = Q^\top\mathbf{v}, then reconstruct v=Qv^\mathbf{v} = Q\hat{\mathbf{v}}.

2.5 Orthogonal Complements

Definition. Given a subspace SVS \subseteq V, the orthogonal complement is:

S={vV:v,s=0 for all sS}S^\perp = \{\mathbf{v} \in V : \langle\mathbf{v},\mathbf{s}\rangle = 0 \text{ for all } \mathbf{s} \in S\}

Theorem 2.5.1. SS^\perp is a subspace of VV.

Proof: If u,vS\mathbf{u}, \mathbf{v} \in S^\perp and a,bRa, b \in \mathbb{R}, then for any sS\mathbf{s} \in S: au+bv,s=au,s+bv,s=0\langle a\mathbf{u}+b\mathbf{v}, \mathbf{s}\rangle = a\langle\mathbf{u},\mathbf{s}\rangle + b\langle\mathbf{v},\mathbf{s}\rangle = 0. \square

Theorem 2.5.2 (Orthogonal Decomposition). If VV is finite-dimensional and SS is a subspace, then:

V=SS(direct sum)V = S \oplus S^\perp \quad (\text{direct sum})

Every vV\mathbf{v} \in V decomposes uniquely as v=vS+vS\mathbf{v} = \mathbf{v}_S + \mathbf{v}_{S^\perp} with vSS\mathbf{v}_S \in S and vSS\mathbf{v}_{S^\perp} \in S^\perp.

Moreover: dim(S)+dim(S)=dim(V)\dim(S) + \dim(S^\perp) = \dim(V).

Proof. Let {q1,,qk}\{\mathbf{q}_1, \ldots, \mathbf{q}_k\} be an ONB for SS. Define vS=i=1kv,qiqi\mathbf{v}_S = \sum_{i=1}^k \langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i (projection onto SS) and vS=vvS\mathbf{v}_{S^\perp} = \mathbf{v} - \mathbf{v}_S. Then v=vS+vS\mathbf{v} = \mathbf{v}_S + \mathbf{v}_{S^\perp} with vSS\mathbf{v}_S \in S. For any qj\mathbf{q}_j:

vS,qj=v,qjvS,qj=v,qjv,qj=0\langle\mathbf{v}_{S^\perp}, \mathbf{q}_j\rangle = \langle\mathbf{v},\mathbf{q}_j\rangle - \langle\mathbf{v}_S,\mathbf{q}_j\rangle = \langle\mathbf{v},\mathbf{q}_j\rangle - \langle\mathbf{v},\mathbf{q}_j\rangle = 0

So vSqj\mathbf{v}_{S^\perp} \perp \mathbf{q}_j for all jj, hence vSS\mathbf{v}_{S^\perp} \in S^\perp. Uniqueness: if v=wS+wS\mathbf{v} = \mathbf{w}_S + \mathbf{w}_{S^\perp} also, then vSwS=wSvSSS={0}\mathbf{v}_S - \mathbf{w}_S = \mathbf{w}_{S^\perp} - \mathbf{v}_{S^\perp} \in S \cap S^\perp = \{\mathbf{0}\}. \square

Connection to the four fundamental subspaces. For ARm×nA \in \mathbb{R}^{m \times n}:

  • null(A)=row(A)\operatorname{null}(A) = \operatorname{row}(A)^\perp in Rn\mathbb{R}^n
  • null(A)=col(A)\operatorname{null}(A^\top) = \operatorname{col}(A)^\perp in Rm\mathbb{R}^m

These orthogonality relations are the deep structure behind the four fundamental subspaces (see 04: Linear Transformations 4.5).

2.6 Orthogonality in Complex Inner Product Spaces

The theory extends to complex vector spaces with a Hermitian (sesquilinear) inner product:

u,v=k=1nukvk\langle\mathbf{u},\mathbf{v}\rangle = \sum_{k=1}^n \overline{u_k}v_k

The complex analog of orthogonal matrices is unitary matrices UCn×nU \in \mathbb{C}^{n \times n} satisfying UU=IU^*U = I, where U=UU^* = \overline{U}^\top is the conjugate transpose.

Unitary group: U(n)={UCn×n:UU=I}\mathcal{U}(n) = \{U \in \mathbb{C}^{n \times n} : U^*U = I\} contains all n×nn \times n unitary matrices. It reduces to O(n)O(n) when restricted to real matrices.

Special unitary group: SU(n)={UU(n):det(U)=1}\mathcal{SU}(n) = \{U \in \mathcal{U}(n) : \det(U) = 1\}.

Key properties of unitary matrices:

  • det(U)=1|\det(U)| = 1 (complex number of modulus 1, not just ±1\pm 1)
  • Eigenvalues lie on the unit circle: λi=1|\lambda_i| = 1
  • Preserve complex inner products: Uu,Uv=u,v\langle U\mathbf{u}, U\mathbf{v}\rangle = \langle\mathbf{u},\mathbf{v}\rangle
  • The DFT matrix Fn/nF_n/\sqrt{n} is unitary (not orthogonal, since its entries are complex)

Hermitian matrices (A=AA = A^*) are the complex analogs of real symmetric matrices. The spectral theorem holds: every Hermitian matrix has real eigenvalues and a unitary eigenbasis.

For AI, complex-valued neural networks and quantum computing naturally use unitary matrices. The DFT, crucial for signal processing and some positional encoding schemes, is a unitary transform in Cn\mathbb{C}^n.

2.7 Abstract Inner Product Spaces: Axioms

We have been working with the Euclidean inner product u,v=uv\langle\mathbf{u},\mathbf{v}\rangle = \mathbf{u}^\top\mathbf{v}. More generally, an inner product on a real vector space VV is a bilinear map ,:V×VR\langle\cdot,\cdot\rangle: V \times V \to \mathbb{R} satisfying:

AxiomExpressionMeaning
Symmetryu,v=v,u\langle\mathbf{u},\mathbf{v}\rangle = \langle\mathbf{v},\mathbf{u}\rangleOrder doesn't matter
Linearityau+bv,w=au,w+bv,w\langle a\mathbf{u}+b\mathbf{v},\mathbf{w}\rangle = a\langle\mathbf{u},\mathbf{w}\rangle + b\langle\mathbf{v},\mathbf{w}\rangleLinear in first argument
Positive definitenessv,v0\langle\mathbf{v},\mathbf{v}\rangle \geq 0 and =0v=0= 0 \Rightarrow \mathbf{v} = \mathbf{0}Self-inner product is positive

Any positive definite symmetric matrix M0M \succ 0 defines an inner product:

u,vM=uMv\langle\mathbf{u},\mathbf{v}\rangle_M = \mathbf{u}^\top M\mathbf{v}

Example (precision-weighted inner product): In Bayesian statistics, the Mahalanobis distance between x\mathbf{x} and μ\boldsymbol{\mu} is:

dM(x,μ)=(xμ)Σ1(xμ)d_M(\mathbf{x},\boldsymbol{\mu}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})}

This is the distance induced by the inner product u,vΣ1=uΣ1v\langle\mathbf{u},\mathbf{v}\rangle_{\Sigma^{-1}} = \mathbf{u}^\top\Sigma^{-1}\mathbf{v}. In this metric, vectors along high-variance directions of Σ\Sigma are "shorter" (the precision Σ1\Sigma^{-1} down-weights those directions).

Orthogonality is metric-dependent. Vectors that are orthogonal in the Euclidean metric may not be orthogonal in the Mahalanobis metric, and vice versa. The choice of inner product determines what "perpendicular" means.

For ML: The natural gradient method (Amari 1998) uses the Fisher information matrix FF to define the inner product, giving rise to a geometry where parameter updates are Δθ=F1L\Delta\theta = F^{-1}\nabla\mathcal{L}. This is gradient descent in the Riemannian metric Δθ1,Δθ2F=Δθ1FΔθ2\langle\Delta\theta_1, \Delta\theta_2\rangle_F = \Delta\theta_1^\top F \Delta\theta_2.


3. Orthogonal Matrices

3.1 Definition and Characterizations

Definition. A square matrix QRn×nQ \in \mathbb{R}^{n \times n} is orthogonal if:

QQ=QQ=InQ^\top Q = QQ^\top = I_n

Equivalently, Q1=QQ^{-1} = Q^\top - the transpose is the inverse.

Characterization via columns. QQ is orthogonal if and only if its columns q1,,qn\mathbf{q}_1, \ldots, \mathbf{q}_n form an orthonormal basis for Rn\mathbb{R}^n:

qiqj=δij={1i=j0ij\mathbf{q}_i^\top \mathbf{q}_j = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}

The condition QQ=IQ^\top Q = I encodes exactly these n2n^2 inner products.

Characterization via rows. By the symmetry QQ=IQQ^\top = I, the rows of QQ also form an orthonormal basis. This is a non-trivial statement: a matrix whose columns are orthonormal must also have orthonormal rows (and vice versa).

Characterization via norms (isometry property). QQ is orthogonal if and only if it preserves the Euclidean norm:

Qx2=x2for all xRn\|Q\mathbf{x}\|_2 = \|\mathbf{x}\|_2 \quad \text{for all } \mathbf{x} \in \mathbb{R}^n

Proof: Qx2=(Qx)(Qx)=xQQx=xIx=x2\|Q\mathbf{x}\|^2 = (Q\mathbf{x})^\top(Q\mathbf{x}) = \mathbf{x}^\top Q^\top Q \mathbf{x} = \mathbf{x}^\top I \mathbf{x} = \|\mathbf{x}\|^2. \square

Characterization via inner products. QQ preserves inner products:

Qx,Qy=x,yfor all x,y\langle Q\mathbf{x}, Q\mathbf{y}\rangle = \langle\mathbf{x},\mathbf{y}\rangle \quad \text{for all } \mathbf{x}, \mathbf{y}

Proof: (Qx)(Qy)=xQQy=xy(Q\mathbf{x})^\top(Q\mathbf{y}) = \mathbf{x}^\top Q^\top Q \mathbf{y} = \mathbf{x}^\top \mathbf{y}. \square

This means orthogonal matrices preserve angles and distances - they are isometries of Euclidean space.

3.2 The Orthogonal Group and Special Orthogonal Group

The set of all n×nn \times n orthogonal matrices forms the orthogonal group:

O(n)={QRn×n:QQ=I}O(n) = \{Q \in \mathbb{R}^{n \times n} : Q^\top Q = I\}

Group axioms:

  • Closure: If Q1,Q2O(n)Q_1, Q_2 \in O(n), then (Q1Q2)(Q1Q2)=Q2Q1Q1Q2=I(Q_1 Q_2)^\top (Q_1 Q_2) = Q_2^\top Q_1^\top Q_1 Q_2 = I, so Q1Q2O(n)Q_1 Q_2 \in O(n).
  • Identity: IO(n)I \in O(n).
  • Inverses: Q1=QO(n)Q^{-1} = Q^\top \in O(n) (since (Q)Q=QQ=I(Q^\top)^\top Q^\top = QQ^\top = I).
  • Associativity: Inherited from matrix multiplication.

Determinant. Since det(QQ)=det(Q)2=det(I)=1\det(Q^\top Q) = \det(Q)^2 = \det(I) = 1, we have det(Q)=±1\det(Q) = \pm 1.

  • det(Q)=+1\det(Q) = +1: rotations - orientation-preserving isometries
  • det(Q)=1\det(Q) = -1: improper rotations - reflections (or rotation composed with a reflection)

The special orthogonal group SO(n)={QO(n):det(Q)=+1}SO(n) = \{Q \in O(n) : \det(Q) = +1\} consists of pure rotations.

In 2D:

SO(2)={(cosθsinθsinθcosθ):θ[0,2π)}SO(2) = \left\{\begin{pmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{pmatrix} : \theta \in [0,2\pi)\right\}

This is the group of rotations of the plane.

In 3D: SO(3)SO(3) parameterizes 3D rotations. This group is the configuration space of a rigid body and appears throughout robotics, computer vision, and physics. Its double cover SU(2)SU(2) appears in quaternion-based rotation representations used in game engines and IMUs.

3.3 Condition Number and Numerical Properties

The condition number of a matrix measures how much it amplifies errors:

κ(A)=AA1=σmaxσmin\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}}{\sigma_{\min}}

For orthogonal matrices QQ: all singular values equal 1, so κ(Q)=1/1=1\kappa(Q) = 1/1 = 1. Orthogonal matrices have the best possible condition number.

This means:

  • Multiplying by QQ never amplifies rounding errors
  • Solving Qx=bQ\mathbf{x} = \mathbf{b} is trivially stable: x=Qb\mathbf{x} = Q^\top\mathbf{b}
  • QR factorization (which uses orthogonal matrices) is numerically stable

For AI: Deep learning systems that maintain orthogonal weight matrices throughout training benefit from gradient signals that are neither exploding nor vanishing. The isometry property means the network's "signal amplification" is exactly 1 at those layers.

3.4 Eigenvalues of Orthogonal Matrices

Theorem. The eigenvalues of a real orthogonal matrix lie on the unit circle in C\mathbb{C}: λ=1|\lambda| = 1.

Proof: If Qv=λvQ\mathbf{v} = \lambda\mathbf{v} with v0\mathbf{v} \neq \mathbf{0}:

v=Qv=λv=λv\|\mathbf{v}\| = \|Q\mathbf{v}\| = \|\lambda\mathbf{v}\| = |\lambda|\|\mathbf{v}\|

Dividing by v>0\|\mathbf{v}\| > 0: λ=1|\lambda| = 1. \square

For real eigenvalues: λ=+1\lambda = +1 (rotation-like) or λ=1\lambda = -1 (reflection-like). Complex eigenvalues come in conjugate pairs eiθ,eiθe^{i\theta}, e^{-i\theta} corresponding to 2D rotations.

3.5 The Cayley Transform

The Cayley transform establishes a correspondence between orthogonal matrices and skew-symmetric matrices (matrices satisfying S=SS^\top = -S).

Definition. For a skew-symmetric matrix SS (such that I+SI + S is invertible):

Q=(IS)(I+S)1Q = (I - S)(I + S)^{-1}

Claim: QQ is orthogonal.

Proof: Using (I+S)=IS(I + S)^\top = I - S (since S=SS^\top = -S):

QQ=((I+S)1)(IS)(IS)(I+S)1Q^\top Q = ((I+S)^{-1})^\top (I-S)^\top (I-S)(I+S)^{-1} =(IS)1(I+S)(IS)(I+S)1= (I-S)^{-1}(I+S)(I-S)(I+S)^{-1}

Since SS is skew-symmetric, ISI-S and I+SI+S commute (they differ only by sign of SS, which is a normal matrix in this context), so:

(I+S)(IS)=IS2=(IS)(I+S)(I+S)(I-S) = I - S^2 = (I-S)(I+S)

Hence QQ=(IS)1(IS)(I+S)(I+S)1=IQ^\top Q = (I-S)^{-1}(I-S)(I+S)(I+S)^{-1} = I. \square

Inverse Cayley transform. Given orthogonal QQ without eigenvalue 1-1:

S=(IQ)(I+Q)1S = (I - Q)(I + Q)^{-1}

Why this matters:

  • Skew-symmetric matrices form a vector space (easy to parameterize)
  • The Cayley transform maps this space bijectively to "most" orthogonal matrices
  • This gives a smooth parameterization of O(n)O(n) near the identity - useful for Riemannian optimization

Example in 2D: S=(0tt0)S = \begin{pmatrix}0 & -t \\ t & 0\end{pmatrix} (skew-symmetric) gives:

Q=11+t2(1t22t2t1t2)Q = \frac{1}{1+t^2}\begin{pmatrix}1-t^2 & -2t \\ 2t & 1-t^2\end{pmatrix}

With t=tan(θ/2)t = \tan(\theta/2) (the Weierstrass substitution), this gives the rotation matrix RθR_\theta. The Cayley transform thus rationalizes the trigonometric functions appearing in rotation matrices.

Applications in neural networks: Neural networks that maintain orthogonal weight matrices throughout training can parameterize them via their skew-symmetric "log" and exponentiate via the matrix exponential or Cayley transform. This enables gradient descent on O(n)O(n) using Lie algebra tools.


4. Orthogonal Projections

4.1 The Projection Formula

Definition. The orthogonal projection of v\mathbf{v} onto a subspace SS is the unique vector vSS\mathbf{v}_S \in S such that vvSS\mathbf{v} - \mathbf{v}_S \perp S.

Formula (1D case). Projection onto the line spanned by unit vector u^\hat{\mathbf{u}}:

proju^(v)=v,u^u^=u^u^v\operatorname{proj}_{\hat{\mathbf{u}}}(\mathbf{v}) = \langle\mathbf{v},\hat{\mathbf{u}}\rangle\,\hat{\mathbf{u}} = \hat{\mathbf{u}}\hat{\mathbf{u}}^\top\mathbf{v}

The matrix P=u^u^P = \hat{\mathbf{u}}\hat{\mathbf{u}}^\top is the rank-1 projection matrix.

Formula (general case). If S=col(A)S = \operatorname{col}(A) for a matrix AA with linearly independent columns:

projS(v)=A(AA)1Av\operatorname{proj}_S(\mathbf{v}) = A(A^\top A)^{-1}A^\top\mathbf{v}

The matrix P=A(AA)1AP = A(A^\top A)^{-1}A^\top is the projection matrix onto col(A)\operatorname{col}(A).

When AA has orthonormal columns (i.e., A=QA = Q with QQ=IQ^\top Q = I), this simplifies dramatically:

P=QQP = QQ^\top

This is why orthonormal bases make projections computationally trivial.

4.2 Projection Matrices: Properties

A matrix PP is a projection if and only if it is idempotent: P2=PP^2 = P.

A projection is an orthogonal projection if and only if it is also symmetric: P=PP^\top = P.

Verification: For P=QQP = QQ^\top with QQ=IQ^\top Q = I:

  • Idempotent: P2=QQQQ=Q(QQ)Q=QIQ=QQ=PP^2 = QQ^\top QQ^\top = Q(Q^\top Q)Q^\top = QIQ^\top = QQ^\top = P OK
  • Symmetric: P=(QQ)=QQ=PP^\top = (QQ^\top)^\top = QQ^\top = P OK

Eigenvalues of projections. PP is a projection if and only if its eigenvalues are 0 and 1.

Proof: If Pv=λvP\mathbf{v} = \lambda\mathbf{v}, then λv=Pv=P2v=λPv=λ2v\lambda\mathbf{v} = P\mathbf{v} = P^2\mathbf{v} = \lambda P\mathbf{v} = \lambda^2\mathbf{v}, so λ2=λ\lambda^2 = \lambda, giving λ=0\lambda = 0 or λ=1\lambda = 1. \square

Geometrically: eigenvectors with λ=1\lambda = 1 lie in the range of PP (they are unchanged), while eigenvectors with λ=0\lambda = 0 lie in the null space (they are annihilated).

4.3 Best Approximation Theorem

Theorem (Best Approximation). The projection vS=projS(v)\mathbf{v}_S = \operatorname{proj}_S(\mathbf{v}) is the closest point in SS to v\mathbf{v}:

vvS2vs2for all sS\|\mathbf{v} - \mathbf{v}_S\|^2 \leq \|\mathbf{v} - \mathbf{s}\|^2 \quad \text{for all } \mathbf{s} \in S

with equality only when s=vS\mathbf{s} = \mathbf{v}_S.

Proof. For any sS\mathbf{s} \in S:

vs2=(vvS)+(vSs)2=vvS2+2vvS,vSs+vSs2\|\mathbf{v} - \mathbf{s}\|^2 = \|(\mathbf{v}-\mathbf{v}_S) + (\mathbf{v}_S - \mathbf{s})\|^2 = \|\mathbf{v}-\mathbf{v}_S\|^2 + 2\langle\mathbf{v}-\mathbf{v}_S, \mathbf{v}_S-\mathbf{s}\rangle + \|\mathbf{v}_S-\mathbf{s}\|^2

Since vvSS\mathbf{v}-\mathbf{v}_S \perp S and vSsS\mathbf{v}_S - \mathbf{s} \in S, the cross term vanishes:

=vvS2+vSs2vvS2= \|\mathbf{v}-\mathbf{v}_S\|^2 + \|\mathbf{v}_S-\mathbf{s}\|^2 \geq \|\mathbf{v}-\mathbf{v}_S\|^2 \quad \square

This theorem is everywhere in machine learning:

  • Least squares: The normal equations give the projection of b\mathbf{b} onto col(A)\operatorname{col}(A)
  • PCA: Principal components are the projection onto the subspace of maximum variance (-> 03: PCA)
  • Attention: Softmax attention can be viewed as computing a weighted projection of value vectors
  • Linear regression: The fitted values y^=Hy\hat{\mathbf{y}} = H\mathbf{y} where H=X(XX)1XH = X(X^\top X)^{-1}X^\top is the hat matrix

4.4 Projection in Coordinate Systems

Given an ONB {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\} for Rn\mathbb{R}^n, the projection onto the subspace S=span(q1,,qk)S = \operatorname{span}(\mathbf{q}_1, \ldots, \mathbf{q}_k) (for knk \leq n) is:

PS=i=1kqiqiP_S = \sum_{i=1}^k \mathbf{q}_i\mathbf{q}_i^\top

This is a sum of rank-1 projectors, one per basis direction. The decomposition:

I=PS+PS=i=1kqiqi+i=k+1nqiqiI = P_S + P_{S^\perp} = \sum_{i=1}^k \mathbf{q}_i\mathbf{q}_i^\top + \sum_{i=k+1}^n \mathbf{q}_i\mathbf{q}_i^\top

is the resolution of the identity - every vector is split into its SS and SS^\perp components.


5. Gram-Schmidt Orthogonalization

5.1 The Algorithm: Classical Gram-Schmidt

Problem. Given linearly independent vectors a1,,ak\mathbf{a}_1, \ldots, \mathbf{a}_k, construct an orthonormal basis for span(a1,,ak)\operatorname{span}(\mathbf{a}_1, \ldots, \mathbf{a}_k).

Classical Gram-Schmidt (CGS):

Initialization: q1=a1/a1\mathbf{q}_1 = \mathbf{a}_1 / \|\mathbf{a}_1\|

Iteration: For j=2,,kj = 2, \ldots, k:

uj=aji=1j1aj,qiqiqj=ujuj\mathbf{u}_j = \mathbf{a}_j - \sum_{i=1}^{j-1}\langle\mathbf{a}_j, \mathbf{q}_i\rangle\,\mathbf{q}_i \qquad \mathbf{q}_j = \frac{\mathbf{u}_j}{\|\mathbf{u}_j\|}

What it does geometrically: At step jj, we subtract the projection of aj\mathbf{a}_j onto the span of the already-processed vectors, leaving the component orthogonal to everything before it. Normalizing gives the jj-th orthonormal vector.

Inductive proof of correctness. After step jj, {q1,,qj}\{\mathbf{q}_1, \ldots, \mathbf{q}_j\} is an ONB for span(a1,,aj)\operatorname{span}(\mathbf{a}_1, \ldots, \mathbf{a}_j).

Base: j=1j=1: q1=a1/a1\mathbf{q}_1 = \mathbf{a}_1/\|\mathbf{a}_1\| is a unit vector spanning the same subspace. OK

Step: Assume true for j1j-1. Then uj=ajPj1aj\mathbf{u}_j = \mathbf{a}_j - P_{j-1}\mathbf{a}_j where Pj1P_{j-1} projects onto span(q1,,qj1)\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_{j-1}).

For any <j\ell < j: uj,q=aj,qaj,q=0\langle\mathbf{u}_j, \mathbf{q}_\ell\rangle = \langle\mathbf{a}_j,\mathbf{q}_\ell\rangle - \langle\mathbf{a}_j,\mathbf{q}_\ell\rangle = 0 OK

Since a1,,aj\mathbf{a}_1,\ldots,\mathbf{a}_j are independent, uj0\mathbf{u}_j \neq \mathbf{0} (otherwise ajspan(a1,,aj1)\mathbf{a}_j \in \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_{j-1})). So qj=uj/uj\mathbf{q}_j = \mathbf{u}_j/\|\mathbf{u}_j\| is a valid unit vector. \square

5.2 Modified Gram-Schmidt

The numerical problem with CGS. In floating-point arithmetic, CGS can catastrophically lose orthogonality. This happens when aj\mathbf{a}_j is nearly in the span of {a1,,aj1}\{\mathbf{a}_1,\ldots,\mathbf{a}_{j-1}\} - rounding errors in the projection subtraction can accumulate, producing a uj\mathbf{u}_j that isn't truly orthogonal to earlier vectors.

Modified Gram-Schmidt (MGS) reorders the computation to orthogonalize against already-computed qi\mathbf{q}_i sequentially rather than all at once:

for j = 1 to k:
    v = a_j
    for i = 1 to j-1:
        v = v - <v, q_i> q_i    # orthogonalize against q_i using CURRENT v
    q_j = v / ||v||

Why MGS is more stable. In CGS, all projections use the original aj\mathbf{a}_j. In MGS, after each orthogonalization step, the updated v\mathbf{v} is used - so numerical errors from the first projection are themselves projected away in subsequent steps.

Theorem (Bjorck 1967). MGS is backward stable: the computed q^i\hat{\mathbf{q}}_i satisfy q^iq^j=δij+O(ϵmachκ(A))\hat{\mathbf{q}}_i^\top \hat{\mathbf{q}}_j = \delta_{ij} + O(\epsilon_{\text{mach}}\kappa(A)) where κ(A)\kappa(A) is the condition number of the input matrix. CGS satisfies only O(ϵmachκ(A)2)O(\epsilon_{\text{mach}}\kappa(A)^2).

Practical advice: Use MGS for explicit basis construction; use Householder for QR factorization (even more stable).

5.3 Gram-Schmidt and QR Factorization

Gram-Schmidt applied to the columns of A=[a1an]A = [\mathbf{a}_1|\cdots|\mathbf{a}_n] produces a QR factorization:

A=QRA = QR

where Q=[q1qn]Q = [\mathbf{q}_1|\cdots|\mathbf{q}_n] has orthonormal columns and RR is upper triangular with:

Rij={aj,qii<juji=j0i>jR_{ij} = \begin{cases}\langle\mathbf{a}_j, \mathbf{q}_i\rangle & i < j \\ \|\mathbf{u}_j\| & i = j \\ 0 & i > j\end{cases}

The diagonal entries Rjj=ujR_{jj} = \|\mathbf{u}_j\| are positive (since uj0\mathbf{u}_j \neq \mathbf{0} for independent columns).

This gives a unique QR factorization of AA with positive diagonal entries of RR.

For AI: Gram-Schmidt implicitly occurs in layer normalization variants, in computing orthogonal weight bases for LoRA, and in the derivation of stable training algorithms.


6. Householder Reflectors and Givens Rotations

6.1 Householder Reflectors

Motivation. While Gram-Schmidt produces QR by building QQ column by column, Householder applies successive orthogonal transformations to reduce AA to upper triangular form RR directly - achieving greater numerical stability.

Definition. The Householder reflector determined by unit vector n\mathbf{n} (the normal to the mirror) is:

H=I2nnH = I - 2\mathbf{n}\mathbf{n}^\top

Properties:

  • Symmetric: H=HH^\top = H OK
  • Orthogonal: HH=H2=(I2nn)2=I4nn+4nnnn=I4nn+4nn=IH^\top H = H^2 = (I - 2\mathbf{n}\mathbf{n}^\top)^2 = I - 4\mathbf{n}\mathbf{n}^\top + 4\mathbf{n}\mathbf{n}^\top\mathbf{n}\mathbf{n}^\top = I - 4\mathbf{n}\mathbf{n}^\top + 4\mathbf{n}\mathbf{n}^\top = I OK
  • Involution: H2=IH^2 = I, so H=H1H = H^{-1} OK
  • Determinant: det(H)=1\det(H) = -1 (it's a reflection, not a rotation)

Action: HH reflects any vector across the hyperplane with normal n\mathbf{n}:

  • Vectors in n\mathbf{n}^\perp are unchanged: Hv=vH\mathbf{v} = \mathbf{v} if v,n=0\langle\mathbf{v},\mathbf{n}\rangle = 0
  • The vector n\mathbf{n} is negated: Hn=n2n=nH\mathbf{n} = \mathbf{n} - 2\mathbf{n} = -\mathbf{n}

Zeroing a column. To zero out all components of a\mathbf{a} below the first entry: choose:

n=a+ae1a+ae1Ha=ae1\mathbf{n} = \frac{\mathbf{a} + \|\mathbf{a}\|\mathbf{e}_1}{\|\mathbf{a} + \|\mathbf{a}\|\mathbf{e}_1\|} \quad \Rightarrow \quad H\mathbf{a} = -\|\mathbf{a}\|\mathbf{e}_1

(The ±\pm sign choice avoids catastrophic cancellation when a1>0a_1 > 0.)

6.2 QR via Householder

To factor ARm×nA \in \mathbb{R}^{m \times n}:

  1. Find H1H_1 to zero the first column below row 1: H1AH_1 A has \ast in position (1,1)(1,1) and zeros below
  2. Find H2H_2 (acting on the (n1)×(n1)(n-1) \times (n-1) submatrix) to zero the second column below row 2
  3. Continue: HnH2H1A=RH_{n} \cdots H_2 H_1 A = R (upper triangular)

Then Q=H1H2HnQ = H_1 H_2 \cdots H_n (product of reflectors, hence orthogonal), and A=QRA = QR.

Complexity: O(mn2n3/3)O(mn^2 - n^3/3) flops - same asymptotic complexity as Gram-Schmidt but with better numerical constants.

Stability: Householder QR is backward stable, meaning the computed factors satisfy Q^R^=A+E\hat{Q}\hat{R} = A + E where E=O(ϵmachA)\|E\| = O(\epsilon_{\text{mach}}\|A\|).

6.3 Givens Rotations

Definition. The Givens rotation G(i,j,θ)Rn×nG(i,j,\theta) \in \mathbb{R}^{n \times n} rotates in the (i,j)(i,j)-plane:

G(i,j,θ)kl={cosθk=l=i or k=l=jsinθk=i,l=jsinθk=j,l=iδklotherwiseG(i,j,\theta)_{kl} = \begin{cases} \cos\theta & k=l=i \text{ or } k=l=j \\ -\sin\theta & k=i, l=j \\ \sin\theta & k=j, l=i \\ \delta_{kl} & \text{otherwise} \end{cases}

Purpose: Choose θ\theta to zero a specific element ajia_{ji} by rotating in the (i,j)(i,j)-plane.

When to use: Givens rotations are preferred when AA is sparse (each rotation affects only two rows/columns), for parallel computation, and in eigenvalue algorithms like QR iteration.


7. QR Decomposition: Theory and Applications

7.1 Existence and Uniqueness

Theorem (QR Decomposition). Every matrix ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n admits a factorization:

A=QRA = QR

where QRm×nQ \in \mathbb{R}^{m \times n} has orthonormal columns (QQ=InQ^\top Q = I_n) and RRn×nR \in \mathbb{R}^{n \times n} is upper triangular.

This is the thin (reduced) QR decomposition. There is also a full QR decomposition where QRm×mQ \in \mathbb{R}^{m \times m} is a full orthogonal matrix and RRm×nR \in \mathbb{R}^{m \times n} is upper trapezoidal.

Uniqueness (when AA has full column rank). If we require Rii>0R_{ii} > 0 for all ii, the thin QR decomposition is unique.

Proof sketch: Two decompositions Q1R1=Q2R2Q_1 R_1 = Q_2 R_2 give Q2Q1=R2R11Q_2^\top Q_1 = R_2 R_1^{-1}. The left side is orthogonal; the right side is upper triangular. An upper triangular orthogonal matrix must be diagonal with ±1\pm 1 entries. With the sign convention Rii>0R_{ii} > 0, this diagonal must be II. \square

When AA has rank r<nr < n: The decomposition still exists but RR has nrn - r zero diagonal entries. The columns of QQ corresponding to zero RiiR_{ii} can be chosen as any completion to an orthonormal set.

7.2 QR for Least Squares

Problem. Given ARm×nA \in \mathbb{R}^{m \times n} (m>nm > n) and bRm\mathbf{b} \in \mathbb{R}^m, solve:

minxRnAxb22\min_{\mathbf{x} \in \mathbb{R}^n} \|A\mathbf{x} - \mathbf{b}\|_2^2

Solution via QR. Let A=QRA = QR (thin QR, QQ=IQ^\top Q = I):

Axb2=QRxb2=RxQb2+(IQQ)b2\|A\mathbf{x} - \mathbf{b}\|^2 = \|QR\mathbf{x} - \mathbf{b}\|^2 = \|R\mathbf{x} - Q^\top\mathbf{b}\|^2 + \|(I - QQ^\top)\mathbf{b}\|^2

The second term is independent of x\mathbf{x}, so the minimum occurs at:

Rx=QbR\mathbf{x} = Q^\top\mathbf{b}

This is an upper triangular system, solved by back substitution in O(n2)O(n^2) operations.

Comparison with normal equations. The normal equations (AA)x=Ab(A^\top A)\mathbf{x} = A^\top\mathbf{b} are equivalent but form AAA^\top A, which squares the condition number: κ(AA)=κ(A)2\kappa(A^\top A) = \kappa(A)^2. QR avoids this squaring and is numerically preferred whenever κ(A)\kappa(A) is moderate.

7.3 QR Algorithm for Eigenvalues

Preview. The QR algorithm - the standard method for computing eigenvalues of symmetric matrices - works by iterating:

A0=A,Ak+1=RkQkwhere Ak=QkRkA_0 = A, \quad A_{k+1} = R_k Q_k \quad \text{where } A_k = Q_k R_k

Under mild conditions, AkA_k \to upper triangular (Schur form) and the diagonal entries converge to eigenvalues. This connects orthogonal matrices directly to spectral theory.

Forward Reference: The full QR algorithm with shifts is covered in 08: Matrix Decompositions. The eigenvalue theory it computes is developed in 01: Eigenvalues and Eigenvectors.


8. The Spectral Theorem for Symmetric Matrices

8.1 Statement and Proof

Theorem (Spectral Theorem). Every real symmetric matrix A=ARn×nA = A^\top \in \mathbb{R}^{n \times n} has:

  1. All eigenvalues are real: λiR\lambda_i \in \mathbb{R}
  2. Eigenvectors for distinct eigenvalues are orthogonal
  3. There exists an orthonormal basis of eigenvectors: A=QΛQA = Q\Lambda Q^\top where QO(n)Q \in O(n) and Λ=diag(λ1,,λn)\Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_n)

Proof of (1): Suppose Av=λvA\mathbf{v} = \lambda\mathbf{v} with vCn\mathbf{v} \in \mathbb{C}^n, v0\mathbf{v} \neq \mathbf{0}. Take the Hermitian inner product:

vˉAv=λvˉv=λv2\bar{\mathbf{v}}^\top A\mathbf{v} = \lambda\bar{\mathbf{v}}^\top\mathbf{v} = \lambda\|\mathbf{v}\|^2 vˉAv=vˉAv=vˉAv(since A=A)\overline{\bar{\mathbf{v}}^\top A\mathbf{v}} = \bar{\mathbf{v}}^\top A^\top\mathbf{v} = \bar{\mathbf{v}}^\top A\mathbf{v} \quad (\text{since } A = A^\top)

So λv2\lambda\|\mathbf{v}\|^2 is real, hence λR\lambda \in \mathbb{R}. \square

Proof of (2): If Au=λuA\mathbf{u} = \lambda\mathbf{u} and Av=μvA\mathbf{v} = \mu\mathbf{v} with λμ\lambda \neq \mu:

λu,v=Au,v=u,Av=u,Av=μu,v\lambda\langle\mathbf{u},\mathbf{v}\rangle = \langle A\mathbf{u},\mathbf{v}\rangle = \langle\mathbf{u},A^\top\mathbf{v}\rangle = \langle\mathbf{u},A\mathbf{v}\rangle = \mu\langle\mathbf{u},\mathbf{v}\rangle

(λμ)u,v=0(\lambda - \mu)\langle\mathbf{u},\mathbf{v}\rangle = 0. Since λμ\lambda \neq \mu: u,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0. \square

Proof of (3): By induction. The base case (n=1n=1) is trivial. For the inductive step: let q1\mathbf{q}_1 be a unit eigenvector for eigenvalue λ1\lambda_1. Let W=q1W = \mathbf{q}_1^\perp (the orthogonal complement). Since AA is symmetric, WW is AA-invariant (if w,q1=0\langle\mathbf{w},\mathbf{q}_1\rangle = 0, then Aw,q1=w,Aq1=λ1w,q1=0\langle A\mathbf{w},\mathbf{q}_1\rangle = \langle\mathbf{w},A\mathbf{q}_1\rangle = \lambda_1\langle\mathbf{w},\mathbf{q}_1\rangle = 0). Apply the inductive hypothesis to the (n1)×(n1)(n-1)\times(n-1) symmetric restriction AWA|_W. \square

8.2 The Rayleigh Quotient

Definition. The Rayleigh quotient of a symmetric matrix AA at vector x0\mathbf{x} \neq \mathbf{0} is:

RA(x)=xAxxxR_A(\mathbf{x}) = \frac{\mathbf{x}^\top A\mathbf{x}}{\mathbf{x}^\top\mathbf{x}}

Key properties:

  • RA(qi)=λiR_A(\mathbf{q}_i) = \lambda_i for eigenvectors qi\mathbf{q}_i
  • λminRA(x)λmax\lambda_{\min} \leq R_A(\mathbf{x}) \leq \lambda_{\max} for all x\mathbf{x}
  • maxxRA(x)=λmax\max_{\mathbf{x}} R_A(\mathbf{x}) = \lambda_{\max}, achieved at the top eigenvector
  • minxRA(x)=λmin\min_{\mathbf{x}} R_A(\mathbf{x}) = \lambda_{\min}, achieved at the bottom eigenvector

Proof of bounds: Expand x=iciqi\mathbf{x} = \sum_i c_i \mathbf{q}_i in the eigenbasis:

RA(x)=ici2λiici2R_A(\mathbf{x}) = \frac{\sum_i c_i^2 \lambda_i}{\sum_i c_i^2}

This is a weighted average of eigenvalues, hence bounded by [λmin,λmax][\lambda_{\min}, \lambda_{\max}]. \square

Courant-Fischer Min-Max Theorem. The kk-th largest eigenvalue satisfies:

λk=maxdimS=kminxS{0}RA(x)=mindimS=nk+1maxxS{0}RA(x)\lambda_k = \max_{\dim S = k}\, \min_{\mathbf{x} \in S \setminus \{0\}} R_A(\mathbf{x}) = \min_{\dim S = n-k+1}\, \max_{\mathbf{x} \in S \setminus \{0\}} R_A(\mathbf{x})

This variational characterization is fundamental to perturbation theory and numerical linear algebra.

8.3 Quadratic Forms and Symmetric Matrices

A quadratic form is f(x)=xAx=i,jAijxixjf(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x} = \sum_{i,j} A_{ij} x_i x_j for symmetric AA.

Positive definiteness via the spectral theorem: A0A \succ 0 (positive definite) \Leftrightarrow all eigenvalues >0> 0.

Proof: In the eigenbasis, xAx=iλici2\mathbf{x}^\top A\mathbf{x} = \sum_i \lambda_i c_i^2. This is positive for all x0\mathbf{x} \neq \mathbf{0} iff all λi>0\lambda_i > 0. \square

For AI: The Hessian 2L\nabla^2\mathcal{L} of a loss function is symmetric. Its eigenvalues tell us the curvature in each direction - large eigenvalues are "sharp" directions (fast loss change), small eigenvalues are "flat" directions. The Rayleigh quotient appears in sharpness-aware minimization (SAM), gradient clipping analysis, and understanding training dynamics.


9. Orthogonality in Function Spaces

9.1 The L^2 Inner Product

Orthogonality extends naturally from finite-dimensional vector spaces to infinite-dimensional function spaces. The space L2([a,b])L^2([a,b]) of square-integrable functions carries the inner product:

f,g=abf(x)g(x)dx\langle f, g\rangle = \int_a^b f(x)g(x)\,dx

Orthogonality of functions: fgf \perp g if abf(x)g(x)dx=0\int_a^b f(x)g(x)\,dx = 0.

Example: On [π,π][-\pi, \pi], the functions {sin(nx),cos(mx)}\{\sin(nx), \cos(mx)\} form a mutually orthogonal family:

ππsin(nx)cos(mx)dx=0for all n,m\int_{-\pi}^{\pi} \sin(nx)\cos(mx)\,dx = 0 \quad \text{for all } n, m ππsin(nx)sin(mx)dx=πδnm\int_{-\pi}^{\pi} \sin(nx)\sin(mx)\,dx = \pi\,\delta_{nm} ππcos(nx)cos(mx)dx=πδnm(n,m1)\int_{-\pi}^{\pi} \cos(nx)\cos(mx)\,dx = \pi\,\delta_{nm} \quad (n, m \geq 1)

This is the foundation of Fourier series - a basis expansion for L2L^2 functions.

9.2 Fourier Series as Orthonormal Expansion

Orthonormal basis for L2([π,π])L^2([-\pi,\pi]):

{12π,cos(x)π,sin(x)π,cos(2x)π,sin(2x)π,}\left\{\frac{1}{\sqrt{2\pi}},\, \frac{\cos(x)}{\sqrt{\pi}},\, \frac{\sin(x)}{\sqrt{\pi}},\, \frac{\cos(2x)}{\sqrt{\pi}},\, \frac{\sin(2x)}{\sqrt{\pi}},\, \ldots\right\}

The Fourier series of ff is its expansion in this ONB:

f(x)=a02+n=1[ancos(nx)+bnsin(nx)]f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty \left[a_n\cos(nx) + b_n\sin(nx)\right]

where the Fourier coefficients are orthogonal projections:

an=1πππf(x)cos(nx)dx,bn=1πππf(x)sin(nx)dxa_n = \frac{1}{\pi}\int_{-\pi}^{\pi}f(x)\cos(nx)\,dx, \qquad b_n = \frac{1}{\pi}\int_{-\pi}^{\pi}f(x)\sin(nx)\,dx

Parseval's theorem for Fourier series:

1πππf(x)2dx=a022+n=1(an2+bn2)\frac{1}{\pi}\int_{-\pi}^{\pi}|f(x)|^2\,dx = \frac{a_0^2}{2} + \sum_{n=1}^\infty(a_n^2 + b_n^2)

This is the infinite-dimensional analog of v2=iv^i2\|\mathbf{v}\|^2 = \sum_i \hat{v}_i^2.

9.3 Discrete Fourier Transform as a Unitary Matrix

The Discrete Fourier Transform (DFT) maps xCn\mathbf{x} \in \mathbb{C}^n to x^Cn\hat{\mathbf{x}} \in \mathbb{C}^n:

x^k=j=0n1xjωjk,ω=e2πi/n\hat{x}_k = \sum_{j=0}^{n-1} x_j\, \omega^{jk}, \qquad \omega = e^{-2\pi i/n}

In matrix form: x^=Fnx\hat{\mathbf{x}} = F_n\mathbf{x} where (Fn)jk=ωjk(F_n)_{jk} = \omega^{jk}.

Key fact: Fn/nF_n / \sqrt{n} is unitary (the complex analog of orthogonal):

1nFnFn=IFnn is unitary\frac{1}{n}F_n^* F_n = I \quad \Leftrightarrow \quad \frac{F_n}{\sqrt{n}} \text{ is unitary}

This is because the DFT basis vectors fk=(1,ωk,ω2k,,ω(n1)k)/n\mathbf{f}_k = (1, \omega^k, \omega^{2k}, \ldots, \omega^{(n-1)k})/\sqrt{n} are orthonormal in Cn\mathbb{C}^n:

fj,fk=1n=0n1ω(kj)=δjk\langle\mathbf{f}_j, \mathbf{f}_k\rangle = \frac{1}{n}\sum_{\ell=0}^{n-1}\omega^{\ell(k-j)} = \delta_{jk}

(The last equality is the geometric series identity for roots of unity.)

Parseval's theorem for DFT: x^2=nx2\|\hat{\mathbf{x}}\|^2 = n\|\mathbf{x}\|^2.

For AI: The DFT/FFT appears in:

  • Signal processing: Feature extraction from audio (MFCCs, spectrograms)
  • Efficient convolutions: fg=F1(F(f)F(g))f * g = \mathcal{F}^{-1}(\mathcal{F}(f) \cdot \mathcal{F}(g)) - used in efficient CNN implementations
  • Positional encodings: Sinusoidal positional embeddings in the original Transformer are Fourier features
  • SSMs: State space models (Mamba, S4) use structured orthogonal/unitary matrices

10. Applications in Machine Learning

10.1 Orthogonal Weight Initialization

The problem with random initialization. If weight matrices have singular values spread over a wide range, gradients either explode or vanish during backpropagation through many layers.

Saxe et al. (2013) showed that initializing with orthogonal matrices preserves gradient norms through linear networks:

Theorem. For a deep linear network with orthogonal weight matrices W1,,WLW_1, \ldots, W_L (each Rn×n\in \mathbb{R}^{n \times n}), the Jacobian of the full network is WLW1W_L \cdots W_1, which is also orthogonal (product of orthogonal matrices). Hence L/x=L/y\|\partial\mathcal{L}/\partial\mathbf{x}\| = \|\partial\mathcal{L}/\partial\mathbf{y}\| - gradients neither explode nor vanish.

Practical implementation:

  1. Generate a random n×nn \times n matrix with i.i.d. standard normal entries: MN(0,I)M \sim \mathcal{N}(0, I)
  2. Compute its QR decomposition: M=QRM = QR
  3. Use QQ (or Qsign(diag(R))Q \cdot \operatorname{sign}(\operatorname{diag}(R)) for uniform distribution over O(n)O(n)) as the initial weight matrix

This is implemented as torch.nn.init.orthogonal_() in PyTorch.

Why it works beyond linear networks: Even in nonlinear networks, orthogonal initialization places the initial weights in a "good" part of weight space where gradients are well-conditioned at initialization. Empirically, networks with orthogonal initialization often converge faster on deep architectures.

10.2 Rotary Position Embeddings (RoPE)

Standard self-attention lacks inherent position information. The original Transformer adds sinusoidal absolute position embeddings. RoPE (Su et al. 2021, used in LLaMA, GPT-NeoX, Falcon) takes a different approach: it encodes position by rotating query and key vectors.

Construction (2D case). For position mm, the rotation matrix is:

Rm=(cosmθsinmθsinmθcosmθ)R_m = \begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}

For a dd-dimensional embedding, pair up dimensions and apply RmR_m to each pair using d/2d/2 different frequencies θi=100002i/d\theta_i = 10000^{-2i/d}.

Key property. The inner product of rotated queries and keys depends only on the relative position mnm - n:

Rmq,Rnk=qRmRnk=qRmnk\langle R_m \mathbf{q}, R_n \mathbf{k}\rangle = \mathbf{q}^\top R_m^\top R_n \mathbf{k} = \mathbf{q}^\top R_{m-n}\mathbf{k}

since RmRn=RnmR_m^\top R_n = R_{n-m} (rotation matrices satisfy RaRb=RbaR_a^\top R_b = R_{b-a}, because the inverse of a rotation is a rotation in the opposite direction).

Orthogonality is central: The key equation RmRn=RmnR_m^\top R_n = R_{m-n} uses the fact that RmR_m is orthogonal (Rm=Rm1R_m^\top = R_m^{-1}), so (Rm)(Rn)=Rm1Rn=Rnm(R_m^\top)(R_n) = R_m^{-1}R_n = R_{n-m}.

10.3 Spectral Normalization

Spectral normalization (Miyato et al. 2018) stabilizes GAN training by constraining weight matrices to have spectral norm (= largest singular value) equal to 1.

Implementation: At each forward pass, divide the weight matrix by its spectral norm: W^=W/σ1(W)\hat{W} = W/\sigma_1(W).

Connection to orthogonality: A matrix with σ1=σ2==1\sigma_1 = \sigma_2 = \cdots = 1 is exactly an orthogonal matrix. Spectral normalization enforces σ1=1\sigma_1 = 1 while letting other singular values vary freely - it's a "one-sided" orthogonality constraint.

Why it helps GANs: The discriminator DD satisfies a Lipschitz condition D(x)D(y)xy\|D(\mathbf{x}) - D(\mathbf{y})\| \leq \|\mathbf{x} - \mathbf{y}\| when all weight layers are spectrally normalized. This relates to the Wasserstein GAN objective, which requires a 1-Lipschitz discriminator.

10.4 QR in Optimization: Orthogonal Gradients

Several modern optimization techniques use orthogonality:

Shampoo optimizer (Gupta et al. 2018): Maintains a Kronecker-factored preconditioner and periodically computes its orthogonal "square root" via eigendecomposition, orthogonalizing the gradient update directions.

Muon optimizer (2024): Orthogonalizes gradient matrices via Newton-Schulz iterations (a matrix-valued version of Newton's method for computing polar decomposition), then applies the orthogonalized gradient as an update. This ensures each weight matrix sees updates with balanced singular values, preventing the accumulation of rank deficiency.

Gradient orthogonality in continual learning: Methods like Gradient Episodic Memory (GEM) and Orthogonal Gradient Descent (OGD) project gradients for new tasks onto the orthogonal complement of the gradient space for previous tasks - preserving past performance while learning new tasks.



10b. Bessel's Inequality and Completeness

10b.1 Bessel's Inequality

When we expand a vector v\mathbf{v} in terms of an orthonormal set {q1,,qk}\{\mathbf{q}_1, \ldots, \mathbf{q}_k\} that is not necessarily a complete basis for the entire space, we get Bessel's inequality:

i=1kv,qi2v2\sum_{i=1}^k |\langle\mathbf{v}, \mathbf{q}_i\rangle|^2 \leq \|\mathbf{v}\|^2

Proof. Let vk=i=1kv,qiqi\mathbf{v}_k = \sum_{i=1}^k \langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i be the projection of v\mathbf{v} onto span(q1,,qk)\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_k). Then:

vvk2=v2vk2=v2i=1kv,qi20\|\mathbf{v} - \mathbf{v}_k\|^2 = \|\mathbf{v}\|^2 - \|\mathbf{v}_k\|^2 = \|\mathbf{v}\|^2 - \sum_{i=1}^k |\langle\mathbf{v},\mathbf{q}_i\rangle|^2 \geq 0

The inequality follows immediately. \square

Equality holds if and only if vspan(q1,,qk)\mathbf{v} \in \operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_k) - i.e., when v\mathbf{v} itself lies in the subspace spanned by the chosen orthonormal vectors.

Parseval's identity is the limit case: when {qi}\{\mathbf{q}_i\} is a complete ONB:

i=1v,qi2=v2\sum_{i=1}^\infty |\langle\mathbf{v}, \mathbf{q}_i\rangle|^2 = \|\mathbf{v}\|^2

10b.2 Completeness of Orthonormal Sets

An orthonormal set B={qi}\mathcal{B} = \{\mathbf{q}_i\} in a Hilbert space H\mathcal{H} is complete (or maximal) if it satisfies any of the following equivalent conditions:

  1. B\mathcal{B} is an ONB: every vH\mathbf{v} \in \mathcal{H} can be written v=iv,qiqi\mathbf{v} = \sum_i \langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i
  2. Parseval's identity holds: v2=iv,qi2\|\mathbf{v}\|^2 = \sum_i |\langle\mathbf{v},\mathbf{q}_i\rangle|^2 for all v\mathbf{v}
  3. v,qi=0\langle\mathbf{v},\mathbf{q}_i\rangle = 0 for all ii implies v=0\mathbf{v} = \mathbf{0} (no non-zero vector is orthogonal to all qi\mathbf{q}_i)

In finite dimensions, any ONB is automatically complete (by dimension counting). In infinite dimensions, completeness is a non-trivial requirement - this is why Hilbert space theory requires careful treatment.

For ML context: In kernel methods, the RKHS is an infinite-dimensional Hilbert space. A kernel function k(x,y)k(\mathbf{x},\mathbf{y}) defines an inner product, and the question of whether the kernel features span the whole space (completeness/universality) determines the expressivity of the corresponding kernel machine.


10c. Detailed Worked Examples

10c.1 Projection in Least-Squares Regression

Setup. We observe m=4m = 4 data points (xi,yi)(x_i, y_i) and want to fit a line y=β0+β1xy = \beta_0 + \beta_1 x:

x=(1,2,3,4),y=(2.1,3.9,5.8,8.2)x = (1, 2, 3, 4)^\top, \quad y = (2.1, 3.9, 5.8, 8.2)^\top

Build the design matrix:

A=(11121314)A = \begin{pmatrix}1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4\end{pmatrix}

Gram-Schmidt on the columns of AA:

Column 1: a1=(1,1,1,1)\mathbf{a}_1 = (1,1,1,1)^\top

q1=a1a1=12(1,1,1,1)\mathbf{q}_1 = \frac{\mathbf{a}_1}{\|\mathbf{a}_1\|} = \frac{1}{2}(1,1,1,1)^\top

Column 2: a2=(1,2,3,4)\mathbf{a}_2 = (1,2,3,4)^\top. Project out q1\mathbf{q}_1:

a2,q1=12(1+2+3+4)=5\langle\mathbf{a}_2, \mathbf{q}_1\rangle = \frac{1}{2}(1+2+3+4) = 5 u2=a25q1=(1,2,3,4)52(1,1,1,1)=(32,12,12,32)\mathbf{u}_2 = \mathbf{a}_2 - 5\mathbf{q}_1 = (1,2,3,4)^\top - \frac{5}{2}(1,1,1,1)^\top = (-\tfrac{3}{2}, -\tfrac{1}{2}, \tfrac{1}{2}, \tfrac{3}{2})^\top u2=94+14+14+94=5,q2=15(32,12,12,32)\|\mathbf{u}_2\| = \sqrt{\tfrac{9}{4}+\tfrac{1}{4}+\tfrac{1}{4}+\tfrac{9}{4}} = \sqrt{5}, \quad \mathbf{q}_2 = \frac{1}{\sqrt{5}}(-\tfrac{3}{2}, -\tfrac{1}{2}, \tfrac{1}{2}, \tfrac{3}{2})^\top

The projection (fitted values):

y^=QQywhere Q=[q1q2]\hat{\mathbf{y}} = QQ^\top\mathbf{y} \quad \text{where } Q = [\mathbf{q}_1|\mathbf{q}_2]

This is the projection of y\mathbf{y} onto col(A)\operatorname{col}(A) = the space of all affine functions of xx. The residual yy^\mathbf{y} - \hat{\mathbf{y}} is orthogonal to every affine function.

Verification of the normal equations: The least-squares solution β^\hat{\boldsymbol{\beta}} satisfies AAβ^=AyA^\top A\hat{\boldsymbol{\beta}} = A^\top\mathbf{y}:

AA=(4101030),Ay=(2056.8)A^\top A = \begin{pmatrix}4 & 10 \\ 10 & 30\end{pmatrix}, \quad A^\top\mathbf{y} = \begin{pmatrix}20 \\ 56.8\end{pmatrix}

Solving: β^(0.1,2.04)\hat{\boldsymbol{\beta}} \approx (0.1, 2.04)^\top, so y^0.1+2.04x\hat{y} \approx 0.1 + 2.04x - a good linear fit.

10c.2 Full Worked Example: Gram-Schmidt in R3\mathbb{R}^3

Input vectors:

a1=(110),a2=(101),a3=(011)\mathbf{a}_1 = \begin{pmatrix}1\\1\\0\end{pmatrix}, \quad \mathbf{a}_2 = \begin{pmatrix}1\\0\\1\end{pmatrix}, \quad \mathbf{a}_3 = \begin{pmatrix}0\\1\\1\end{pmatrix}

Step 1: q1=a1/a1=(1/2)(1,1,0)\mathbf{q}_1 = \mathbf{a}_1/\|\mathbf{a}_1\| = (1/\sqrt{2})(1,1,0)^\top

Step 2: Remove the component of a2\mathbf{a}_2 along q1\mathbf{q}_1:

a2,q1=12(11+10+01)=12\langle\mathbf{a}_2, \mathbf{q}_1\rangle = \frac{1}{\sqrt{2}}(1 \cdot 1 + 1 \cdot 0 + 0 \cdot 1) = \frac{1}{\sqrt{2}} u2=a212q1=(101)12(110)=(1/21/21)\mathbf{u}_2 = \mathbf{a}_2 - \frac{1}{\sqrt{2}}\mathbf{q}_1 = \begin{pmatrix}1\\0\\1\end{pmatrix} - \frac{1}{2}\begin{pmatrix}1\\1\\0\end{pmatrix} = \begin{pmatrix}1/2\\-1/2\\1\end{pmatrix} u2=1/4+1/4+1=3/2,q2=23(1/21/21)=16(112)\|\mathbf{u}_2\| = \sqrt{1/4 + 1/4 + 1} = \sqrt{3/2}, \quad \mathbf{q}_2 = \sqrt{\frac{2}{3}}\begin{pmatrix}1/2\\-1/2\\1\end{pmatrix} = \frac{1}{\sqrt{6}}\begin{pmatrix}1\\-1\\2\end{pmatrix}

Step 3: Remove components of a3\mathbf{a}_3 along q1\mathbf{q}_1 and q2\mathbf{q}_2:

a3,q1=12(0+1+0)=12\langle\mathbf{a}_3, \mathbf{q}_1\rangle = \frac{1}{\sqrt{2}}(0+1+0) = \frac{1}{\sqrt{2}} a3,q2=16(01+2)=16\langle\mathbf{a}_3, \mathbf{q}_2\rangle = \frac{1}{\sqrt{6}}(0-1+2) = \frac{1}{\sqrt{6}} u3=(011)1212(110)1616(112)\mathbf{u}_3 = \begin{pmatrix}0\\1\\1\end{pmatrix} - \frac{1}{\sqrt{2}}\cdot\frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\\0\end{pmatrix} - \frac{1}{\sqrt{6}}\cdot\frac{1}{\sqrt{6}}\begin{pmatrix}1\\-1\\2\end{pmatrix} =(011)12(110)16(112)=(2/32/32/3)= \begin{pmatrix}0\\1\\1\end{pmatrix} - \frac{1}{2}\begin{pmatrix}1\\1\\0\end{pmatrix} - \frac{1}{6}\begin{pmatrix}1\\-1\\2\end{pmatrix} = \begin{pmatrix}-2/3\\2/3\\2/3\end{pmatrix} u3=4/9+4/9+4/9=2/3,q3=13(111)\|\mathbf{u}_3\| = \sqrt{4/9+4/9+4/9} = 2/\sqrt{3}, \quad \mathbf{q}_3 = \frac{1}{\sqrt{3}}\begin{pmatrix}-1\\1\\1\end{pmatrix}

Verification: q1,q2=(1/2)(1/6)(11+0)=0\langle\mathbf{q}_1,\mathbf{q}_2\rangle = (1/\sqrt{2})(1/\sqrt{6})(1-1+0) = 0 OK q1,q3=(1/2)(1/3)(1+1+0)=0\langle\mathbf{q}_1,\mathbf{q}_3\rangle = (1/\sqrt{2})(1/\sqrt{3})(-1+1+0) = 0 OK q2,q3=(1/6)(1/3)(11+2)=0\langle\mathbf{q}_2,\mathbf{q}_3\rangle = (1/\sqrt{6})(1/\sqrt{3})(-1-1+2) = 0 OK

QR factorization: The matrix RR is upper triangular with:

R=(a1a2,q1a3,q10u2a3,q200u3)=(21/21/203/21/6002/3)R = \begin{pmatrix}\|\mathbf{a}_1\| & \langle\mathbf{a}_2,\mathbf{q}_1\rangle & \langle\mathbf{a}_3,\mathbf{q}_1\rangle \\ 0 & \|\mathbf{u}_2\| & \langle\mathbf{a}_3,\mathbf{q}_2\rangle \\ 0 & 0 & \|\mathbf{u}_3\|\end{pmatrix} = \begin{pmatrix}\sqrt{2} & 1/\sqrt{2} & 1/\sqrt{2} \\ 0 & \sqrt{3/2} & 1/\sqrt{6} \\ 0 & 0 & 2/\sqrt{3}\end{pmatrix}

10c.3 Spectral Decomposition Example

Let A=(3113)A = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix} (symmetric).

Eigenvalues: det(AλI)=(3λ)21=0λ=2\det(A - \lambda I) = (3-\lambda)^2 - 1 = 0 \Rightarrow \lambda = 2 or λ=4\lambda = 4.

Eigenvectors:

  • λ1=2\lambda_1 = 2: (A2I)v=0(1111)v=0v1=(1,1)/2(A - 2I)\mathbf{v} = 0 \Rightarrow \begin{pmatrix}1&1\\1&1\end{pmatrix}\mathbf{v} = 0 \Rightarrow \mathbf{v}_1 = (1,-1)^\top/\sqrt{2}
  • λ2=4\lambda_2 = 4: (A4I)v=0(1111)v=0v2=(1,1)/2(A - 4I)\mathbf{v} = 0 \Rightarrow \begin{pmatrix}-1&1\\1&-1\end{pmatrix}\mathbf{v} = 0 \Rightarrow \mathbf{v}_2 = (1,1)^\top/\sqrt{2}

Verify orthogonality: v1v2=(1)(1)+(1)(1)=0\mathbf{v}_1^\top\mathbf{v}_2 = (1)(1) + (-1)(1) = 0 OK

Spectral decomposition:

A=λ1v1v1+λ2v2v2=2(1/21/21/21/2)+4(1/21/21/21/2)A = \lambda_1\mathbf{v}_1\mathbf{v}_1^\top + \lambda_2\mathbf{v}_2\mathbf{v}_2^\top = 2\begin{pmatrix}1/2 & -1/2 \\ -1/2 & 1/2\end{pmatrix} + 4\begin{pmatrix}1/2 & 1/2 \\ 1/2 & 1/2\end{pmatrix} =(1111)+(2222)=(3113)=A= \begin{pmatrix}1 & -1 \\ -1 & 1\end{pmatrix} + \begin{pmatrix}2 & 2 \\ 2 & 2\end{pmatrix} = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix} = A \checkmark

Rayleigh quotient at x=(1,0)\mathbf{x} = (1,0)^\top:

RA(x)=xAx=(1,0)(3113)(10)=3R_A(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x} = (1,0)\begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix}\begin{pmatrix}1\\0\end{pmatrix} = 3

Indeed λmin=234=λmax\lambda_{\min} = 2 \leq 3 \leq 4 = \lambda_{\max}.

10c.4 Householder Reflector: Concrete Computation

Problem: Find a Householder reflector HH such that Ha=ae1H\mathbf{a} = \|\mathbf{a}\|\mathbf{e}_1 where a=(4,3)\mathbf{a} = (4, 3)^\top.

Step 1: a=16+9=5\|\mathbf{a}\| = \sqrt{16+9} = 5.

Step 2: Since a1=4>0a_1 = 4 > 0, use ++ sign to avoid cancellation:

nunnorm=a+ae1=(43)+(50)=(93)\mathbf{n}_{\text{unnorm}} = \mathbf{a} + \|\mathbf{a}\|\mathbf{e}_1 = \begin{pmatrix}4\\3\end{pmatrix} + \begin{pmatrix}5\\0\end{pmatrix} = \begin{pmatrix}9\\3\end{pmatrix}

Step 3: Normalize: nunnorm=81+9=90=310\|\mathbf{n}_{\text{unnorm}}\| = \sqrt{81+9} = \sqrt{90} = 3\sqrt{10}

n=1310(93)=110(31)\mathbf{n} = \frac{1}{3\sqrt{10}}\begin{pmatrix}9\\3\end{pmatrix} = \frac{1}{\sqrt{10}}\begin{pmatrix}3\\1\end{pmatrix}

Step 4: Build the reflector:

H=I2nn=(1001)210(9331)=(19/53/53/511/5)=(4/53/53/54/5)H = I - 2\mathbf{n}\mathbf{n}^\top = \begin{pmatrix}1 & 0 \\ 0 & 1\end{pmatrix} - \frac{2}{10}\begin{pmatrix}9 & 3 \\ 3 & 1\end{pmatrix} = \begin{pmatrix}1-9/5 & -3/5 \\ -3/5 & 1-1/5\end{pmatrix} = \begin{pmatrix}-4/5 & -3/5 \\ -3/5 & 4/5\end{pmatrix}

Verify: Ha=(4/53/53/54/5)(43)=(16/59/512/5+12/5)=(50)=5e1H\mathbf{a} = \begin{pmatrix}-4/5 & -3/5 \\ -3/5 & 4/5\end{pmatrix}\begin{pmatrix}4\\3\end{pmatrix} = \begin{pmatrix}-16/5-9/5\\-12/5+12/5\end{pmatrix} = \begin{pmatrix}-5\\0\end{pmatrix} = -5\mathbf{e}_1

This gives ae1-\|\mathbf{a}\|\mathbf{e}_1 (a valid Householder result; the sign depends on the convention). det(H)=1\det(H) = -1 as expected. OK


10d. Orthogonality and the Geometry of Neural Networks

10d.1 The Geometry of Weight Matrices

A single linear layer y=Wx\mathbf{y} = W\mathbf{x} transforms the input geometry. Understanding this transformation requires decomposing WW via its singular value decomposition (preview):

W=UΣV(covered fully in 02: SVD)W = U\Sigma V^\top \qquad \text{(covered fully in 02: SVD)}
  • VV^\top: rotate the input space (first orthogonal transformation)
  • Σ\Sigma: scale along the new axes by singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0
  • UU: rotate the output space (second orthogonal transformation)

For orthogonal WW: All singular values are 1, so Σ=I\Sigma = I. The layer is a pure rotation - it doesn't compress or expand the input distribution. Distances and angles are perfectly preserved.

Implication for gradient flow. The gradient of the loss with respect to the input is:

Lx=WLy\frac{\partial\mathcal{L}}{\partial\mathbf{x}} = W^\top\frac{\partial\mathcal{L}}{\partial\mathbf{y}}

For orthogonal WW: Wg=g\|W^\top\mathbf{g}\| = \|\mathbf{g}\| - gradients are neither amplified nor attenuated. This is the formal reason orthogonal initialization prevents gradient vanishing/explosion in deep linear networks.

10d.2 Attention Heads as Orthogonal Projections

In a Transformer's multi-head attention with HH heads:

headh=Attn(XWhQ,XWhK,XWhV)\text{head}_h = \text{Attn}(XW_h^Q, XW_h^K, XW_h^V)

where WhQ,WhK,WhVRd×dkW_h^Q, W_h^K, W_h^V \in \mathbb{R}^{d \times d_k} are the query/key/value projection matrices.

Mechanistic interpretability perspective (Elhage et al. 2021): Each attention head computes a projection of the residual stream onto a low-dimensional subspace, adds information from that subspace back, and (in some sense) different heads should attend to different "directions" of information.

Mathematical idealization: If we require WhV(WhV)=0W_h^V (W_{h'}^V)^\top = 0 for hhh \neq h' (the value matrices project to orthogonal subspaces), then different heads genuinely operate on independent information. In practice, heads aren't exactly orthogonal, but studying the overlap WhV(WhV)F\|W_h^V (W_{h'}^V)^\top\|_F quantifies how much heads interfere with each other.

QK circuit analysis. The effective attention pattern for head hh involves WhQ(WhK)Rd×dW_h^Q (W_h^K)^\top \in \mathbb{R}^{d \times d} (or rather its action on the residual stream). This matrix's singular values determine how "sharp" vs "diffuse" the attention pattern is.

10d.3 Layer Normalization and Orthogonal Complements

Layer normalization (Ba et al. 2016) operates on a vector xRd\mathbf{x} \in \mathbb{R}^d:

LayerNorm(x)=xμ1σγ+β\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu\mathbf{1}}{\sigma} \cdot \boldsymbol{\gamma} + \boldsymbol{\beta}

where μ=1dixi\mu = \frac{1}{d}\sum_i x_i and σ2=1di(xiμ)2\sigma^2 = \frac{1}{d}\sum_i (x_i - \mu)^2.

Orthogonal decomposition view. The centering step xxμ1\mathbf{x} \mapsto \mathbf{x} - \mu\mathbf{1} is an orthogonal projection onto the subspace 1={vRd:ivi=0}\mathbf{1}^\perp = \{\mathbf{v} \in \mathbb{R}^d : \sum_i v_i = 0\}:

P1=I11dP_{\mathbf{1}^\perp} = I - \frac{\mathbf{1}\mathbf{1}^\top}{d}

This is a rank-(d1)(d-1) orthogonal projector. Layer normalization first projects out the "mean direction" 1/d\mathbf{1}/\sqrt{d}, then rescales to unit norm on the remaining (d1)(d-1)-dimensional subspace.

Consequence for representational geometry: After layer normalization, the model's internal representations h\mathbf{h} always satisfy h1\mathbf{h} \perp \mathbf{1}. The model can only represent information in the orthogonal complement of the constant vector - this is why all-zero inputs and constant inputs are indistinguishable after layer norm.

10d.4 Orthogonal Regularization in Weight Matrices

Several training techniques maintain (near-)orthogonality in weight matrices throughout training:

Orthogonal regularization: Add a penalty λWWIF2\lambda\|W^\top W - I\|_F^2 to the loss. This penalizes deviation from orthogonality without enforcing it exactly.

Spectral regularization: Penalize λi(σi1)2\lambda \sum_i (\sigma_i - 1)^2 - penalizes singular values deviating from 1. Stronger than spectral normalization, which only clips the largest singular value.

Riemannian optimization on the Stiefel manifold: The set of n×kn \times k matrices with orthonormal columns, St(n,k)={QRn×k:QQ=Ik}\text{St}(n,k) = \{Q \in \mathbb{R}^{n \times k} : Q^\top Q = I_k\}, is a smooth Riemannian manifold. Gradient descent on St(n,k)\text{St}(n,k) uses the Riemannian gradient (projection of Euclidean gradient onto the tangent space) and retracts to the manifold via QR decomposition or Cayley transform after each step.

Applications: Orthogonal RNNs, rotation-equivariant neural networks, invertible neural networks, and normalizing flows all benefit from exact or approximate orthogonality constraints.


11. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing "orthogonal" and "orthonormal"Orthogonal means u,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0; orthonormal additionally requires unit length. An orthogonal matrix QQ has orthonormal columns, not just orthogonal ones.Check: orthogonal <-> zero inner products; orthonormal <-> zero inner products AND qi=1\|\mathbf{q}_i\| = 1
2Assuming QQ=IQQ^\top = I implies QQ=IQ^\top Q = I without checking squareFor non-square QRm×nQ \in \mathbb{R}^{m \times n} (m>nm > n), QQ=InQ^\top Q = I_n but QQImQQ^\top \neq I_m (it's a rank-nn projection). Only square orthogonal matrices satisfy both.Specify dimensions: "thin Q" (QQ=InQ^\top Q = I_n) vs "full Q" (QQ=Im=QQQQ^\top = I_m = Q^\top Q)
3Thinking Gram-Schmidt preserves span orderingCGS does: span(q1,,qk)=span(a1,,ak)\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_k) = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_k). But with pivoting or reordering, this property is lost.Track column ordering explicitly when pivoting QR is used
4Using CGS when numerical stability mattersClassical Gram-Schmidt squares the error: O(ϵmachκ(A)2)O(\epsilon_\text{mach}\kappa(A)^2) vs MGS's O(ϵmachκ(A))O(\epsilon_\text{mach}\kappa(A)). For ill-conditioned AA, CGS can produce non-orthogonal "orthonormal" vectors.Use Modified Gram-Schmidt (MGS) or Householder QR for production code
5Projecting onto a non-orthogonal basisThe formula P=QQP = QQ^\top assumes QQ=IQ^\top Q = I. If the columns are merely linearly independent (not orthonormal), the correct formula is P=Q(QQ)1QP = Q(Q^\top Q)^{-1}Q^\top, which involves the (QQ)(Q^\top Q) inverse.Always check if QQ=IQ^\top Q = I before using the simplified projection formula
6Thinking P2=PP^2 = P is sufficient for orthogonal projectionIdempotence P2=PP^2 = P only guarantees PP is some projection. Orthogonal projections additionally require P=PP^\top = P (symmetry). Oblique projections satisfy idempotence but not symmetry.An orthogonal projection must be both idempotent (P2=PP^2 = P) and symmetric (P=PP^\top = P)
7Claiming eigenvectors are always orthogonalEigenvectors of different eigenvalues are orthogonal only for symmetric matrices. For general matrices, eigenvectors can be arbitrary non-orthogonal vectors (or even non-real).The spectral theorem applies to symmetric/Hermitian matrices; general matrices need Schur decomposition
8Forgetting the sign convention in HouseholderWhen computing n=(a±ae1)/\mathbf{n} = (\mathbf{a} \pm \|\mathbf{a}\|\mathbf{e}_1)/\|\cdots\|, the sign should be chosen as +sign(a1)+\operatorname{sign}(a_1) to avoid cancellation when a1>0a_1 > 0.Always choose the sign of ±ae1\pm\|\mathbf{a}\|\mathbf{e}_1 to be opposite the sign of a1a_1: n=(a+sign(a1)ae1)/\mathbf{n} = (\mathbf{a} + \operatorname{sign}(a_1)\|\mathbf{a}\|\mathbf{e}_1)/\|\cdots\|
9Misinterpreting Parseval's identity as an approximationv2=iv^i2\|\mathbf{v}\|^2 = \sum_i \hat{v}_i^2 is an equality when {qi}\{\mathbf{q}_i\} is a complete orthonormal basis for the whole space. If it's only a basis for a subspace, you get Bessel's inequality: iv^i2v2\sum_i \hat{v}_i^2 \leq \|\mathbf{v}\|^2.Distinguish: complete ONB -> Parseval's equality; partial ONB -> Bessel's inequality
10Assuming QR is always better than Cholesky for normal equationsQR avoids squaring the condition number but costs more flops (O(mn2)O(mn^2) vs O(mn2/2+n3/6)O(mn^2/2 + n^3/6) for Cholesky). For well-conditioned problems with large mm, Cholesky on AAA^\top A may be faster.Choose based on condition number: if κ(A)1\kappa(A) \gg 1, use QR; if κ(A)1\kappa(A) \approx 1, Cholesky is acceptable and faster
11Using det(Q)=1\det(Q) = 1 as the definition of orthogonaldet(Q)=±1\det(Q) = \pm 1 is a consequence of orthogonality, not the definition. A matrix can have determinant 1 without being orthogonal (e.g., any upper triangular matrix with diagonal product 1).Definition is QQ=IQ^\top Q = I. Determinant =±1= \pm 1 follows from this.
12Thinking orthogonal initialization prevents gradient problems throughout trainingOrthogonal initialization helps at initialization. As training proceeds, weights deviate from orthogonality. Maintaining orthogonality throughout training requires explicit regularization (e.g., spectral normalization, orthogonal regularization loss).Use orthogonal init as a starting point; for persistent orthogonality through training, use explicit constraints

12. Exercises

Exercise 1 * - Projection and Decomposition

Given v=(3,1,2)\mathbf{v} = (3, 1, 2)^\top and the subspace S=span{(1,1,0),(0,1,1)}S = \operatorname{span}\{(1, 1, 0)^\top, (0, 1, 1)^\top\}:

(a) Apply Gram-Schmidt to the spanning vectors to obtain an ONB {q1,q2}\{\mathbf{q}_1, \mathbf{q}_2\} for SS.

(b) Compute the projection vS=projS(v)\mathbf{v}_S = \operatorname{proj}_S(\mathbf{v}) using the ONB.

(c) Verify: vvSq1\mathbf{v} - \mathbf{v}_S \perp \mathbf{q}_1 and vvSq2\mathbf{v} - \mathbf{v}_S \perp \mathbf{q}_2.

(d) Compute v2\|\mathbf{v}\|^2 and verify v2=vS2+vvS2\|\mathbf{v}\|^2 = \|\mathbf{v}_S\|^2 + \|\mathbf{v} - \mathbf{v}_S\|^2 (Pythagorean theorem).


Exercise 2 * - Gram-Schmidt and QR

Let A=(110101011)A = \begin{pmatrix}1 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 1\end{pmatrix}.

(a) Apply Gram-Schmidt to the columns of AA to produce QQ and RR such that A=QRA = QR.

(b) Verify that QQ=IQ^\top Q = I and QR=AQR = A numerically.

(c) Solve the least-squares problem minxAxb\min_\mathbf{x}\|A\mathbf{x} - \mathbf{b}\| for b=(1,2,3)\mathbf{b} = (1,2,3)^\top by solving Rx=QbR\mathbf{x} = Q^\top\mathbf{b} via back substitution.

(d) Verify the solution against np.linalg.lstsq.


Exercise 3 * - Orthogonal Matrices

(a) Verify that the 2×22\times 2 rotation matrix Rθ=(cosθsinθsinθcosθ)R_\theta = \begin{pmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{pmatrix} is orthogonal for θ=π/4\theta = \pi/4.

(b) Show that the composition of two rotations RθRϕ=Rθ+ϕR_\theta R_\phi = R_{\theta+\phi} by matrix multiplication.

(c) Construct a Householder reflector HH such that Ha=ae1H\mathbf{a} = \|\mathbf{a}\|\mathbf{e}_1 for a=(3,4)\mathbf{a} = (3, 4)^\top. Verify HH=IH^\top H = I and Ha=(5,0)H\mathbf{a} = (5, 0)^\top.

(d) What is det(H)\det(H)? Explain geometrically.


Exercise 4 ** - Modified Gram-Schmidt

(a) Implement both Classical and Modified Gram-Schmidt.

(b) Test on the Hilbert matrix Hij=1/(i+j1)H_{ij} = 1/(i+j-1) for n=10n = 10. Compute the maximum off-diagonal entry of QQIQ^\top Q - I for each method.

(c) Plot orthogonality error (QQIF\|Q^\top Q - I\|_F) vs matrix size nn for both methods. What do you observe?

(d) Explain why MGS is more stable using the error analysis from 5.2.


Exercise 5 ** - Least Squares Stability

Given the Vandermonde system fitting a degree-dd polynomial through mm points:

(a) Build the design matrix ARm×(d+1)A \in \mathbb{R}^{m \times (d+1)} with Aij=xij1A_{ij} = x_i^{j-1} for m=20m = 20, d=10d = 10.

(b) Solve the least-squares problem via: (i) normal equations (AA)c^=Ay(A^\top A)\hat{\mathbf{c}} = A^\top\mathbf{y}, (ii) QR factorization A=QRA = QR then Rc^=QyR\hat{\mathbf{c}} = Q^\top\mathbf{y}.

(c) Compute the condition number κ(A)\kappa(A) and κ(AA)\kappa(A^\top A). What is their ratio?

(d) Perturb y\mathbf{y} by small noise and compare the residuals from both methods. Which is more stable?


Exercise 6 ** - Spectral Theorem

Let A=(4221)A = \begin{pmatrix}4 & 2 \\ 2 & 1\end{pmatrix}.

(a) Find all eigenvalues and eigenvectors of AA analytically.

(b) Verify that eigenvectors are orthogonal. Normalize them to form an orthogonal matrix QQ.

(c) Verify the spectral decomposition: A=QΛQA = Q\Lambda Q^\top where Λ=diag(λ1,λ2)\Lambda = \operatorname{diag}(\lambda_1, \lambda_2).

(d) Compute the Rayleigh quotient RA(x)R_A(\mathbf{x}) for x=(1,0)\mathbf{x} = (1, 0)^\top, (0,1)(0, 1)^\top, (1,1)/2(1, 1)^\top/\sqrt{2}. Verify each is in [λmin,λmax][\lambda_{\min}, \lambda_{\max}].

(e) Is AA positive definite? Verify using eigenvalues and by direct computation of xAx\mathbf{x}^\top A\mathbf{x}.


Exercise 7 *** - Orthogonal Weight Initialization

(a) Generate 100 random matrices W1,,WLR64×64W_1, \ldots, W_L \in \mathbb{R}^{64 \times 64} with i.i.d. Gaussian entries (standard init). Compute the spectral norm of their product WLW12\|W_L \cdots W_1\|_2 for L=1,5,10,20L = 1, 5, 10, 20.

(b) Repeat with each WkW_k initialized as a random orthogonal matrix (via QR of a Gaussian matrix). Plot WLW12\|W_L \cdots W_1\|_2 vs LL for both initializations.

(c) Generate synthetic "gradient" vectors gR64\mathbf{g} \in \mathbb{R}^{64} and propagate backward through the chains. Compare gradient norms: (WLW1)g\|(W_L \cdots W_1)^\top\mathbf{g}\| vs g\|\mathbf{g}\|.

(d) Explain the result using the isometry property of orthogonal matrices.


Exercise 8 *** - Rayleigh Quotient Iteration

Rayleigh Quotient Iteration is a method for finding eigenvectors with cubic convergence:

x(k+1)=(AρkI)1x(k)(AρkI)1x(k),ρk=RA(x(k))\mathbf{x}^{(k+1)} = \frac{(A - \rho_k I)^{-1}\mathbf{x}^{(k)}}{\|(A - \rho_k I)^{-1}\mathbf{x}^{(k)}\|}, \qquad \rho_k = R_A(\mathbf{x}^{(k)})

(a) Implement Rayleigh Quotient Iteration for a symmetric matrix.

(b) Test on A=(310121011)A = \begin{pmatrix}3 & 1 & 0 \\ 1 & 2 & 1 \\ 0 & 1 & 1\end{pmatrix} with random initial x(0)\mathbf{x}^{(0)}. Track ρkλ|\rho_k - \lambda_\star| over iterations.

(c) Plot convergence on a log scale and estimate the convergence rate (should be cubic: ρk+1λCρkλ3|\rho_{k+1} - \lambda_\star| \approx C|\rho_k - \lambda_\star|^3).

(d) Compare convergence speed vs power iteration and inverse iteration with a fixed shift.

(e) Explain why Rayleigh quotient (not a fixed shift) is needed for cubic convergence.


13. Why This Matters for AI (2026 Perspective)

ConceptAI ApplicationImpact
Orthogonal initializationtorch.nn.init.orthogonal_()Prevents gradient explosion/vanishing at initialization; standard practice for deep linear networks (Saxe et al.) and RNNs
QR decompositionMuon optimizer; orthogonal gradient updatesOrthogonalizing gradient matrices via QR/Newton-Schulz stabilizes LLM training; avoids rank collapse of weight matrices
RoPE embeddingsLLaMA, Mistral, GPT-NeoX, Falcon, GemmaRotary position encoding uses orthogonal rotation matrices to encode relative position in self-attention; rotation composition gives relative position property
Spectral normalizationGANs, discriminator regularizationConstrains σ1(W)=1\sigma_1(W) = 1 for 1-Lipschitz discriminator; stabilizes Wasserstein GAN training
Gram-Schmidt / QRNumerically stable least squaresUsed in regression, feature selection, basis pursuit; avoiding normal equations prevents condition number squaring
Orthogonal complementContinual learning (OGD, GEM)New task gradients are projected onto the orthogonal complement of old task gradient subspaces; prevents catastrophic forgetting
Spectral theoremHessian analysis, SAM optimizerEigendecomposition of the loss Hessian reveals sharp/flat directions; Sharpness-Aware Minimization (SAM) seeks minima with small \lambda_\max(\nabla^2\mathcal{L})
Orthonormal basesAttention head analysisQueries, keys, values in each attention head span subspaces; orthogonality between heads corresponds to specialization; mechanistic interpretability studies this
DFT (unitary matrix)Audio models, efficient convolutionsWav2Vec, Whisper process spectrograms (DFT magnitude); convolutional layers in frequency domain use FFT = matrix multiply by unitary FnF_n
Householder reflectorsQR in neural network trainingUsed in computing QR factorizations during optimizer steps (Shampoo); Householder product form compresses orthogonal matrices for efficient multiplication
Rayleigh quotientEigenvalue estimation in trainingStochastic Lanczos quadrature (Ghorbani et al. 2019) approximates the Hessian spectrum via Rayleigh quotients on random probes
Gram matrixGram-matrix style loss, NTKNeural Tangent Kernel Θ=JJ\Theta = J J^\top (where JJ is the Jacobian) is a Gram matrix; its eigenstructure determines training dynamics

14. Advanced Topics

14.1 Polar Decomposition

Every matrix ARm×nA \in \mathbb{R}^{m \times n} (mnm \geq n) has a polar decomposition:

A=QSA = QS

where QRm×nQ \in \mathbb{R}^{m \times n} has orthonormal columns (QQ=InQ^\top Q = I_n) and SRn×nS \in \mathbb{R}^{n \times n} is symmetric positive semidefinite.

Uniqueness: If AA has full column rank, QQ and SS are unique, with S=(AA)1/2S = (A^\top A)^{1/2} (the matrix square root) and Q=A(AA)1/2Q = A(A^\top A)^{-1/2}.

Geometric meaning: SS captures the "stretching" and QQ captures the "rotation" - this is the matrix analog of writing a complex number as reiθre^{i\theta}.

Relation to SVD: If A=UΣVA = U\Sigma V^\top (thin SVD), then:

A=(UV)(VΣV)=QSA = (UV^\top)(V\Sigma V^\top) = Q \cdot S

So Q=UVQ = UV^\top (the "nearest orthogonal matrix" to AA) and S=VΣVS = V\Sigma V^\top.

For AI: The Muon optimizer computes Q=UVQ = UV^\top from the SVD of the gradient matrix GG, then applies QQ as the weight update - this is exactly the orthogonal factor of the polar decomposition of GG.

14.2 Orthogonality in Hilbert Spaces

All finite-dimensional results extend to Hilbert spaces - complete inner product spaces, possibly infinite-dimensional.

Orthogonal Projection Theorem (Hilbert spaces). If H\mathcal{H} is a Hilbert space and MH\mathcal{M} \subseteq \mathcal{H} is a closed subspace, then every vH\mathbf{v} \in \mathcal{H} has a unique decomposition v=vM+vM\mathbf{v} = \mathbf{v}_\mathcal{M} + \mathbf{v}_{\mathcal{M}^\perp} with vMM\mathbf{v}_\mathcal{M} \in \mathcal{M} and vMM\mathbf{v}_{\mathcal{M}^\perp} \in \mathcal{M}^\perp.

Completeness is essential: Without it, the projection may not exist. (The "closed" hypothesis ensures that sequences of projections converge.)

Examples of Hilbert spaces in ML:

  • L2(R)L^2(\mathbb{R}): the space of square-integrable functions (kernel methods, RKHS)
  • 2\ell^2: square-summable sequences (infinite-dimensional limits of neural networks)
  • Reproducing Kernel Hilbert Spaces (RKHS): the function space for kernel machines, Gaussian processes

14.3 Gram Matrices and the Kernel Trick

Given vectors x1,,xmRd\mathbf{x}_1, \ldots, \mathbf{x}_m \in \mathbb{R}^d, the Gram matrix is:

Gij=xi,xj=xixjG_{ij} = \langle\mathbf{x}_i, \mathbf{x}_j\rangle = \mathbf{x}_i^\top\mathbf{x}_j

In matrix form: G=XXG = XX^\top where X=[x1xm]Rm×dX = [\mathbf{x}_1|\cdots|\mathbf{x}_m]^\top \in \mathbb{R}^{m \times d}.

Properties: GG is symmetric positive semidefinite. Its rank equals rank(X)\operatorname{rank}(X).

Kernel trick: Replace xi,xj\langle\mathbf{x}_i,\mathbf{x}_j\rangle with k(xi,xj)k(\mathbf{x}_i,\mathbf{x}_j) where kk is a positive definite kernel. The resulting kernel matrix Kij=k(xi,xj)K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j) is a Gram matrix in a (possibly infinite-dimensional) RKHS.

For attention: The attention matrix Attnijexp(qikj/dk)\text{Attn}_{ij} \propto \exp(\mathbf{q}_i^\top\mathbf{k}_j / \sqrt{d_k}) is a kernelized Gram matrix between queries and keys. Its spectral properties determine how information flows between tokens.


Conceptual Bridge

Looking Backward: What Made This Possible

The theory developed in this section builds directly on foundations from earlier chapters:

  • Vectors and inner products (01: Vectors and Spaces): The inner product u,v\langle\mathbf{u},\mathbf{v}\rangle is the primitive notion from which orthogonality, angles, and projections all derive.
  • Linear transformations (04: Linear Transformations): Orthogonal matrices are the isometric linear maps - those that preserve both angles and distances. The kernel and image of projection operators are orthogonal complements.
  • Matrix rank (02-LA: Matrix Rank): The rank-nullity theorem and the four fundamental subspaces are unified by orthogonality: null(A)=row(A)\operatorname{null}(A) = \operatorname{row}(A)^\perp.
  • Eigenvalues (01: Eigenvalues): The spectral theorem tells us that symmetric matrices are diagonalizable with an orthonormal eigenbasis - the connection between orthogonality and spectral theory.

Looking Forward: What This Enables

Orthogonality and orthonormality are not endpoints but foundations:

  • Singular Value Decomposition (02: SVD): The SVD A=UΣVA = U\Sigma V^\top is built from two orthogonal matrices. The right singular vectors VV form an ONB for the row space; the left singular vectors UU form an ONB for the column space. Without the theory developed here, SVD cannot be understood.
  • Matrix Decompositions (08: Matrix Decompositions): LU, QR, Cholesky, Schur - QR is the central algorithm among these, and the QR algorithm for eigenvalues iterates orthogonal similarity transformations.
  • Matrix Norms (06: Matrix Norms): The spectral norm A2=σ1(A)\|A\|_2 = \sigma_1(A) is defined via orthogonal invariance. The condition number κ(Q)=1\kappa(Q) = 1 for orthogonal QQ.
  • Optimization (Chapter 08): Gradient methods, second-order methods, and their variants all navigate spaces equipped with inner products. The geometry of the loss landscape is described using the spectral theory developed here.
CURRICULUM POSITION: ORTHOGONALITY AND ORTHONORMALITY
===========================================================================

  FOUNDATIONS                    THIS SECTION                  FORWARD
  -----------                    ------------                  -------
  +-----------------+            +-------------------------+   +------------------+
  | Inner Products  |----------> | Orthogonality (05)     |-->| SVD (02)        |
  | Vectors/Spaces  |            | +- Gram-Schmidt          |   | low-rank approx  |
  +-----------------+            | +- QR Decomposition      |   +------------------+
                                 | +- Orthogonal Matrices   |
  +-----------------+            | +- Projections           |   +------------------+
  | Eigenvalues     |----------> | +- Spectral Theorem      |-->| Matrix Norms     |
  | (01)           |            | +- ML Applications       |   | (06)            |
  +-----------------+            +-------------------------+   +------------------+
                                          |
  +-----------------+                     |                     +------------------+
  | Linear Transf.  |---------->----------+                  -->| Decompositions   |
  | (04)           |                                            | (08) QR alg.    |
  +-----------------+                                            +------------------+

  AI CONNECTIONS:
  +------------------------------------------------------------------+
  |  RoPE embeddings -> Gram-Schmidt  -> Muon optimizer  -> OGD/GEM    |
  |  Orthogonal init -> Spectral norm -> Rayleigh quotient -> SAM      |
  |  QR least squares -> DFT (unitary) -> Kernel Gram matrix -> NTK    |
  +------------------------------------------------------------------+

===========================================================================

<- Back to Advanced Linear Algebra | Next: Matrix Norms ->


Appendix A: Key Inequalities in Inner Product Spaces

A.1 Cauchy-Schwarz in Different Forms

The Cauchy-Schwarz inequality u,vuv|\langle\mathbf{u},\mathbf{v}\rangle| \leq \|\mathbf{u}\|\|\mathbf{v}\| has many equivalent forms:

Discrete form:

(k=1nakbk)2(k=1nak2)(k=1nbk2)\left(\sum_{k=1}^n a_k b_k\right)^2 \leq \left(\sum_{k=1}^n a_k^2\right)\left(\sum_{k=1}^n b_k^2\right)

Integral form:

(abf(x)g(x)dx)2(abf(x)2dx)(abg(x)2dx)\left(\int_a^b f(x)g(x)\,dx\right)^2 \leq \left(\int_a^b f(x)^2\,dx\right)\left(\int_a^b g(x)^2\,dx\right)

Probability form: For random variables X,YX, Y with E[X2],E[Y2]<\mathbb{E}[X^2], \mathbb{E}[Y^2] < \infty:

E[XY]2E[X2]E[Y2]|\mathbb{E}[XY]|^2 \leq \mathbb{E}[X^2]\mathbb{E}[Y^2]

The probability form directly implies: the correlation coefficient satisfies Corr(X,Y)1|\text{Corr}(X,Y)| \leq 1.

Second proof (via AM-GM). For any t>0t > 0:

0tu1tv2=t2u22u,v+1t2v20 \leq \|t\mathbf{u} - \frac{1}{t}\mathbf{v}\|^2 = t^2\|\mathbf{u}\|^2 - 2\langle\mathbf{u},\mathbf{v}\rangle + \frac{1}{t^2}\|\mathbf{v}\|^2

Minimizing over t>0t > 0 gives t2=v/ut^2 = \|\mathbf{v}\|/\|\mathbf{u}\| and:

02uv2u,vu,vuv0 \leq 2\|\mathbf{u}\|\|\mathbf{v}\| - 2\langle\mathbf{u},\mathbf{v}\rangle \quad \Rightarrow \quad \langle\mathbf{u},\mathbf{v}\rangle \leq \|\mathbf{u}\|\|\mathbf{v}\|

Applying to v-\mathbf{v} gives the lower bound. \square

A.2 Triangle Inequality and Its Uses

The triangle inequality for norms follows from Cauchy-Schwarz:

u+v2=u2+2u,v+v2u2+2uv+v2=(u+v)2\|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + 2\langle\mathbf{u},\mathbf{v}\rangle + \|\mathbf{v}\|^2 \leq \|\mathbf{u}\|^2 + 2\|\mathbf{u}\|\|\mathbf{v}\| + \|\mathbf{v}\|^2 = (\|\mathbf{u}\|+\|\mathbf{v}\|)^2

Parallelogram law: For any inner product space:

u+v2+uv2=2(u2+v2)\|\mathbf{u}+\mathbf{v}\|^2 + \|\mathbf{u}-\mathbf{v}\|^2 = 2(\|\mathbf{u}\|^2 + \|\mathbf{v}\|^2)

Proof: Expand both norms and add. The cross terms cancel. \square

This law characterizes inner product spaces: a norm satisfying the parallelogram law comes from an inner product via the polarization identity:

u,v=14(u+v2uv2)\langle\mathbf{u},\mathbf{v}\rangle = \frac{1}{4}\left(\|\mathbf{u}+\mathbf{v}\|^2 - \|\mathbf{u}-\mathbf{v}\|^2\right)

Practical importance: The 1\ell^1 and \ell^\infty norms do NOT satisfy the parallelogram law, confirming they do not arise from inner products. This is why optimization in 1\ell^1 (LASSO) has different geometry from 2\ell^2 (ridge regression) - there is no meaningful notion of "orthogonality" in 1\ell^1.

A.3 Weyl's Inequality for Eigenvalue Perturbation

When a symmetric matrix AA is perturbed to A+EA + E (with EE symmetric), Weyl's inequality bounds the eigenvalue shifts:

λk(A+E)λk(A)E2=σ1(E)|\lambda_k(A+E) - \lambda_k(A)| \leq \|E\|_2 = \sigma_1(E)

This means: the eigenvalues of a symmetric matrix are Lipschitz functions of the matrix entries, with Lipschitz constant 1 in the spectral norm.

Consequence for ML: If the Hessian 2L(θ)\nabla^2\mathcal{L}(\theta) changes slowly along a training trajectory (small H(t+1)H(t)2\|H(t+1) - H(t)\|_2), its eigenvalues also change slowly. This justifies quasi-Newton methods that assume the curvature is approximately constant over several steps.


Appendix B: Numerical Orthogonalization Recipes

B.1 When to Use Each Method

MethodComplexityStabilityWhen to Use
Classical Gram-Schmidt (CGS)O(mnk)O(mnk)Poor: O(ϵκ2)O(\epsilon\kappa^2)Never in production; educational use only
Modified Gram-Schmidt (MGS)O(mnk)O(mnk)Good: O(ϵκ)O(\epsilon\kappa)When explicit QQ needed and κ(A)\kappa(A) moderate
Householder QRO(mn2n3/3)O(mn^2 - n^3/3)Excellent: O(ϵ)O(\epsilon)General QR factorization; recommended default
Givens QRO(mn2)O(mn^2)ExcellentSparse matrices; parallel processing
Randomized QRO(mnlogk+nk2)O(mn\log k + nk^2)GoodLarge matrices; approximate low-rank QR

B.2 Loss of Orthogonality: Quantifying the Damage

In IEEE double precision (ϵmach2.2×1016\epsilon_\text{mach} \approx 2.2 \times 10^{-16}), the computed Q^\hat{Q} from CGS on an n×nn \times n matrix satisfies:

Q^Q^IFnϵmachκ(A)2\|\hat{Q}^\top\hat{Q} - I\|_F \lesssim n \cdot \epsilon_\text{mach} \cdot \kappa(A)^2

For a well-conditioned matrix (κ=10\kappa = 10): error n1014\approx n \cdot 10^{-14} - acceptable.

For an ill-conditioned matrix (κ=108\kappa = 10^8): error n1\approx n \cdot 1 - completely non-orthogonal!

MGS improves this to nϵmachκ(A)\lesssim n \cdot \epsilon_\text{mach} \cdot \kappa(A), and Householder to nϵmach\lesssim n \cdot \epsilon_\text{mach} regardless of κ\kappa.

B.3 Reorthogonalization

When MGS produces inadequate orthogonality (common for nearly-dependent input vectors), apply reorthogonalization: after completing Gram-Schmidt once, apply it again to the already-computed QQ. This reduces the error from O(ϵκ)O(\epsilon\kappa) to O(ϵ2κ2)O(\epsilon^2\kappa^2), which is typically negligible.

def mgs_with_reorth(A, num_reorth=1):
    """Modified Gram-Schmidt with optional reorthogonalization."""
    m, n = A.shape
    Q = np.zeros((m, n))
    R = np.zeros((n, n))
    for j in range(n):
        v = A[:, j].copy()
        for reorth in range(1 + num_reorth):
            for i in range(j):
                R[i, j] = Q[:, i] @ v  if reorth == 0 else Q[:, i] @ v
                v -= (Q[:, i] @ v) * Q[:, i]
        R[j, j] = np.linalg.norm(v)
        Q[:, j] = v / R[j, j]
    return Q, R

B.4 Block Gram-Schmidt for Parallel Computation

For very large matrices on parallel hardware, block Gram-Schmidt processes columns in blocks of size bb:

  1. Orthogonalize within each block using MGS or Householder
  2. Orthogonalize each new block against all previous blocks using a BLAS-3 matrix operation

The key advantage: step 2 uses matrix-matrix products (O(nb2)O(nb^2) flops, BLAS Level 3) rather than matrix-vector products (O(nb)O(nb) flops, BLAS Level 1/2). On modern hardware with SIMD and caching, this gives 10-100\times speedup for large nn.


Appendix C: Orthogonality in Machine Learning Libraries

C.1 PyTorch Orthogonal Utilities

import torch
import torch.nn as nn

# Orthogonal initialization
linear = nn.Linear(64, 64)
nn.init.orthogonal_(linear.weight)  # U from QR decomposition

# Spectral normalization
conv = nn.utils.spectral_norm(nn.Conv2d(64, 64, 3))
# Divides weight by its spectral norm at each forward pass

# Orthogonal regularization
def orthogonal_reg(model, lambda_=1e-4):
    loss = 0
    for name, param in model.named_parameters():
        if 'weight' in name and param.dim() == 2:
            W = param
            loss += lambda_ * torch.norm(W.T @ W - torch.eye(W.shape[1]))
    return loss

C.2 NumPy/SciPy QR and Orthogonal Matrix Tools

import numpy as np
from scipy.linalg import qr, qr_multiply, polar

# Full QR decomposition
Q, R = np.linalg.qr(A)              # 'reduced' (default) or 'complete'
Q, R, P = qr(A, pivoting=True)      # QR with column pivoting

# Random orthogonal matrix (Haar measure)
M = np.random.randn(n, n)
Q, R = np.linalg.qr(M)
Q *= np.sign(np.diag(R))            # Fix signs for uniform distribution

# Polar decomposition: A = Q @ S
Q, S = polar(A)                     # scipy.linalg.polar

# Check orthogonality
I_approx = Q.T @ Q
print(f"Max off-diagonal: {np.max(np.abs(I_approx - np.eye(n))):.2e}")

C.3 Efficient Householder Representation

For large nn, storing the full QQ matrix explicitly requires O(n2)O(n^2) space. The compact WY representation stores Q=IWYQ = I - WY^\top where W,YRn×kW, Y \in \mathbb{R}^{n \times k}, requiring only O(nk)O(nk) space and allowing matrix-vector products in O(nk)O(nk) time:

def apply_householder_sequence(v_list, tau_list, x):
    """Apply Q = H_1 H_2 ... H_k to x using the compact representation."""
    for v, tau in zip(v_list, tau_list):
        x = x - tau * v * (v @ x)  # Apply H = I - tau * v * v^T
    return x

This is how LAPACK and NumPy store QQ internally - only the Householder vectors are saved, not the explicit matrix.


Appendix D: The Geometry of Projection - Visual Summary

ORTHOGONAL DECOMPOSITION THEOREM
========================================================================

  The Euclidean space \mathbb{R}^n splits into complementary orthogonal subspaces:

  For any subspace S \subseteq \mathbb{R}^n:

           \mathbb{R}^n = S \oplus S\perp

           +------------------------------------------+
           |                  \mathbb{R}^n                      |
           |         +----------------+               |
           |         |       S        |               |
           |         |    v_S = Pv   |               |
           |         |      -        |               |
           |         |    /         |               |
           |         |  /           |               |
           |  v-     |/  <- v_S      |               |
           |   \     +----------------+               |
           |    \          |  orthogonal               |
           |     \         |  complement               |
           |      \        |  S\perp                      |
           |       ------- + ------------------------- |
           |    v - v_S    |  (v - v_S \perp every s \in S) |
           +------------------------------------------+

  Properties of projection matrix P = QQ^T (Q has ONB for S):

    Symmetric:    P^T = P         (orthogonal, not oblique)
    Idempotent:   P^2 = P         (projecting twice = projecting once)
    Range:        col(P) = S     (maps into S)
    Null space:   null(P) = S\perp  (annihilates complement)

========================================================================
GRAM-SCHMIDT PROCESS - STEP BY STEP
========================================================================

  Input: linearly independent vectors a_1, a_2, a_3 (not orthogonal)
  Output: orthonormal basis q_1, q_2, q_3 for same subspace

  STEP 1:  q_1 = a_1 / ||a_1||
           ------------------------------------
           a_1  --------------------------> q_1 (unit vector)

  STEP 2:  u_2 = a_2 - <a_2,q_1>q_1   (subtract projection onto q_1)
           q_2 = u_2 / ||u_2||
           ------------------------------------
           a_2  --->  remove q_1 component  ---> u_2 ---> q_2

                     a_2
                    /
                   / up <a_2,q_1>q_1 (this part removed)
                  /
           q_1 ------------ q_1 direction
                  u_2 (perpendicular remainder) --> q_2

  STEP 3:  u_3 = a_3 - <a_3,q_1>q_1 - <a_3,q_2>q_2
           q_3 = u_3 / ||u_3||
           ------------------------------------
           Remove projections onto BOTH q_1 and q_2

  RESULT: {q_1, q_2, q_3} is an ONB with same span as {a_1, a_2, a_3}

  QR FACTORIZATION: A = [a_1|a_2|a_3] = [q_1|q_2|q_3] * R
  where R is upper triangular with positive diagonal.

========================================================================
SPECTRAL THEOREM: SYMMETRIC <-> ORTHOGONAL EIGENBASIS
========================================================================

  A = A^T  (symmetric)
       <->
  A has orthonormal eigenbasis {q_1, ..., q_n}
       <->
  A = Q\LambdaQ^T  where  Q = [q_1|...|q_n] \in O(n),  \Lambda = diag(\lambda_1,...,\lambda_n)
       <->
  Quadratic form:  x^TAx = \Sigma^i \lambda^i(x^Tq^i)^2 = weighted sum of squared projections

  INTERPRETATION:
  +-----------------------------------------------------------------+
  |  Q^T: Express x in eigenbasis  ->  \Lambda: Scale each component  ->   |
  |  Q: Return to standard basis                                    |
  |                                                                  |
  |  \lambda^i > 0: stretch in q^i direction  (positive definite: all \lambda^i>0)|
  |  \lambda^i = 0: collapse in q^i direction  (singular matrix)           |
  |  \lambda^i < 0: reflection in q^i direction  (indefinite: mixed signs) |
  +-----------------------------------------------------------------+

  AI CONNECTION: Hessian \nabla^2\mathcal{L} is symmetric. Its eigenvectors are the
  "principal curvature directions" of the loss surface. Eigenvalues
  tell us how curved the loss is in each direction.

========================================================================
ORTHOGONAL MATRICES: STRUCTURE AND CLASSIFICATION
========================================================================

  Q^TQ = I  (definition)   =>   det(Q) = \pm1

  det(Q) = +1:  ROTATIONS (SO(n))      det(Q) = -1:  REFLECTIONS
  ----------------------------         --------------------------
  Preserve orientation                 Reverse orientation
  Examples in 2D: R_\theta for all \theta        Example: diag(1,...,1,-1)
  Compose to give rotations            Compose to give rotations

  O(n) = SO(n) \cup {reflections}   (two connected components)

  EIGENVALUES: always on unit circle \in \mathbb{C}
  +-----------------------------------------------------------------+
  |   Real eigenvalues: \pm1 only                                     |
  |   Complex eigenvalues: e^{i\theta}, e^{-i\theta} pairs -> 2D rotations    |
  |                                                                  |
  |   So(2n) block structure:                                        |
  |   +----------+-------+                                          |
  |   | R_{\theta_1}  |       |  each block is a 2\times2 rotation           |
  |   +----------+ R_{\theta_2}|  with its own angle \theta^i                 |
  |   |          +-------+                                          |
  |   |          |  ...  |                                          |
  |   +----------+-------+                                          |
  +-----------------------------------------------------------------+

========================================================================

Appendix E: Summary of Key Formulas

Inner Products and Orthogonality

FormulaExpressionNotes
Inner productu,v=uv=iuivi\langle\mathbf{u},\mathbf{v}\rangle = \mathbf{u}^\top\mathbf{v} = \sum_i u_i v_iEuclidean; generalizes to u,vM=uMv\langle\mathbf{u},\mathbf{v}\rangle_M = \mathbf{u}^\top M\mathbf{v}
Normv=v,v\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}Induced by inner product
Anglecosθ=u,v/(uv)\cos\theta = \langle\mathbf{u},\mathbf{v}\rangle / (\|\mathbf{u}\|\|\mathbf{v}\|)Requires both nonzero
Cauchy-Schwarz$\langle\mathbf{u},\mathbf{v}\rangle
Pythagorean theoremuvu+v2=u2+v2\mathbf{u} \perp \mathbf{v} \Rightarrow \|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2Converse also holds
Parseval's identity$|\mathbf{v}|^2 = \sum_{i=1}^n\langle\mathbf{v},\mathbf{q}_i\rangle
Bessel's inequality$\sum_{i=1}^k\langle\mathbf{v},\mathbf{q}_i\rangle

Projections

FormulaExpressionNotes
Proj onto unit vectorproju^(v)=v,u^u^\operatorname{proj}_{\hat{\mathbf{u}}}(\mathbf{v}) = \langle\mathbf{v},\hat{\mathbf{u}}\rangle\hat{\mathbf{u}}1D case
Proj onto subspace (ONB QQ)PSv=QQvP_S\mathbf{v} = QQ^\top\mathbf{v}Requires QQ=IQ^\top Q = I
Proj onto col(A)(A) (general)PSv=A(AA)1AvP_S\mathbf{v} = A(A^\top A)^{-1}A^\top\mathbf{v}General basis
Projection matrix propertiesP2=PP^2 = P, P=PP^\top = PIdempotent + symmetric
Complementary projectorPS=IPSP_{S^\perp} = I - P_SAlso a projector
Best approx propertyvPSvvs\|\mathbf{v} - P_S\mathbf{v}\| \leq \|\mathbf{v} - \mathbf{s}\| for all sS\mathbf{s} \in SFundamental theorem

Gram-Schmidt and QR

FormulaExpressionNotes
GS stepuj=aji<jaj,qiqi\mathbf{u}_j = \mathbf{a}_j - \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\mathbf{q}_iRemove prior directions
QR factorizationA=QRA = QRQQ orthonormal cols, RR upper triangular
Least squares via QRRx=QbR\mathbf{x} = Q^\top\mathbf{b}Solve upper triangular system
HouseholderH=I2nnH = I - 2\mathbf{n}\mathbf{n}^\topReflection across n\mathbf{n}^\perp
Cayley transformQ=(IS)(I+S)1Q = (I-S)(I+S)^{-1}From skew-symmetric SS

Spectral Theorem

FormulaExpressionNotes
Spectral decompositionA=QΛQ=iλiqiqiA = Q\Lambda Q^\top = \sum_i \lambda_i\mathbf{q}_i\mathbf{q}_i^\topSymmetric AA only
Rayleigh quotientRA(x)=xAx/x2R_A(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x}/\|\mathbf{x}\|^2Bounded by [\lambda_\min,\lambda_\max]
Positive definitenessA0A \succ 0 \Leftrightarrow all λi>0\lambda_i > 0Via spectral theorem
Polar decompositionA=QSA = QS, QQ=IQ^\top Q = I, S0S \succeq 0Extends to non-square

Appendix F: Connections to Other Chapters

F.1 From Eigenvalues (01) to Orthogonality

The spectral theorem shows that symmetric matrices are the matrices that are diagonalizable by orthogonal transformations. Non-symmetric matrices may still be diagonalizable, but with a non-orthogonal change of basis - making computations numerically less stable.

The QR algorithm for computing eigenvalues (01) is built entirely on orthogonal matrix iterations:

A_0 = A
for k = 0, 1, 2, ...:
    A_k = Q_k R_k    (QR factorization of current iterate)
    A_{k+1} = R_k Q_k  (swap factors)

Each iterate is orthogonally similar to the previous: Ak+1=QkAkQkA_{k+1} = Q_k^\top A_k Q_k (since Rk=QkAkR_k = Q_k^\top A_k and Ak+1=RkQk=QkAkQkA_{k+1} = R_k Q_k = Q_k^\top A_k Q_k). The iterates converge to the Schur form - upper triangular, with eigenvalues on the diagonal. The entire algorithm is numerically stable precisely because it uses only orthogonal transformations.

F.2 From SVD (02) to Orthogonality

The SVD A=UΣVA = U\Sigma V^\top can be viewed as the "orthogonality structure" of AA:

  • V=[v1vn]V = [\mathbf{v}_1|\cdots|\mathbf{v}_n]: ONB for the row space, mapping inputs to "pure directions"
  • U=[u1um]U = [\mathbf{u}_1|\cdots|\mathbf{u}_m]: ONB for the column space, the natural output directions
  • Σ\Sigma: the non-orthogonal part (pure scaling)

The polar decomposition A=(UV)(VΣV)A = (U V^\top)(V \Sigma V^\top) isolates the orthogonal factor Q=UVQ = UV^\top from the symmetric factor S=VΣVS = V\Sigma V^\top. This factorization requires all the tools developed in this section.

F.3 Orthogonality Enables Low-Rank Approximation (02, 03)

Eckart-Young-Mirsky theorem: The best rank-kk approximation to AA in both spectral and Frobenius norms is:

Ak=i=1kσiuiviA_k = \sum_{i=1}^k \sigma_i\mathbf{u}_i\mathbf{v}_i^\top

This result requires:

  1. The SVD (which produces orthogonal U,VU, V)
  2. The best approximation theorem (projection gives the nearest element in a subspace)
  3. Parseval's identity (to express AAkF2=i>kσi2\|A - A_k\|_F^2 = \sum_{i>k}\sigma_i^2)

All three are developed in this section. The Eckart-Young theorem is the mathematical foundation for PCA, LoRA, and any other low-rank approximation method.

F.4 Orthogonality and Matrix Norms (06)

Unitarily invariant norms: A matrix norm A\|A\| is unitarily invariant if UAV=A\|UAV\| = \|A\| for all unitary U,VU, V. The three most important matrix norms are all unitarily invariant:

  • Spectral norm: A2=σ1(A)\|A\|_2 = \sigma_1(A) (largest singular value)
  • Frobenius norm: AF=i,jAij2=iσi2\|A\|_F = \sqrt{\sum_{i,j}A_{ij}^2} = \sqrt{\sum_i\sigma_i^2} (sum of squared singular values)
  • Nuclear norm: A=iσi\|A\|_* = \sum_i\sigma_i (sum of singular values)

These norms are "orthogonality-aware" - they don't change when the matrix is multiplied by orthogonal matrices. This property is what allows the Eckart-Young theorem to work: the error in low-rank approximation depends only on the discarded singular values, not on the choice of orthogonal bases.

F.5 Orthogonality in Optimization (Chapter 08)

Modern optimization methods for training large neural networks increasingly exploit orthogonal structure:

Sharpness-Aware Minimization (SAM): The sharpness of a minimum is λmax(2L)\lambda_\text{max}(\nabla^2\mathcal{L}). SAM perturbs parameters in the direction of the eigenvector of the Hessian corresponding to λmax\lambda_\text{max} - computed approximately using power iteration. Power iteration produces a sequence of normalized vectors converging to the top eigenvector, each step involving a matrix-vector product with the Hessian and a re-normalization (orthogonalization against zero-dimensional subspace = normalization).

SOAP optimizer (2024): Maintains a factored approximation of the Adam preconditioner in which the "rotation" and "scaling" factors are tracked separately. The rotation factor is updated via QR decomposition to maintain orthogonality, while the scaling factor tracks the second moments in the rotating basis.

K-FAC and its variants: The Kronecker-Factored Approximate Curvature approximates the Fisher information matrix as a Kronecker product FAGF \approx A \otimes G. Computing the update requires inverting these factors, which can be done via their eigendecompositions - the eigenvectors are orthonormal matrices.


Appendix G: Historical Notes and References

Timeline: Development of Orthogonality Theory

YearMathematicianContribution
1801GaussLeast squares method; orthogonal projection as normal equations
1844GrassmannAbstract vector spaces and subspaces; inner product generalization
1850sCauchy, SchwarzCauchy-Schwarz inequality (Cauchy for sums 1821, Schwarz integral form 1888)
1897GramGram-Schmidt process (Gram's formulation for L2L^2 functions)
1902SchmidtFinite-dimensional Gram-Schmidt for 2\ell^2
1907ParsevalParseval's theorem for Fourier series
1908HilbertAbstract Hilbert spaces; orthogonal projection theorem
1915-1927Weyl, von NeumannSpectral theorem in infinite dimensions
1954HouseholderHouseholder reflectors for numerically stable QR
1965FrancisQR algorithm for eigenvalues (with implicit shifts)
1967BjorckNumerical analysis of Gram-Schmidt stability
1983Saxe, McClelland, GanguliOrthogonal initialization for deep linear networks
2014Mishkin, MatasLSUV initialization using SVD/QR
2018Miyato et al.Spectral normalization for GANs
2021Su et al.RoPE: Rotary Position Embedding using orthogonal rotations
2024Jordan et al.Muon optimizer: gradient orthogonalization via Newton-Schulz

Key References

  1. Golub & Van Loan, Matrix Computations (4th ed., 2013) - Definitive reference on numerical linear algebra including Gram-Schmidt, Householder, and QR algorithms
  2. Axler, Linear Algebra Done Right (3rd ed., 2015) - Elegant proof-based treatment of inner product spaces and spectral theorem
  3. Trefethen & Bau, Numerical Linear Algebra (1997) - Excellent coverage of QR, projections, and stability from computational perspective
  4. Saxe et al. (2013), "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks" - Theoretical foundation for orthogonal initialization
  5. Miyato et al. (2018), "Spectral Normalization for GANs" - Seminal paper on spectral normalization in deep learning
  6. Su et al. (2021), "RoFormer: Enhanced Transformer with Rotary Position Embedding" - RoPE using orthogonal rotation matrices
  7. Jordan et al. (2024), "Muon: An optimizer for hidden layers in neural networks" - Newton-Schulz orthogonalization for gradients

Appendix H: Proofs of Selected Theorems

H.1 Complete Proof: Cauchy-Schwarz Inequality

Theorem. For any inner product space and vectors u,v\mathbf{u}, \mathbf{v}:

u,v2u,uv,v|\langle\mathbf{u},\mathbf{v}\rangle|^2 \leq \langle\mathbf{u},\mathbf{u}\rangle\langle\mathbf{v},\mathbf{v}\rangle

Proof. If v=0\mathbf{v} = \mathbf{0}, both sides are 0 and the inequality holds trivially. Assume v0\mathbf{v} \neq \mathbf{0}.

For any scalar tRt \in \mathbb{R}:

0utv,utv=u,u2tu,v+t2v,v0 \leq \langle\mathbf{u} - t\mathbf{v},\mathbf{u} - t\mathbf{v}\rangle = \langle\mathbf{u},\mathbf{u}\rangle - 2t\langle\mathbf{u},\mathbf{v}\rangle + t^2\langle\mathbf{v},\mathbf{v}\rangle

This is a quadratic in tt that is always 0\geq 0. Its discriminant must therefore be 0\leq 0:

Δ=4u,v24u,uv,v0\Delta = 4\langle\mathbf{u},\mathbf{v}\rangle^2 - 4\langle\mathbf{u},\mathbf{u}\rangle\langle\mathbf{v},\mathbf{v}\rangle \leq 0 u,v2u,uv,v\langle\mathbf{u},\mathbf{v}\rangle^2 \leq \langle\mathbf{u},\mathbf{u}\rangle\langle\mathbf{v},\mathbf{v}\rangle \quad \square

Equality condition. Equality holds iff Δ=0\Delta = 0 iff the quadratic in tt has a zero iff there exists t0t_0 with ut0v=0\mathbf{u} - t_0\mathbf{v} = \mathbf{0} iff u\mathbf{u} and v\mathbf{v} are linearly dependent (one is a scalar multiple of the other).

Remark. This proof works for any abstract inner product space, finite or infinite dimensional. The key was that ,\langle\cdot,\cdot\rangle is positive definite.

H.2 Complete Proof: Best Approximation Uniqueness

Theorem. Let SS be a closed subspace of a Hilbert space H\mathcal{H}. For any vH\mathbf{v} \in \mathcal{H}, the projection vS\mathbf{v}_S satisfying vvSS\mathbf{v} - \mathbf{v}_S \perp S is unique.

Proof of uniqueness. Suppose both w1,w2S\mathbf{w}_1, \mathbf{w}_2 \in S satisfy vwiS\mathbf{v} - \mathbf{w}_i \perp S. Then (vw1)(vw2)=w2w1S(\mathbf{v} - \mathbf{w}_1) - (\mathbf{v} - \mathbf{w}_2) = \mathbf{w}_2 - \mathbf{w}_1 \in S (since SS is a subspace). But also w2w1S\mathbf{w}_2 - \mathbf{w}_1 \perp S (as a difference of vectors each perpendicular to SS). So w2w1SS={0}\mathbf{w}_2 - \mathbf{w}_1 \in S \cap S^\perp = \{\mathbf{0}\}, giving w1=w2\mathbf{w}_1 = \mathbf{w}_2. \square

H.3 Complete Proof: Gram-Schmidt Produces ONB

Theorem. If a1,,ak\mathbf{a}_1, \ldots, \mathbf{a}_k are linearly independent vectors in an inner product space VV, the Gram-Schmidt process produces an orthonormal set q1,,qk\mathbf{q}_1, \ldots, \mathbf{q}_k with span(q1,,qj)=span(a1,,aj)\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_j) = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_j) for each j=1,,kj = 1,\ldots,k.

Proof by induction on jj.

Base case (j=1j=1): q1=a1/a1\mathbf{q}_1 = \mathbf{a}_1/\|\mathbf{a}_1\| is a unit vector (since a10\mathbf{a}_1 \neq \mathbf{0} by independence). span(q1)=span(a1)\operatorname{span}(\mathbf{q}_1) = \operatorname{span}(\mathbf{a}_1). OK

Inductive step: Assume the claim holds for j1j-1. Define Sj1=span(a1,,aj1)=span(q1,,qj1)S_{j-1} = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_{j-1}) = \operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_{j-1}).

(i) uj0\mathbf{u}_j \neq \mathbf{0}: Suppose for contradiction uj=0\mathbf{u}_j = \mathbf{0}. Then aj=i<jaj,qiqiSj1=span(a1,,aj1)\mathbf{a}_j = \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\mathbf{q}_i \in S_{j-1} = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_{j-1}), contradicting linear independence.

(ii) qjq\mathbf{q}_j \perp \mathbf{q}_\ell for <j\ell < j: For any <j\ell < j:

uj,q=aj,qi<jaj,qiqi,q=aj,qaj,q1=0\langle\mathbf{u}_j, \mathbf{q}_\ell\rangle = \langle\mathbf{a}_j, \mathbf{q}_\ell\rangle - \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\langle\mathbf{q}_i,\mathbf{q}_\ell\rangle = \langle\mathbf{a}_j,\mathbf{q}_\ell\rangle - \langle\mathbf{a}_j,\mathbf{q}_\ell\rangle\cdot 1 = 0

(using the inductive hypothesis qi,q=δi\langle\mathbf{q}_i,\mathbf{q}_\ell\rangle = \delta_{i\ell}). Since qj=uj/uj\mathbf{q}_j = \mathbf{u}_j/\|\mathbf{u}_j\|, we also have qj,q=0\langle\mathbf{q}_j,\mathbf{q}_\ell\rangle = 0. OK

(iii) Same span: qj=uj/uj\mathbf{q}_j = \mathbf{u}_j/\|\mathbf{u}_j\| is in span(aj,q1,,qj1)=span(a1,,aj)\operatorname{span}(\mathbf{a}_j, \mathbf{q}_1,\ldots,\mathbf{q}_{j-1}) = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_j). Conversely, aj=uj+i<jaj,qiqi=ujqj+i<jaj,qiqispan(q1,,qj)\mathbf{a}_j = \mathbf{u}_j + \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\mathbf{q}_i = \|\mathbf{u}_j\|\mathbf{q}_j + \sum_{i<j}\langle\mathbf{a}_j,\mathbf{q}_i\rangle\mathbf{q}_i \in \operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_j). OK

By induction, the claim holds for all jkj \leq k. \square


This section is part of the 03-Advanced-Linear-Algebra chapter.


Appendix I: The Stiefel Manifold and Optimization on Orthogonal Groups

I.1 The Stiefel Manifold St(n,k)\text{St}(n, k)

The set of n×kn \times k matrices with orthonormal columns is the Stiefel manifold:

St(n,k)={QRn×k:QQ=Ik}\text{St}(n, k) = \{Q \in \mathbb{R}^{n \times k} : Q^\top Q = I_k\}

Special cases:

  • St(n,n)=O(n)\text{St}(n, n) = O(n) - the full orthogonal group
  • St(n,1)=Sn1\text{St}(n, 1) = S^{n-1} - the unit sphere in Rn\mathbb{R}^n
  • St(n,k)\text{St}(n, k) for 1<k<n1 < k < n - Stiefel manifolds proper

Dimension: dim(St(n,k))=nkk(k+1)/2\dim(\text{St}(n,k)) = nk - k(k+1)/2

Intuition: We have nknk free parameters (entries of a n×kn \times k matrix), subject to k(k+1)/2k(k+1)/2 constraints (kk normalization constraints + k(k1)/2k(k-1)/2 orthogonality constraints between distinct columns).

Tangent space at QSt(n,k)Q \in \text{St}(n,k): The tangent space consists of matrices Q˙\dot{Q} satisfying QQ˙+Q˙Q=0Q^\top\dot{Q} + \dot{Q}^\top Q = 0, i.e., QQ˙Q^\top\dot{Q} is skew-symmetric:

TQSt(n,k)={Q˙Rn×k:QQ˙+Q˙Q=0}T_Q\text{St}(n,k) = \{\dot{Q} \in \mathbb{R}^{n \times k} : Q^\top\dot{Q} + \dot{Q}^\top Q = 0\}

I.2 Riemannian Gradient Descent on the Stiefel Manifold

To minimize f(Q)f(Q) subject to QSt(n,k)Q \in \text{St}(n,k):

Step 1: Euclidean gradient. Compute Qf\nabla_Q f in Rn×k\mathbb{R}^{n \times k}.

Step 2: Project to tangent space. The Riemannian gradient is the projection of Qf\nabla_Q f onto TQSt(n,k)T_Q\text{St}(n,k):

gradf(Q)=QfQsym(QQf)\text{grad}_f(Q) = \nabla_Q f - Q\,\text{sym}(Q^\top\nabla_Q f)

where sym(M)=(M+M)/2\text{sym}(M) = (M + M^\top)/2 is the symmetric part.

Step 3: Update and retract. Move in the direction of the (negative) Riemannian gradient, then retract back to the manifold:

Retraction via QR: Qnew=qr(Qηgradf(Q))Q_{\text{new}} = \text{qr}(Q - \eta\,\text{grad}_f(Q)) (take the QQ factor from QR decomposition)

Retraction via Cayley: Qnew=(I+ηW/2)1(IηW/2)QQ_{\text{new}} = (I + \eta W/2)^{-1}(I - \eta W/2)Q for a suitable skew-symmetric WW

I.3 Applications in Modern AI

Orthogonal RNNs (Henaff et al. 2016, Arjovsky et al. 2016). Vanilla RNNs suffer from vanishing/exploding gradients because the recurrence matrix WW can have singular values 1\neq 1. Unitary/Orthogonal RNNs constrain WO(n)W \in O(n), ensuring stable gradient flow. Training optimizes over O(n)O(n) using Stiefel manifold optimization or parameterization via Cayley transform.

Rotation-Equivariant Networks. Networks for 3D point cloud processing (SE(3)-equivariant networks, e3nn) explicitly parameterize their weight tensors using representations of SO(3)SO(3). The convolution filters transform predictably under 3D rotations of the input, making the network geometry-aware.

LoRA with Orthogonal Factors. Standard LoRA approximates a weight update as ΔW=BA\Delta W = BA for low-rank matrices BRd×rB \in \mathbb{R}^{d \times r}, ARr×dA \in \mathbb{R}^{r \times d}. GLoRA, DoRA, and OLoRA variants constrain BB or AA to have orthonormal columns (lying on the Stiefel manifold), improving gradient flow during fine-tuning and reducing the rank collapse problem.

I.4 The Polar Decomposition as Nearest Orthogonal Matrix

Problem: Given ARn×nA \in \mathbb{R}^{n \times n} (not necessarily orthogonal), find the nearest orthogonal matrix:

Q=argminQO(n)AQFQ^* = \arg\min_{Q \in O(n)} \|A - Q\|_F

Solution: The polar decomposition A=QSA = QS gives Q=UVQ^* = UV^\top where A=UΣVA = U\Sigma V^\top (SVD).

Proof: For any PO(n)P \in O(n):

APF2=UΣVPF2=ΣUPVF2\|A - P\|_F^2 = \|U\Sigma V^\top - P\|_F^2 = \|\Sigma - U^\top P V\|_F^2

(using unitary invariance of Frobenius norm). Let R=UPVO(n)R = U^\top P V \in O(n). Then:

ΣRF2=ΣF22Σ,RF+RF2=iσi22tr(ΣR)+n\|\Sigma - R\|_F^2 = \|\Sigma\|_F^2 - 2\langle\Sigma, R\rangle_F + \|R\|_F^2 = \sum_i\sigma_i^2 - 2\text{tr}(\Sigma R) + n

This is minimized when tr(ΣR)\text{tr}(\Sigma R) is maximized. Since Σ\Sigma is diagonal with nonneg entries and RR is orthogonal, the Hadamard-type inequality gives tr(ΣR)iσi\text{tr}(\Sigma R) \leq \sum_i\sigma_i, with equality at R=IR = I (i.e., P=UVP = UV^\top). \square

Muon optimizer interpretation: The Muon optimizer applies the update WWηQGW \leftarrow W - \eta Q_G where QG=UVQ_G = UV^\top is the nearest orthogonal matrix to the gradient GG. This can be computed efficiently via Newton-Schulz iterations:

X0=G/G,Xk+1=32Xk12XkXkXkX_0 = G/\|G\|, \quad X_{k+1} = \frac{3}{2}X_k - \frac{1}{2}X_k X_k^\top X_k

which converges quadratically to the orthogonal factor of the polar decomposition of GG.


Appendix J: Quick Reference - Orthogonality Vocabulary

TermDefinitionKey Formula
Orthogonal vectorsu,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0uv=0\mathbf{u}^\top\mathbf{v} = 0
Orthonormal vectorsOrthogonal + unit lengthqiqj=δij\mathbf{q}_i^\top\mathbf{q}_j = \delta_{ij}
Orthonormal basis (ONB)ONS that spans the spacev=iv,qiqi\mathbf{v} = \sum_i\langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i
Orthogonal matrixSquare, orthonormal columnsQQ=QQ=IQ^\top Q = QQ^\top = I
Orthogonal groupAll n×nn \times n orthogonal matricesO(n)O(n), closed under \cdot, \top, 1\cdot^{-1}
Special orthogonal groupRotations (det = +1)SO(n)O(n)SO(n) \subset O(n)
Orthogonal projectionvS=\mathbf{v}_S = closest point in SSPS=QQP_S = QQ^\top, P2=PP^2=P, P=PP^\top=P
Orthogonal complementVectors \perp all of SSS={v:v,s=0s}S^\perp = \{\mathbf{v}: \langle\mathbf{v},\mathbf{s}\rangle=0\,\forall\mathbf{s}\}
Gram-SchmidtOrthogonalization algorithmBuild ONB from any basis
Householder reflectorReflection across a hyperplaneH=I2nnH = I - 2\mathbf{n}\mathbf{n}^\top
QR decompositionA=QRA = QR factorizationQQ: orthonormal cols, RR: upper triangular
Rayleigh quotientQuotient for eigenvalue boundsRA(x)=xAx/x2R_A(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x}/\|\mathbf{x}\|^2
Spectral theoremSymmetric AA has ONB of eigenvectorsA=QΛQA = Q\Lambda Q^\top
IsometryNorm-preserving linear mapQx=x\|Q\mathbf{x}\| = \|\mathbf{x}\|
Stiefel manifoldMatrices with orthonormal columnsSt(n,k)={Q:QQ=Ik}\text{St}(n,k) = \{Q: Q^\top Q = I_k\}
Polar decompositionOrthogonal \times symmetric factorizationA=QSA = QS
Unitary matrixComplex analog of orthogonalUU=IU^*U = I, entries C\in \mathbb{C}

Appendix K: Common Numerical Experiments

K.1 Verifying Gram-Schmidt Stability

The following experiment quantifies the loss of orthogonality in CGS vs MGS:

import numpy as np

def cgs(A):
    m, n = A.shape
    Q = np.zeros_like(A)
    R = np.zeros((n, n))
    for j in range(n):
        v = A[:, j].copy()
        R[:j, j] = Q[:, :j].T @ v
        v -= Q[:, :j] @ R[:j, j]
        R[j, j] = np.linalg.norm(v)
        Q[:, j] = v / R[j, j]
    return Q, R

def mgs(A):
    m, n = A.shape
    Q = A.astype(float).copy()
    R = np.zeros((n, n))
    for j in range(n):
        R[j, j] = np.linalg.norm(Q[:, j])
        Q[:, j] /= R[j, j]
        R[j, j+1:] = Q[:, j] @ Q[:, j+1:]
        Q[:, j+1:] -= np.outer(Q[:, j], R[j, j+1:])
    return Q, R

# Test on Hilbert matrix (ill-conditioned)
n = 12
H = 1.0 / (np.arange(1, n+1)[:, None] + np.arange(1, n+1)[None, :] - 1)

Q_cgs, _ = cgs(H)
Q_mgs, _ = mgs(H)

err_cgs = np.max(np.abs(Q_cgs.T @ Q_cgs - np.eye(n)))
err_mgs = np.max(np.abs(Q_mgs.T @ Q_mgs - np.eye(n)))

print(f"CGS orthogonality error: {err_cgs:.2e}")   # ~1e-4 (catastrophic)
print(f"MGS orthogonality error: {err_mgs:.2e}")   # ~1e-8 (acceptable)
print(f"kappa(H): {np.linalg.cond(H):.2e}")        # ~10^16 for n=12

K.2 Orthogonal Initialization Impact on Gradient Norms

import numpy as np

def grad_norm_experiment(n=64, L=20, trials=100):
    """Compare gradient norms: Gaussian init vs Orthogonal init."""
    results = {'gaussian': [], 'orthogonal': []}
    
    for _ in range(trials):
        # Gaussian initialization
        Ws_gauss = [np.random.randn(n, n) / np.sqrt(n) for _ in range(L)]
        prod_gauss = Ws_gauss[0]
        for W in Ws_gauss[1:]:
            prod_gauss = prod_gauss @ W
        results['gaussian'].append(np.linalg.norm(prod_gauss, 2))
        
        # Orthogonal initialization
        Ws_orth = []
        for _ in range(L):
            M = np.random.randn(n, n)
            Q, _ = np.linalg.qr(M)
            Ws_orth.append(Q)
        prod_orth = Ws_orth[0]
        for W in Ws_orth[1:]:
            prod_orth = prod_orth @ W
        results['orthogonal'].append(np.linalg.norm(prod_orth, 2))
    
    print(f"Gaussian: mean={np.mean(results['gaussian']):.3e}, "
          f"std={np.std(results['gaussian']):.3e}")
    print(f"Orthogonal: mean={np.mean(results['orthogonal']):.3e}, "
          f"std={np.std(results['orthogonal']):.3e}")
    # Expected: Gaussian shows extreme variance (very large or very small)
    #           Orthogonal shows norm \approx 1.0 with tiny variance

grad_norm_experiment()

K.3 Rayleigh Quotient Iteration Convergence

import numpy as np

def rayleigh_quotient_iteration(A, x0, n_iter=20):
    """Track convergence of Rayleigh Quotient Iteration."""
    x = x0 / np.linalg.norm(x0)
    lam_true = np.sort(np.linalg.eigvalsh(A))[-1]  # largest eigenvalue
    
    history = []
    for _ in range(n_iter):
        rho = x @ A @ x        # Rayleigh quotient
        history.append(abs(rho - lam_true))
        try:
            y = np.linalg.solve(A - rho * np.eye(len(A)), x)
            x = y / np.linalg.norm(y)
        except np.linalg.LinAlgError:
            break  # Converged (A - rho*I is singular)
    return history

A = np.array([[3., 1., 0.], [1., 2., 1.], [0., 1., 1.]])
x0 = np.random.randn(3)
errors = rayleigh_quotient_iteration(A, x0)
print("Errors per iteration:")
for i, e in enumerate(errors[:10]):
    print(f"  iter {i}: {e:.2e}")
# Expected: errors should decrease cubically (each step ~= previous^3)