Part 1Math for LLMs

Orthogonality and Orthonormality: Part 1 - Intuition To 8 The Spectral Theorem For Symmetric Matrices

Advanced Linear Algebra / Orthogonality and Orthonormality

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Orthogonality and Orthonormality: Part 1: Intuition to 8. The Spectral Theorem for Symmetric Matrices

1. Intuition

1.1 What Orthogonality "Buys" You

Orthogonality is the key that makes linear algebra computationally tractable. Without it, working in a basis requires solving systems of equations. With it, everything simplifies to dot products.

Five computational miracles of orthogonality:

Miracle 1: Coordinates become dot products. In a general basis {b1,,bn}\{\mathbf{b}_1, \ldots, \mathbf{b}_n\}, finding the coordinates of a vector v=cibi\mathbf{v} = \sum c_i \mathbf{b}_i requires solving an n×nn \times n linear system. In an orthonormal basis {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}, the coordinate ci=v,qic_i = \langle \mathbf{v}, \mathbf{q}_i \rangle - just a dot product. No system to solve.

Miracle 2: Length is preserved component-wise. In an ONB: v2=ici2\|\mathbf{v}\|^2 = \sum_i c_i^2 (Parseval's identity). The squared norm is simply the sum of squared coordinates - no cross terms.

Miracle 3: Orthogonal matrices have condition number 1. The condition number of QQ is QQ1=11=1\|Q\| \cdot \|Q^{-1}\| = 1 \cdot 1 = 1. This is the best possible - orthogonal systems are maximally well-conditioned. Solving Qx=bQ\mathbf{x} = \mathbf{b} is trivially x=Qb\mathbf{x} = Q^\top \mathbf{b}.

Miracle 4: Projection is the best approximation. The orthogonal projection of b\mathbf{b} onto a subspace SS gives the closest point in SS to b\mathbf{b}. This is the geometric foundation of least squares, PCA, and low-rank approximation.

Miracle 5: Composition of orthogonal maps is free. Two isometries composed give another isometry. Chaining rotations, reflections, and permutations never accumulates numerical error.

THE ORTHOGONALITY ADVANTAGE
========================================================================

  General basis {b_1, ..., b_n}         Orthonormal basis {q_1, ..., q_n}
  -----------------------------        ----------------------------------

  Coordinates:  solve Bc = v           Coordinates:  c^i = <v, q^i> (dot!)
  Norm:         v^T B^T B v            Norm:         \Sigma c^i^2 (Parseval)
  Projection:   A(A^T A)^-^1 A^T b      Projection:   QQ^T b (just Q^T then Q)
  Solve Ax=b:   Gaussian elim.         Solve Qx=b:   x = Q^T b (trivial)
  Condition#:   can be >> 1            Condition#:   exactly 1

========================================================================

1.2 Geometric Picture

In Euclidean space, two vectors are orthogonal if they meet at a right angle. The dot product measures "how much they point in the same direction":

u,v=uvcosθ\langle \mathbf{u}, \mathbf{v} \rangle = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta
  • θ=0degrees\theta = 0 degrees : u,v=uv\langle\mathbf{u},\mathbf{v}\rangle = \|\mathbf{u}\|\|\mathbf{v}\| (parallel, maximum alignment)
  • θ=90degrees\theta = 90 degrees : u,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0 (orthogonal, zero alignment)
  • θ=180degrees\theta = 180 degrees : u,v=uv\langle\mathbf{u},\mathbf{v}\rangle = -\|\mathbf{u}\|\|\mathbf{v}\| (antiparallel, maximum anti-alignment)

The projection picture. The orthogonal projection of v\mathbf{v} onto u\mathbf{u} is the "shadow" of v\mathbf{v} cast perpendicularly onto the line through u\mathbf{u}:

projuv=v,uu2u\text{proj}_{\mathbf{u}}\mathbf{v} = \frac{\langle\mathbf{v},\mathbf{u}\rangle}{\|\mathbf{u}\|^2}\mathbf{u}

The error vprojuv\mathbf{v} - \text{proj}_{\mathbf{u}}\mathbf{v} is orthogonal to u\mathbf{u}. This Pythagorean decomposition is the geometric core of all least-squares theory.

Orthonormal bases as "rulers." A set of orthonormal vectors {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\} forms a perfect coordinate system: each qi\mathbf{q}_i is a unit "ruler" pointing in an independent direction. The coordinate of v\mathbf{v} along qi\mathbf{q}_i is exactly v,qi\langle\mathbf{v}, \mathbf{q}_i\rangle - how far v\mathbf{v} extends in the ii-th direction.

1.3 Why Orthogonality Matters for AI

PCA finds orthogonal directions of variance. The principal components are the eigenvectors of the sample covariance matrix C=XX/(n1)C = X^\top X/(n-1). By the spectral theorem (7), these eigenvectors are orthogonal. PCA is possible - and gives uncorrelated components - precisely because of orthogonality.

Attention subspaces. In multi-head attention, each head projects queries and keys into a dkd_k-dimensional subspace. The heads are designed to attend to different aspects of the input. The more orthogonal the subspaces, the more independent information each head extracts.

Orthogonal weight initialization (Saxe et al., 2013) initializes weight matrices as random orthogonal matrices (via QR decomposition of a random Gaussian matrix). This preserves the norm of activations and gradients through linear layers, enabling training of very deep networks without vanishing/exploding gradients.

RoPE (Su et al., 2023) encodes position mm by rotating the query/key vectors by angle mθim\theta_i in each 2D subspace. The key property: the dot product between a rotated query at position mm and a rotated key at position nn depends only on (mn)(m-n) - position appears as a relative rotation. This works because rotation is orthogonal: it preserves the magnitude of inner products.

Numerical stability. QR decomposition is the backbone of numerical linear algebra. Householder QR is used to solve least-squares problems, compute eigenvalues (QR iteration), and factor matrices. Its superiority over LU (for rectangular systems) stems from orthogonality: the Householder reflectors are orthogonal transformations that never amplify errors.

1.4 Historical Timeline

YearPersonContribution
1801GaussLeast squares by orthogonal projection (unpublished until 1809)
1805LegendrePublished least squares - minimizing sum of squared residuals
1883Jrgen GramGram-Schmidt process published in Danish
1907Erhard SchmidtGram-Schmidt republished and popularized
1958Alston HouseholderHouseholder reflections for stable QR factorization
1961J.G.F. FrancisPractical QR algorithm for eigenvalue computation
1961V. KublanovskayaIndependent discovery of QR algorithm (USSR)
1965James WilkinsonError analysis of Householder QR; numerical stability theory
2013Saxe, McClelland, GanguliOrthogonal initialization for deep linear networks
2021Su et al.RoPE: rotary positional encoding via orthogonal rotation matrices
2021Hua et al.Transformer Quality with orthogonal attention mechanisms

2. Formal Definitions

2.1 Inner Products and the Angle Formula

Definition (Inner Product). A real inner product on a vector space VV is a function ,:V×VR\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R} satisfying:

  1. Symmetry: u,v=v,u\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle
  2. Linearity in first argument: au+bv,w=au,w+bv,w\langle a\mathbf{u} + b\mathbf{v}, \mathbf{w} \rangle = a\langle\mathbf{u},\mathbf{w}\rangle + b\langle\mathbf{v},\mathbf{w}\rangle
  3. Positive definiteness: v,v0\langle \mathbf{v}, \mathbf{v} \rangle \geq 0 with equality iff v=0\mathbf{v} = \mathbf{0}

The standard inner product (dot product) on Rn\mathbb{R}^n is u,v=uv=i=1nuivi\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v} = \sum_{i=1}^n u_i v_i.

The induced norm: v=v,v\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle} is the Euclidean length of v\mathbf{v}.

Theorem 2.1.1 (Cauchy-Schwarz Inequality). For any u,v\mathbf{u}, \mathbf{v} in an inner product space:

u,vuv|\langle\mathbf{u},\mathbf{v}\rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\|

with equality iff u\mathbf{u} and v\mathbf{v} are linearly dependent (one is a scalar multiple of the other).

Proof. For any tRt \in \mathbb{R}, u+tv20\|\mathbf{u} + t\mathbf{v}\|^2 \geq 0. Expanding:

u2+2tu,v+t2v20\|\mathbf{u}\|^2 + 2t\langle\mathbf{u},\mathbf{v}\rangle + t^2\|\mathbf{v}\|^2 \geq 0

This is a quadratic in tt that is always non-negative, so its discriminant must be 0\leq 0:

4u,v24u2v204\langle\mathbf{u},\mathbf{v}\rangle^2 - 4\|\mathbf{u}\|^2\|\mathbf{v}\|^2 \leq 0

which gives u,vuv|\langle\mathbf{u},\mathbf{v}\rangle| \leq \|\mathbf{u}\|\|\mathbf{v}\|. Equality holds iff t=u,v/v2t^* = -\langle\mathbf{u},\mathbf{v}\rangle/\|\mathbf{v}\|^2 gives u+tv=0\|\mathbf{u} + t^*\mathbf{v}\| = 0, i.e., u=tv\mathbf{u} = -t^*\mathbf{v}. \square

The angle formula. By Cauchy-Schwarz, 1u,v/(uv)1-1 \leq \langle\mathbf{u},\mathbf{v}\rangle/(\|\mathbf{u}\|\|\mathbf{v}\|) \leq 1, so it is valid to define the angle θ\theta between u\mathbf{u} and v\mathbf{v} by:

cosθ=u,vuv,θ[0,π]\cos\theta = \frac{\langle\mathbf{u},\mathbf{v}\rangle}{\|\mathbf{u}\|\|\mathbf{v}\|}, \quad \theta \in [0, \pi]

Corollary (Triangle Inequality). u+vu+v\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|.

Proof: u+v2=u2+2u,v+v2u2+2uv+v2=(u+v)2\|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + 2\langle\mathbf{u},\mathbf{v}\rangle + \|\mathbf{v}\|^2 \leq \|\mathbf{u}\|^2 + 2\|\mathbf{u}\|\|\mathbf{v}\| + \|\mathbf{v}\|^2 = (\|\mathbf{u}\|+\|\mathbf{v}\|)^2. \square

2.2 Orthogonal and Orthonormal Vectors

Definition. Two vectors u,v\mathbf{u}, \mathbf{v} are orthogonal (written uv\mathbf{u} \perp \mathbf{v}) if u,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0.

A vector v\mathbf{v} is a unit vector if v=1\|\mathbf{v}\| = 1. The normalization of nonzero v\mathbf{v} is v^=v/v\hat{\mathbf{v}} = \mathbf{v}/\|\mathbf{v}\|.

Two vectors are orthonormal if they are both orthogonal AND both unit vectors:

qi,qj=δij={1i=j0ij\langle\mathbf{q}_i, \mathbf{q}_j\rangle = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}

Key consequences of orthogonality:

1. Zero maps to everything: 0v\mathbf{0} \perp \mathbf{v} for all v\mathbf{v} (since 0,v=0\langle\mathbf{0},\mathbf{v}\rangle = 0).

2. Pythagorean theorem: If uv\mathbf{u} \perp \mathbf{v}, then u+v2=u2+v2\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2. Proof: u+v2=u+v,u+v=u2+2u,v=0+v2\|\mathbf{u}+\mathbf{v}\|^2 = \langle\mathbf{u}+\mathbf{v},\mathbf{u}+\mathbf{v}\rangle = \|\mathbf{u}\|^2 + 2\underbrace{\langle\mathbf{u},\mathbf{v}\rangle}_{=0} + \|\mathbf{v}\|^2. \square

3. Generalized Pythagorean theorem: For mutually orthogonal v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k:

i=1kvi2=i=1kvi2\left\|\sum_{i=1}^k \mathbf{v}_i\right\|^2 = \sum_{i=1}^k \|\mathbf{v}_i\|^2

4. Linearity of orthogonality: If uv\mathbf{u} \perp \mathbf{v} and uw\mathbf{u} \perp \mathbf{w}, then u(av+bw)\mathbf{u} \perp (a\mathbf{v} + b\mathbf{w}) for all scalars a,ba, b.

2.3 Orthogonal Sets and Linear Independence

Definition. A set {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} is an orthogonal set if vi,vj=0\langle\mathbf{v}_i,\mathbf{v}_j\rangle = 0 for all iji \neq j. It is an orthonormal set if additionally each vi=1\|\mathbf{v}_i\| = 1.

Theorem 2.3.1. Any orthogonal set of nonzero vectors is linearly independent.

Proof. Suppose i=1kcivi=0\sum_{i=1}^k c_i \mathbf{v}_i = \mathbf{0}. Take the inner product of both sides with vj\mathbf{v}_j:

icivi,vj=cjvj,vj=cjvj2=0\left\langle \sum_i c_i \mathbf{v}_i, \mathbf{v}_j \right\rangle = c_j \langle\mathbf{v}_j,\mathbf{v}_j\rangle = c_j \|\mathbf{v}_j\|^2 = 0

Since vj0\mathbf{v}_j \neq \mathbf{0}, we have vj2>0\|\mathbf{v}_j\|^2 > 0, so cj=0c_j = 0. This holds for all jj. \square

Remark. This theorem gives an effortless proof of linear independence for orthogonal sets - no row reduction needed. The inner product "isolates" each coefficient.

Non-examples:

  • {(1,0),(1,1)}\{(1,0), (1,1)\} is not orthogonal: (1,0),(1,1)=10\langle(1,0),(1,1)\rangle = 1 \neq 0.
  • {(1,1,0)/2,(0,0,1),(1,1,0)/2,(1,0,0)}\{(1,1,0)/\sqrt{2}, (0,0,1), (1,-1,0)/\sqrt{2}, (1,0,0)\} - the last vector breaks orthogonality.

2.4 Orthonormal Bases

Definition. An orthonormal basis (ONB) for VV is a set {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\} that is both orthonormal and a basis for VV.

Theorem 2.4.1 (Fourier Coefficients). If {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\} is an ONB for VV, then for any vV\mathbf{v} \in V:

v=i=1nv,qiqi\mathbf{v} = \sum_{i=1}^n \langle\mathbf{v}, \mathbf{q}_i\rangle \mathbf{q}_i

The coefficients v^i=v,qi\hat{v}_i = \langle\mathbf{v}, \mathbf{q}_i\rangle are called Fourier coefficients (or coordinates of v\mathbf{v} in the ONB).

Proof. Since {qi}\{\mathbf{q}_i\} is a basis, v=jcjqj\mathbf{v} = \sum_j c_j \mathbf{q}_j for some unique cjc_j. Taking the inner product with qi\mathbf{q}_i:

v,qi=jcjqj,qi=jcjδji=ci\langle\mathbf{v}, \mathbf{q}_i\rangle = \sum_j c_j \langle\mathbf{q}_j, \mathbf{q}_i\rangle = \sum_j c_j \delta_{ji} = c_i \quad \square

Theorem 2.4.2 (Parseval's Identity). Under the same conditions:

v2=i=1nv,qi2=i=1nv^i2\|\mathbf{v}\|^2 = \sum_{i=1}^n |\langle\mathbf{v}, \mathbf{q}_i\rangle|^2 = \sum_{i=1}^n |\hat{v}_i|^2

Proof. v2=iv^iqi,jv^jqj=i,jv^iv^jqi,qj=iv^i2\|\mathbf{v}\|^2 = \langle\sum_i \hat{v}_i\mathbf{q}_i, \sum_j \hat{v}_j\mathbf{q}_j\rangle = \sum_{i,j} \hat{v}_i\hat{v}_j \langle\mathbf{q}_i,\mathbf{q}_j\rangle = \sum_i \hat{v}_i^2. \square

Remark. Parseval's identity says: in an ONB, the squared norm equals the sum of squared coordinates. There are no cross terms - the orthonormality kills them all. This is the high-dimensional Pythagorean theorem.

Matrix form. Assembling the ONB into a matrix Q=[q1qn]Q = [\mathbf{q}_1 | \cdots | \mathbf{q}_n], the Fourier decomposition is simply v=Q(Qv)\mathbf{v} = Q(Q^\top\mathbf{v}): first compute coordinates v^=Qv\hat{\mathbf{v}} = Q^\top\mathbf{v}, then reconstruct v=Qv^\mathbf{v} = Q\hat{\mathbf{v}}.

2.5 Orthogonal Complements

Definition. Given a subspace SVS \subseteq V, the orthogonal complement is:

S={vV:v,s=0 for all sS}S^\perp = \{\mathbf{v} \in V : \langle\mathbf{v},\mathbf{s}\rangle = 0 \text{ for all } \mathbf{s} \in S\}

Theorem 2.5.1. SS^\perp is a subspace of VV.

Proof: If u,vS\mathbf{u}, \mathbf{v} \in S^\perp and a,bRa, b \in \mathbb{R}, then for any sS\mathbf{s} \in S: au+bv,s=au,s+bv,s=0\langle a\mathbf{u}+b\mathbf{v}, \mathbf{s}\rangle = a\langle\mathbf{u},\mathbf{s}\rangle + b\langle\mathbf{v},\mathbf{s}\rangle = 0. \square

Theorem 2.5.2 (Orthogonal Decomposition). If VV is finite-dimensional and SS is a subspace, then:

V=SS(direct sum)V = S \oplus S^\perp \quad (\text{direct sum})

Every vV\mathbf{v} \in V decomposes uniquely as v=vS+vS\mathbf{v} = \mathbf{v}_S + \mathbf{v}_{S^\perp} with vSS\mathbf{v}_S \in S and vSS\mathbf{v}_{S^\perp} \in S^\perp.

Moreover: dim(S)+dim(S)=dim(V)\dim(S) + \dim(S^\perp) = \dim(V).

Proof. Let {q1,,qk}\{\mathbf{q}_1, \ldots, \mathbf{q}_k\} be an ONB for SS. Define vS=i=1kv,qiqi\mathbf{v}_S = \sum_{i=1}^k \langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i (projection onto SS) and vS=vvS\mathbf{v}_{S^\perp} = \mathbf{v} - \mathbf{v}_S. Then v=vS+vS\mathbf{v} = \mathbf{v}_S + \mathbf{v}_{S^\perp} with vSS\mathbf{v}_S \in S. For any qj\mathbf{q}_j:

vS,qj=v,qjvS,qj=v,qjv,qj=0\langle\mathbf{v}_{S^\perp}, \mathbf{q}_j\rangle = \langle\mathbf{v},\mathbf{q}_j\rangle - \langle\mathbf{v}_S,\mathbf{q}_j\rangle = \langle\mathbf{v},\mathbf{q}_j\rangle - \langle\mathbf{v},\mathbf{q}_j\rangle = 0

So vSqj\mathbf{v}_{S^\perp} \perp \mathbf{q}_j for all jj, hence vSS\mathbf{v}_{S^\perp} \in S^\perp. Uniqueness: if v=wS+wS\mathbf{v} = \mathbf{w}_S + \mathbf{w}_{S^\perp} also, then vSwS=wSvSSS={0}\mathbf{v}_S - \mathbf{w}_S = \mathbf{w}_{S^\perp} - \mathbf{v}_{S^\perp} \in S \cap S^\perp = \{\mathbf{0}\}. \square

Connection to the four fundamental subspaces. For ARm×nA \in \mathbb{R}^{m \times n}:

  • null(A)=row(A)\operatorname{null}(A) = \operatorname{row}(A)^\perp in Rn\mathbb{R}^n
  • null(A)=col(A)\operatorname{null}(A^\top) = \operatorname{col}(A)^\perp in Rm\mathbb{R}^m

These orthogonality relations are the deep structure behind the four fundamental subspaces (see 04: Linear Transformations 4.5).

2.6 Orthogonality in Complex Inner Product Spaces

The theory extends to complex vector spaces with a Hermitian (sesquilinear) inner product:

u,v=k=1nukvk\langle\mathbf{u},\mathbf{v}\rangle = \sum_{k=1}^n \overline{u_k}v_k

The complex analog of orthogonal matrices is unitary matrices UCn×nU \in \mathbb{C}^{n \times n} satisfying UU=IU^*U = I, where U=UU^* = \overline{U}^\top is the conjugate transpose.

Unitary group: U(n)={UCn×n:UU=I}\mathcal{U}(n) = \{U \in \mathbb{C}^{n \times n} : U^*U = I\} contains all n×nn \times n unitary matrices. It reduces to O(n)O(n) when restricted to real matrices.

Special unitary group: SU(n)={UU(n):det(U)=1}\mathcal{SU}(n) = \{U \in \mathcal{U}(n) : \det(U) = 1\}.

Key properties of unitary matrices:

  • det(U)=1|\det(U)| = 1 (complex number of modulus 1, not just ±1\pm 1)
  • Eigenvalues lie on the unit circle: λi=1|\lambda_i| = 1
  • Preserve complex inner products: Uu,Uv=u,v\langle U\mathbf{u}, U\mathbf{v}\rangle = \langle\mathbf{u},\mathbf{v}\rangle
  • The DFT matrix Fn/nF_n/\sqrt{n} is unitary (not orthogonal, since its entries are complex)

Hermitian matrices (A=AA = A^*) are the complex analogs of real symmetric matrices. The spectral theorem holds: every Hermitian matrix has real eigenvalues and a unitary eigenbasis.

For AI, complex-valued neural networks and quantum computing naturally use unitary matrices. The DFT, crucial for signal processing and some positional encoding schemes, is a unitary transform in Cn\mathbb{C}^n.

2.7 Abstract Inner Product Spaces: Axioms

We have been working with the Euclidean inner product u,v=uv\langle\mathbf{u},\mathbf{v}\rangle = \mathbf{u}^\top\mathbf{v}. More generally, an inner product on a real vector space VV is a bilinear map ,:V×VR\langle\cdot,\cdot\rangle: V \times V \to \mathbb{R} satisfying:

AxiomExpressionMeaning
Symmetryu,v=v,u\langle\mathbf{u},\mathbf{v}\rangle = \langle\mathbf{v},\mathbf{u}\rangleOrder doesn't matter
Linearityau+bv,w=au,w+bv,w\langle a\mathbf{u}+b\mathbf{v},\mathbf{w}\rangle = a\langle\mathbf{u},\mathbf{w}\rangle + b\langle\mathbf{v},\mathbf{w}\rangleLinear in first argument
Positive definitenessv,v0\langle\mathbf{v},\mathbf{v}\rangle \geq 0 and =0v=0= 0 \Rightarrow \mathbf{v} = \mathbf{0}Self-inner product is positive

Any positive definite symmetric matrix M0M \succ 0 defines an inner product:

u,vM=uMv\langle\mathbf{u},\mathbf{v}\rangle_M = \mathbf{u}^\top M\mathbf{v}

Example (precision-weighted inner product): In Bayesian statistics, the Mahalanobis distance between x\mathbf{x} and μ\boldsymbol{\mu} is:

dM(x,μ)=(xμ)Σ1(xμ)d_M(\mathbf{x},\boldsymbol{\mu}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})}

This is the distance induced by the inner product u,vΣ1=uΣ1v\langle\mathbf{u},\mathbf{v}\rangle_{\Sigma^{-1}} = \mathbf{u}^\top\Sigma^{-1}\mathbf{v}. In this metric, vectors along high-variance directions of Σ\Sigma are "shorter" (the precision Σ1\Sigma^{-1} down-weights those directions).

Orthogonality is metric-dependent. Vectors that are orthogonal in the Euclidean metric may not be orthogonal in the Mahalanobis metric, and vice versa. The choice of inner product determines what "perpendicular" means.

For ML: The natural gradient method (Amari 1998) uses the Fisher information matrix FF to define the inner product, giving rise to a geometry where parameter updates are Δθ=F1L\Delta\theta = F^{-1}\nabla\mathcal{L}. This is gradient descent in the Riemannian metric Δθ1,Δθ2F=Δθ1FΔθ2\langle\Delta\theta_1, \Delta\theta_2\rangle_F = \Delta\theta_1^\top F \Delta\theta_2.


3. Orthogonal Matrices

3.1 Definition and Characterizations

Definition. A square matrix QRn×nQ \in \mathbb{R}^{n \times n} is orthogonal if:

QQ=QQ=InQ^\top Q = QQ^\top = I_n

Equivalently, Q1=QQ^{-1} = Q^\top - the transpose is the inverse.

Characterization via columns. QQ is orthogonal if and only if its columns q1,,qn\mathbf{q}_1, \ldots, \mathbf{q}_n form an orthonormal basis for Rn\mathbb{R}^n:

qiqj=δij={1i=j0ij\mathbf{q}_i^\top \mathbf{q}_j = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}

The condition QQ=IQ^\top Q = I encodes exactly these n2n^2 inner products.

Characterization via rows. By the symmetry QQ=IQQ^\top = I, the rows of QQ also form an orthonormal basis. This is a non-trivial statement: a matrix whose columns are orthonormal must also have orthonormal rows (and vice versa).

Characterization via norms (isometry property). QQ is orthogonal if and only if it preserves the Euclidean norm:

Qx2=x2for all xRn\|Q\mathbf{x}\|_2 = \|\mathbf{x}\|_2 \quad \text{for all } \mathbf{x} \in \mathbb{R}^n

Proof: Qx2=(Qx)(Qx)=xQQx=xIx=x2\|Q\mathbf{x}\|^2 = (Q\mathbf{x})^\top(Q\mathbf{x}) = \mathbf{x}^\top Q^\top Q \mathbf{x} = \mathbf{x}^\top I \mathbf{x} = \|\mathbf{x}\|^2. \square

Characterization via inner products. QQ preserves inner products:

Qx,Qy=x,yfor all x,y\langle Q\mathbf{x}, Q\mathbf{y}\rangle = \langle\mathbf{x},\mathbf{y}\rangle \quad \text{for all } \mathbf{x}, \mathbf{y}

Proof: (Qx)(Qy)=xQQy=xy(Q\mathbf{x})^\top(Q\mathbf{y}) = \mathbf{x}^\top Q^\top Q \mathbf{y} = \mathbf{x}^\top \mathbf{y}. \square

This means orthogonal matrices preserve angles and distances - they are isometries of Euclidean space.

3.2 The Orthogonal Group and Special Orthogonal Group

The set of all n×nn \times n orthogonal matrices forms the orthogonal group:

O(n)={QRn×n:QQ=I}O(n) = \{Q \in \mathbb{R}^{n \times n} : Q^\top Q = I\}

Group axioms:

  • Closure: If Q1,Q2O(n)Q_1, Q_2 \in O(n), then (Q1Q2)(Q1Q2)=Q2Q1Q1Q2=I(Q_1 Q_2)^\top (Q_1 Q_2) = Q_2^\top Q_1^\top Q_1 Q_2 = I, so Q1Q2O(n)Q_1 Q_2 \in O(n).
  • Identity: IO(n)I \in O(n).
  • Inverses: Q1=QO(n)Q^{-1} = Q^\top \in O(n) (since (Q)Q=QQ=I(Q^\top)^\top Q^\top = QQ^\top = I).
  • Associativity: Inherited from matrix multiplication.

Determinant. Since det(QQ)=det(Q)2=det(I)=1\det(Q^\top Q) = \det(Q)^2 = \det(I) = 1, we have det(Q)=±1\det(Q) = \pm 1.

  • det(Q)=+1\det(Q) = +1: rotations - orientation-preserving isometries
  • det(Q)=1\det(Q) = -1: improper rotations - reflections (or rotation composed with a reflection)

The special orthogonal group SO(n)={QO(n):det(Q)=+1}SO(n) = \{Q \in O(n) : \det(Q) = +1\} consists of pure rotations.

In 2D:

SO(2)={(cosθsinθsinθcosθ):θ[0,2π)}SO(2) = \left\{\begin{pmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{pmatrix} : \theta \in [0,2\pi)\right\}

This is the group of rotations of the plane.

In 3D: SO(3)SO(3) parameterizes 3D rotations. This group is the configuration space of a rigid body and appears throughout robotics, computer vision, and physics. Its double cover SU(2)SU(2) appears in quaternion-based rotation representations used in game engines and IMUs.

3.3 Condition Number and Numerical Properties

The condition number of a matrix measures how much it amplifies errors:

κ(A)=AA1=σmaxσmin\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}}{\sigma_{\min}}

For orthogonal matrices QQ: all singular values equal 1, so κ(Q)=1/1=1\kappa(Q) = 1/1 = 1. Orthogonal matrices have the best possible condition number.

This means:

  • Multiplying by QQ never amplifies rounding errors
  • Solving Qx=bQ\mathbf{x} = \mathbf{b} is trivially stable: x=Qb\mathbf{x} = Q^\top\mathbf{b}
  • QR factorization (which uses orthogonal matrices) is numerically stable

For AI: Deep learning systems that maintain orthogonal weight matrices throughout training benefit from gradient signals that are neither exploding nor vanishing. The isometry property means the network's "signal amplification" is exactly 1 at those layers.

3.4 Eigenvalues of Orthogonal Matrices

Theorem. The eigenvalues of a real orthogonal matrix lie on the unit circle in C\mathbb{C}: λ=1|\lambda| = 1.

Proof: If Qv=λvQ\mathbf{v} = \lambda\mathbf{v} with v0\mathbf{v} \neq \mathbf{0}:

v=Qv=λv=λv\|\mathbf{v}\| = \|Q\mathbf{v}\| = \|\lambda\mathbf{v}\| = |\lambda|\|\mathbf{v}\|

Dividing by v>0\|\mathbf{v}\| > 0: λ=1|\lambda| = 1. \square

For real eigenvalues: λ=+1\lambda = +1 (rotation-like) or λ=1\lambda = -1 (reflection-like). Complex eigenvalues come in conjugate pairs eiθ,eiθe^{i\theta}, e^{-i\theta} corresponding to 2D rotations.

3.5 The Cayley Transform

The Cayley transform establishes a correspondence between orthogonal matrices and skew-symmetric matrices (matrices satisfying S=SS^\top = -S).

Definition. For a skew-symmetric matrix SS (such that I+SI + S is invertible):

Q=(IS)(I+S)1Q = (I - S)(I + S)^{-1}

Claim: QQ is orthogonal.

Proof: Using (I+S)=IS(I + S)^\top = I - S (since S=SS^\top = -S):

QQ=((I+S)1)(IS)(IS)(I+S)1Q^\top Q = ((I+S)^{-1})^\top (I-S)^\top (I-S)(I+S)^{-1} =(IS)1(I+S)(IS)(I+S)1= (I-S)^{-1}(I+S)(I-S)(I+S)^{-1}

Since SS is skew-symmetric, ISI-S and I+SI+S commute (they differ only by sign of SS, which is a normal matrix in this context), so:

(I+S)(IS)=IS2=(IS)(I+S)(I+S)(I-S) = I - S^2 = (I-S)(I+S)

Hence QQ=(IS)1(IS)(I+S)(I+S)1=IQ^\top Q = (I-S)^{-1}(I-S)(I+S)(I+S)^{-1} = I. \square

Inverse Cayley transform. Given orthogonal QQ without eigenvalue 1-1:

S=(IQ)(I+Q)1S = (I - Q)(I + Q)^{-1}

Why this matters:

  • Skew-symmetric matrices form a vector space (easy to parameterize)
  • The Cayley transform maps this space bijectively to "most" orthogonal matrices
  • This gives a smooth parameterization of O(n)O(n) near the identity - useful for Riemannian optimization

Example in 2D: S=(0tt0)S = \begin{pmatrix}0 & -t \\ t & 0\end{pmatrix} (skew-symmetric) gives:

Q=11+t2(1t22t2t1t2)Q = \frac{1}{1+t^2}\begin{pmatrix}1-t^2 & -2t \\ 2t & 1-t^2\end{pmatrix}

With t=tan(θ/2)t = \tan(\theta/2) (the Weierstrass substitution), this gives the rotation matrix RθR_\theta. The Cayley transform thus rationalizes the trigonometric functions appearing in rotation matrices.

Applications in neural networks: Neural networks that maintain orthogonal weight matrices throughout training can parameterize them via their skew-symmetric "log" and exponentiate via the matrix exponential or Cayley transform. This enables gradient descent on O(n)O(n) using Lie algebra tools.


4. Orthogonal Projections

4.1 The Projection Formula

Definition. The orthogonal projection of v\mathbf{v} onto a subspace SS is the unique vector vSS\mathbf{v}_S \in S such that vvSS\mathbf{v} - \mathbf{v}_S \perp S.

Formula (1D case). Projection onto the line spanned by unit vector u^\hat{\mathbf{u}}:

proju^(v)=v,u^u^=u^u^v\operatorname{proj}_{\hat{\mathbf{u}}}(\mathbf{v}) = \langle\mathbf{v},\hat{\mathbf{u}}\rangle\,\hat{\mathbf{u}} = \hat{\mathbf{u}}\hat{\mathbf{u}}^\top\mathbf{v}

The matrix P=u^u^P = \hat{\mathbf{u}}\hat{\mathbf{u}}^\top is the rank-1 projection matrix.

Formula (general case). If S=col(A)S = \operatorname{col}(A) for a matrix AA with linearly independent columns:

projS(v)=A(AA)1Av\operatorname{proj}_S(\mathbf{v}) = A(A^\top A)^{-1}A^\top\mathbf{v}

The matrix P=A(AA)1AP = A(A^\top A)^{-1}A^\top is the projection matrix onto col(A)\operatorname{col}(A).

When AA has orthonormal columns (i.e., A=QA = Q with QQ=IQ^\top Q = I), this simplifies dramatically:

P=QQP = QQ^\top

This is why orthonormal bases make projections computationally trivial.

4.2 Projection Matrices: Properties

A matrix PP is a projection if and only if it is idempotent: P2=PP^2 = P.

A projection is an orthogonal projection if and only if it is also symmetric: P=PP^\top = P.

Verification: For P=QQP = QQ^\top with QQ=IQ^\top Q = I:

  • Idempotent: P2=QQQQ=Q(QQ)Q=QIQ=QQ=PP^2 = QQ^\top QQ^\top = Q(Q^\top Q)Q^\top = QIQ^\top = QQ^\top = P OK
  • Symmetric: P=(QQ)=QQ=PP^\top = (QQ^\top)^\top = QQ^\top = P OK

Eigenvalues of projections. PP is a projection if and only if its eigenvalues are 0 and 1.

Proof: If Pv=λvP\mathbf{v} = \lambda\mathbf{v}, then λv=Pv=P2v=λPv=λ2v\lambda\mathbf{v} = P\mathbf{v} = P^2\mathbf{v} = \lambda P\mathbf{v} = \lambda^2\mathbf{v}, so λ2=λ\lambda^2 = \lambda, giving λ=0\lambda = 0 or λ=1\lambda = 1. \square

Geometrically: eigenvectors with λ=1\lambda = 1 lie in the range of PP (they are unchanged), while eigenvectors with λ=0\lambda = 0 lie in the null space (they are annihilated).

4.3 Best Approximation Theorem

Theorem (Best Approximation). The projection vS=projS(v)\mathbf{v}_S = \operatorname{proj}_S(\mathbf{v}) is the closest point in SS to v\mathbf{v}:

vvS2vs2for all sS\|\mathbf{v} - \mathbf{v}_S\|^2 \leq \|\mathbf{v} - \mathbf{s}\|^2 \quad \text{for all } \mathbf{s} \in S

with equality only when s=vS\mathbf{s} = \mathbf{v}_S.

Proof. For any sS\mathbf{s} \in S:

vs2=(vvS)+(vSs)2=vvS2+2vvS,vSs+vSs2\|\mathbf{v} - \mathbf{s}\|^2 = \|(\mathbf{v}-\mathbf{v}_S) + (\mathbf{v}_S - \mathbf{s})\|^2 = \|\mathbf{v}-\mathbf{v}_S\|^2 + 2\langle\mathbf{v}-\mathbf{v}_S, \mathbf{v}_S-\mathbf{s}\rangle + \|\mathbf{v}_S-\mathbf{s}\|^2

Since vvSS\mathbf{v}-\mathbf{v}_S \perp S and vSsS\mathbf{v}_S - \mathbf{s} \in S, the cross term vanishes:

=vvS2+vSs2vvS2= \|\mathbf{v}-\mathbf{v}_S\|^2 + \|\mathbf{v}_S-\mathbf{s}\|^2 \geq \|\mathbf{v}-\mathbf{v}_S\|^2 \quad \square

This theorem is everywhere in machine learning:

  • Least squares: The normal equations give the projection of b\mathbf{b} onto col(A)\operatorname{col}(A)
  • PCA: Principal components are the projection onto the subspace of maximum variance (-> 03: PCA)
  • Attention: Softmax attention can be viewed as computing a weighted projection of value vectors
  • Linear regression: The fitted values y^=Hy\hat{\mathbf{y}} = H\mathbf{y} where H=X(XX)1XH = X(X^\top X)^{-1}X^\top is the hat matrix

4.4 Projection in Coordinate Systems

Given an ONB {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\} for Rn\mathbb{R}^n, the projection onto the subspace S=span(q1,,qk)S = \operatorname{span}(\mathbf{q}_1, \ldots, \mathbf{q}_k) (for knk \leq n) is:

PS=i=1kqiqiP_S = \sum_{i=1}^k \mathbf{q}_i\mathbf{q}_i^\top

This is a sum of rank-1 projectors, one per basis direction. The decomposition:

I=PS+PS=i=1kqiqi+i=k+1nqiqiI = P_S + P_{S^\perp} = \sum_{i=1}^k \mathbf{q}_i\mathbf{q}_i^\top + \sum_{i=k+1}^n \mathbf{q}_i\mathbf{q}_i^\top

is the resolution of the identity - every vector is split into its SS and SS^\perp components.


5. Gram-Schmidt Orthogonalization

5.1 The Algorithm: Classical Gram-Schmidt

Problem. Given linearly independent vectors a1,,ak\mathbf{a}_1, \ldots, \mathbf{a}_k, construct an orthonormal basis for span(a1,,ak)\operatorname{span}(\mathbf{a}_1, \ldots, \mathbf{a}_k).

Classical Gram-Schmidt (CGS):

Initialization: q1=a1/a1\mathbf{q}_1 = \mathbf{a}_1 / \|\mathbf{a}_1\|

Iteration: For j=2,,kj = 2, \ldots, k:

uj=aji=1j1aj,qiqiqj=ujuj\mathbf{u}_j = \mathbf{a}_j - \sum_{i=1}^{j-1}\langle\mathbf{a}_j, \mathbf{q}_i\rangle\,\mathbf{q}_i \qquad \mathbf{q}_j = \frac{\mathbf{u}_j}{\|\mathbf{u}_j\|}

What it does geometrically: At step jj, we subtract the projection of aj\mathbf{a}_j onto the span of the already-processed vectors, leaving the component orthogonal to everything before it. Normalizing gives the jj-th orthonormal vector.

Inductive proof of correctness. After step jj, {q1,,qj}\{\mathbf{q}_1, \ldots, \mathbf{q}_j\} is an ONB for span(a1,,aj)\operatorname{span}(\mathbf{a}_1, \ldots, \mathbf{a}_j).

Base: j=1j=1: q1=a1/a1\mathbf{q}_1 = \mathbf{a}_1/\|\mathbf{a}_1\| is a unit vector spanning the same subspace. OK

Step: Assume true for j1j-1. Then uj=ajPj1aj\mathbf{u}_j = \mathbf{a}_j - P_{j-1}\mathbf{a}_j where Pj1P_{j-1} projects onto span(q1,,qj1)\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_{j-1}).

For any <j\ell < j: uj,q=aj,qaj,q=0\langle\mathbf{u}_j, \mathbf{q}_\ell\rangle = \langle\mathbf{a}_j,\mathbf{q}_\ell\rangle - \langle\mathbf{a}_j,\mathbf{q}_\ell\rangle = 0 OK

Since a1,,aj\mathbf{a}_1,\ldots,\mathbf{a}_j are independent, uj0\mathbf{u}_j \neq \mathbf{0} (otherwise ajspan(a1,,aj1)\mathbf{a}_j \in \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_{j-1})). So qj=uj/uj\mathbf{q}_j = \mathbf{u}_j/\|\mathbf{u}_j\| is a valid unit vector. \square

5.2 Modified Gram-Schmidt

The numerical problem with CGS. In floating-point arithmetic, CGS can catastrophically lose orthogonality. This happens when aj\mathbf{a}_j is nearly in the span of {a1,,aj1}\{\mathbf{a}_1,\ldots,\mathbf{a}_{j-1}\} - rounding errors in the projection subtraction can accumulate, producing a uj\mathbf{u}_j that isn't truly orthogonal to earlier vectors.

Modified Gram-Schmidt (MGS) reorders the computation to orthogonalize against already-computed qi\mathbf{q}_i sequentially rather than all at once:

for j = 1 to k:
    v = a_j
    for i = 1 to j-1:
        v = v - <v, q_i> q_i    # orthogonalize against q_i using CURRENT v
    q_j = v / ||v||

Why MGS is more stable. In CGS, all projections use the original aj\mathbf{a}_j. In MGS, after each orthogonalization step, the updated v\mathbf{v} is used - so numerical errors from the first projection are themselves projected away in subsequent steps.

Theorem (Bjorck 1967). MGS is backward stable: the computed q^i\hat{\mathbf{q}}_i satisfy q^iq^j=δij+O(ϵmachκ(A))\hat{\mathbf{q}}_i^\top \hat{\mathbf{q}}_j = \delta_{ij} + O(\epsilon_{\text{mach}}\kappa(A)) where κ(A)\kappa(A) is the condition number of the input matrix. CGS satisfies only O(ϵmachκ(A)2)O(\epsilon_{\text{mach}}\kappa(A)^2).

Practical advice: Use MGS for explicit basis construction; use Householder for QR factorization (even more stable).

5.3 Gram-Schmidt and QR Factorization

Gram-Schmidt applied to the columns of A=[a1an]A = [\mathbf{a}_1|\cdots|\mathbf{a}_n] produces a QR factorization:

A=QRA = QR

where Q=[q1qn]Q = [\mathbf{q}_1|\cdots|\mathbf{q}_n] has orthonormal columns and RR is upper triangular with:

Rij={aj,qii<juji=j0i>jR_{ij} = \begin{cases}\langle\mathbf{a}_j, \mathbf{q}_i\rangle & i < j \\ \|\mathbf{u}_j\| & i = j \\ 0 & i > j\end{cases}

The diagonal entries Rjj=ujR_{jj} = \|\mathbf{u}_j\| are positive (since uj0\mathbf{u}_j \neq \mathbf{0} for independent columns).

This gives a unique QR factorization of AA with positive diagonal entries of RR.

For AI: Gram-Schmidt implicitly occurs in layer normalization variants, in computing orthogonal weight bases for LoRA, and in the derivation of stable training algorithms.


6. Householder Reflectors and Givens Rotations

6.1 Householder Reflectors

Motivation. While Gram-Schmidt produces QR by building QQ column by column, Householder applies successive orthogonal transformations to reduce AA to upper triangular form RR directly - achieving greater numerical stability.

Definition. The Householder reflector determined by unit vector n\mathbf{n} (the normal to the mirror) is:

H=I2nnH = I - 2\mathbf{n}\mathbf{n}^\top

Properties:

  • Symmetric: H=HH^\top = H OK
  • Orthogonal: HH=H2=(I2nn)2=I4nn+4nnnn=I4nn+4nn=IH^\top H = H^2 = (I - 2\mathbf{n}\mathbf{n}^\top)^2 = I - 4\mathbf{n}\mathbf{n}^\top + 4\mathbf{n}\mathbf{n}^\top\mathbf{n}\mathbf{n}^\top = I - 4\mathbf{n}\mathbf{n}^\top + 4\mathbf{n}\mathbf{n}^\top = I OK
  • Involution: H2=IH^2 = I, so H=H1H = H^{-1} OK
  • Determinant: det(H)=1\det(H) = -1 (it's a reflection, not a rotation)

Action: HH reflects any vector across the hyperplane with normal n\mathbf{n}:

  • Vectors in n\mathbf{n}^\perp are unchanged: Hv=vH\mathbf{v} = \mathbf{v} if v,n=0\langle\mathbf{v},\mathbf{n}\rangle = 0
  • The vector n\mathbf{n} is negated: Hn=n2n=nH\mathbf{n} = \mathbf{n} - 2\mathbf{n} = -\mathbf{n}

Zeroing a column. To zero out all components of a\mathbf{a} below the first entry: choose:

n=a+ae1a+ae1Ha=ae1\mathbf{n} = \frac{\mathbf{a} + \|\mathbf{a}\|\mathbf{e}_1}{\|\mathbf{a} + \|\mathbf{a}\|\mathbf{e}_1\|} \quad \Rightarrow \quad H\mathbf{a} = -\|\mathbf{a}\|\mathbf{e}_1

(The ±\pm sign choice avoids catastrophic cancellation when a1>0a_1 > 0.)

6.2 QR via Householder

To factor ARm×nA \in \mathbb{R}^{m \times n}:

  1. Find H1H_1 to zero the first column below row 1: H1AH_1 A has \ast in position (1,1)(1,1) and zeros below
  2. Find H2H_2 (acting on the (n1)×(n1)(n-1) \times (n-1) submatrix) to zero the second column below row 2
  3. Continue: HnH2H1A=RH_{n} \cdots H_2 H_1 A = R (upper triangular)

Then Q=H1H2HnQ = H_1 H_2 \cdots H_n (product of reflectors, hence orthogonal), and A=QRA = QR.

Complexity: O(mn2n3/3)O(mn^2 - n^3/3) flops - same asymptotic complexity as Gram-Schmidt but with better numerical constants.

Stability: Householder QR is backward stable, meaning the computed factors satisfy Q^R^=A+E\hat{Q}\hat{R} = A + E where E=O(ϵmachA)\|E\| = O(\epsilon_{\text{mach}}\|A\|).

6.3 Givens Rotations

Definition. The Givens rotation G(i,j,θ)Rn×nG(i,j,\theta) \in \mathbb{R}^{n \times n} rotates in the (i,j)(i,j)-plane:

G(i,j,θ)kl={cosθk=l=i or k=l=jsinθk=i,l=jsinθk=j,l=iδklotherwiseG(i,j,\theta)_{kl} = \begin{cases} \cos\theta & k=l=i \text{ or } k=l=j \\ -\sin\theta & k=i, l=j \\ \sin\theta & k=j, l=i \\ \delta_{kl} & \text{otherwise} \end{cases}

Purpose: Choose θ\theta to zero a specific element ajia_{ji} by rotating in the (i,j)(i,j)-plane.

When to use: Givens rotations are preferred when AA is sparse (each rotation affects only two rows/columns), for parallel computation, and in eigenvalue algorithms like QR iteration.


7. QR Decomposition: Theory and Applications

7.1 Existence and Uniqueness

Theorem (QR Decomposition). Every matrix ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n admits a factorization:

A=QRA = QR

where QRm×nQ \in \mathbb{R}^{m \times n} has orthonormal columns (QQ=InQ^\top Q = I_n) and RRn×nR \in \mathbb{R}^{n \times n} is upper triangular.

This is the thin (reduced) QR decomposition. There is also a full QR decomposition where QRm×mQ \in \mathbb{R}^{m \times m} is a full orthogonal matrix and RRm×nR \in \mathbb{R}^{m \times n} is upper trapezoidal.

Uniqueness (when AA has full column rank). If we require Rii>0R_{ii} > 0 for all ii, the thin QR decomposition is unique.

Proof sketch: Two decompositions Q1R1=Q2R2Q_1 R_1 = Q_2 R_2 give Q2Q1=R2R11Q_2^\top Q_1 = R_2 R_1^{-1}. The left side is orthogonal; the right side is upper triangular. An upper triangular orthogonal matrix must be diagonal with ±1\pm 1 entries. With the sign convention Rii>0R_{ii} > 0, this diagonal must be II. \square

When AA has rank r<nr < n: The decomposition still exists but RR has nrn - r zero diagonal entries. The columns of QQ corresponding to zero RiiR_{ii} can be chosen as any completion to an orthonormal set.

7.2 QR for Least Squares

Problem. Given ARm×nA \in \mathbb{R}^{m \times n} (m>nm > n) and bRm\mathbf{b} \in \mathbb{R}^m, solve:

minxRnAxb22\min_{\mathbf{x} \in \mathbb{R}^n} \|A\mathbf{x} - \mathbf{b}\|_2^2

Solution via QR. Let A=QRA = QR (thin QR, QQ=IQ^\top Q = I):

Axb2=QRxb2=RxQb2+(IQQ)b2\|A\mathbf{x} - \mathbf{b}\|^2 = \|QR\mathbf{x} - \mathbf{b}\|^2 = \|R\mathbf{x} - Q^\top\mathbf{b}\|^2 + \|(I - QQ^\top)\mathbf{b}\|^2

The second term is independent of x\mathbf{x}, so the minimum occurs at:

Rx=QbR\mathbf{x} = Q^\top\mathbf{b}

This is an upper triangular system, solved by back substitution in O(n2)O(n^2) operations.

Comparison with normal equations. The normal equations (AA)x=Ab(A^\top A)\mathbf{x} = A^\top\mathbf{b} are equivalent but form AAA^\top A, which squares the condition number: κ(AA)=κ(A)2\kappa(A^\top A) = \kappa(A)^2. QR avoids this squaring and is numerically preferred whenever κ(A)\kappa(A) is moderate.

7.3 QR Algorithm for Eigenvalues

Preview. The QR algorithm - the standard method for computing eigenvalues of symmetric matrices - works by iterating:

A0=A,Ak+1=RkQkwhere Ak=QkRkA_0 = A, \quad A_{k+1} = R_k Q_k \quad \text{where } A_k = Q_k R_k

Under mild conditions, AkA_k \to upper triangular (Schur form) and the diagonal entries converge to eigenvalues. This connects orthogonal matrices directly to spectral theory.

Forward Reference: The full QR algorithm with shifts is covered in 08: Matrix Decompositions. The eigenvalue theory it computes is developed in 01: Eigenvalues and Eigenvectors.


8. The Spectral Theorem for Symmetric Matrices

8.1 Statement and Proof

Theorem (Spectral Theorem). Every real symmetric matrix A=ARn×nA = A^\top \in \mathbb{R}^{n \times n} has:

  1. All eigenvalues are real: λiR\lambda_i \in \mathbb{R}
  2. Eigenvectors for distinct eigenvalues are orthogonal
  3. There exists an orthonormal basis of eigenvectors: A=QΛQA = Q\Lambda Q^\top where QO(n)Q \in O(n) and Λ=diag(λ1,,λn)\Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_n)

Proof of (1): Suppose Av=λvA\mathbf{v} = \lambda\mathbf{v} with vCn\mathbf{v} \in \mathbb{C}^n, v0\mathbf{v} \neq \mathbf{0}. Take the Hermitian inner product:

vˉAv=λvˉv=λv2\bar{\mathbf{v}}^\top A\mathbf{v} = \lambda\bar{\mathbf{v}}^\top\mathbf{v} = \lambda\|\mathbf{v}\|^2 vˉAv=vˉAv=vˉAv(since A=A)\overline{\bar{\mathbf{v}}^\top A\mathbf{v}} = \bar{\mathbf{v}}^\top A^\top\mathbf{v} = \bar{\mathbf{v}}^\top A\mathbf{v} \quad (\text{since } A = A^\top)

So λv2\lambda\|\mathbf{v}\|^2 is real, hence λR\lambda \in \mathbb{R}. \square

Proof of (2): If Au=λuA\mathbf{u} = \lambda\mathbf{u} and Av=μvA\mathbf{v} = \mu\mathbf{v} with λμ\lambda \neq \mu:

λu,v=Au,v=u,Av=u,Av=μu,v\lambda\langle\mathbf{u},\mathbf{v}\rangle = \langle A\mathbf{u},\mathbf{v}\rangle = \langle\mathbf{u},A^\top\mathbf{v}\rangle = \langle\mathbf{u},A\mathbf{v}\rangle = \mu\langle\mathbf{u},\mathbf{v}\rangle

(λμ)u,v=0(\lambda - \mu)\langle\mathbf{u},\mathbf{v}\rangle = 0. Since λμ\lambda \neq \mu: u,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0. \square

Proof of (3): By induction. The base case (n=1n=1) is trivial. For the inductive step: let q1\mathbf{q}_1 be a unit eigenvector for eigenvalue λ1\lambda_1. Let W=q1W = \mathbf{q}_1^\perp (the orthogonal complement). Since AA is symmetric, WW is AA-invariant (if w,q1=0\langle\mathbf{w},\mathbf{q}_1\rangle = 0, then Aw,q1=w,Aq1=λ1w,q1=0\langle A\mathbf{w},\mathbf{q}_1\rangle = \langle\mathbf{w},A\mathbf{q}_1\rangle = \lambda_1\langle\mathbf{w},\mathbf{q}_1\rangle = 0). Apply the inductive hypothesis to the (n1)×(n1)(n-1)\times(n-1) symmetric restriction AWA|_W. \square

8.2 The Rayleigh Quotient

Definition. The Rayleigh quotient of a symmetric matrix AA at vector x0\mathbf{x} \neq \mathbf{0} is:

RA(x)=xAxxxR_A(\mathbf{x}) = \frac{\mathbf{x}^\top A\mathbf{x}}{\mathbf{x}^\top\mathbf{x}}

Key properties:

  • RA(qi)=λiR_A(\mathbf{q}_i) = \lambda_i for eigenvectors qi\mathbf{q}_i
  • λminRA(x)λmax\lambda_{\min} \leq R_A(\mathbf{x}) \leq \lambda_{\max} for all x\mathbf{x}
  • maxxRA(x)=λmax\max_{\mathbf{x}} R_A(\mathbf{x}) = \lambda_{\max}, achieved at the top eigenvector
  • minxRA(x)=λmin\min_{\mathbf{x}} R_A(\mathbf{x}) = \lambda_{\min}, achieved at the bottom eigenvector

Proof of bounds: Expand x=iciqi\mathbf{x} = \sum_i c_i \mathbf{q}_i in the eigenbasis:

RA(x)=ici2λiici2R_A(\mathbf{x}) = \frac{\sum_i c_i^2 \lambda_i}{\sum_i c_i^2}

This is a weighted average of eigenvalues, hence bounded by [λmin,λmax][\lambda_{\min}, \lambda_{\max}]. \square

Courant-Fischer Min-Max Theorem. The kk-th largest eigenvalue satisfies:

λk=maxdimS=kminxS{0}RA(x)=mindimS=nk+1maxxS{0}RA(x)\lambda_k = \max_{\dim S = k}\, \min_{\mathbf{x} \in S \setminus \{0\}} R_A(\mathbf{x}) = \min_{\dim S = n-k+1}\, \max_{\mathbf{x} \in S \setminus \{0\}} R_A(\mathbf{x})

This variational characterization is fundamental to perturbation theory and numerical linear algebra.

8.3 Quadratic Forms and Symmetric Matrices

A quadratic form is f(x)=xAx=i,jAijxixjf(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x} = \sum_{i,j} A_{ij} x_i x_j for symmetric AA.

Positive definiteness via the spectral theorem: A0A \succ 0 (positive definite) \Leftrightarrow all eigenvalues >0> 0.

Proof: In the eigenbasis, xAx=iλici2\mathbf{x}^\top A\mathbf{x} = \sum_i \lambda_i c_i^2. This is positive for all x0\mathbf{x} \neq \mathbf{0} iff all λi>0\lambda_i > 0. \square

For AI: The Hessian 2L\nabla^2\mathcal{L} of a loss function is symmetric. Its eigenvalues tell us the curvature in each direction - large eigenvalues are "sharp" directions (fast loss change), small eigenvalues are "flat" directions. The Rayleigh quotient appears in sharpness-aware minimization (SAM), gradient clipping analysis, and understanding training dynamics.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue