Part 2Math for LLMs

Orthogonality and Orthonormality: Part 2 - Orthogonality In Function Spaces To Conceptual Bridge

Advanced Linear Algebra / Orthogonality and Orthonormality

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
28 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Orthogonality and Orthonormality: Part 9: Orthogonality in Function Spaces to Conceptual Bridge

9. Orthogonality in Function Spaces

9.1 The L^2 Inner Product

Orthogonality extends naturally from finite-dimensional vector spaces to infinite-dimensional function spaces. The space L2([a,b])L^2([a,b]) of square-integrable functions carries the inner product:

f,g=abf(x)g(x)dx\langle f, g\rangle = \int_a^b f(x)g(x)\,dx

Orthogonality of functions: fgf \perp g if abf(x)g(x)dx=0\int_a^b f(x)g(x)\,dx = 0.

Example: On [π,π][-\pi, \pi], the functions {sin(nx),cos(mx)}\{\sin(nx), \cos(mx)\} form a mutually orthogonal family:

ππsin(nx)cos(mx)dx=0for all n,m\int_{-\pi}^{\pi} \sin(nx)\cos(mx)\,dx = 0 \quad \text{for all } n, m ππsin(nx)sin(mx)dx=πδnm\int_{-\pi}^{\pi} \sin(nx)\sin(mx)\,dx = \pi\,\delta_{nm} ππcos(nx)cos(mx)dx=πδnm(n,m1)\int_{-\pi}^{\pi} \cos(nx)\cos(mx)\,dx = \pi\,\delta_{nm} \quad (n, m \geq 1)

This is the foundation of Fourier series - a basis expansion for L2L^2 functions.

9.2 Fourier Series as Orthonormal Expansion

Orthonormal basis for L2([π,π])L^2([-\pi,\pi]):

{12π,cos(x)π,sin(x)π,cos(2x)π,sin(2x)π,}\left\{\frac{1}{\sqrt{2\pi}},\, \frac{\cos(x)}{\sqrt{\pi}},\, \frac{\sin(x)}{\sqrt{\pi}},\, \frac{\cos(2x)}{\sqrt{\pi}},\, \frac{\sin(2x)}{\sqrt{\pi}},\, \ldots\right\}

The Fourier series of ff is its expansion in this ONB:

f(x)=a02+n=1[ancos(nx)+bnsin(nx)]f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty \left[a_n\cos(nx) + b_n\sin(nx)\right]

where the Fourier coefficients are orthogonal projections:

an=1πππf(x)cos(nx)dx,bn=1πππf(x)sin(nx)dxa_n = \frac{1}{\pi}\int_{-\pi}^{\pi}f(x)\cos(nx)\,dx, \qquad b_n = \frac{1}{\pi}\int_{-\pi}^{\pi}f(x)\sin(nx)\,dx

Parseval's theorem for Fourier series:

1πππf(x)2dx=a022+n=1(an2+bn2)\frac{1}{\pi}\int_{-\pi}^{\pi}|f(x)|^2\,dx = \frac{a_0^2}{2} + \sum_{n=1}^\infty(a_n^2 + b_n^2)

This is the infinite-dimensional analog of v2=iv^i2\|\mathbf{v}\|^2 = \sum_i \hat{v}_i^2.

9.3 Discrete Fourier Transform as a Unitary Matrix

The Discrete Fourier Transform (DFT) maps xCn\mathbf{x} \in \mathbb{C}^n to x^Cn\hat{\mathbf{x}} \in \mathbb{C}^n:

x^k=j=0n1xjωjk,ω=e2πi/n\hat{x}_k = \sum_{j=0}^{n-1} x_j\, \omega^{jk}, \qquad \omega = e^{-2\pi i/n}

In matrix form: x^=Fnx\hat{\mathbf{x}} = F_n\mathbf{x} where (Fn)jk=ωjk(F_n)_{jk} = \omega^{jk}.

Key fact: Fn/nF_n / \sqrt{n} is unitary (the complex analog of orthogonal):

1nFnFn=IFnn is unitary\frac{1}{n}F_n^* F_n = I \quad \Leftrightarrow \quad \frac{F_n}{\sqrt{n}} \text{ is unitary}

This is because the DFT basis vectors fk=(1,ωk,ω2k,,ω(n1)k)/n\mathbf{f}_k = (1, \omega^k, \omega^{2k}, \ldots, \omega^{(n-1)k})/\sqrt{n} are orthonormal in Cn\mathbb{C}^n:

fj,fk=1n=0n1ω(kj)=δjk\langle\mathbf{f}_j, \mathbf{f}_k\rangle = \frac{1}{n}\sum_{\ell=0}^{n-1}\omega^{\ell(k-j)} = \delta_{jk}

(The last equality is the geometric series identity for roots of unity.)

Parseval's theorem for DFT: x^2=nx2\|\hat{\mathbf{x}}\|^2 = n\|\mathbf{x}\|^2.

For AI: The DFT/FFT appears in:

  • Signal processing: Feature extraction from audio (MFCCs, spectrograms)
  • Efficient convolutions: fg=F1(F(f)F(g))f * g = \mathcal{F}^{-1}(\mathcal{F}(f) \cdot \mathcal{F}(g)) - used in efficient CNN implementations
  • Positional encodings: Sinusoidal positional embeddings in the original Transformer are Fourier features
  • SSMs: State space models (Mamba, S4) use structured orthogonal/unitary matrices

10. Applications in Machine Learning

10.1 Orthogonal Weight Initialization

The problem with random initialization. If weight matrices have singular values spread over a wide range, gradients either explode or vanish during backpropagation through many layers.

Saxe et al. (2013) showed that initializing with orthogonal matrices preserves gradient norms through linear networks:

Theorem. For a deep linear network with orthogonal weight matrices W1,,WLW_1, \ldots, W_L (each Rn×n\in \mathbb{R}^{n \times n}), the Jacobian of the full network is WLW1W_L \cdots W_1, which is also orthogonal (product of orthogonal matrices). Hence L/x=L/y\|\partial\mathcal{L}/\partial\mathbf{x}\| = \|\partial\mathcal{L}/\partial\mathbf{y}\| - gradients neither explode nor vanish.

Practical implementation:

  1. Generate a random n×nn \times n matrix with i.i.d. standard normal entries: MN(0,I)M \sim \mathcal{N}(0, I)
  2. Compute its QR decomposition: M=QRM = QR
  3. Use QQ (or Qsign(diag(R))Q \cdot \operatorname{sign}(\operatorname{diag}(R)) for uniform distribution over O(n)O(n)) as the initial weight matrix

This is implemented as torch.nn.init.orthogonal_() in PyTorch.

Why it works beyond linear networks: Even in nonlinear networks, orthogonal initialization places the initial weights in a "good" part of weight space where gradients are well-conditioned at initialization. Empirically, networks with orthogonal initialization often converge faster on deep architectures.

10.2 Rotary Position Embeddings (RoPE)

Standard self-attention lacks inherent position information. The original Transformer adds sinusoidal absolute position embeddings. RoPE (Su et al. 2021, used in LLaMA, GPT-NeoX, Falcon) takes a different approach: it encodes position by rotating query and key vectors.

Construction (2D case). For position mm, the rotation matrix is:

Rm=(cosmθsinmθsinmθcosmθ)R_m = \begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}

For a dd-dimensional embedding, pair up dimensions and apply RmR_m to each pair using d/2d/2 different frequencies θi=100002i/d\theta_i = 10000^{-2i/d}.

Key property. The inner product of rotated queries and keys depends only on the relative position mnm - n:

Rmq,Rnk=qRmRnk=qRmnk\langle R_m \mathbf{q}, R_n \mathbf{k}\rangle = \mathbf{q}^\top R_m^\top R_n \mathbf{k} = \mathbf{q}^\top R_{m-n}\mathbf{k}

since RmRn=RnmR_m^\top R_n = R_{n-m} (rotation matrices satisfy RaRb=RbaR_a^\top R_b = R_{b-a}, because the inverse of a rotation is a rotation in the opposite direction).

Orthogonality is central: The key equation RmRn=RmnR_m^\top R_n = R_{m-n} uses the fact that RmR_m is orthogonal (Rm=Rm1R_m^\top = R_m^{-1}), so (Rm)(Rn)=Rm1Rn=Rnm(R_m^\top)(R_n) = R_m^{-1}R_n = R_{n-m}.

10.3 Spectral Normalization

Spectral normalization (Miyato et al. 2018) stabilizes GAN training by constraining weight matrices to have spectral norm (= largest singular value) equal to 1.

Implementation: At each forward pass, divide the weight matrix by its spectral norm: W^=W/σ1(W)\hat{W} = W/\sigma_1(W).

Connection to orthogonality: A matrix with σ1=σ2==1\sigma_1 = \sigma_2 = \cdots = 1 is exactly an orthogonal matrix. Spectral normalization enforces σ1=1\sigma_1 = 1 while letting other singular values vary freely - it's a "one-sided" orthogonality constraint.

Why it helps GANs: The discriminator DD satisfies a Lipschitz condition D(x)D(y)xy\|D(\mathbf{x}) - D(\mathbf{y})\| \leq \|\mathbf{x} - \mathbf{y}\| when all weight layers are spectrally normalized. This relates to the Wasserstein GAN objective, which requires a 1-Lipschitz discriminator.

10.4 QR in Optimization: Orthogonal Gradients

Several modern optimization techniques use orthogonality:

Shampoo optimizer (Gupta et al. 2018): Maintains a Kronecker-factored preconditioner and periodically computes its orthogonal "square root" via eigendecomposition, orthogonalizing the gradient update directions.

Muon optimizer (2024): Orthogonalizes gradient matrices via Newton-Schulz iterations (a matrix-valued version of Newton's method for computing polar decomposition), then applies the orthogonalized gradient as an update. This ensures each weight matrix sees updates with balanced singular values, preventing the accumulation of rank deficiency.

Gradient orthogonality in continual learning: Methods like Gradient Episodic Memory (GEM) and Orthogonal Gradient Descent (OGD) project gradients for new tasks onto the orthogonal complement of the gradient space for previous tasks - preserving past performance while learning new tasks.



10b. Bessel's Inequality and Completeness

10b.1 Bessel's Inequality

When we expand a vector v\mathbf{v} in terms of an orthonormal set {q1,,qk}\{\mathbf{q}_1, \ldots, \mathbf{q}_k\} that is not necessarily a complete basis for the entire space, we get Bessel's inequality:

i=1kv,qi2v2\sum_{i=1}^k |\langle\mathbf{v}, \mathbf{q}_i\rangle|^2 \leq \|\mathbf{v}\|^2

Proof. Let vk=i=1kv,qiqi\mathbf{v}_k = \sum_{i=1}^k \langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i be the projection of v\mathbf{v} onto span(q1,,qk)\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_k). Then:

vvk2=v2vk2=v2i=1kv,qi20\|\mathbf{v} - \mathbf{v}_k\|^2 = \|\mathbf{v}\|^2 - \|\mathbf{v}_k\|^2 = \|\mathbf{v}\|^2 - \sum_{i=1}^k |\langle\mathbf{v},\mathbf{q}_i\rangle|^2 \geq 0

The inequality follows immediately. \square

Equality holds if and only if vspan(q1,,qk)\mathbf{v} \in \operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_k) - i.e., when v\mathbf{v} itself lies in the subspace spanned by the chosen orthonormal vectors.

Parseval's identity is the limit case: when {qi}\{\mathbf{q}_i\} is a complete ONB:

i=1v,qi2=v2\sum_{i=1}^\infty |\langle\mathbf{v}, \mathbf{q}_i\rangle|^2 = \|\mathbf{v}\|^2

10b.2 Completeness of Orthonormal Sets

An orthonormal set B={qi}\mathcal{B} = \{\mathbf{q}_i\} in a Hilbert space H\mathcal{H} is complete (or maximal) if it satisfies any of the following equivalent conditions:

  1. B\mathcal{B} is an ONB: every vH\mathbf{v} \in \mathcal{H} can be written v=iv,qiqi\mathbf{v} = \sum_i \langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i
  2. Parseval's identity holds: v2=iv,qi2\|\mathbf{v}\|^2 = \sum_i |\langle\mathbf{v},\mathbf{q}_i\rangle|^2 for all v\mathbf{v}
  3. v,qi=0\langle\mathbf{v},\mathbf{q}_i\rangle = 0 for all ii implies v=0\mathbf{v} = \mathbf{0} (no non-zero vector is orthogonal to all qi\mathbf{q}_i)

In finite dimensions, any ONB is automatically complete (by dimension counting). In infinite dimensions, completeness is a non-trivial requirement - this is why Hilbert space theory requires careful treatment.

For ML context: In kernel methods, the RKHS is an infinite-dimensional Hilbert space. A kernel function k(x,y)k(\mathbf{x},\mathbf{y}) defines an inner product, and the question of whether the kernel features span the whole space (completeness/universality) determines the expressivity of the corresponding kernel machine.


10c. Detailed Worked Examples

10c.1 Projection in Least-Squares Regression

Setup. We observe m=4m = 4 data points (xi,yi)(x_i, y_i) and want to fit a line y=β0+β1xy = \beta_0 + \beta_1 x:

x=(1,2,3,4),y=(2.1,3.9,5.8,8.2)x = (1, 2, 3, 4)^\top, \quad y = (2.1, 3.9, 5.8, 8.2)^\top

Build the design matrix:

A=(11121314)A = \begin{pmatrix}1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4\end{pmatrix}

Gram-Schmidt on the columns of AA:

Column 1: a1=(1,1,1,1)\mathbf{a}_1 = (1,1,1,1)^\top

q1=a1a1=12(1,1,1,1)\mathbf{q}_1 = \frac{\mathbf{a}_1}{\|\mathbf{a}_1\|} = \frac{1}{2}(1,1,1,1)^\top

Column 2: a2=(1,2,3,4)\mathbf{a}_2 = (1,2,3,4)^\top. Project out q1\mathbf{q}_1:

a2,q1=12(1+2+3+4)=5\langle\mathbf{a}_2, \mathbf{q}_1\rangle = \frac{1}{2}(1+2+3+4) = 5 u2=a25q1=(1,2,3,4)52(1,1,1,1)=(32,12,12,32)\mathbf{u}_2 = \mathbf{a}_2 - 5\mathbf{q}_1 = (1,2,3,4)^\top - \frac{5}{2}(1,1,1,1)^\top = (-\tfrac{3}{2}, -\tfrac{1}{2}, \tfrac{1}{2}, \tfrac{3}{2})^\top u2=94+14+14+94=5,q2=15(32,12,12,32)\|\mathbf{u}_2\| = \sqrt{\tfrac{9}{4}+\tfrac{1}{4}+\tfrac{1}{4}+\tfrac{9}{4}} = \sqrt{5}, \quad \mathbf{q}_2 = \frac{1}{\sqrt{5}}(-\tfrac{3}{2}, -\tfrac{1}{2}, \tfrac{1}{2}, \tfrac{3}{2})^\top

The projection (fitted values):

y^=QQywhere Q=[q1q2]\hat{\mathbf{y}} = QQ^\top\mathbf{y} \quad \text{where } Q = [\mathbf{q}_1|\mathbf{q}_2]

This is the projection of y\mathbf{y} onto col(A)\operatorname{col}(A) = the space of all affine functions of xx. The residual yy^\mathbf{y} - \hat{\mathbf{y}} is orthogonal to every affine function.

Verification of the normal equations: The least-squares solution β^\hat{\boldsymbol{\beta}} satisfies AAβ^=AyA^\top A\hat{\boldsymbol{\beta}} = A^\top\mathbf{y}:

AA=(4101030),Ay=(2056.8)A^\top A = \begin{pmatrix}4 & 10 \\ 10 & 30\end{pmatrix}, \quad A^\top\mathbf{y} = \begin{pmatrix}20 \\ 56.8\end{pmatrix}

Solving: β^(0.1,2.04)\hat{\boldsymbol{\beta}} \approx (0.1, 2.04)^\top, so y^0.1+2.04x\hat{y} \approx 0.1 + 2.04x - a good linear fit.

10c.2 Full Worked Example: Gram-Schmidt in R3\mathbb{R}^3

Input vectors:

a1=(110),a2=(101),a3=(011)\mathbf{a}_1 = \begin{pmatrix}1\\1\\0\end{pmatrix}, \quad \mathbf{a}_2 = \begin{pmatrix}1\\0\\1\end{pmatrix}, \quad \mathbf{a}_3 = \begin{pmatrix}0\\1\\1\end{pmatrix}

Step 1: q1=a1/a1=(1/2)(1,1,0)\mathbf{q}_1 = \mathbf{a}_1/\|\mathbf{a}_1\| = (1/\sqrt{2})(1,1,0)^\top

Step 2: Remove the component of a2\mathbf{a}_2 along q1\mathbf{q}_1:

a2,q1=12(11+10+01)=12\langle\mathbf{a}_2, \mathbf{q}_1\rangle = \frac{1}{\sqrt{2}}(1 \cdot 1 + 1 \cdot 0 + 0 \cdot 1) = \frac{1}{\sqrt{2}} u2=a212q1=(101)12(110)=(1/21/21)\mathbf{u}_2 = \mathbf{a}_2 - \frac{1}{\sqrt{2}}\mathbf{q}_1 = \begin{pmatrix}1\\0\\1\end{pmatrix} - \frac{1}{2}\begin{pmatrix}1\\1\\0\end{pmatrix} = \begin{pmatrix}1/2\\-1/2\\1\end{pmatrix} u2=1/4+1/4+1=3/2,q2=23(1/21/21)=16(112)\|\mathbf{u}_2\| = \sqrt{1/4 + 1/4 + 1} = \sqrt{3/2}, \quad \mathbf{q}_2 = \sqrt{\frac{2}{3}}\begin{pmatrix}1/2\\-1/2\\1\end{pmatrix} = \frac{1}{\sqrt{6}}\begin{pmatrix}1\\-1\\2\end{pmatrix}

Step 3: Remove components of a3\mathbf{a}_3 along q1\mathbf{q}_1 and q2\mathbf{q}_2:

a3,q1=12(0+1+0)=12\langle\mathbf{a}_3, \mathbf{q}_1\rangle = \frac{1}{\sqrt{2}}(0+1+0) = \frac{1}{\sqrt{2}} a3,q2=16(01+2)=16\langle\mathbf{a}_3, \mathbf{q}_2\rangle = \frac{1}{\sqrt{6}}(0-1+2) = \frac{1}{\sqrt{6}} u3=(011)1212(110)1616(112)\mathbf{u}_3 = \begin{pmatrix}0\\1\\1\end{pmatrix} - \frac{1}{\sqrt{2}}\cdot\frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\\0\end{pmatrix} - \frac{1}{\sqrt{6}}\cdot\frac{1}{\sqrt{6}}\begin{pmatrix}1\\-1\\2\end{pmatrix} =(011)12(110)16(112)=(2/32/32/3)= \begin{pmatrix}0\\1\\1\end{pmatrix} - \frac{1}{2}\begin{pmatrix}1\\1\\0\end{pmatrix} - \frac{1}{6}\begin{pmatrix}1\\-1\\2\end{pmatrix} = \begin{pmatrix}-2/3\\2/3\\2/3\end{pmatrix} u3=4/9+4/9+4/9=2/3,q3=13(111)\|\mathbf{u}_3\| = \sqrt{4/9+4/9+4/9} = 2/\sqrt{3}, \quad \mathbf{q}_3 = \frac{1}{\sqrt{3}}\begin{pmatrix}-1\\1\\1\end{pmatrix}

Verification: q1,q2=(1/2)(1/6)(11+0)=0\langle\mathbf{q}_1,\mathbf{q}_2\rangle = (1/\sqrt{2})(1/\sqrt{6})(1-1+0) = 0 OK q1,q3=(1/2)(1/3)(1+1+0)=0\langle\mathbf{q}_1,\mathbf{q}_3\rangle = (1/\sqrt{2})(1/\sqrt{3})(-1+1+0) = 0 OK q2,q3=(1/6)(1/3)(11+2)=0\langle\mathbf{q}_2,\mathbf{q}_3\rangle = (1/\sqrt{6})(1/\sqrt{3})(-1-1+2) = 0 OK

QR factorization: The matrix RR is upper triangular with:

R=(a1a2,q1a3,q10u2a3,q200u3)=(21/21/203/21/6002/3)R = \begin{pmatrix}\|\mathbf{a}_1\| & \langle\mathbf{a}_2,\mathbf{q}_1\rangle & \langle\mathbf{a}_3,\mathbf{q}_1\rangle \\ 0 & \|\mathbf{u}_2\| & \langle\mathbf{a}_3,\mathbf{q}_2\rangle \\ 0 & 0 & \|\mathbf{u}_3\|\end{pmatrix} = \begin{pmatrix}\sqrt{2} & 1/\sqrt{2} & 1/\sqrt{2} \\ 0 & \sqrt{3/2} & 1/\sqrt{6} \\ 0 & 0 & 2/\sqrt{3}\end{pmatrix}

10c.3 Spectral Decomposition Example

Let A=(3113)A = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix} (symmetric).

Eigenvalues: det(AλI)=(3λ)21=0λ=2\det(A - \lambda I) = (3-\lambda)^2 - 1 = 0 \Rightarrow \lambda = 2 or λ=4\lambda = 4.

Eigenvectors:

  • λ1=2\lambda_1 = 2: (A2I)v=0(1111)v=0v1=(1,1)/2(A - 2I)\mathbf{v} = 0 \Rightarrow \begin{pmatrix}1&1\\1&1\end{pmatrix}\mathbf{v} = 0 \Rightarrow \mathbf{v}_1 = (1,-1)^\top/\sqrt{2}
  • λ2=4\lambda_2 = 4: (A4I)v=0(1111)v=0v2=(1,1)/2(A - 4I)\mathbf{v} = 0 \Rightarrow \begin{pmatrix}-1&1\\1&-1\end{pmatrix}\mathbf{v} = 0 \Rightarrow \mathbf{v}_2 = (1,1)^\top/\sqrt{2}

Verify orthogonality: v1v2=(1)(1)+(1)(1)=0\mathbf{v}_1^\top\mathbf{v}_2 = (1)(1) + (-1)(1) = 0 OK

Spectral decomposition:

A=λ1v1v1+λ2v2v2=2(1/21/21/21/2)+4(1/21/21/21/2)A = \lambda_1\mathbf{v}_1\mathbf{v}_1^\top + \lambda_2\mathbf{v}_2\mathbf{v}_2^\top = 2\begin{pmatrix}1/2 & -1/2 \\ -1/2 & 1/2\end{pmatrix} + 4\begin{pmatrix}1/2 & 1/2 \\ 1/2 & 1/2\end{pmatrix} =(1111)+(2222)=(3113)=A= \begin{pmatrix}1 & -1 \\ -1 & 1\end{pmatrix} + \begin{pmatrix}2 & 2 \\ 2 & 2\end{pmatrix} = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix} = A \checkmark

Rayleigh quotient at x=(1,0)\mathbf{x} = (1,0)^\top:

RA(x)=xAx=(1,0)(3113)(10)=3R_A(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x} = (1,0)\begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix}\begin{pmatrix}1\\0\end{pmatrix} = 3

Indeed λmin=234=λmax\lambda_{\min} = 2 \leq 3 \leq 4 = \lambda_{\max}.

10c.4 Householder Reflector: Concrete Computation

Problem: Find a Householder reflector HH such that Ha=ae1H\mathbf{a} = \|\mathbf{a}\|\mathbf{e}_1 where a=(4,3)\mathbf{a} = (4, 3)^\top.

Step 1: a=16+9=5\|\mathbf{a}\| = \sqrt{16+9} = 5.

Step 2: Since a1=4>0a_1 = 4 > 0, use ++ sign to avoid cancellation:

nunnorm=a+ae1=(43)+(50)=(93)\mathbf{n}_{\text{unnorm}} = \mathbf{a} + \|\mathbf{a}\|\mathbf{e}_1 = \begin{pmatrix}4\\3\end{pmatrix} + \begin{pmatrix}5\\0\end{pmatrix} = \begin{pmatrix}9\\3\end{pmatrix}

Step 3: Normalize: nunnorm=81+9=90=310\|\mathbf{n}_{\text{unnorm}}\| = \sqrt{81+9} = \sqrt{90} = 3\sqrt{10}

n=1310(93)=110(31)\mathbf{n} = \frac{1}{3\sqrt{10}}\begin{pmatrix}9\\3\end{pmatrix} = \frac{1}{\sqrt{10}}\begin{pmatrix}3\\1\end{pmatrix}

Step 4: Build the reflector:

H=I2nn=(1001)210(9331)=(19/53/53/511/5)=(4/53/53/54/5)H = I - 2\mathbf{n}\mathbf{n}^\top = \begin{pmatrix}1 & 0 \\ 0 & 1\end{pmatrix} - \frac{2}{10}\begin{pmatrix}9 & 3 \\ 3 & 1\end{pmatrix} = \begin{pmatrix}1-9/5 & -3/5 \\ -3/5 & 1-1/5\end{pmatrix} = \begin{pmatrix}-4/5 & -3/5 \\ -3/5 & 4/5\end{pmatrix}

Verify: Ha=(4/53/53/54/5)(43)=(16/59/512/5+12/5)=(50)=5e1H\mathbf{a} = \begin{pmatrix}-4/5 & -3/5 \\ -3/5 & 4/5\end{pmatrix}\begin{pmatrix}4\\3\end{pmatrix} = \begin{pmatrix}-16/5-9/5\\-12/5+12/5\end{pmatrix} = \begin{pmatrix}-5\\0\end{pmatrix} = -5\mathbf{e}_1

This gives ae1-\|\mathbf{a}\|\mathbf{e}_1 (a valid Householder result; the sign depends on the convention). det(H)=1\det(H) = -1 as expected. OK


10d. Orthogonality and the Geometry of Neural Networks

10d.1 The Geometry of Weight Matrices

A single linear layer y=Wx\mathbf{y} = W\mathbf{x} transforms the input geometry. Understanding this transformation requires decomposing WW via its singular value decomposition (preview):

W=UΣV(covered fully in 02: SVD)W = U\Sigma V^\top \qquad \text{(covered fully in 02: SVD)}
  • VV^\top: rotate the input space (first orthogonal transformation)
  • Σ\Sigma: scale along the new axes by singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0
  • UU: rotate the output space (second orthogonal transformation)

For orthogonal WW: All singular values are 1, so Σ=I\Sigma = I. The layer is a pure rotation - it doesn't compress or expand the input distribution. Distances and angles are perfectly preserved.

Implication for gradient flow. The gradient of the loss with respect to the input is:

Lx=WLy\frac{\partial\mathcal{L}}{\partial\mathbf{x}} = W^\top\frac{\partial\mathcal{L}}{\partial\mathbf{y}}

For orthogonal WW: Wg=g\|W^\top\mathbf{g}\| = \|\mathbf{g}\| - gradients are neither amplified nor attenuated. This is the formal reason orthogonal initialization prevents gradient vanishing/explosion in deep linear networks.

10d.2 Attention Heads as Orthogonal Projections

In a Transformer's multi-head attention with HH heads:

headh=Attn(XWhQ,XWhK,XWhV)\text{head}_h = \text{Attn}(XW_h^Q, XW_h^K, XW_h^V)

where WhQ,WhK,WhVRd×dkW_h^Q, W_h^K, W_h^V \in \mathbb{R}^{d \times d_k} are the query/key/value projection matrices.

Mechanistic interpretability perspective (Elhage et al. 2021): Each attention head computes a projection of the residual stream onto a low-dimensional subspace, adds information from that subspace back, and (in some sense) different heads should attend to different "directions" of information.

Mathematical idealization: If we require WhV(WhV)=0W_h^V (W_{h'}^V)^\top = 0 for hhh \neq h' (the value matrices project to orthogonal subspaces), then different heads genuinely operate on independent information. In practice, heads aren't exactly orthogonal, but studying the overlap WhV(WhV)F\|W_h^V (W_{h'}^V)^\top\|_F quantifies how much heads interfere with each other.

QK circuit analysis. The effective attention pattern for head hh involves WhQ(WhK)Rd×dW_h^Q (W_h^K)^\top \in \mathbb{R}^{d \times d} (or rather its action on the residual stream). This matrix's singular values determine how "sharp" vs "diffuse" the attention pattern is.

10d.3 Layer Normalization and Orthogonal Complements

Layer normalization (Ba et al. 2016) operates on a vector xRd\mathbf{x} \in \mathbb{R}^d:

LayerNorm(x)=xμ1σγ+β\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu\mathbf{1}}{\sigma} \cdot \boldsymbol{\gamma} + \boldsymbol{\beta}

where μ=1dixi\mu = \frac{1}{d}\sum_i x_i and σ2=1di(xiμ)2\sigma^2 = \frac{1}{d}\sum_i (x_i - \mu)^2.

Orthogonal decomposition view. The centering step xxμ1\mathbf{x} \mapsto \mathbf{x} - \mu\mathbf{1} is an orthogonal projection onto the subspace 1={vRd:ivi=0}\mathbf{1}^\perp = \{\mathbf{v} \in \mathbb{R}^d : \sum_i v_i = 0\}:

P1=I11dP_{\mathbf{1}^\perp} = I - \frac{\mathbf{1}\mathbf{1}^\top}{d}

This is a rank-(d1)(d-1) orthogonal projector. Layer normalization first projects out the "mean direction" 1/d\mathbf{1}/\sqrt{d}, then rescales to unit norm on the remaining (d1)(d-1)-dimensional subspace.

Consequence for representational geometry: After layer normalization, the model's internal representations h\mathbf{h} always satisfy h1\mathbf{h} \perp \mathbf{1}. The model can only represent information in the orthogonal complement of the constant vector - this is why all-zero inputs and constant inputs are indistinguishable after layer norm.

10d.4 Orthogonal Regularization in Weight Matrices

Several training techniques maintain (near-)orthogonality in weight matrices throughout training:

Orthogonal regularization: Add a penalty λWWIF2\lambda\|W^\top W - I\|_F^2 to the loss. This penalizes deviation from orthogonality without enforcing it exactly.

Spectral regularization: Penalize λi(σi1)2\lambda \sum_i (\sigma_i - 1)^2 - penalizes singular values deviating from 1. Stronger than spectral normalization, which only clips the largest singular value.

Riemannian optimization on the Stiefel manifold: The set of n×kn \times k matrices with orthonormal columns, St(n,k)={QRn×k:QQ=Ik}\text{St}(n,k) = \{Q \in \mathbb{R}^{n \times k} : Q^\top Q = I_k\}, is a smooth Riemannian manifold. Gradient descent on St(n,k)\text{St}(n,k) uses the Riemannian gradient (projection of Euclidean gradient onto the tangent space) and retracts to the manifold via QR decomposition or Cayley transform after each step.

Applications: Orthogonal RNNs, rotation-equivariant neural networks, invertible neural networks, and normalizing flows all benefit from exact or approximate orthogonality constraints.


11. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing "orthogonal" and "orthonormal"Orthogonal means u,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0; orthonormal additionally requires unit length. An orthogonal matrix QQ has orthonormal columns, not just orthogonal ones.Check: orthogonal <-> zero inner products; orthonormal <-> zero inner products AND qi=1\|\mathbf{q}_i\| = 1
2Assuming QQ=IQQ^\top = I implies QQ=IQ^\top Q = I without checking squareFor non-square QRm×nQ \in \mathbb{R}^{m \times n} (m>nm > n), QQ=InQ^\top Q = I_n but QQImQQ^\top \neq I_m (it's a rank-nn projection). Only square orthogonal matrices satisfy both.Specify dimensions: "thin Q" (QQ=InQ^\top Q = I_n) vs "full Q" (QQ=Im=QQQQ^\top = I_m = Q^\top Q)
3Thinking Gram-Schmidt preserves span orderingCGS does: span(q1,,qk)=span(a1,,ak)\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_k) = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_k). But with pivoting or reordering, this property is lost.Track column ordering explicitly when pivoting QR is used
4Using CGS when numerical stability mattersClassical Gram-Schmidt squares the error: O(ϵmachκ(A)2)O(\epsilon_\text{mach}\kappa(A)^2) vs MGS's O(ϵmachκ(A))O(\epsilon_\text{mach}\kappa(A)). For ill-conditioned AA, CGS can produce non-orthogonal "orthonormal" vectors.Use Modified Gram-Schmidt (MGS) or Householder QR for production code
5Projecting onto a non-orthogonal basisThe formula P=QQP = QQ^\top assumes QQ=IQ^\top Q = I. If the columns are merely linearly independent (not orthonormal), the correct formula is P=Q(QQ)1QP = Q(Q^\top Q)^{-1}Q^\top, which involves the (QQ)(Q^\top Q) inverse.Always check if QQ=IQ^\top Q = I before using the simplified projection formula
6Thinking P2=PP^2 = P is sufficient for orthogonal projectionIdempotence P2=PP^2 = P only guarantees PP is some projection. Orthogonal projections additionally require P=PP^\top = P (symmetry). Oblique projections satisfy idempotence but not symmetry.An orthogonal projection must be both idempotent (P2=PP^2 = P) and symmetric (P=PP^\top = P)
7Claiming eigenvectors are always orthogonalEigenvectors of different eigenvalues are orthogonal only for symmetric matrices. For general matrices, eigenvectors can be arbitrary non-orthogonal vectors (or even non-real).The spectral theorem applies to symmetric/Hermitian matrices; general matrices need Schur decomposition
8Forgetting the sign convention in HouseholderWhen computing n=(a±ae1)/\mathbf{n} = (\mathbf{a} \pm \|\mathbf{a}\|\mathbf{e}_1)/\|\cdots\|, the sign should be chosen as +sign(a1)+\operatorname{sign}(a_1) to avoid cancellation when a1>0a_1 > 0.Always choose the sign of ±ae1\pm\|\mathbf{a}\|\mathbf{e}_1 to be opposite the sign of a1a_1: n=(a+sign(a1)ae1)/\mathbf{n} = (\mathbf{a} + \operatorname{sign}(a_1)\|\mathbf{a}\|\mathbf{e}_1)/\|\cdots\|
9Misinterpreting Parseval's identity as an approximationv2=iv^i2\|\mathbf{v}\|^2 = \sum_i \hat{v}_i^2 is an equality when {qi}\{\mathbf{q}_i\} is a complete orthonormal basis for the whole space. If it's only a basis for a subspace, you get Bessel's inequality: iv^i2v2\sum_i \hat{v}_i^2 \leq \|\mathbf{v}\|^2.Distinguish: complete ONB -> Parseval's equality; partial ONB -> Bessel's inequality
10Assuming QR is always better than Cholesky for normal equationsQR avoids squaring the condition number but costs more flops (O(mn2)O(mn^2) vs O(mn2/2+n3/6)O(mn^2/2 + n^3/6) for Cholesky). For well-conditioned problems with large mm, Cholesky on AAA^\top A may be faster.Choose based on condition number: if κ(A)1\kappa(A) \gg 1, use QR; if κ(A)1\kappa(A) \approx 1, Cholesky is acceptable and faster
11Using det(Q)=1\det(Q) = 1 as the definition of orthogonaldet(Q)=±1\det(Q) = \pm 1 is a consequence of orthogonality, not the definition. A matrix can have determinant 1 without being orthogonal (e.g., any upper triangular matrix with diagonal product 1).Definition is QQ=IQ^\top Q = I. Determinant =±1= \pm 1 follows from this.
12Thinking orthogonal initialization prevents gradient problems throughout trainingOrthogonal initialization helps at initialization. As training proceeds, weights deviate from orthogonality. Maintaining orthogonality throughout training requires explicit regularization (e.g., spectral normalization, orthogonal regularization loss).Use orthogonal init as a starting point; for persistent orthogonality through training, use explicit constraints

12. Exercises

Exercise 1 * - Projection and Decomposition

Given v=(3,1,2)\mathbf{v} = (3, 1, 2)^\top and the subspace S=span{(1,1,0),(0,1,1)}S = \operatorname{span}\{(1, 1, 0)^\top, (0, 1, 1)^\top\}:

(a) Apply Gram-Schmidt to the spanning vectors to obtain an ONB {q1,q2}\{\mathbf{q}_1, \mathbf{q}_2\} for SS.

(b) Compute the projection vS=projS(v)\mathbf{v}_S = \operatorname{proj}_S(\mathbf{v}) using the ONB.

(c) Verify: vvSq1\mathbf{v} - \mathbf{v}_S \perp \mathbf{q}_1 and vvSq2\mathbf{v} - \mathbf{v}_S \perp \mathbf{q}_2.

(d) Compute v2\|\mathbf{v}\|^2 and verify v2=vS2+vvS2\|\mathbf{v}\|^2 = \|\mathbf{v}_S\|^2 + \|\mathbf{v} - \mathbf{v}_S\|^2 (Pythagorean theorem).


Exercise 2 * - Gram-Schmidt and QR

Let A=(110101011)A = \begin{pmatrix}1 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 1\end{pmatrix}.

(a) Apply Gram-Schmidt to the columns of AA to produce QQ and RR such that A=QRA = QR.

(b) Verify that QQ=IQ^\top Q = I and QR=AQR = A numerically.

(c) Solve the least-squares problem minxAxb\min_\mathbf{x}\|A\mathbf{x} - \mathbf{b}\| for b=(1,2,3)\mathbf{b} = (1,2,3)^\top by solving Rx=QbR\mathbf{x} = Q^\top\mathbf{b} via back substitution.

(d) Verify the solution against np.linalg.lstsq.


Exercise 3 * - Orthogonal Matrices

(a) Verify that the 2×22\times 2 rotation matrix Rθ=(cosθsinθsinθcosθ)R_\theta = \begin{pmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{pmatrix} is orthogonal for θ=π/4\theta = \pi/4.

(b) Show that the composition of two rotations RθRϕ=Rθ+ϕR_\theta R_\phi = R_{\theta+\phi} by matrix multiplication.

(c) Construct a Householder reflector HH such that Ha=ae1H\mathbf{a} = \|\mathbf{a}\|\mathbf{e}_1 for a=(3,4)\mathbf{a} = (3, 4)^\top. Verify HH=IH^\top H = I and Ha=(5,0)H\mathbf{a} = (5, 0)^\top.

(d) What is det(H)\det(H)? Explain geometrically.


Exercise 4 ** - Modified Gram-Schmidt

(a) Implement both Classical and Modified Gram-Schmidt.

(b) Test on the Hilbert matrix Hij=1/(i+j1)H_{ij} = 1/(i+j-1) for n=10n = 10. Compute the maximum off-diagonal entry of QQIQ^\top Q - I for each method.

(c) Plot orthogonality error (QQIF\|Q^\top Q - I\|_F) vs matrix size nn for both methods. What do you observe?

(d) Explain why MGS is more stable using the error analysis from 5.2.


Exercise 5 ** - Least Squares Stability

Given the Vandermonde system fitting a degree-dd polynomial through mm points:

(a) Build the design matrix ARm×(d+1)A \in \mathbb{R}^{m \times (d+1)} with Aij=xij1A_{ij} = x_i^{j-1} for m=20m = 20, d=10d = 10.

(b) Solve the least-squares problem via: (i) normal equations (AA)c^=Ay(A^\top A)\hat{\mathbf{c}} = A^\top\mathbf{y}, (ii) QR factorization A=QRA = QR then Rc^=QyR\hat{\mathbf{c}} = Q^\top\mathbf{y}.

(c) Compute the condition number κ(A)\kappa(A) and κ(AA)\kappa(A^\top A). What is their ratio?

(d) Perturb y\mathbf{y} by small noise and compare the residuals from both methods. Which is more stable?


Exercise 6 ** - Spectral Theorem

Let A=(4221)A = \begin{pmatrix}4 & 2 \\ 2 & 1\end{pmatrix}.

(a) Find all eigenvalues and eigenvectors of AA analytically.

(b) Verify that eigenvectors are orthogonal. Normalize them to form an orthogonal matrix QQ.

(c) Verify the spectral decomposition: A=QΛQA = Q\Lambda Q^\top where Λ=diag(λ1,λ2)\Lambda = \operatorname{diag}(\lambda_1, \lambda_2).

(d) Compute the Rayleigh quotient RA(x)R_A(\mathbf{x}) for x=(1,0)\mathbf{x} = (1, 0)^\top, (0,1)(0, 1)^\top, (1,1)/2(1, 1)^\top/\sqrt{2}. Verify each is in [λmin,λmax][\lambda_{\min}, \lambda_{\max}].

(e) Is AA positive definite? Verify using eigenvalues and by direct computation of xAx\mathbf{x}^\top A\mathbf{x}.


Exercise 7 *** - Orthogonal Weight Initialization

(a) Generate 100 random matrices W1,,WLR64×64W_1, \ldots, W_L \in \mathbb{R}^{64 \times 64} with i.i.d. Gaussian entries (standard init). Compute the spectral norm of their product WLW12\|W_L \cdots W_1\|_2 for L=1,5,10,20L = 1, 5, 10, 20.

(b) Repeat with each WkW_k initialized as a random orthogonal matrix (via QR of a Gaussian matrix). Plot WLW12\|W_L \cdots W_1\|_2 vs LL for both initializations.

(c) Generate synthetic "gradient" vectors gR64\mathbf{g} \in \mathbb{R}^{64} and propagate backward through the chains. Compare gradient norms: (WLW1)g\|(W_L \cdots W_1)^\top\mathbf{g}\| vs g\|\mathbf{g}\|.

(d) Explain the result using the isometry property of orthogonal matrices.


Exercise 8 *** - Rayleigh Quotient Iteration

Rayleigh Quotient Iteration is a method for finding eigenvectors with cubic convergence:

x(k+1)=(AρkI)1x(k)(AρkI)1x(k),ρk=RA(x(k))\mathbf{x}^{(k+1)} = \frac{(A - \rho_k I)^{-1}\mathbf{x}^{(k)}}{\|(A - \rho_k I)^{-1}\mathbf{x}^{(k)}\|}, \qquad \rho_k = R_A(\mathbf{x}^{(k)})

(a) Implement Rayleigh Quotient Iteration for a symmetric matrix.

(b) Test on A=(310121011)A = \begin{pmatrix}3 & 1 & 0 \\ 1 & 2 & 1 \\ 0 & 1 & 1\end{pmatrix} with random initial x(0)\mathbf{x}^{(0)}. Track ρkλ|\rho_k - \lambda_\star| over iterations.

(c) Plot convergence on a log scale and estimate the convergence rate (should be cubic: ρk+1λCρkλ3|\rho_{k+1} - \lambda_\star| \approx C|\rho_k - \lambda_\star|^3).

(d) Compare convergence speed vs power iteration and inverse iteration with a fixed shift.

(e) Explain why Rayleigh quotient (not a fixed shift) is needed for cubic convergence.


13. Why This Matters for AI (2026 Perspective)

ConceptAI ApplicationImpact
Orthogonal initializationtorch.nn.init.orthogonal_()Prevents gradient explosion/vanishing at initialization; standard practice for deep linear networks (Saxe et al.) and RNNs
QR decompositionMuon optimizer; orthogonal gradient updatesOrthogonalizing gradient matrices via QR/Newton-Schulz stabilizes LLM training; avoids rank collapse of weight matrices
RoPE embeddingsLLaMA, Mistral, GPT-NeoX, Falcon, GemmaRotary position encoding uses orthogonal rotation matrices to encode relative position in self-attention; rotation composition gives relative position property
Spectral normalizationGANs, discriminator regularizationConstrains σ1(W)=1\sigma_1(W) = 1 for 1-Lipschitz discriminator; stabilizes Wasserstein GAN training
Gram-Schmidt / QRNumerically stable least squaresUsed in regression, feature selection, basis pursuit; avoiding normal equations prevents condition number squaring
Orthogonal complementContinual learning (OGD, GEM)New task gradients are projected onto the orthogonal complement of old task gradient subspaces; prevents catastrophic forgetting
Spectral theoremHessian analysis, SAM optimizerEigendecomposition of the loss Hessian reveals sharp/flat directions; Sharpness-Aware Minimization (SAM) seeks minima with small \lambda_\max(\nabla^2\mathcal{L})
Orthonormal basesAttention head analysisQueries, keys, values in each attention head span subspaces; orthogonality between heads corresponds to specialization; mechanistic interpretability studies this
DFT (unitary matrix)Audio models, efficient convolutionsWav2Vec, Whisper process spectrograms (DFT magnitude); convolutional layers in frequency domain use FFT = matrix multiply by unitary FnF_n
Householder reflectorsQR in neural network trainingUsed in computing QR factorizations during optimizer steps (Shampoo); Householder product form compresses orthogonal matrices for efficient multiplication
Rayleigh quotientEigenvalue estimation in trainingStochastic Lanczos quadrature (Ghorbani et al. 2019) approximates the Hessian spectrum via Rayleigh quotients on random probes
Gram matrixGram-matrix style loss, NTKNeural Tangent Kernel Θ=JJ\Theta = J J^\top (where JJ is the Jacobian) is a Gram matrix; its eigenstructure determines training dynamics

14. Advanced Topics

14.1 Polar Decomposition

Every matrix ARm×nA \in \mathbb{R}^{m \times n} (mnm \geq n) has a polar decomposition:

A=QSA = QS

where QRm×nQ \in \mathbb{R}^{m \times n} has orthonormal columns (QQ=InQ^\top Q = I_n) and SRn×nS \in \mathbb{R}^{n \times n} is symmetric positive semidefinite.

Uniqueness: If AA has full column rank, QQ and SS are unique, with S=(AA)1/2S = (A^\top A)^{1/2} (the matrix square root) and Q=A(AA)1/2Q = A(A^\top A)^{-1/2}.

Geometric meaning: SS captures the "stretching" and QQ captures the "rotation" - this is the matrix analog of writing a complex number as reiθre^{i\theta}.

Relation to SVD: If A=UΣVA = U\Sigma V^\top (thin SVD), then:

A=(UV)(VΣV)=QSA = (UV^\top)(V\Sigma V^\top) = Q \cdot S

So Q=UVQ = UV^\top (the "nearest orthogonal matrix" to AA) and S=VΣVS = V\Sigma V^\top.

For AI: The Muon optimizer computes Q=UVQ = UV^\top from the SVD of the gradient matrix GG, then applies QQ as the weight update - this is exactly the orthogonal factor of the polar decomposition of GG.

14.2 Orthogonality in Hilbert Spaces

All finite-dimensional results extend to Hilbert spaces - complete inner product spaces, possibly infinite-dimensional.

Orthogonal Projection Theorem (Hilbert spaces). If H\mathcal{H} is a Hilbert space and MH\mathcal{M} \subseteq \mathcal{H} is a closed subspace, then every vH\mathbf{v} \in \mathcal{H} has a unique decomposition v=vM+vM\mathbf{v} = \mathbf{v}_\mathcal{M} + \mathbf{v}_{\mathcal{M}^\perp} with vMM\mathbf{v}_\mathcal{M} \in \mathcal{M} and vMM\mathbf{v}_{\mathcal{M}^\perp} \in \mathcal{M}^\perp.

Completeness is essential: Without it, the projection may not exist. (The "closed" hypothesis ensures that sequences of projections converge.)

Examples of Hilbert spaces in ML:

  • L2(R)L^2(\mathbb{R}): the space of square-integrable functions (kernel methods, RKHS)
  • 2\ell^2: square-summable sequences (infinite-dimensional limits of neural networks)
  • Reproducing Kernel Hilbert Spaces (RKHS): the function space for kernel machines, Gaussian processes

14.3 Gram Matrices and the Kernel Trick

Given vectors x1,,xmRd\mathbf{x}_1, \ldots, \mathbf{x}_m \in \mathbb{R}^d, the Gram matrix is:

Gij=xi,xj=xixjG_{ij} = \langle\mathbf{x}_i, \mathbf{x}_j\rangle = \mathbf{x}_i^\top\mathbf{x}_j

In matrix form: G=XXG = XX^\top where X=[x1xm]Rm×dX = [\mathbf{x}_1|\cdots|\mathbf{x}_m]^\top \in \mathbb{R}^{m \times d}.

Properties: GG is symmetric positive semidefinite. Its rank equals rank(X)\operatorname{rank}(X).

Kernel trick: Replace xi,xj\langle\mathbf{x}_i,\mathbf{x}_j\rangle with k(xi,xj)k(\mathbf{x}_i,\mathbf{x}_j) where kk is a positive definite kernel. The resulting kernel matrix Kij=k(xi,xj)K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j) is a Gram matrix in a (possibly infinite-dimensional) RKHS.

For attention: The attention matrix Attnijexp(qikj/dk)\text{Attn}_{ij} \propto \exp(\mathbf{q}_i^\top\mathbf{k}_j / \sqrt{d_k}) is a kernelized Gram matrix between queries and keys. Its spectral properties determine how information flows between tokens.


Conceptual Bridge

Looking Backward: What Made This Possible

The theory developed in this section builds directly on foundations from earlier chapters:

  • Vectors and inner products (01: Vectors and Spaces): The inner product u,v\langle\mathbf{u},\mathbf{v}\rangle is the primitive notion from which orthogonality, angles, and projections all derive.
  • Linear transformations (04: Linear Transformations): Orthogonal matrices are the isometric linear maps - those that preserve both angles and distances. The kernel and image of projection operators are orthogonal complements.
  • Matrix rank (02-LA: Matrix Rank): The rank-nullity theorem and the four fundamental subspaces are unified by orthogonality: null(A)=row(A)\operatorname{null}(A) = \operatorname{row}(A)^\perp.
  • Eigenvalues (01: Eigenvalues): The spectral theorem tells us that symmetric matrices are diagonalizable with an orthonormal eigenbasis - the connection between orthogonality and spectral theory.

Looking Forward: What This Enables

Orthogonality and orthonormality are not endpoints but foundations:

  • Singular Value Decomposition (02: SVD): The SVD A=UΣVA = U\Sigma V^\top is built from two orthogonal matrices. The right singular vectors VV form an ONB for the row space; the left singular vectors UU form an ONB for the column space. Without the theory developed here, SVD cannot be understood.
  • Matrix Decompositions (08: Matrix Decompositions): LU, QR, Cholesky, Schur - QR is the central algorithm among these, and the QR algorithm for eigenvalues iterates orthogonal similarity transformations.
  • Matrix Norms (06: Matrix Norms): The spectral norm A2=σ1(A)\|A\|_2 = \sigma_1(A) is defined via orthogonal invariance. The condition number κ(Q)=1\kappa(Q) = 1 for orthogonal QQ.
  • Optimization (Chapter 08): Gradient methods, second-order methods, and their variants all navigate spaces equipped with inner products. The geometry of the loss landscape is described using the spectral theory developed here.
CURRICULUM POSITION: ORTHOGONALITY AND ORTHONORMALITY
===========================================================================

  FOUNDATIONS                    THIS SECTION                  FORWARD
  -----------                    ------------                  -------
  +-----------------+            +-------------------------+   +------------------+
  | Inner Products  |----------> | Orthogonality (05)     |-->| SVD (02)        |
  | Vectors/Spaces  |            | +- Gram-Schmidt          |   | low-rank approx  |
  +-----------------+            | +- QR Decomposition      |   +------------------+
                                 | +- Orthogonal Matrices   |
  +-----------------+            | +- Projections           |   +------------------+
  | Eigenvalues     |----------> | +- Spectral Theorem      |-->| Matrix Norms     |
  | (01)           |            | +- ML Applications       |   | (06)            |
  +-----------------+            +-------------------------+   +------------------+
                                          |
  +-----------------+                     |                     +------------------+
  | Linear Transf.  |---------->----------+                  -->| Decompositions   |
  | (04)           |                                            | (08) QR alg.    |
  +-----------------+                                            +------------------+

  AI CONNECTIONS:
  +------------------------------------------------------------------+
  |  RoPE embeddings -> Gram-Schmidt  -> Muon optimizer  -> OGD/GEM    |
  |  Orthogonal init -> Spectral norm -> Rayleigh quotient -> SAM      |
  |  QR least squares -> DFT (unitary) -> Kernel Gram matrix -> NTK    |
  +------------------------------------------------------------------+

===========================================================================

<- Back to Advanced Linear Algebra | Next: Matrix Norms ->


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue