Part 2

28 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Orthogonality and Orthonormality: Part 9: Orthogonality in Function Spaces to Conceptual Bridge

9. Orthogonality in Function Spaces

9.1 The L^2 Inner Product

Orthogonality extends naturally from finite-dimensional vector spaces to infinite-dimensional function spaces. The space $L^2([a,b])$ of square-integrable functions carries the inner product:

\langle f, g\rangle = \int_a^b f(x)g(x)\,dx

Orthogonality of functions: $f \perp g$ if $\int_a^b f(x)g(x)\,dx = 0$ .

Example: On $[-\pi, \pi]$ , the functions $\{\sin(nx), \cos(mx)\}$ form a mutually orthogonal family:

\int_{-\pi}^{\pi} \sin(nx)\cos(mx)\,dx = 0 \quad \text{for all } n, m

\int_{-\pi}^{\pi} \sin(nx)\sin(mx)\,dx = \pi\,\delta_{nm}

\int_{-\pi}^{\pi} \cos(nx)\cos(mx)\,dx = \pi\,\delta_{nm} \quad (n, m \geq 1)

This is the foundation of Fourier series - a basis expansion for $L^2$ functions.

9.2 Fourier Series as Orthonormal Expansion

Orthonormal basis for $L^2([-\pi,\pi])$ :

\left\{\frac{1}{\sqrt{2\pi}},\, \frac{\cos(x)}{\sqrt{\pi}},\, \frac{\sin(x)}{\sqrt{\pi}},\, \frac{\cos(2x)}{\sqrt{\pi}},\, \frac{\sin(2x)}{\sqrt{\pi}},\, \ldots\right\}

The Fourier series of $f$ is its expansion in this ONB:

f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty \left[a_n\cos(nx) + b_n\sin(nx)\right]

where the Fourier coefficients are orthogonal projections:

a_n = \frac{1}{\pi}\int_{-\pi}^{\pi}f(x)\cos(nx)\,dx, \qquad b_n = \frac{1}{\pi}\int_{-\pi}^{\pi}f(x)\sin(nx)\,dx

Parseval's theorem for Fourier series:

\frac{1}{\pi}\int_{-\pi}^{\pi}|f(x)|^2\,dx = \frac{a_0^2}{2} + \sum_{n=1}^\infty(a_n^2 + b_n^2)

This is the infinite-dimensional analog of $\|\mathbf{v}\|^2 = \sum_i \hat{v}_i^2$ .

9.3 Discrete Fourier Transform as a Unitary Matrix

The Discrete Fourier Transform (DFT) maps $\mathbf{x} \in \mathbb{C}^n$ to $\hat{\mathbf{x}} \in \mathbb{C}^n$ :

\hat{x}_k = \sum_{j=0}^{n-1} x_j\, \omega^{jk}, \qquad \omega = e^{-2\pi i/n}

In matrix form: $\hat{\mathbf{x}} = F_n\mathbf{x}$ where $(F_n)_{jk} = \omega^{jk}$ .

Key fact: $F_n / \sqrt{n}$ is unitary (the complex analog of orthogonal):

\frac{1}{n}F_n^* F_n = I \quad \Leftrightarrow \quad \frac{F_n}{\sqrt{n}} \text{ is unitary}

This is because the DFT basis vectors $\mathbf{f}_k = (1, \omega^k, \omega^{2k}, \ldots, \omega^{(n-1)k})/\sqrt{n}$ are orthonormal in $\mathbb{C}^n$ :

\langle\mathbf{f}_j, \mathbf{f}_k\rangle = \frac{1}{n}\sum_{\ell=0}^{n-1}\omega^{\ell(k-j)} = \delta_{jk}

(The last equality is the geometric series identity for roots of unity.)

Parseval's theorem for DFT: $\|\hat{\mathbf{x}}\|^2 = n\|\mathbf{x}\|^2$ .

For AI: The DFT/FFT appears in:

Signal processing: Feature extraction from audio (MFCCs, spectrograms)
Efficient convolutions: $f * g = \mathcal{F}^{-1}(\mathcal{F}(f) \cdot \mathcal{F}(g))$ - used in efficient CNN implementations
Positional encodings: Sinusoidal positional embeddings in the original Transformer are Fourier features
SSMs: State space models (Mamba, S4) use structured orthogonal/unitary matrices

10. Applications in Machine Learning

10.1 Orthogonal Weight Initialization

The problem with random initialization. If weight matrices have singular values spread over a wide range, gradients either explode or vanish during backpropagation through many layers.

Saxe et al. (2013) showed that initializing with orthogonal matrices preserves gradient norms through linear networks:

Theorem. For a deep linear network with orthogonal weight matrices $W_1, \ldots, W_L$ (each $\in \mathbb{R}^{n \times n}$ ), the Jacobian of the full network is $W_L \cdots W_1$ , which is also orthogonal (product of orthogonal matrices). Hence $\|\partial\mathcal{L}/\partial\mathbf{x}\| = \|\partial\mathcal{L}/\partial\mathbf{y}\|$ - gradients neither explode nor vanish.

Practical implementation:

Generate a random $n \times n$ matrix with i.i.d. standard normal entries: $M \sim \mathcal{N}(0, I)$
Compute its QR decomposition: $M = QR$
Use $Q$ (or $Q \cdot \operatorname{sign}(\operatorname{diag}(R))$ for uniform distribution over $O(n)$ ) as the initial weight matrix

This is implemented as torch.nn.init.orthogonal_() in PyTorch.

Why it works beyond linear networks: Even in nonlinear networks, orthogonal initialization places the initial weights in a "good" part of weight space where gradients are well-conditioned at initialization. Empirically, networks with orthogonal initialization often converge faster on deep architectures.

10.2 Rotary Position Embeddings (RoPE)

Standard self-attention lacks inherent position information. The original Transformer adds sinusoidal absolute position embeddings. RoPE (Su et al. 2021, used in LLaMA, GPT-NeoX, Falcon) takes a different approach: it encodes position by rotating query and key vectors.

Construction (2D case). For position $m$ , the rotation matrix is:

R_m = \begin{pmatrix}\cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta\end{pmatrix}

For a $d$ -dimensional embedding, pair up dimensions and apply $R_m$ to each pair using $d/2$ different frequencies $\theta_i = 10000^{-2i/d}$ .

Key property. The inner product of rotated queries and keys depends only on the relative position $m - n$ :

\langle R_m \mathbf{q}, R_n \mathbf{k}\rangle = \mathbf{q}^\top R_m^\top R_n \mathbf{k} = \mathbf{q}^\top R_{m-n}\mathbf{k}

since $R_m^\top R_n = R_{n-m}$ (rotation matrices satisfy $R_a^\top R_b = R_{b-a}$ , because the inverse of a rotation is a rotation in the opposite direction).

Orthogonality is central: The key equation $R_m^\top R_n = R_{m-n}$ uses the fact that $R_m$ is orthogonal ( $R_m^\top = R_m^{-1}$ ), so $(R_m^\top)(R_n) = R_m^{-1}R_n = R_{n-m}$ .

10.3 Spectral Normalization

Spectral normalization (Miyato et al. 2018) stabilizes GAN training by constraining weight matrices to have spectral norm (= largest singular value) equal to 1.

Implementation: At each forward pass, divide the weight matrix by its spectral norm: $\hat{W} = W/\sigma_1(W)$ .

Connection to orthogonality: A matrix with $\sigma_1 = \sigma_2 = \cdots = 1$ is exactly an orthogonal matrix. Spectral normalization enforces $\sigma_1 = 1$ while letting other singular values vary freely - it's a "one-sided" orthogonality constraint.

Why it helps GANs: The discriminator $D$ satisfies a Lipschitz condition $\|D(\mathbf{x}) - D(\mathbf{y})\| \leq \|\mathbf{x} - \mathbf{y}\|$ when all weight layers are spectrally normalized. This relates to the Wasserstein GAN objective, which requires a 1-Lipschitz discriminator.

10.4 QR in Optimization: Orthogonal Gradients

Several modern optimization techniques use orthogonality:

Shampoo optimizer (Gupta et al. 2018): Maintains a Kronecker-factored preconditioner and periodically computes its orthogonal "square root" via eigendecomposition, orthogonalizing the gradient update directions.

Muon optimizer (2024): Orthogonalizes gradient matrices via Newton-Schulz iterations (a matrix-valued version of Newton's method for computing polar decomposition), then applies the orthogonalized gradient as an update. This ensures each weight matrix sees updates with balanced singular values, preventing the accumulation of rank deficiency.

Gradient orthogonality in continual learning: Methods like Gradient Episodic Memory (GEM) and Orthogonal Gradient Descent (OGD) project gradients for new tasks onto the orthogonal complement of the gradient space for previous tasks - preserving past performance while learning new tasks.

10b. Bessel's Inequality and Completeness

10b.1 Bessel's Inequality

When we expand a vector $\mathbf{v}$ in terms of an orthonormal set $\{\mathbf{q}_1, \ldots, \mathbf{q}_k\}$ that is not necessarily a complete basis for the entire space, we get Bessel's inequality:

\sum_{i=1}^k |\langle\mathbf{v}, \mathbf{q}_i\rangle|^2 \leq \|\mathbf{v}\|^2

Proof. Let $\mathbf{v}_k = \sum_{i=1}^k \langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i$ be the projection of $\mathbf{v}$ onto $\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_k)$ . Then:

\|\mathbf{v} - \mathbf{v}_k\|^2 = \|\mathbf{v}\|^2 - \|\mathbf{v}_k\|^2 = \|\mathbf{v}\|^2 - \sum_{i=1}^k |\langle\mathbf{v},\mathbf{q}_i\rangle|^2 \geq 0

The inequality follows immediately. $\square$

Equality holds if and only if $\mathbf{v} \in \operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_k)$ - i.e., when $\mathbf{v}$ itself lies in the subspace spanned by the chosen orthonormal vectors.

Parseval's identity is the limit case: when $\{\mathbf{q}_i\}$ is a complete ONB:

\sum_{i=1}^\infty |\langle\mathbf{v}, \mathbf{q}_i\rangle|^2 = \|\mathbf{v}\|^2

10b.2 Completeness of Orthonormal Sets

An orthonormal set $\mathcal{B} = \{\mathbf{q}_i\}$ in a Hilbert space $\mathcal{H}$ is complete (or maximal) if it satisfies any of the following equivalent conditions:

$\mathcal{B}$ is an ONB: every $\mathbf{v} \in \mathcal{H}$ can be written $\mathbf{v} = \sum_i \langle\mathbf{v},\mathbf{q}_i\rangle\mathbf{q}_i$
Parseval's identity holds: $\|\mathbf{v}\|^2 = \sum_i |\langle\mathbf{v},\mathbf{q}_i\rangle|^2$ for all $\mathbf{v}$
$\langle\mathbf{v},\mathbf{q}_i\rangle = 0$ for all $i$ implies $\mathbf{v} = \mathbf{0}$ (no non-zero vector is orthogonal to all $\mathbf{q}_i$ )

In finite dimensions, any ONB is automatically complete (by dimension counting). In infinite dimensions, completeness is a non-trivial requirement - this is why Hilbert space theory requires careful treatment.

For ML context: In kernel methods, the RKHS is an infinite-dimensional Hilbert space. A kernel function $k(\mathbf{x},\mathbf{y})$ defines an inner product, and the question of whether the kernel features span the whole space (completeness/universality) determines the expressivity of the corresponding kernel machine.

10c. Detailed Worked Examples

10c.1 Projection in Least-Squares Regression

Setup. We observe $m = 4$ data points $(x_i, y_i)$ and want to fit a line $y = \beta_0 + \beta_1 x$ :

x = (1, 2, 3, 4)^\top, \quad y = (2.1, 3.9, 5.8, 8.2)^\top

Build the design matrix:

A = \begin{pmatrix}1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4\end{pmatrix}

Gram-Schmidt on the columns of $A$ :

Column 1: $\mathbf{a}_1 = (1,1,1,1)^\top$

\mathbf{q}_1 = \frac{\mathbf{a}_1}{\|\mathbf{a}_1\|} = \frac{1}{2}(1,1,1,1)^\top

Column 2: $\mathbf{a}_2 = (1,2,3,4)^\top$ . Project out $\mathbf{q}_1$ :

\langle\mathbf{a}_2, \mathbf{q}_1\rangle = \frac{1}{2}(1+2+3+4) = 5

\mathbf{u}_2 = \mathbf{a}_2 - 5\mathbf{q}_1 = (1,2,3,4)^\top - \frac{5}{2}(1,1,1,1)^\top = (-\tfrac{3}{2}, -\tfrac{1}{2}, \tfrac{1}{2}, \tfrac{3}{2})^\top

\|\mathbf{u}_2\| = \sqrt{\tfrac{9}{4}+\tfrac{1}{4}+\tfrac{1}{4}+\tfrac{9}{4}} = \sqrt{5}, \quad \mathbf{q}_2 = \frac{1}{\sqrt{5}}(-\tfrac{3}{2}, -\tfrac{1}{2}, \tfrac{1}{2}, \tfrac{3}{2})^\top

The projection (fitted values):

\hat{\mathbf{y}} = QQ^\top\mathbf{y} \quad \text{where } Q = [\mathbf{q}_1|\mathbf{q}_2]

This is the projection of $\mathbf{y}$ onto $\operatorname{col}(A)$ = the space of all affine functions of $x$ . The residual $\mathbf{y} - \hat{\mathbf{y}}$ is orthogonal to every affine function.

Verification of the normal equations: The least-squares solution $\hat{\boldsymbol{\beta}}$ satisfies $A^\top A\hat{\boldsymbol{\beta}} = A^\top\mathbf{y}$ :

A^\top A = \begin{pmatrix}4 & 10 \\ 10 & 30\end{pmatrix}, \quad A^\top\mathbf{y} = \begin{pmatrix}20 \\ 56.8\end{pmatrix}

Solving: $\hat{\boldsymbol{\beta}} \approx (0.1, 2.04)^\top$ , so $\hat{y} \approx 0.1 + 2.04x$ - a good linear fit.

10c.2 Full Worked Example: Gram-Schmidt in $\mathbb{R}^3$

Input vectors:

\mathbf{a}_1 = \begin{pmatrix}1\\1\\0\end{pmatrix}, \quad \mathbf{a}_2 = \begin{pmatrix}1\\0\\1\end{pmatrix}, \quad \mathbf{a}_3 = \begin{pmatrix}0\\1\\1\end{pmatrix}

Step 1: $\mathbf{q}_1 = \mathbf{a}_1/\|\mathbf{a}_1\| = (1/\sqrt{2})(1,1,0)^\top$

Step 2: Remove the component of $\mathbf{a}_2$ along $\mathbf{q}_1$ :

\langle\mathbf{a}_2, \mathbf{q}_1\rangle = \frac{1}{\sqrt{2}}(1 \cdot 1 + 1 \cdot 0 + 0 \cdot 1) = \frac{1}{\sqrt{2}}

\mathbf{u}_2 = \mathbf{a}_2 - \frac{1}{\sqrt{2}}\mathbf{q}_1 = \begin{pmatrix}1\\0\\1\end{pmatrix} - \frac{1}{2}\begin{pmatrix}1\\1\\0\end{pmatrix} = \begin{pmatrix}1/2\\-1/2\\1\end{pmatrix}

\|\mathbf{u}_2\| = \sqrt{1/4 + 1/4 + 1} = \sqrt{3/2}, \quad \mathbf{q}_2 = \sqrt{\frac{2}{3}}\begin{pmatrix}1/2\\-1/2\\1\end{pmatrix} = \frac{1}{\sqrt{6}}\begin{pmatrix}1\\-1\\2\end{pmatrix}

Step 3: Remove components of $\mathbf{a}_3$ along $\mathbf{q}_1$ and $\mathbf{q}_2$ :

\langle\mathbf{a}_3, \mathbf{q}_1\rangle = \frac{1}{\sqrt{2}}(0+1+0) = \frac{1}{\sqrt{2}}

\langle\mathbf{a}_3, \mathbf{q}_2\rangle = \frac{1}{\sqrt{6}}(0-1+2) = \frac{1}{\sqrt{6}}

\mathbf{u}_3 = \begin{pmatrix}0\\1\\1\end{pmatrix} - \frac{1}{\sqrt{2}}\cdot\frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\\0\end{pmatrix} - \frac{1}{\sqrt{6}}\cdot\frac{1}{\sqrt{6}}\begin{pmatrix}1\\-1\\2\end{pmatrix}

= \begin{pmatrix}0\\1\\1\end{pmatrix} - \frac{1}{2}\begin{pmatrix}1\\1\\0\end{pmatrix} - \frac{1}{6}\begin{pmatrix}1\\-1\\2\end{pmatrix} = \begin{pmatrix}-2/3\\2/3\\2/3\end{pmatrix}

\|\mathbf{u}_3\| = \sqrt{4/9+4/9+4/9} = 2/\sqrt{3}, \quad \mathbf{q}_3 = \frac{1}{\sqrt{3}}\begin{pmatrix}-1\\1\\1\end{pmatrix}

Verification: $\langle\mathbf{q}_1,\mathbf{q}_2\rangle = (1/\sqrt{2})(1/\sqrt{6})(1-1+0) = 0$ OK $\langle\mathbf{q}_1,\mathbf{q}_3\rangle = (1/\sqrt{2})(1/\sqrt{3})(-1+1+0) = 0$ OK $\langle\mathbf{q}_2,\mathbf{q}_3\rangle = (1/\sqrt{6})(1/\sqrt{3})(-1-1+2) = 0$ OK

QR factorization: The matrix $R$ is upper triangular with:

R = \begin{pmatrix}\|\mathbf{a}_1\| & \langle\mathbf{a}_2,\mathbf{q}_1\rangle & \langle\mathbf{a}_3,\mathbf{q}_1\rangle \\ 0 & \|\mathbf{u}_2\| & \langle\mathbf{a}_3,\mathbf{q}_2\rangle \\ 0 & 0 & \|\mathbf{u}_3\|\end{pmatrix} = \begin{pmatrix}\sqrt{2} & 1/\sqrt{2} & 1/\sqrt{2} \\ 0 & \sqrt{3/2} & 1/\sqrt{6} \\ 0 & 0 & 2/\sqrt{3}\end{pmatrix}

10c.3 Spectral Decomposition Example

Let $A = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix}$ (symmetric).

Eigenvalues: $\det(A - \lambda I) = (3-\lambda)^2 - 1 = 0 \Rightarrow \lambda = 2$ or $\lambda = 4$ .

Eigenvectors:

$\lambda_1 = 2$ : $(A - 2I)\mathbf{v} = 0 \Rightarrow \begin{pmatrix}1&1\\1&1\end{pmatrix}\mathbf{v} = 0 \Rightarrow \mathbf{v}_1 = (1,-1)^\top/\sqrt{2}$
$\lambda_2 = 4$ : $(A - 4I)\mathbf{v} = 0 \Rightarrow \begin{pmatrix}-1&1\\1&-1\end{pmatrix}\mathbf{v} = 0 \Rightarrow \mathbf{v}_2 = (1,1)^\top/\sqrt{2}$

Verify orthogonality: $\mathbf{v}_1^\top\mathbf{v}_2 = (1)(1) + (-1)(1) = 0$ OK

Spectral decomposition:

A = \lambda_1\mathbf{v}_1\mathbf{v}_1^\top + \lambda_2\mathbf{v}_2\mathbf{v}_2^\top = 2\begin{pmatrix}1/2 & -1/2 \\ -1/2 & 1/2\end{pmatrix} + 4\begin{pmatrix}1/2 & 1/2 \\ 1/2 & 1/2\end{pmatrix}

= \begin{pmatrix}1 & -1 \\ -1 & 1\end{pmatrix} + \begin{pmatrix}2 & 2 \\ 2 & 2\end{pmatrix} = \begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix} = A \checkmark

Rayleigh quotient at $\mathbf{x} = (1,0)^\top$ :

R_A(\mathbf{x}) = \mathbf{x}^\top A\mathbf{x} = (1,0)\begin{pmatrix}3 & 1 \\ 1 & 3\end{pmatrix}\begin{pmatrix}1\\0\end{pmatrix} = 3

Indeed $\lambda_{\min} = 2 \leq 3 \leq 4 = \lambda_{\max}$ .

10c.4 Householder Reflector: Concrete Computation

Problem: Find a Householder reflector $H$ such that $H\mathbf{a} = \|\mathbf{a}\|\mathbf{e}_1$ where $\mathbf{a} = (4, 3)^\top$ .

Step 1: $\|\mathbf{a}\| = \sqrt{16+9} = 5$ .

Step 2: Since $a_1 = 4 > 0$ , use $+$ sign to avoid cancellation:

\mathbf{n}_{\text{unnorm}} = \mathbf{a} + \|\mathbf{a}\|\mathbf{e}_1 = \begin{pmatrix}4\\3\end{pmatrix} + \begin{pmatrix}5\\0\end{pmatrix} = \begin{pmatrix}9\\3\end{pmatrix}

Step 3: Normalize: $\|\mathbf{n}_{\text{unnorm}}\| = \sqrt{81+9} = \sqrt{90} = 3\sqrt{10}$

\mathbf{n} = \frac{1}{3\sqrt{10}}\begin{pmatrix}9\\3\end{pmatrix} = \frac{1}{\sqrt{10}}\begin{pmatrix}3\\1\end{pmatrix}

Step 4: Build the reflector:

H = I - 2\mathbf{n}\mathbf{n}^\top = \begin{pmatrix}1 & 0 \\ 0 & 1\end{pmatrix} - \frac{2}{10}\begin{pmatrix}9 & 3 \\ 3 & 1\end{pmatrix} = \begin{pmatrix}1-9/5 & -3/5 \\ -3/5 & 1-1/5\end{pmatrix} = \begin{pmatrix}-4/5 & -3/5 \\ -3/5 & 4/5\end{pmatrix}

Verify: $H\mathbf{a} = \begin{pmatrix}-4/5 & -3/5 \\ -3/5 & 4/5\end{pmatrix}\begin{pmatrix}4\\3\end{pmatrix} = \begin{pmatrix}-16/5-9/5\\-12/5+12/5\end{pmatrix} = \begin{pmatrix}-5\\0\end{pmatrix} = -5\mathbf{e}_1$

This gives $-\|\mathbf{a}\|\mathbf{e}_1$ (a valid Householder result; the sign depends on the convention). $\det(H) = -1$ as expected. OK

10d. Orthogonality and the Geometry of Neural Networks

10d.1 The Geometry of Weight Matrices

A single linear layer $\mathbf{y} = W\mathbf{x}$ transforms the input geometry. Understanding this transformation requires decomposing $W$ via its singular value decomposition (preview):

W = U\Sigma V^\top \qquad \text{(covered fully in 02: SVD)}

$V^\top$ : rotate the input space (first orthogonal transformation)
$\Sigma$ : scale along the new axes by singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$
$U$ : rotate the output space (second orthogonal transformation)

For orthogonal $W$ : All singular values are 1, so $\Sigma = I$ . The layer is a pure rotation - it doesn't compress or expand the input distribution. Distances and angles are perfectly preserved.

Implication for gradient flow. The gradient of the loss with respect to the input is:

\frac{\partial\mathcal{L}}{\partial\mathbf{x}} = W^\top\frac{\partial\mathcal{L}}{\partial\mathbf{y}}

For orthogonal $W$ : $\|W^\top\mathbf{g}\| = \|\mathbf{g}\|$ - gradients are neither amplified nor attenuated. This is the formal reason orthogonal initialization prevents gradient vanishing/explosion in deep linear networks.

10d.2 Attention Heads as Orthogonal Projections

In a Transformer's multi-head attention with $H$ heads:

\text{head}_h = \text{Attn}(XW_h^Q, XW_h^K, XW_h^V)

where $W_h^Q, W_h^K, W_h^V \in \mathbb{R}^{d \times d_k}$ are the query/key/value projection matrices.

Mechanistic interpretability perspective (Elhage et al. 2021): Each attention head computes a projection of the residual stream onto a low-dimensional subspace, adds information from that subspace back, and (in some sense) different heads should attend to different "directions" of information.

Mathematical idealization: If we require $W_h^V (W_{h'}^V)^\top = 0$ for $h \neq h'$ (the value matrices project to orthogonal subspaces), then different heads genuinely operate on independent information. In practice, heads aren't exactly orthogonal, but studying the overlap $\|W_h^V (W_{h'}^V)^\top\|_F$ quantifies how much heads interfere with each other.

QK circuit analysis. The effective attention pattern for head $h$ involves $W_h^Q (W_h^K)^\top \in \mathbb{R}^{d \times d}$ (or rather its action on the residual stream). This matrix's singular values determine how "sharp" vs "diffuse" the attention pattern is.

10d.3 Layer Normalization and Orthogonal Complements

Layer normalization (Ba et al. 2016) operates on a vector $\mathbf{x} \in \mathbb{R}^d$ :

\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu\mathbf{1}}{\sigma} \cdot \boldsymbol{\gamma} + \boldsymbol{\beta}

where $\mu = \frac{1}{d}\sum_i x_i$ and $\sigma^2 = \frac{1}{d}\sum_i (x_i - \mu)^2$ .

Orthogonal decomposition view. The centering step $\mathbf{x} \mapsto \mathbf{x} - \mu\mathbf{1}$ is an orthogonal projection onto the subspace $\mathbf{1}^\perp = \{\mathbf{v} \in \mathbb{R}^d : \sum_i v_i = 0\}$ :

P_{\mathbf{1}^\perp} = I - \frac{\mathbf{1}\mathbf{1}^\top}{d}

This is a rank- $(d-1)$ orthogonal projector. Layer normalization first projects out the "mean direction" $\mathbf{1}/\sqrt{d}$ , then rescales to unit norm on the remaining $(d-1)$ -dimensional subspace.

Consequence for representational geometry: After layer normalization, the model's internal representations $\mathbf{h}$ always satisfy $\mathbf{h} \perp \mathbf{1}$ . The model can only represent information in the orthogonal complement of the constant vector - this is why all-zero inputs and constant inputs are indistinguishable after layer norm.

10d.4 Orthogonal Regularization in Weight Matrices

Several training techniques maintain (near-)orthogonality in weight matrices throughout training:

Orthogonal regularization: Add a penalty $\lambda\|W^\top W - I\|_F^2$ to the loss. This penalizes deviation from orthogonality without enforcing it exactly.

Spectral regularization: Penalize $\lambda \sum_i (\sigma_i - 1)^2$ - penalizes singular values deviating from 1. Stronger than spectral normalization, which only clips the largest singular value.

Riemannian optimization on the Stiefel manifold: The set of $n \times k$ matrices with orthonormal columns, $\text{St}(n,k) = \{Q \in \mathbb{R}^{n \times k} : Q^\top Q = I_k\}$ , is a smooth Riemannian manifold. Gradient descent on $\text{St}(n,k)$ uses the Riemannian gradient (projection of Euclidean gradient onto the tangent space) and retracts to the manifold via QR decomposition or Cayley transform after each step.

Applications: Orthogonal RNNs, rotation-equivariant neural networks, invertible neural networks, and normalizing flows all benefit from exact or approximate orthogonality constraints.

11. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Confusing "orthogonal" and "orthonormal"	Orthogonal means $\langle\mathbf{u},\mathbf{v}\rangle = 0$ ; orthonormal additionally requires unit length. An orthogonal matrix $Q$ has orthonormal columns, not just orthogonal ones.	Check: orthogonal <-> zero inner products; orthonormal <-> zero inner products AND $\\|\mathbf{q}_i\\| = 1$
2	Assuming $QQ^\top = I$ implies $Q^\top Q = I$ without checking square	For non-square $Q \in \mathbb{R}^{m \times n}$ ( $m > n$ ), $Q^\top Q = I_n$ but $QQ^\top \neq I_m$ (it's a rank- $n$ projection). Only square orthogonal matrices satisfy both.	Specify dimensions: "thin Q" ( $Q^\top Q = I_n$ ) vs "full Q" ( $QQ^\top = I_m = Q^\top Q$ )
3	Thinking Gram-Schmidt preserves span ordering	CGS does: $\operatorname{span}(\mathbf{q}_1,\ldots,\mathbf{q}_k) = \operatorname{span}(\mathbf{a}_1,\ldots,\mathbf{a}_k)$ . But with pivoting or reordering, this property is lost.	Track column ordering explicitly when pivoting QR is used
4	Using CGS when numerical stability matters	Classical Gram-Schmidt squares the error: $O(\epsilon_\text{mach}\kappa(A)^2)$ vs MGS's $O(\epsilon_\text{mach}\kappa(A))$ . For ill-conditioned $A$ , CGS can produce non-orthogonal "orthonormal" vectors.	Use Modified Gram-Schmidt (MGS) or Householder QR for production code
5	Projecting onto a non-orthogonal basis	The formula $P = QQ^\top$ assumes $Q^\top Q = I$ . If the columns are merely linearly independent (not orthonormal), the correct formula is $P = Q(Q^\top Q)^{-1}Q^\top$ , which involves the $(Q^\top Q)$ inverse.	Always check if $Q^\top Q = I$ before using the simplified projection formula
6	Thinking $P^2 = P$ is sufficient for orthogonal projection	Idempotence $P^2 = P$ only guarantees $P$ is some projection. Orthogonal projections additionally require $P^\top = P$ (symmetry). Oblique projections satisfy idempotence but not symmetry.	An orthogonal projection must be both idempotent ( $P^2 = P$ ) and symmetric ( $P^\top = P$ )
7	Claiming eigenvectors are always orthogonal	Eigenvectors of different eigenvalues are orthogonal only for symmetric matrices. For general matrices, eigenvectors can be arbitrary non-orthogonal vectors (or even non-real).	The spectral theorem applies to symmetric/Hermitian matrices; general matrices need Schur decomposition
8	Forgetting the sign convention in Householder	When computing $\mathbf{n} = (\mathbf{a} \pm \\|\mathbf{a}\\|\mathbf{e}_1)/\\|\cdots\\|$ , the sign should be chosen as $+\operatorname{sign}(a_1)$ to avoid cancellation when $a_1 > 0$ .	Always choose the sign of $\pm\\|\mathbf{a}\\|\mathbf{e}_1$ to be opposite the sign of $a_1$ : $\mathbf{n} = (\mathbf{a} + \operatorname{sign}(a_1)\\|\mathbf{a}\\|\mathbf{e}_1)/\\|\cdots\\|$
9	Misinterpreting Parseval's identity as an approximation	$\\|\mathbf{v}\\|^2 = \sum_i \hat{v}_i^2$ is an equality when $\{\mathbf{q}_i\}$ is a complete orthonormal basis for the whole space. If it's only a basis for a subspace, you get Bessel's inequality: $\sum_i \hat{v}_i^2 \leq \\|\mathbf{v}\\|^2$ .	Distinguish: complete ONB -> Parseval's equality; partial ONB -> Bessel's inequality
10	Assuming QR is always better than Cholesky for normal equations	QR avoids squaring the condition number but costs more flops ( $O(mn^2)$ vs $O(mn^2/2 + n^3/6)$ for Cholesky). For well-conditioned problems with large $m$ , Cholesky on $A^\top A$ may be faster.	Choose based on condition number: if $\kappa(A) \gg 1$ , use QR; if $\kappa(A) \approx 1$ , Cholesky is acceptable and faster
11	Using $\det(Q) = 1$ as the definition of orthogonal	$\det(Q) = \pm 1$ is a consequence of orthogonality, not the definition. A matrix can have determinant 1 without being orthogonal (e.g., any upper triangular matrix with diagonal product 1).	Definition is $Q^\top Q = I$ . Determinant $= \pm 1$ follows from this.
12	Thinking orthogonal initialization prevents gradient problems throughout training	Orthogonal initialization helps at initialization. As training proceeds, weights deviate from orthogonality. Maintaining orthogonality throughout training requires explicit regularization (e.g., spectral normalization, orthogonal regularization loss).	Use orthogonal init as a starting point; for persistent orthogonality through training, use explicit constraints

12. Exercises

Exercise 1 * - Projection and Decomposition

Given $\mathbf{v} = (3, 1, 2)^\top$ and the subspace $S = \operatorname{span}\{(1, 1, 0)^\top, (0, 1, 1)^\top\}$ :

(a) Apply Gram-Schmidt to the spanning vectors to obtain an ONB $\{\mathbf{q}_1, \mathbf{q}_2\}$ for $S$ .

(b) Compute the projection $\mathbf{v}_S = \operatorname{proj}_S(\mathbf{v})$ using the ONB.

(c) Verify: $\mathbf{v} - \mathbf{v}_S \perp \mathbf{q}_1$ and $\mathbf{v} - \mathbf{v}_S \perp \mathbf{q}_2$ .

(d) Compute $\|\mathbf{v}\|^2$ and verify $\|\mathbf{v}\|^2 = \|\mathbf{v}_S\|^2 + \|\mathbf{v} - \mathbf{v}_S\|^2$ (Pythagorean theorem).

Exercise 2 * - Gram-Schmidt and QR

Let $A = \begin{pmatrix}1 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 1\end{pmatrix}$ .

(a) Apply Gram-Schmidt to the columns of $A$ to produce $Q$ and $R$ such that $A = QR$ .

(b) Verify that $Q^\top Q = I$ and $QR = A$ numerically.

(c) Solve the least-squares problem $\min_\mathbf{x}\|A\mathbf{x} - \mathbf{b}\|$ for $\mathbf{b} = (1,2,3)^\top$ by solving $R\mathbf{x} = Q^\top\mathbf{b}$ via back substitution.

(d) Verify the solution against np.linalg.lstsq.

Exercise 3 * - Orthogonal Matrices

(a) Verify that the $2\times 2$ rotation matrix $R_\theta = \begin{pmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{pmatrix}$ is orthogonal for $\theta = \pi/4$ .

(b) Show that the composition of two rotations $R_\theta R_\phi = R_{\theta+\phi}$ by matrix multiplication.

(c) Construct a Householder reflector $H$ such that $H\mathbf{a} = \|\mathbf{a}\|\mathbf{e}_1$ for $\mathbf{a} = (3, 4)^\top$ . Verify $H^\top H = I$ and $H\mathbf{a} = (5, 0)^\top$ .

(d) What is $\det(H)$ ? Explain geometrically.

Exercise 4 ** - Modified Gram-Schmidt

(a) Implement both Classical and Modified Gram-Schmidt.

(b) Test on the Hilbert matrix $H_{ij} = 1/(i+j-1)$ for $n = 10$ . Compute the maximum off-diagonal entry of $Q^\top Q - I$ for each method.

(c) Plot orthogonality error ( $\|Q^\top Q - I\|_F$ ) vs matrix size $n$ for both methods. What do you observe?

(d) Explain why MGS is more stable using the error analysis from 5.2.

Exercise 5 ** - Least Squares Stability

Given the Vandermonde system fitting a degree- $d$ polynomial through $m$ points:

(a) Build the design matrix $A \in \mathbb{R}^{m \times (d+1)}$ with $A_{ij} = x_i^{j-1}$ for $m = 20$ , $d = 10$ .

(b) Solve the least-squares problem via: (i) normal equations $(A^\top A)\hat{\mathbf{c}} = A^\top\mathbf{y}$ , (ii) QR factorization $A = QR$ then $R\hat{\mathbf{c}} = Q^\top\mathbf{y}$ .

(c) Compute the condition number $\kappa(A)$ and $\kappa(A^\top A)$ . What is their ratio?

(d) Perturb $\mathbf{y}$ by small noise and compare the residuals from both methods. Which is more stable?

Exercise 6 ** - Spectral Theorem

Let $A = \begin{pmatrix}4 & 2 \\ 2 & 1\end{pmatrix}$ .

(a) Find all eigenvalues and eigenvectors of $A$ analytically.

(b) Verify that eigenvectors are orthogonal. Normalize them to form an orthogonal matrix $Q$ .

(c) Verify the spectral decomposition: $A = Q\Lambda Q^\top$ where $\Lambda = \operatorname{diag}(\lambda_1, \lambda_2)$ .

(d) Compute the Rayleigh quotient $R_A(\mathbf{x})$ for $\mathbf{x} = (1, 0)^\top$ , $(0, 1)^\top$ , $(1, 1)^\top/\sqrt{2}$ . Verify each is in $[\lambda_{\min}, \lambda_{\max}]$ .

(e) Is $A$ positive definite? Verify using eigenvalues and by direct computation of $\mathbf{x}^\top A\mathbf{x}$ .

Exercise 7 *** - Orthogonal Weight Initialization

(a) Generate 100 random matrices $W_1, \ldots, W_L \in \mathbb{R}^{64 \times 64}$ with i.i.d. Gaussian entries (standard init). Compute the spectral norm of their product $\|W_L \cdots W_1\|_2$ for $L = 1, 5, 10, 20$ .

(b) Repeat with each $W_k$ initialized as a random orthogonal matrix (via QR of a Gaussian matrix). Plot $\|W_L \cdots W_1\|_2$ vs $L$ for both initializations.

(c) Generate synthetic "gradient" vectors $\mathbf{g} \in \mathbb{R}^{64}$ and propagate backward through the chains. Compare gradient norms: $\|(W_L \cdots W_1)^\top\mathbf{g}\|$ vs $\|\mathbf{g}\|$ .

(d) Explain the result using the isometry property of orthogonal matrices.

Exercise 8 *** - Rayleigh Quotient Iteration

Rayleigh Quotient Iteration is a method for finding eigenvectors with cubic convergence:

\mathbf{x}^{(k+1)} = \frac{(A - \rho_k I)^{-1}\mathbf{x}^{(k)}}{\|(A - \rho_k I)^{-1}\mathbf{x}^{(k)}\|}, \qquad \rho_k = R_A(\mathbf{x}^{(k)})

(a) Implement Rayleigh Quotient Iteration for a symmetric matrix.

(b) Test on $A = \begin{pmatrix}3 & 1 & 0 \\ 1 & 2 & 1 \\ 0 & 1 & 1\end{pmatrix}$ with random initial $\mathbf{x}^{(0)}$ . Track $|\rho_k - \lambda_\star|$ over iterations.

(c) Plot convergence on a log scale and estimate the convergence rate (should be cubic: $|\rho_{k+1} - \lambda_\star| \approx C|\rho_k - \lambda_\star|^3$ ).

(d) Compare convergence speed vs power iteration and inverse iteration with a fixed shift.

(e) Explain why Rayleigh quotient (not a fixed shift) is needed for cubic convergence.

13. Why This Matters for AI (2026 Perspective)

Concept	AI Application	Impact
Orthogonal initialization	`torch.nn.init.orthogonal_()`	Prevents gradient explosion/vanishing at initialization; standard practice for deep linear networks (Saxe et al.) and RNNs
QR decomposition	Muon optimizer; orthogonal gradient updates	Orthogonalizing gradient matrices via QR/Newton-Schulz stabilizes LLM training; avoids rank collapse of weight matrices
RoPE embeddings	LLaMA, Mistral, GPT-NeoX, Falcon, Gemma	Rotary position encoding uses orthogonal rotation matrices to encode relative position in self-attention; rotation composition gives relative position property
Spectral normalization	GANs, discriminator regularization	Constrains $\sigma_1(W) = 1$ for 1-Lipschitz discriminator; stabilizes Wasserstein GAN training
Gram-Schmidt / QR	Numerically stable least squares	Used in regression, feature selection, basis pursuit; avoiding normal equations prevents condition number squaring
Orthogonal complement	Continual learning (OGD, GEM)	New task gradients are projected onto the orthogonal complement of old task gradient subspaces; prevents catastrophic forgetting
Spectral theorem	Hessian analysis, SAM optimizer	Eigendecomposition of the loss Hessian reveals sharp/flat directions; Sharpness-Aware Minimization (SAM) seeks minima with small $\lambda_\max(\nabla^2\mathcal{L})$
Orthonormal bases	Attention head analysis	Queries, keys, values in each attention head span subspaces; orthogonality between heads corresponds to specialization; mechanistic interpretability studies this
DFT (unitary matrix)	Audio models, efficient convolutions	Wav2Vec, Whisper process spectrograms (DFT magnitude); convolutional layers in frequency domain use FFT = matrix multiply by unitary $F_n$
Householder reflectors	QR in neural network training	Used in computing QR factorizations during optimizer steps (Shampoo); Householder product form compresses orthogonal matrices for efficient multiplication
Rayleigh quotient	Eigenvalue estimation in training	Stochastic Lanczos quadrature (Ghorbani et al. 2019) approximates the Hessian spectrum via Rayleigh quotients on random probes
Gram matrix	Gram-matrix style loss, NTK	Neural Tangent Kernel $\Theta = J J^\top$ (where $J$ is the Jacobian) is a Gram matrix; its eigenstructure determines training dynamics

14. Advanced Topics

14.1 Polar Decomposition

Every matrix $A \in \mathbb{R}^{m \times n}$ ( $m \geq n$ ) has a polar decomposition:

A = QS

where $Q \in \mathbb{R}^{m \times n}$ has orthonormal columns ( $Q^\top Q = I_n$ ) and $S \in \mathbb{R}^{n \times n}$ is symmetric positive semidefinite.

Uniqueness: If $A$ has full column rank, $Q$ and $S$ are unique, with $S = (A^\top A)^{1/2}$ (the matrix square root) and $Q = A(A^\top A)^{-1/2}$ .

Geometric meaning: $S$ captures the "stretching" and $Q$ captures the "rotation" - this is the matrix analog of writing a complex number as $re^{i\theta}$ .

Relation to SVD: If $A = U\Sigma V^\top$ (thin SVD), then:

A = (UV^\top)(V\Sigma V^\top) = Q \cdot S

So $Q = UV^\top$ (the "nearest orthogonal matrix" to $A$ ) and $S = V\Sigma V^\top$ .

For AI: The Muon optimizer computes $Q = UV^\top$ from the SVD of the gradient matrix $G$ , then applies $Q$ as the weight update - this is exactly the orthogonal factor of the polar decomposition of $G$ .

14.2 Orthogonality in Hilbert Spaces

All finite-dimensional results extend to Hilbert spaces - complete inner product spaces, possibly infinite-dimensional.

Orthogonal Projection Theorem (Hilbert spaces). If $\mathcal{H}$ is a Hilbert space and $\mathcal{M} \subseteq \mathcal{H}$ is a closed subspace, then every $\mathbf{v} \in \mathcal{H}$ has a unique decomposition $\mathbf{v} = \mathbf{v}_\mathcal{M} + \mathbf{v}_{\mathcal{M}^\perp}$ with $\mathbf{v}_\mathcal{M} \in \mathcal{M}$ and $\mathbf{v}_{\mathcal{M}^\perp} \in \mathcal{M}^\perp$ .

Completeness is essential: Without it, the projection may not exist. (The "closed" hypothesis ensures that sequences of projections converge.)

Examples of Hilbert spaces in ML:

$L^2(\mathbb{R})$ : the space of square-integrable functions (kernel methods, RKHS)
$\ell^2$ : square-summable sequences (infinite-dimensional limits of neural networks)
Reproducing Kernel Hilbert Spaces (RKHS): the function space for kernel machines, Gaussian processes

14.3 Gram Matrices and the Kernel Trick

Given vectors $\mathbf{x}_1, \ldots, \mathbf{x}_m \in \mathbb{R}^d$ , the Gram matrix is:

G_{ij} = \langle\mathbf{x}_i, \mathbf{x}_j\rangle = \mathbf{x}_i^\top\mathbf{x}_j

In matrix form: $G = XX^\top$ where $X = [\mathbf{x}_1|\cdots|\mathbf{x}_m]^\top \in \mathbb{R}^{m \times d}$ .

Properties: $G$ is symmetric positive semidefinite. Its rank equals $\operatorname{rank}(X)$ .

Kernel trick: Replace $\langle\mathbf{x}_i,\mathbf{x}_j\rangle$ with $k(\mathbf{x}_i,\mathbf{x}_j)$ where $k$ is a positive definite kernel. The resulting kernel matrix $K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)$ is a Gram matrix in a (possibly infinite-dimensional) RKHS.

For attention: The attention matrix $\text{Attn}_{ij} \propto \exp(\mathbf{q}_i^\top\mathbf{k}_j / \sqrt{d_k})$ is a kernelized Gram matrix between queries and keys. Its spectral properties determine how information flows between tokens.

Conceptual Bridge

Looking Backward: What Made This Possible

The theory developed in this section builds directly on foundations from earlier chapters:

Vectors and inner products (01: Vectors and Spaces): The inner product $\langle\mathbf{u},\mathbf{v}\rangle$ is the primitive notion from which orthogonality, angles, and projections all derive.
Linear transformations (04: Linear Transformations): Orthogonal matrices are the isometric linear maps - those that preserve both angles and distances. The kernel and image of projection operators are orthogonal complements.
Matrix rank (02-LA: Matrix Rank): The rank-nullity theorem and the four fundamental subspaces are unified by orthogonality: $\operatorname{null}(A) = \operatorname{row}(A)^\perp$ .
Eigenvalues (01: Eigenvalues): The spectral theorem tells us that symmetric matrices are diagonalizable with an orthonormal eigenbasis - the connection between orthogonality and spectral theory.

Looking Forward: What This Enables

Orthogonality and orthonormality are not endpoints but foundations:

Singular Value Decomposition (02: SVD): The SVD $A = U\Sigma V^\top$ is built from two orthogonal matrices. The right singular vectors $V$ form an ONB for the row space; the left singular vectors $U$ form an ONB for the column space. Without the theory developed here, SVD cannot be understood.
Matrix Decompositions (08: Matrix Decompositions): LU, QR, Cholesky, Schur - QR is the central algorithm among these, and the QR algorithm for eigenvalues iterates orthogonal similarity transformations.
Matrix Norms (06: Matrix Norms): The spectral norm $\|A\|_2 = \sigma_1(A)$ is defined via orthogonal invariance. The condition number $\kappa(Q) = 1$ for orthogonal $Q$ .
Optimization (Chapter 08): Gradient methods, second-order methods, and their variants all navigate spaces equipped with inner products. The geometry of the loss landscape is described using the spectral theory developed here.

CURRICULUM POSITION: ORTHOGONALITY AND ORTHONORMALITY
===========================================================================

  FOUNDATIONS                    THIS SECTION                  FORWARD
  -----------                    ------------                  -------
  +-----------------+            +-------------------------+   +------------------+
  | Inner Products  |----------> | Orthogonality (05)     |-->| SVD (02)        |
  | Vectors/Spaces  |            | +- Gram-Schmidt          |   | low-rank approx  |
  +-----------------+            | +- QR Decomposition      |   +------------------+
                                 | +- Orthogonal Matrices   |
  +-----------------+            | +- Projections           |   +------------------+
  | Eigenvalues     |----------> | +- Spectral Theorem      |-->| Matrix Norms     |
  | (01)           |            | +- ML Applications       |   | (06)            |
  +-----------------+            +-------------------------+   +------------------+
                                          |
  +-----------------+                     |                     +------------------+
  | Linear Transf.  |---------->----------+                  -->| Decompositions   |
  | (04)           |                                            | (08) QR alg.    |
  +-----------------+                                            +------------------+

  AI CONNECTIONS:
  +------------------------------------------------------------------+
  |  RoPE embeddings -> Gram-Schmidt  -> Muon optimizer  -> OGD/GEM    |
  |  Orthogonal init -> Spectral norm -> Rayleigh quotient -> SAM      |
  |  QR least squares -> DFT (unitary) -> Kernel Gram matrix -> NTK    |
  +------------------------------------------------------------------+

===========================================================================

<- Back to Advanced Linear Algebra | Next: Matrix Norms ->

Orthogonality and Orthonormality: Part 2 - Orthogonality In Function Spaces To Conceptual Bridge

Orthogonality and Orthonormality: Part 9: Orthogonality in Function Spaces to Conceptual Bridge

9. Orthogonality in Function Spaces

9.1 The L^2 Inner Product

9.2 Fourier Series as Orthonormal Expansion

9.3 Discrete Fourier Transform as a Unitary Matrix

10. Applications in Machine Learning

10.1 Orthogonal Weight Initialization

10.2 Rotary Position Embeddings (RoPE)

10.3 Spectral Normalization

10.4 QR in Optimization: Orthogonal Gradients

10b. Bessel's Inequality and Completeness

10b.1 Bessel's Inequality

10b.2 Completeness of Orthonormal Sets

10c. Detailed Worked Examples

10c.1 Projection in Least-Squares Regression

10c.2 Full Worked Example: Gram-Schmidt in $\mathbb{R}^3$

10c.3 Spectral Decomposition Example

10c.4 Householder Reflector: Concrete Computation

10d. Orthogonality and the Geometry of Neural Networks

10d.1 The Geometry of Weight Matrices

10d.2 Attention Heads as Orthogonal Projections

10d.3 Layer Normalization and Orthogonal Complements

10d.4 Orthogonal Regularization in Weight Matrices

11. Common Mistakes

12. Exercises

13. Why This Matters for AI (2026 Perspective)

14. Advanced Topics

14.1 Polar Decomposition

14.2 Orthogonality in Hilbert Spaces

14.3 Gram Matrices and the Kernel Trick

Conceptual Bridge

Looking Backward: What Made This Possible

Looking Forward: What This Enables

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

Orthogonality and Orthonormality: Part 2 - Orthogonality In Function Spaces To Conceptual Bridge

Orthogonality and Orthonormality: Part 9: Orthogonality in Function Spaces to Conceptual Bridge

9. Orthogonality in Function Spaces

9.1 The L^2 Inner Product

9.2 Fourier Series as Orthonormal Expansion

9.3 Discrete Fourier Transform as a Unitary Matrix

10. Applications in Machine Learning

10.1 Orthogonal Weight Initialization

10.2 Rotary Position Embeddings (RoPE)

10.3 Spectral Normalization

10.4 QR in Optimization: Orthogonal Gradients

10b. Bessel's Inequality and Completeness

10b.1 Bessel's Inequality

10b.2 Completeness of Orthonormal Sets

10c. Detailed Worked Examples

10c.1 Projection in Least-Squares Regression

10c.2 Full Worked Example: Gram-Schmidt in R3\mathbb{R}^3R3

10c.3 Spectral Decomposition Example

10c.4 Householder Reflector: Concrete Computation

10d. Orthogonality and the Geometry of Neural Networks

10d.1 The Geometry of Weight Matrices

10d.2 Attention Heads as Orthogonal Projections

10d.3 Layer Normalization and Orthogonal Complements

10d.4 Orthogonal Regularization in Weight Matrices

11. Common Mistakes

12. Exercises

13. Why This Matters for AI (2026 Perspective)

14. Advanced Topics

14.1 Polar Decomposition

14.2 Orthogonality in Hilbert Spaces

14.3 Gram Matrices and the Kernel Trick

Conceptual Bridge

Looking Backward: What Made This Possible

Looking Forward: What This Enables

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

10c.2 Full Worked Example: Gram-Schmidt in $\mathbb{R}^3$