NotesMath for LLMs

Hilbert Spaces

Functional Analysis / Hilbert Spaces

Notes

Inner products turn size into geometry. Completeness makes that geometry stable under limits.

Overview

Hilbert spaces are complete inner product spaces. They keep the geometry of Euclidean vectors while allowing infinite-dimensional objects such as functions, signals, square-summable sequences, random variables, and feature maps. Normed spaces let us measure size. Hilbert spaces add angle, orthogonality, projection, Fourier coordinates, and self-duality.

This section is the bridge between the normed-space foundations of the previous section and the kernel methods of the next section. The central message is practical: whenever a learning algorithm uses dot products, cosine similarity, least squares, PCA, attention scores, Fourier features, Gaussian processes, or kernelized optimization, it is leaning on Hilbert-space structure.

We focus on the core Hilbert toolkit:

  • inner products and induced norms
  • Cauchy-Schwarz, Pythagoras, and orthogonality
  • projection theorem and least squares
  • orthonormal systems, Bessel inequality, Parseval identity, and Fourier-Bessel expansion
  • Riesz representation and gradients as vectors
  • adjoints, self-adjoint operators, positive operators, compact operators, and spectral decomposition
  • careful bridges to RKHS, kernels, Fourier analysis, PCA, and neural tangent kernels

Kernel methods, positive definite kernels, support vector machines, Gaussian processes, Mercer expansions, and scalable kernel approximations are only previewed here. They are developed in detail in Kernel Methods.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive Hilbert geometry, projections, Gram-Schmidt, Parseval checks, Riesz gradients, adjoints, PCA, and kernel previews
exercises.ipynb8 graded exercises covering inner products, projections, bases, Riesz representation, operators, PCA, and RKHS previews

Learning Objectives

After completing this section, you will be able to:

  1. Define real and complex inner products and the norm they induce
  2. Distinguish pre-Hilbert spaces from Hilbert spaces
  3. Prove and apply Cauchy-Schwarz, Pythagoras, and the parallelogram law
  4. Use orthogonal complements to decompose Hilbert spaces
  5. Apply the projection theorem to closed subspaces and least-squares problems
  6. Build orthonormal systems with Gram-Schmidt
  7. Use Bessel inequality, Parseval identity, and Fourier-Bessel coordinates
  8. State and apply the Riesz representation theorem
  9. Interpret gradients as Riesz representatives of differentials
  10. Work with adjoints, self-adjoint operators, positive operators, and compact operators
  11. Explain how Hilbert geometry supports attention, PCA, ridge regression, Gaussian processes, RKHS theory, and infinite-width neural-network limits

Table of Contents


1. Intuition

1.1 From Norms to Angles

A normed space tells us how large a vector is. A Hilbert space tells us not only how large vectors are, but also how they face each other.

The extra structure is the inner product:

x,y.\langle \mathbf{x}, \mathbf{y} \rangle.

In Rn\mathbb{R}^n, the standard inner product is

x,y=xy=i=1nxiyi.\langle \mathbf{x}, \mathbf{y} \rangle = \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^n x_i y_i.

This single operation gives length,

x=x,x,\lVert \mathbf{x} \rVert = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle},

angle,

cosθ=x,yxy,\cos \theta = \frac{\langle \mathbf{x}, \mathbf{y} \rangle} {\lVert \mathbf{x} \rVert \lVert \mathbf{y} \rVert},

and orthogonality,

xyx,y=0.\mathbf{x} \perp \mathbf{y} \quad \Longleftrightarrow \quad \langle \mathbf{x}, \mathbf{y} \rangle = 0.

The philosophical shift is small but powerful:

normed space:
  vector + size + convergence

Hilbert space:
  vector + size + convergence + angle + projection + coordinates

In machine learning, this is why dot products can act as similarity scores, why least squares has a clean geometric solution, why PCA is an orthogonal coordinate system, and why Fourier analysis can decompose functions into energy-preserving frequency components.

1.2 Why Completeness Matters

A Hilbert space is not just an inner product space. It is a complete inner product space. Completeness means every Cauchy sequence converges to a point inside the same space.

This matters because optimization, approximation, and learning algorithms often produce limits. If the space is incomplete, an algorithm can converge toward an object that does not live in the space being studied.

Let P[0,1]\mathcal{P}[0,1] be the polynomials on [0,1][0,1] with inner product

f,g=01f(t)g(t)dt.\langle f,g \rangle = \int_0^1 f(t)g(t)\,dt.

This is an inner product space, but it is not complete. A sequence of polynomials can be Cauchy in the L2L^2 norm and converge to a square-integrable function that is not a polynomial. The completion is L2[0,1]L^2[0,1].

For AI, incompleteness shows up whenever we approximate functions by finite models but reason about limiting function classes. A Hilbert space is the stable mathematical container for those limits.

1.3 Why Hilbert Spaces Matter for AI

Hilbert-space ideas are everywhere in modern ML:

ML conceptHilbert-space idea
attention score qk\mathbf{q}^\top \mathbf{k}inner product as alignment
cosine similaritynormalized Hilbert angle
least squaresorthogonal projection
ridge regressionprojected or regularized Hilbert problem
PCAspectral theorem for self-adjoint covariance operators
Fourier featurescoordinates in an orthonormal system
Gaussian processescovariance kernels and function-space geometry
kernel methodsimplicit inner products in feature Hilbert spaces
gradient descent in function spaceRiesz representation of differentials
neural tangent kernelkernel gradient flow in a Hilbert-like feature geometry

The key operational pattern is:

learning problemapproximation in a Hilbert spaceprojection, coordinates, or spectral analysis.\text{learning problem} \quad \leadsto \quad \text{approximation in a Hilbert space} \quad \leadsto \quad \text{projection, coordinates, or spectral analysis}.

1.4 Historical Timeline

  • 1900s: Hilbert and Schmidt formalize infinite systems of equations and spectral methods.
  • 1920s-1930s: Hilbert spaces become the language of quantum mechanics.
  • 1940s-1950s: Functional analysis develops projection, duality, and operator theory.
  • 1950s-1970s: RKHS theory connects kernels with Hilbert spaces of functions.
  • 1990s-2000s: Support vector machines and Gaussian processes bring kernels into mainstream ML.
  • 2010s-2020s: Neural tangent kernels and infinite-width limits reconnect deep learning with Hilbert-space and kernel viewpoints.

2. Formal Definitions

2.1 Real and Complex Inner Products

Let H\mathcal{H} be a vector space over R\mathbb{R} or C\mathbb{C}. An inner product is a map

,:H×HF,F{R,C},\langle \cdot,\cdot \rangle : \mathcal{H} \times \mathcal{H} \to \mathbb{F}, \qquad \mathbb{F} \in \{\mathbb{R}, \mathbb{C}\},

such that for all x,y,zH\mathbf{x},\mathbf{y},\mathbf{z} \in \mathcal{H} and scalars a,bFa,b \in \mathbb{F}:

  1. Linearity in the first argument
ax+by,z=ax,z+by,z. \langle a\mathbf{x} + b\mathbf{y}, \mathbf{z} \rangle = a\langle \mathbf{x},\mathbf{z} \rangle + b\langle \mathbf{y},\mathbf{z} \rangle.
  1. Conjugate symmetry
x,y=y,x. \langle \mathbf{x},\mathbf{y} \rangle = \overline{\langle \mathbf{y},\mathbf{x} \rangle}.
  1. Positive definiteness
x,x0,x,x=0x=0. \langle \mathbf{x},\mathbf{x} \rangle \geq 0, \qquad \langle \mathbf{x},\mathbf{x} \rangle = 0 \Longleftrightarrow \mathbf{x}=\mathbf{0}.

Some books choose linearity in the second argument for complex spaces. This repository uses linearity in the first argument in this section. The formulas involving adjoints and Riesz representatives should be read consistently with that convention.

2.2 Induced Norm and Metric

Every inner product induces a norm:

x=x,x.\lVert \mathbf{x} \rVert = \sqrt{\langle \mathbf{x},\mathbf{x} \rangle}.

The induced metric is

d(x,y)=xy.d(\mathbf{x},\mathbf{y}) = \lVert \mathbf{x}-\mathbf{y} \rVert.

So every inner product space is a normed space and every inner product space is a metric space. The reverse is false. Many norms do not come from any inner product.

2.3 Hilbert and Pre-Hilbert Spaces

An inner product space is also called a pre-Hilbert space when we want to emphasize that it may not be complete.

A Hilbert space is a complete inner product space. Equivalently, H\mathcal{H} is Hilbert if every sequence (xn)(\mathbf{x}_n) satisfying

ϵ>0, N, m,nN:xnxm<ϵ\forall \epsilon > 0,\ \exists N,\ \forall m,n \geq N: \lVert \mathbf{x}_n-\mathbf{x}_m \rVert < \epsilon

has a limit xH\mathbf{x} \in \mathcal{H} with

xnx0.\lVert \mathbf{x}_n-\mathbf{x} \rVert \to 0.

Completeness is a statement about the norm induced by the inner product.

2.4 Examples and Non-Examples

Finite-dimensional Euclidean spaces. Rn\mathbb{R}^n and Cn\mathbb{C}^n are Hilbert spaces with the usual inner product. Every finite-dimensional inner product space is complete.

Weighted Euclidean spaces. If WS++nW \in \mathbb{S}^n_{++}, then

x,yW=xWy\langle \mathbf{x},\mathbf{y} \rangle_W = \mathbf{x}^\top W\mathbf{y}

is an inner product on Rn\mathbb{R}^n. The induced norm is a Mahalanobis-style norm:

xW=xWx.\lVert \mathbf{x} \rVert_W = \sqrt{\mathbf{x}^\top W\mathbf{x}}.

Square-summable sequences. The space

2={x=(x1,x2,):i=1xi2<}\ell^2 = \left\{ \mathbf{x}=(x_1,x_2,\ldots): \sum_{i=1}^\infty \lvert x_i \rvert^2 < \infty \right\}

is Hilbert under

x,y=i=1xiyi.\langle \mathbf{x},\mathbf{y} \rangle = \sum_{i=1}^\infty x_i\overline{y_i}.

Square-integrable functions. The space L2(Ω)L^2(\Omega) is Hilbert under

f,g=Ωf(t)g(t)dt.\langle f,g \rangle = \int_{\Omega} f(t)\overline{g(t)}\,dt.

Technically, elements of L2L^2 are equivalence classes of functions equal almost everywhere. That detail prevents zero-norm nonzero representatives from breaking positive definiteness.

Polynomials are not complete. The polynomial space P[0,1]\mathcal{P}[0,1] with the L2L^2 inner product is a pre-Hilbert space, not a Hilbert space.

C[0,1]C[0,1] with L2L^2 inner product is not complete. Continuous functions are dense in L2[0,1]L^2[0,1], but an L2L^2 limit of continuous functions need not be continuous.

1\ell^1 is not a Hilbert space with its usual norm. The norm x1\lVert \mathbf{x} \rVert_1 does not satisfy the parallelogram law, so it cannot be induced by any inner product.

2.5 Parallelogram Law and Polarization Identity

Every inner-product norm satisfies the parallelogram law:

x+y2+xy2=2x2+2y2.\lVert \mathbf{x}+\mathbf{y} \rVert^2 + \lVert \mathbf{x}-\mathbf{y} \rVert^2 = 2\lVert \mathbf{x} \rVert^2 + 2\lVert \mathbf{y} \rVert^2.

This law says the two diagonals of a parallelogram carry exactly the same total squared length as twice the sum of squared side lengths.

For real inner product spaces, the inner product can be recovered from the norm:

x,y=14(x+y2xy2).\langle \mathbf{x},\mathbf{y} \rangle = \frac{1}{4} \left( \lVert \mathbf{x}+\mathbf{y} \rVert^2 - \lVert \mathbf{x}-\mathbf{y} \rVert^2 \right).

For complex spaces, the polarization identity is

x,y=14k=03ikx+iky2.\langle \mathbf{x},\mathbf{y} \rangle = \frac{1}{4} \sum_{k=0}^{3} i^k \lVert \mathbf{x}+i^k\mathbf{y} \rVert^2.

Thus a Hilbert norm is not just any norm. It is exactly a norm whose geometry secretly contains an inner product.


3. Core Theory I: Inner Product Geometry

3.1 Cauchy-Schwarz Inequality

For any x,y\mathbf{x},\mathbf{y} in an inner product space,

x,yxy.\lvert \langle \mathbf{x},\mathbf{y} \rangle \rvert \leq \lVert \mathbf{x} \rVert \lVert \mathbf{y} \rVert.

Equality holds if and only if x\mathbf{x} and y\mathbf{y} are linearly dependent.

Proof sketch. If y=0\mathbf{y}=\mathbf{0}, the claim is immediate. Otherwise, consider

ϕ(a)=xay20.\phi(a) = \lVert \mathbf{x}-a\mathbf{y} \rVert^2 \geq 0.

Choose

a=x,yy2.a = \frac{\langle \mathbf{x},\mathbf{y} \rangle}{\lVert \mathbf{y} \rVert^2}.

Expanding ϕ(a)\phi(a) gives

0x2x,y2y2,0 \leq \lVert \mathbf{x} \rVert^2 - \frac{\lvert \langle \mathbf{x},\mathbf{y} \rangle \rvert^2} {\lVert \mathbf{y} \rVert^2},

which rearranges to Cauchy-Schwarz.

In ML, Cauchy-Schwarz bounds the maximum possible alignment between embeddings, gradients, features, and residuals. It justifies cosine similarity and many norm-based generalization bounds.

3.2 Angles, Cosine Similarity, and Orthogonality

For nonzero x\mathbf{x} and y\mathbf{y} in a real Hilbert space, define the angle θ\theta by

cosθ=x,yxy.\cos \theta = \frac{\langle \mathbf{x},\mathbf{y} \rangle} {\lVert \mathbf{x} \rVert\lVert \mathbf{y} \rVert}.

Cauchy-Schwarz guarantees the right side lies in [1,1][-1,1].

Cosine similarity is the same expression. It removes magnitude and keeps directional alignment:

cosim(x,y)=xx,yy.\operatorname{cosim}(\mathbf{x},\mathbf{y}) = \left\langle \frac{\mathbf{x}}{\lVert \mathbf{x} \rVert}, \frac{\mathbf{y}}{\lVert \mathbf{y} \rVert} \right\rangle.

Two vectors are orthogonal when

x,y=0.\langle \mathbf{x},\mathbf{y} \rangle = 0.

Attention scores, embedding retrieval, contrastive learning, and spectral algorithms all use this notion of alignment.

3.3 Pythagorean Theorem

If xy\mathbf{x}\perp\mathbf{y}, then

x+y2=x2+y2.\lVert \mathbf{x}+\mathbf{y} \rVert^2 = \lVert \mathbf{x} \rVert^2 + \lVert \mathbf{y} \rVert^2.

Proof:

x+y2=x+y,x+y=x2+x,y+y,x+y2=x2+y2.\begin{aligned} \lVert \mathbf{x}+\mathbf{y} \rVert^2 &= \langle \mathbf{x}+\mathbf{y},\mathbf{x}+\mathbf{y} \rangle \\ &= \lVert \mathbf{x} \rVert^2 + \langle \mathbf{x},\mathbf{y} \rangle + \langle \mathbf{y},\mathbf{x} \rangle + \lVert \mathbf{y} \rVert^2 \\ &= \lVert \mathbf{x} \rVert^2 + \lVert \mathbf{y} \rVert^2. \end{aligned}

This identity is the energy bookkeeping behind least squares, Fourier analysis, PCA, and variance decomposition.

3.4 Orthogonal Complements and Closed Subspaces

For a subset MH\mathcal{M}\subseteq\mathcal{H}, its orthogonal complement is

M={xH:x,m=0 for all mM}.\mathcal{M}^{\perp} = \{\mathbf{x}\in\mathcal{H}: \langle \mathbf{x},\mathbf{m} \rangle = 0 \text{ for all } \mathbf{m}\in\mathcal{M}\}.

M\mathcal{M}^{\perp} is always a closed subspace, even if M\mathcal{M} is not closed.

If M\mathcal{M} is a closed subspace of a Hilbert space, then

H=MM.\mathcal{H} = \mathcal{M} \oplus \mathcal{M}^{\perp}.

Every vector xH\mathbf{x}\in\mathcal{H} has a unique decomposition

x=m+r,mM,rM.\mathbf{x} = \mathbf{m} + \mathbf{r}, \qquad \mathbf{m}\in\mathcal{M}, \quad \mathbf{r}\in\mathcal{M}^{\perp}.

In least squares, m\mathbf{m} is the fitted value and r\mathbf{r} is the residual.

3.5 Best Approximation Intuition

Hilbert spaces make approximation geometric. Suppose M\mathcal{M} is a model class that is a closed subspace and x\mathbf{x} is a target. The best approximation problem is

minmMxm.\min_{\mathbf{m}\in\mathcal{M}} \lVert \mathbf{x}-\mathbf{m} \rVert.

The solution is characterized not by a complicated search condition but by orthogonality:

xmM.\mathbf{x}-\mathbf{m}^{\star} \perp \mathcal{M}.

That is, the error left after the best approximation has no component in the model subspace.

target x
   *
   |\
   | \
   |  \ residual x - P_M x
   |   \
   |    *
   |   P_M x
---+---------------- model subspace M

4. Core Theory II: Projection Theorem

4.1 Projection onto Closed Convex Sets

Let C\mathcal{C} be a nonempty closed convex subset of a Hilbert space H\mathcal{H}. For every xH\mathbf{x}\in\mathcal{H}, there exists a unique point PCxCP_{\mathcal{C}}\mathbf{x}\in\mathcal{C} such that

xPCx=infcCxc.\lVert \mathbf{x}-P_{\mathcal{C}}\mathbf{x} \rVert = \inf_{\mathbf{c}\in\mathcal{C}} \lVert \mathbf{x}-\mathbf{c} \rVert.

The point PCxP_{\mathcal{C}}\mathbf{x} is the projection of x\mathbf{x} onto C\mathcal{C}.

For convex sets, the variational characterization is

RexPCx,cPCx0for all cC.\operatorname{Re} \langle \mathbf{x}-P_{\mathcal{C}}\mathbf{x}, \mathbf{c}-P_{\mathcal{C}}\mathbf{x} \rangle \leq 0 \quad \text{for all } \mathbf{c}\in\mathcal{C}.

This inequality says every feasible direction from the projection point makes an obtuse angle with the residual.

4.2 Orthogonal Projection onto Closed Subspaces

When M\mathcal{M} is a closed subspace, the projection condition simplifies:

xPMxM.\mathbf{x}-P_{\mathcal{M}}\mathbf{x} \in \mathcal{M}^{\perp}.

Equivalently,

xPMx,m=0for all mM.\langle \mathbf{x}-P_{\mathcal{M}}\mathbf{x},\mathbf{m} \rangle = 0 \quad \text{for all } \mathbf{m}\in\mathcal{M}.

If {e1,,ek}\{ \mathbf{e}_1,\ldots,\mathbf{e}_k \} is an orthonormal basis for a finite-dimensional subspace M\mathcal{M}, then

PMx=j=1kx,ejej.P_{\mathcal{M}}\mathbf{x} = \sum_{j=1}^k \langle \mathbf{x},\mathbf{e}_j \rangle \mathbf{e}_j.

This formula is the finite-dimensional ancestor of Fourier expansion.

4.3 Least Squares as Projection

Given ARm×nA\in\mathbb{R}^{m\times n} and bRm\mathbf{b}\in\mathbb{R}^m, least squares solves

minxRnAxb22.\min_{\mathbf{x}\in\mathbb{R}^n} \lVert A\mathbf{x}-\mathbf{b} \rVert_2^2.

The fitted vector AxA\mathbf{x}^{\star} is the projection of b\mathbf{b} onto the column space col(A)\operatorname{col}(A). The residual is orthogonal to every column of AA:

A(Axb)=0.A^\top(A\mathbf{x}^{\star}-\mathbf{b}) = \mathbf{0}.

Thus the normal equations are not arbitrary algebra. They are the orthogonality condition for projection:

AAx=Ab.A^\top A\mathbf{x}^{\star} = A^\top\mathbf{b}.

If AAA^\top A is invertible,

x=(AA)1Ab.\mathbf{x}^{\star} = (A^\top A)^{-1}A^\top\mathbf{b}.

If not, the minimum-norm solution is

x=Ab.\mathbf{x}^{\star} = A^\dagger\mathbf{b}.

4.4 Projection Operators

The orthogonal projection PM:HMP_{\mathcal{M}}:\mathcal{H}\to\mathcal{M} satisfies:

PM2=PM,PM=PM,PM1.P_{\mathcal{M}}^2=P_{\mathcal{M}}, \qquad P_{\mathcal{M}}^*=P_{\mathcal{M}}, \qquad \lVert P_{\mathcal{M}} \rVert \leq 1.

If M{0}\mathcal{M}\neq\{\mathbf{0}\}, then PM=1\lVert P_{\mathcal{M}} \rVert=1.

For a matrix AA with full column rank, the projection matrix onto col(A)\operatorname{col}(A) is

PA=A(AA)1A.P_A = A(A^\top A)^{-1}A^\top.

It satisfies

PA2=PA,PA=PA.P_A^2=P_A, \qquad P_A^\top=P_A.

4.5 Projection Algorithms in ML

Projection appears in many algorithms:

  • projected gradient descent enforces constraints by applying PCP_{\mathcal{C}}
  • alternating projections solve feasibility problems
  • least-squares layers project labels onto feature spans
  • PCA projects data onto top eigenspaces
  • denoising methods often estimate a projection onto a data manifold or low-dimensional signal set
  • constrained decoding can be viewed as repeated projection onto feasible token or structure sets, though usually in non-Hilbert geometries

The exact Hilbert projection theorem applies to closed convex sets in Hilbert spaces. ML practice often borrows the intuition outside this perfect setting, so the theorem-level statement and engineering heuristic should not be confused.


5. Core Theory III: Orthonormal Systems and Bases

5.1 Orthonormal Sets and Gram-Schmidt

A set {ej}jJ\{\mathbf{e}_j\}_{j\in J} is orthonormal if

ei,ej={1,i=j,0,ij.\langle \mathbf{e}_i,\mathbf{e}_j \rangle = \begin{cases} 1, & i=j, \\ 0, & i\neq j. \end{cases}

Given linearly independent vectors v1,,vk\mathbf{v}_1,\ldots,\mathbf{v}_k, Gram-Schmidt constructs an orthonormal set:

uj=vji=1j1vj,eiei,ej=ujuj.\mathbf{u}_j = \mathbf{v}_j - \sum_{i=1}^{j-1} \langle \mathbf{v}_j,\mathbf{e}_i \rangle \mathbf{e}_i, \qquad \mathbf{e}_j = \frac{\mathbf{u}_j}{\lVert \mathbf{u}_j \rVert}.

Numerically, classical Gram-Schmidt can be unstable. Modified Gram-Schmidt or QR factorization is preferred in computation.

5.2 Bessel Inequality

If {ej}jJ\{\mathbf{e}_j\}_{j\in J} is an orthonormal set, then for every xH\mathbf{x}\in\mathcal{H},

jJx,ej2x2.\sum_{j\in J} \lvert \langle \mathbf{x},\mathbf{e}_j \rangle \rvert^2 \leq \lVert \mathbf{x} \rVert^2.

For finite JJ, define

p=jJx,ejej.\mathbf{p} = \sum_{j\in J} \langle \mathbf{x},\mathbf{e}_j \rangle \mathbf{e}_j.

Then xp\mathbf{x}-\mathbf{p} is orthogonal to p\mathbf{p}, so

x2=p2+xp2p2=jJx,ej2.\lVert \mathbf{x} \rVert^2 = \lVert \mathbf{p} \rVert^2 + \lVert \mathbf{x}-\mathbf{p} \rVert^2 \geq \lVert \mathbf{p} \rVert^2 = \sum_{j\in J} \lvert \langle \mathbf{x},\mathbf{e}_j \rangle \rvert^2.

Bessel inequality says orthonormal coordinates cannot contain more energy than the vector itself.

5.3 Complete Orthonormal Systems and Hilbert Bases

An orthonormal set {ej}jJ\{\mathbf{e}_j\}_{j\in J} is complete if the only vector orthogonal to every ej\mathbf{e}_j is 0\mathbf{0}:

(x,ej=0 for all jJ)x=0.\left( \langle \mathbf{x},\mathbf{e}_j \rangle = 0 \text{ for all } j\in J \right) \Longrightarrow \mathbf{x}=\mathbf{0}.

For separable Hilbert spaces, a countable complete orthonormal system is often called a Hilbert basis. This is not a Hamel basis. Infinite Hilbert expansions converge in norm, not as finite algebraic sums.

5.4 Parseval Identity and Fourier-Bessel Expansion

If {ej}jJ\{\mathbf{e}_j\}_{j\in J} is a complete orthonormal system, then

x=jJx,ejej\mathbf{x} = \sum_{j\in J} \langle \mathbf{x},\mathbf{e}_j \rangle \mathbf{e}_j

with convergence in the Hilbert norm, and

x2=jJx,ej2.\lVert \mathbf{x} \rVert^2 = \sum_{j\in J} \lvert \langle \mathbf{x},\mathbf{e}_j \rangle \rvert^2.

The first identity is Fourier-Bessel expansion. The second is Parseval identity.

For L2[π,π]L^2[-\pi,\pi], the trigonometric functions form an orthonormal coordinate system after normalization. A square-integrable function can be represented by its Fourier coefficients in L2L^2 norm, even when pointwise convergence needs additional hypotheses.

5.5 Separability and 2\ell^2 Coordinates

A Hilbert space is separable if it has a countable dense subset. Most Hilbert spaces used in ML and signal processing are separable.

If H\mathcal{H} is separable with orthonormal basis (ej)j=1(\mathbf{e}_j)_{j=1}^{\infty}, the map

x(x,e1,x,e2,)\mathbf{x} \mapsto \left( \langle \mathbf{x},\mathbf{e}_1 \rangle, \langle \mathbf{x},\mathbf{e}_2 \rangle, \ldots \right)

is an isometric isomorphism from H\mathcal{H} to a closed subspace of 2\ell^2. If the basis is complete, it is an isometric isomorphism onto 2\ell^2.

This is the reason 2\ell^2 is the prototype of separable infinite-dimensional Hilbert spaces.


6. Core Theory IV: Riesz Representation and Duality

6.1 Bounded Linear Functionals

A linear functional is a linear map

L:HF.L:\mathcal{H}\to\mathbb{F}.

It is bounded if there exists C0C\geq 0 such that

L(x)Cxfor all xH.\lvert L(\mathbf{x}) \rvert \leq C\lVert \mathbf{x} \rVert \quad \text{for all } \mathbf{x}\in\mathcal{H}.

The operator norm is

L=supx1L(x).\lVert L \rVert = \sup_{\lVert \mathbf{x} \rVert\leq 1} \lvert L(\mathbf{x}) \rvert.

Bounded linear functionals are exactly continuous linear functionals.

6.2 Riesz Representation Theorem

For every bounded linear functional LL on a Hilbert space H\mathcal{H}, there exists a unique vector hLH\mathbf{h}_L\in\mathcal{H} such that

L(x)=x,hLfor all xH.L(\mathbf{x}) = \langle \mathbf{x},\mathbf{h}_L \rangle \quad \text{for all } \mathbf{x}\in\mathcal{H}.

Moreover,

L=hL.\lVert L \rVert = \lVert \mathbf{h}_L \rVert.

The vector hL\mathbf{h}_L is the Riesz representative of LL.

6.3 Hilbert Spaces Are Self-Dual

The Riesz theorem identifies H\mathcal{H}^* with H\mathcal{H}. In a general Banach space, the dual can be quite different from the original space. In a Hilbert space, continuous linear measurements are inner products with vectors inside the same space.

This is the clean mathematical reason gradients can often be represented as vectors. A derivative is naturally a linear functional. To turn it into a gradient vector, we choose an inner product and apply Riesz representation.

6.4 Gradients as Riesz Representatives

Let F:HRF:\mathcal{H}\to\mathbb{R} be differentiable. Its differential at x\mathbf{x} is a bounded linear functional

DF(x)[v].DF(\mathbf{x})[\mathbf{v}].

The gradient F(x)\nabla F(\mathbf{x}) is the Riesz representative satisfying

DF(x)[v]=v,F(x)for all vH.DF(\mathbf{x})[\mathbf{v}] = \langle \mathbf{v},\nabla F(\mathbf{x}) \rangle \quad \text{for all } \mathbf{v}\in\mathcal{H}.

Changing the inner product changes the gradient vector while preserving the same differential. This is why natural gradient, preconditioning, and mirror-descent-like methods can be interpreted as changing the geometry in which steepest descent is measured.

6.5 Why This Matters for Optimization and Backpropagation

Backpropagation computes derivatives. Optimizers apply gradient vectors. The bridge from derivative-as-functional to gradient-as-vector is Riesz representation plus an inner product choice.

For the squared loss

F(w)=12Xwy22,F(\mathbf{w}) = \frac{1}{2} \lVert X\mathbf{w}-\mathbf{y} \rVert_2^2,

the differential is

DF(w)[v]=Xwy,Xv=X(Xwy),v.DF(\mathbf{w})[\mathbf{v}] = \langle X\mathbf{w}-\mathbf{y},X\mathbf{v} \rangle = \langle X^\top(X\mathbf{w}-\mathbf{y}),\mathbf{v} \rangle.

Thus the Euclidean gradient is

F(w)=X(Xwy).\nabla F(\mathbf{w}) = X^\top(X\mathbf{w}-\mathbf{y}).

With a different inner product u,vG=uGv\langle \mathbf{u},\mathbf{v} \rangle_G=\mathbf{u}^\top G\mathbf{v}, the gradient vector becomes

GF(w)=G1F(w).\nabla_G F(\mathbf{w}) = G^{-1} \nabla F(\mathbf{w}).

This is preconditioning in Hilbert geometry.


7. Core Theory V: Operators on Hilbert Spaces

7.1 Bounded Operators and Operator Norms

A linear operator T:HKT:\mathcal{H}\to\mathcal{K} is bounded if

TxKCxHfor all xH.\lVert T\mathbf{x} \rVert_{\mathcal{K}} \leq C\lVert \mathbf{x} \rVert_{\mathcal{H}} \quad \text{for all } \mathbf{x}\in\mathcal{H}.

Its operator norm is

T=supxH1TxK.\lVert T \rVert = \sup_{\lVert \mathbf{x} \rVert_{\mathcal{H}}\leq 1} \lVert T\mathbf{x} \rVert_{\mathcal{K}}.

For matrices with Euclidean geometry, this is the spectral norm:

A2=σmax(A).\lVert A \rVert_2 = \sigma_{\max}(A).

7.2 Adjoints and Self-Adjoint Operators

The adjoint T:KHT^*:\mathcal{K}\to\mathcal{H} is defined by

Tx,yK=x,TyH.\langle T\mathbf{x},\mathbf{y} \rangle_{\mathcal{K}} = \langle \mathbf{x},T^*\mathbf{y} \rangle_{\mathcal{H}}.

In finite-dimensional real Euclidean space, the adjoint is the transpose. In complex Euclidean space, it is the conjugate transpose.

An operator is self-adjoint if

T=T.T=T^*.

Self-adjoint operators are the Hilbert-space analogue of symmetric matrices. Covariance operators, kernel integral operators, Hessians of quadratic losses, and graph Laplacians are central ML examples.

7.3 Positive Operators and Quadratic Forms

A self-adjoint operator TT is positive if

Tx,x0for all xH.\langle T\mathbf{x},\mathbf{x} \rangle \geq 0 \quad \text{for all } \mathbf{x}\in\mathcal{H}.

In finite dimensions, this is the positive semidefinite condition A0A\succeq0.

Positive operators define energy functionals:

E(x)=Tx,x.E(\mathbf{x}) = \langle T\mathbf{x},\mathbf{x} \rangle.

Examples:

  • covariance matrix: variance in direction x\mathbf{x}
  • graph Laplacian: smoothness energy over graph nodes
  • kernel Gram matrix: squared norm of a finite feature combination
  • Hessian: local curvature of a twice-differentiable loss

7.4 Compact Operators

A bounded operator T:HKT:\mathcal{H}\to\mathcal{K} is compact if it maps bounded sets to relatively compact sets. Equivalently, every bounded sequence (xn)(\mathbf{x}_n) has a subsequence such that (Txnk)(T\mathbf{x}_{n_k}) converges.

In finite dimensions, every bounded operator is compact. In infinite dimensions, compactness is special. Compact operators often behave like infinite matrices whose singular values decay to zero.

Integral operators with smooth kernels are typical examples:

(Tf)(s)=Ωk(s,t)f(t)dt.(Tf)(s) = \int_{\Omega} k(s,t)f(t)\,dt.

Kernel methods and Gaussian-process covariance operators often lead to compact positive self-adjoint operators under suitable domain and regularity assumptions.

7.5 Spectral Theorem for Compact Self-Adjoint Operators

If TT is compact and self-adjoint on a Hilbert space, then there is an orthonormal set of eigenvectors associated with nonzero real eigenvalues, and the nonzero spectrum can accumulate only at 00.

In favorable cases,

Tx=j=1λjx,ejej.T\mathbf{x} = \sum_{j=1}^{\infty} \lambda_j \langle \mathbf{x},\mathbf{e}_j \rangle \mathbf{e}_j.

This is the infinite-dimensional analogue of diagonalizing a symmetric matrix:

A=QΛQ.A = Q\Lambda Q^\top.

PCA is the finite-sample version of this story. Kernel PCA and Gaussian-process covariance analysis use the same idea in feature or function spaces.


8. Advanced Topics and Bridges

8.1 Weak Convergence

Norm convergence is strong:

xnx0.\lVert \mathbf{x}_n-\mathbf{x} \rVert \to 0.

Weak convergence asks only that all bounded linear measurements converge:

xnxxn,hx,hfor all hH.\mathbf{x}_n \rightharpoonup \mathbf{x} \quad \Longleftrightarrow \quad \langle \mathbf{x}_n,\mathbf{h} \rangle \to \langle \mathbf{x},\mathbf{h} \rangle \quad \text{for all } \mathbf{h}\in\mathcal{H}.

Strong convergence implies weak convergence. Weak convergence does not generally imply strong convergence.

In infinite-dimensional optimization, weak compactness and lower semicontinuity are often enough to prove existence of minimizers even when norm-compactness fails.

8.2 Fourier Analysis as Hilbert-Space Coordinates

Fourier analysis is Hilbert-space coordinate expansion in L2L^2. The functions

12π,cos(nt)π,sin(nt)π,n1,\frac{1}{\sqrt{2\pi}}, \qquad \frac{\cos(nt)}{\sqrt{\pi}}, \qquad \frac{\sin(nt)}{\sqrt{\pi}}, \quad n\geq1,

form an orthonormal system in L2[π,π]L^2[-\pi,\pi].

Fourier coefficients are inner products:

an=f,en.a_n = \langle f,\mathbf{e}_n \rangle.

Parseval says signal energy equals coefficient energy. This is the basis of spectral signal processing, convolution analysis, and random Fourier feature intuition.

8.3 RKHS and Reproducing Kernels

A reproducing kernel Hilbert space is a Hilbert space HK\mathcal{H}_K of functions on a set X\mathcal{X} such that point evaluation is continuous:

Lx(f)=f(x).L_x(f)=f(x).

By Riesz representation, for each xXx\in\mathcal{X} there exists KxHKK_x\in\mathcal{H}_K such that

f(x)=f,KxHK.f(x) = \langle f,K_x \rangle_{\mathcal{H}_K}.

Define

K(x,x)=Kx(x)=Kx,KxHK.K(x,x') = K_{x'}(x) = \langle K_{x'},K_x \rangle_{\mathcal{H}_K}.

This is the reproducing kernel. The next section develops positive definite kernels, the Moore-Aronszajn theorem, SVMs, kernel ridge regression, Gaussian processes, and kernel approximations in detail.

8.4 Kernel Gradient Flow and Neural Tangent Kernels

Many overparameterized models can be studied by tracking how predictions change under gradient descent. In certain infinite-width limits, the dynamics of the prediction function are governed by a kernel:

dft(x)dt=i=1nK(x,x(i))Lft(x(i)).\frac{d f_t(\mathbf{x})}{dt} = - \sum_{i=1}^n K(\mathbf{x},\mathbf{x}^{(i)}) \frac{\partial \mathcal{L}}{\partial f_t(\mathbf{x}^{(i)})}.

The neural tangent kernel perspective says that, under specific assumptions, very wide neural networks behave like kernel machines during training. This is a modern bridge between deep learning and Hilbert-space geometry, but the details depend on architecture, scaling, initialization, and limiting arguments.

8.5 Continuous vs Discrete Spectra

Compact self-adjoint operators resemble diagonal matrices with eigenvalues tending to zero. General self-adjoint operators can have continuous spectrum. This matters in quantum mechanics, PDEs, and some infinite-dimensional learning problems.

For this course, the main working intuition is:

  • finite-dimensional symmetric matrices have orthonormal eigenvectors
  • compact self-adjoint operators retain a similar discrete spectral structure
  • general self-adjoint operators require spectral measures, which are beyond this section

9. Applications in Machine Learning

9.1 Embedding Similarity and Attention Scores

The transformer attention score between query q\mathbf{q} and key k\mathbf{k} is

q,kdk.\frac{\langle \mathbf{q},\mathbf{k} \rangle}{\sqrt{d_k}}.

The inner product measures alignment. Scaling by dk\sqrt{d_k} controls variance when coordinates have roughly unit scale. Cosine similarity normalizes away vector length; dot-product attention keeps both angle and magnitude.

9.2 Least Squares, Ridge Regression, and Orthogonal Projection

Least squares is projection onto the feature span. Ridge regression solves

minwXwy22+λw22.\min_{\mathbf{w}} \lVert X\mathbf{w}-\mathbf{y} \rVert_2^2 + \lambda \lVert \mathbf{w} \rVert_2^2.

The regularization term adds Hilbert norm control. In kernel ridge regression, the same idea is lifted to an RKHS norm:

minfHKi=1n(f(x(i))y(i))2+λfHK2.\min_{f\in\mathcal{H}_K} \sum_{i=1}^n \left(f(\mathbf{x}^{(i)})-y^{(i)}\right)^2 + \lambda \lVert f \rVert_{\mathcal{H}_K}^2.

The representer theorem then says the minimizer lies in the finite span of kernel sections K(x(i),)K(\mathbf{x}^{(i)},\cdot).

9.3 PCA and Spectral Decomposition

Given centered data matrix XX, the sample covariance is

C=1nXX.C = \frac{1}{n}X^\top X.

It is positive self-adjoint. PCA finds orthonormal directions vj\mathbf{v}_j that solve

Cvj=λjvj.C\mathbf{v}_j = \lambda_j\mathbf{v}_j.

Projecting onto the top kk eigenvectors gives the best rank-kk approximation in squared reconstruction error. This is Hilbert projection plus spectral theory.

9.4 Gaussian Processes and RKHS Preview

A Gaussian process is determined by a mean function and covariance kernel. The covariance kernel is positive definite, so it defines a Hilbert geometry. The RKHS associated with the kernel is not the same as the random sample-path space in general, but it captures the deterministic smoothness geometry encoded by the kernel.

9.5 Infinite-Width Neural Networks and NTK Preview

In neural tangent kernel theory, an infinite-width network can converge to dynamics described by a kernel operator. The Hilbert-space lesson is that inner products between parameter gradients induce a geometry on functions:

K(x,x)=θf(x;θ),θf(x;θ).K(\mathbf{x},\mathbf{x}') = \left\langle \nabla_{\boldsymbol{\theta}} f(\mathbf{x};\boldsymbol{\theta}), \nabla_{\boldsymbol{\theta}} f(\mathbf{x}';\boldsymbol{\theta}) \right\rangle.

This kernel measures how a parameter update that changes the output at x\mathbf{x} also changes the output at x\mathbf{x}'.


10. Common Mistakes

  1. Confusing every normed space with a Hilbert space. A Hilbert space needs an inner product whose induced norm is complete. 1\ell^1 with its usual norm is Banach but not Hilbert.

  2. Forgetting completeness. Polynomials with the L2L^2 inner product have angles and lengths but are not complete.

  3. Treating a Hilbert basis like a Hamel basis. Hilbert expansions can be infinite and converge in norm.

  4. Assuming orthogonal projection exists onto every subspace. The subspace must be closed. Dense proper subspaces do not have nontrivial orthogonal complements.

  5. Using pointwise convergence when L2L^2 convergence is meant. Fourier series often converge in mean-square even when pointwise behavior is delicate.

  6. Assuming all ML feature spaces are finite-dimensional. Kernels often represent implicit Hilbert spaces that may be infinite-dimensional.

  7. Ignoring the chosen inner product when discussing gradients. Gradients depend on geometry. Differentials do not.

  8. Confusing PSD matrices with positive operators in all contexts. PSD matrices are finite-dimensional examples. Infinite-dimensional positive operators require domain and boundedness care.

  9. Assuming RKHS sample paths are typical Gaussian-process samples. The RKHS is the deterministic geometry of the kernel; GP sample paths may lie outside it with probability one in many common settings.

  10. Overstating NTK conclusions. NTK limits are mathematically specific and do not automatically explain all finite-width training behavior.


11. Exercises

  1. Prove Cauchy-Schwarz using minimization of xay2\lVert \mathbf{x}-a\mathbf{y} \rVert^2.
  2. Check whether a weighted bilinear form xWy\mathbf{x}^\top W\mathbf{y} is an inner product for several matrices WW.
  3. Show that 1\ell^1 does not satisfy the parallelogram law.
  4. Compute the orthogonal projection of a vector onto a line and onto a column space.
  5. Derive the normal equations for least squares from residual orthogonality.
  6. Implement Gram-Schmidt and compare it with QR factorization.
  7. Verify Bessel inequality and Parseval identity for a finite orthonormal basis.
  8. Find the Riesz representative of L(x)=axL(\mathbf{x})=\mathbf{a}^\top\mathbf{x} under a weighted inner product.
  9. Show that a symmetric PSD matrix defines a positive self-adjoint operator.
  10. Explain why kernel evaluation is point evaluation represented as an inner product in an RKHS.

The companion exercise notebook contains scaffolded versions with computational checks and full solutions.


12. Why This Matters for AI

Hilbert spaces are one of the quiet foundations of AI mathematics. They explain why dot products are meaningful, why squared error is geometrically special, why projections solve approximation problems, why PCA diagonalizes variance, why Fourier features preserve energy, why kernels let nonlinear learning become linear in a feature space, and why gradients become vectors once an inner product is chosen.

Three ideas are especially important:

1. Similarity is geometry. Attention, retrieval, contrastive learning, and recommendation systems all compare representations. Inner products and normalized inner products are Hilbert-space measurements of alignment.

2. Approximation is projection. Least squares, PCA, denoising, and many function-approximation problems become clearer when the learned object is viewed as a projection onto a subspace or a regularized approximation inside a Hilbert space.

3. Learning dynamics depend on inner products. A gradient is not just a list of partial derivatives. It is a Riesz representative of a differential under a chosen geometry. Changing the geometry changes the optimization path.


13. Conceptual Bridge

The next section, Kernel Methods, starts from the Hilbert idea that inner products measure similarity. A kernel is a function

K(x,x)=ϕ(x),ϕ(x)HK(\mathbf{x},\mathbf{x}') = \langle \phi(\mathbf{x}),\phi(\mathbf{x}') \rangle_{\mathcal{H}}

that computes an inner product in a feature Hilbert space without explicitly constructing ϕ(x)\phi(\mathbf{x}).

The key transition is:

Hilbert spaces:
  inner products, projections, bases, Riesz representation

Kernel methods:
  compute Hilbert-space inner products through K(x, x')
  learn nonlinear functions with linear Hilbert-space geometry

Once you understand Hilbert spaces, the kernel trick is no longer a trick. It is just inner-product geometry moved into a richer feature space.


References