Part 1

29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Spectral Graph Theory: Part 1: Intuition to 6. Graph Fourier Transform and Signal Processing

1. Intuition

1.1 Hearing the Shape of a Graph

In 1966, mathematician Mark Kac posed the question: "Can you hear the shape of a drum?" - meaning, can you reconstruct the geometry of a vibrating membrane from the frequencies it produces? The question turned out to have a negative answer in general, but it crystallized one of the deepest ideas in mathematics: the spectrum of a differential operator encodes geometric information.

Spectral graph theory asks the same question for discrete structures. A graph $G = (V, E)$ has an associated matrix - the Laplacian $L$ - whose eigenvalues form the graph spectrum. These numbers are not arbitrary: they encode whether the graph is connected, how tightly its communities are glued together, how quickly a random walk mixes across its edges, how hard it is to cut the graph in two.

Think of a social network. Each person is a node; each friendship is an edge. The graph has "natural frequencies": a society with two isolated groups (a disconnected graph) has a different spectrum from one that is fully interconnected. The small eigenvalues of $L$ correspond to smooth, slowly-varying signals - the overall community membership function. The large eigenvalues correspond to rapidly-oscillating signals - the microscopic variation from person to person. This is the graph analogue of low and high frequencies in audio.

Three statements, each surprising when first encountered, that spectral graph theory makes precise:

The number of connected components of $G$ equals the number of times $0$ appears as an eigenvalue of $L$ .
The second-smallest eigenvalue $\lambda_2(L)$ - the "Fiedler value" - tells you how hard it is to disconnect the graph. A graph is harder to cut when $\lambda_2$ is larger.
The eigenvector corresponding to $\lambda_2$ - the "Fiedler vector" - assigns a real number to each vertex, and the sign of this number tells you which side of the best bisection each vertex belongs to.

These are not vague analogies. They are theorems with proofs, and they form the backbone of a theory that has become indispensable in machine learning.

1.2 The Three Graph Matrices

For a graph $G = (V, E)$ with $n = |V|$ vertices and $m = |E|$ edges, three matrices appear constantly:

Adjacency matrix $A \in \mathbb{R}^{n \times n}$ :

A_{ij} = \begin{cases} 1 & \text{if } (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

For undirected graphs, $A$ is symmetric. For weighted graphs, $A_{ij} = w_{ij}$ , the edge weight. The adjacency matrix encodes the direct connections in the graph.

Degree matrix $D \in \mathbb{R}^{n \times n}$ : a diagonal matrix with $D_{ii} = d_i = \sum_j A_{ij}$ , the degree of vertex $i$ . For weighted graphs, $d_i = \sum_j w_{ij}$ is the weighted degree (also called strength).

Graph Laplacian $L = D - A$ : the central object of spectral graph theory. Explicitly:

L_{ij} = \begin{cases} d_i & \text{if } i = j \\ -1 & \text{if } (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

The Laplacian is named after Pierre-Simon Laplace because it is the discrete analogue of the continuous Laplace operator $\Delta = \sum_i \partial^2/\partial x_i^2$ . For a function $f: V \to \mathbb{R}$ defined on the vertices:

(Lf)_i = \sum_{j: (i,j) \in E} (f(i) - f(j)) = d_i f(i) - \sum_{j: (i,j) \in E} f(j)

This is the "discrete second derivative" - it measures how much the value at vertex $i$ differs from the average value among its neighbors.

For AI: In a Graph Neural Network, the operation $\tilde{A} \mathbf{H}$ (multiplying node features by the adjacency matrix with self-loops) is equivalent to computing $(\tilde{D} - \tilde{L})\mathbf{H}$ . The Laplacian is implicitly present in every GNN layer.

1.3 Why Eigenvalues Reveal Structure

The Laplacian $L$ is a real symmetric positive semidefinite matrix. By the spectral theorem (03-Advanced-Linear-Algebra), it has a complete orthonormal basis of eigenvectors $\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_n$ with real non-negative eigenvalues:

0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n

Why is $\lambda_1 = 0$ always? Because $L \mathbf{1} = \mathbf{0}$ - the all-ones vector is always in the null space of $L$ (every row of $L$ sums to zero). The constant function "assign the same value to every vertex" has zero variation across every edge, so it has zero energy.

The deeper result: $\lambda_1 = \lambda_2 = \cdots = \lambda_k = 0$ if and only if the graph has exactly $k$ connected components. On a disconnected graph with $k$ components, the eigenvectors for eigenvalue $0$ are the indicator vectors of each component.

The first nonzero eigenvalue $\lambda_2$ - if it exists - is called the algebraic connectivity or Fiedler value (after Miroslav Fiedler, who proved its key properties in 1973). A larger $\lambda_2$ means the graph is harder to disconnect; a $\lambda_2$ close to zero means there is almost a disconnection - a bottleneck.

The largest eigenvalue $\lambda_n$ gives the spectral radius of the Laplacian and satisfies $\lambda_n \leq 2 d_{\max}$ .

1.4 Historical Timeline

SPECTRAL GRAPH THEORY - HISTORICAL TIMELINE
========================================================================

  1847  Kirchhoff     - Matrix-Tree theorem; Laplacian for electrical circuits
  1931  Whitney       - Graph isomorphism; chromatic polynomials
  1954  Collatz &     - Systematic study of graph spectra begins
        Sinogowitz
  1973  Fiedler       - Algebraic connectivity; Fiedler vector; graph bisection
  1981  Cheeger       - Cheeger inequality (originally for manifolds)
  1988  Alon & Milman - Discrete Cheeger inequality for graphs
  1996  Belkin &      - Laplacian eigenmaps for manifold learning (pub. 2001)
        Niyogi
  2000  Shi & Malik   - Normalized Cuts and image segmentation
  2002  Ng, Jordan,   - Spectral clustering algorithm (the standard version)
        Weiss
  2004  Spielman &    - Spectral sparsification; fast Laplacian solvers
        Teng
  2011  Hammond et al - Wavelets on graphs
  2014  Bruna et al   - Spectral graph CNNs (first spectral GNN)
  2016  Defferrard    - ChebNet: Chebyshev polynomial filters on graphs
        et al
  2017  Kipf &        - GCN: first-order Chebyshev -> simple spatial rule
        Welling
  2022  Dwivedi et al - Laplacian positional encodings for graph Transformers
  2022  Rampasek et   - GPS: General, Powerful, Scalable graph Transformer
        al                with spectral PE

========================================================================

1.5 Roadmap of the Section

This section follows a deliberate progression from foundational algebra to modern AI applications:

SECTION ROADMAP
========================================================================

  2 Graph Matrices          Build the algebraic objects
         down
  3 Quadratic Form / PSD    Prove fundamental spectral properties
         down
  4 Fiedler Vector          Connect \lambda_2 to graph connectivity
         down
  5 Cheeger Inequality      Connect \lambda_2 to cut structure and mixing
         down
  6 Graph Fourier Transform  Signal processing on graphs
         down
  7 Spectral Filtering      From Fourier to polynomial approximations
         down
  8 Spectral Clustering     Partition graphs via eigenvectors
         down
  9 Laplacian Eigenmaps     Embed graphs; PE for transformers
         down
  10 Directed Graphs        PageRank; complex eigenvalues
         down
  11 Advanced Topics        Sparsification; wavelets; random matrices
         down
  12 ML Applications        KGs, molecules, LLM attention analysis

========================================================================

2. Graph Matrices and Their Spectra

2.1 Adjacency Matrix: Spectral View

Definition. For $G = (V, E)$ with $n$ vertices, the adjacency matrix $A \in \mathbb{R}^{n \times n}$ is defined by $A_{ij} = w_{ij}$ if $(i,j) \in E$ and $A_{ij} = 0$ otherwise (with $w_{ij} = 1$ for unweighted graphs).

Key spectral property: Walk counting. The $(i,j)$ entry of $A^k$ counts the number of walks of length exactly $k$ from vertex $i$ to vertex $j$ . This follows by induction: $(A^k)_{ij} = \sum_\ell (A^{k-1})_{i\ell} A_{\ell j}$ sums over all ways to reach $j$ in $k$ steps by first taking $k-1$ steps to reach $\ell$ , then one step to $j$ .

For AI: This walk-counting property is the spectral justification for why a $k$ -layer GNN can "see" information from the $k$ -hop neighborhood. The matrix $A^k$ is what a linear GNN with $k$ layers computes.

Eigenvalues of $A$ . For an undirected graph, $A$ is symmetric, so all eigenvalues are real. Let $\mu_1 \geq \mu_2 \geq \cdots \geq \mu_n$ denote the eigenvalues of $A$ in decreasing order. Key bounds:

For any graph: $|\mu_i| \leq d_{\max}$ (the maximum degree), since $\lVert A \rVert_2 = \mu_1 \leq d_{\max}$ .
For a $d$ -regular graph: $\mu_1 = d$ with eigenvector $\mathbf{1}/\sqrt{n}$ .
Bipartite graphs have symmetric spectra: $\mu_i = -\mu_{n+1-i}$ .
The number of distinct eigenvalues is at least $\text{diam}(G) + 1$ (where $\text{diam}$ is graph diameter).

Non-examples of symmetry: For a directed graph, $A$ is not symmetric and eigenvalues may be complex. This is why directed spectral theory (10) requires separate treatment.

2.2 Degree Matrix and Volume

The degree matrix $D = \operatorname{diag}(d_1, d_2, \ldots, d_n)$ is diagonal with $D_{ii} = d_i = \sum_j A_{ij}$ .

Volume. For a subset $S \subseteq V$ , the volume is $\operatorname{vol}(S) = \sum_{i \in S} d_i$ . For the full graph, $\operatorname{vol}(V) = 2m$ (each edge contributes 2 to the total degree sum). Volume plays the role of "mass" in the normalized Laplacian theory.

For a $d$ -regular graph, $D = d \cdot I_n$ and $\operatorname{vol}(S) = d |S|$ , making the theory particularly clean. Most derivations proceed with general $D$ but reduce to simpler formulas in the regular case.

Random walk transition matrix. The matrix $P = D^{-1} A$ is a row-stochastic matrix: $\sum_j P_{ij} = 1$ for all $i$ . It defines a random walk on the graph: from vertex $i$ , move to neighbor $j$ with probability $A_{ij}/d_i$ . The stationary distribution of this walk is $\boldsymbol{\pi}$ with $\pi_i = d_i / \operatorname{vol}(V)$ - proportional to degree. This connection between $D$ , $A$ , and random walks is central to the normalized Laplacian theory.

2.3 Unnormalized Laplacian L = D - A

Definition. $L = D - A$ . Entry-by-entry:

L_{ij} = \begin{cases} d_i & i = j \\ -w_{ij} & (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

Every row (and column) sums to zero: $\sum_j L_{ij} = d_i - \sum_{j:(i,j)\in E} 1 = 0$ . Equivalently, $L \mathbf{1} = \mathbf{0}$ .

The fundamental quadratic form:

\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j) \in E} w_{ij}(x_i - x_j)^2 \quad \text{for all } \mathbf{x} \in \mathbb{R}^n

Proof:

\mathbf{x}^\top L \mathbf{x} = \mathbf{x}^\top D \mathbf{x} - \mathbf{x}^\top A \mathbf{x}

= \sum_i d_i x_i^2 - \sum_{(i,j)\in E} 2 w_{ij} x_i x_j

= \sum_{(i,j)\in E} w_{ij}(x_i^2 + x_j^2) - 2\sum_{(i,j)\in E} w_{ij} x_i x_j = \sum_{(i,j)\in E} w_{ij}(x_i - x_j)^2 \geq 0

Since $w_{ij} > 0$ , this is always non-negative: $L \succeq 0$ .

Geometric meaning: $\mathbf{x}^\top L \mathbf{x}$ measures the total variation of the signal $\mathbf{x}: V \to \mathbb{R}$ across all edges. It is zero if and only if $x_i = x_j$ for all edges $(i,j)$ - i.e., $\mathbf{x}$ is constant on each connected component.

For AI: Graph regularization in semi-supervised learning minimizes $\mathbf{f}^\top L \mathbf{f}$ subject to labeling constraints. This penalizes label functions that change rapidly across edges - a smoothness prior that says "connected nodes likely have the same label."

2.4 Normalized Laplacians

Two normalized variants of the Laplacian are used in practice:

Symmetric normalized Laplacian:

L_{\text{sym}} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2}

with entries:

\left(L_{\text{sym}}\right)_{ij} = \begin{cases} 1 & i = j \\ -w_{ij}/\sqrt{d_i d_j} & (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

Properties: Symmetric; eigenvalues in $[0, 2]$ ; $\lambda_k(L_{\text{sym}}) = 1$ for all $k$ iff the graph is bipartite (eigenvalues symmetric around 1); $\tilde{\mathbf{u}}_k = D^{1/2} \mathbf{u}_k$ where $\mathbf{u}_k$ are eigenvectors of $L_{\text{rw}}$ .

Random-walk normalized Laplacian:

L_{\text{rw}} = D^{-1} L = I - D^{-1} A = I - P

with $P = D^{-1}A$ the random walk transition matrix. Properties: Not symmetric, but has the same eigenvalues as $L_{\text{sym}}$ (they are similar matrices). Eigenvalues in $[0, 2]$ . The eigenvectors of $L_{\text{rw}}$ for eigenvalue $0$ are the constant vectors on each component.

When to use which:

Laplacian	Use case	Why
$L = D - A$	Graphs with uniform degree; theoretical proofs	Simplest form
$L_{\text{sym}}$	Spectral clustering (Ng et al.); GCN normalization	Symmetric -> orthogonal eigenvectors
$L_{\text{rw}}$	Random walk analysis; Shi-Malik NCut	Direct connection to $P$

For AI (GCN connection): The GCN propagation rule $\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ uses the symmetric normalized adjacency of the graph with self-loops - equivalently, $I - L_{\text{sym}}$ of the augmented graph.

2.5 Spectra of Special Graphs

Closed-form eigenvalues for key graph families provide calibration and test cases:

Complete graph $K_n$ : $A = \mathbf{1}\mathbf{1}^\top - I$ . Eigenvalues of $A$ : $n-1$ (once) and $-1$ ( $n-1$ times). Eigenvalues of $L$ : $0$ (once) and $n$ ( $n-1$ times). The graph is maximally connected: $\lambda_2(L) = n$ .

Path graph $P_n$ : Vertices $\{1, \ldots, n\}$ , edges $\{(i, i+1)\}$ . Eigenvalues of $L$ :

\lambda_k = 2 - 2\cos\!\left(\frac{(k-1)\pi}{n}\right), \quad k = 1, 2, \ldots, n

So $\lambda_1 = 0$ , $\lambda_2 = 2 - 2\cos(\pi/n) \approx \pi^2/n^2$ for large $n$ - very small. This reflects the intuition that a long path is easy to cut (just remove the middle edge).

Cycle graph $C_n$ : Eigenvalues of $L$ :

\lambda_k = 2 - 2\cos\!\left(\frac{2\pi(k-1)}{n}\right), \quad k = 1, 2, \ldots, n

The spectrum is symmetric around $\lambda = 2$ . $\lambda_2 = 2 - 2\cos(2\pi/n) \approx 4\pi^2/n^2$ for large $n$ .

Star graph $S_n$ : One hub connected to $n-1$ leaves. Eigenvalues of $L$ : $0$ (once), $1$ ( $n-2$ times), $n$ (once). $\lambda_2 = 1$ regardless of how many leaves there are - the star is easy to disconnect (remove the hub).

$d$ -regular bipartite graph $K_{n/2, n/2}$ : Eigenvalues of $L$ : $0, d, d, \ldots, d, 2d$ with the pattern dictated by the bipartite structure; eigenvalue $2d$ indicates bipartiteness.

2.6 Characteristic Polynomial and Cospectral Graphs

The characteristic polynomial of a graph $G$ is $p_G(\lambda) = \det(\lambda I - A)$ . The roots are the eigenvalues of $A$ . The coefficients of $p_G$ are spectral invariants: the sum of eigenvalues equals $\operatorname{tr}(A) = 0$ (no self-loops); the sum of squares of eigenvalues equals $\operatorname{tr}(A^2) = 2m$ .

Cospectral (isospectral) graphs are graphs with identical characteristic polynomials but non-isomorphic structures. The simplest pair: two graphs on 6 vertices found by Schwenk (1973). Cospectrality shows that the spectrum does not uniquely determine a graph - a fundamental limitation of spectral methods. For graph isomorphism testing, the Weisfeiler-Lehman test (05) captures structure that the spectrum misses.

For AI: The WL-expressiveness hierarchy of GNNs (Xu et al., 2019) parallels this cospectrality result. GNNs based on spectral convolution can distinguish everything the Laplacian spectrum distinguishes - but no more. This is one motivation for higher-order GNNs and attention-based methods.

3. The Fundamental Quadratic Form and PSD Structure

3.1 Dirichlet Energy

The quadratic form $\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j)\in E}(x_i - x_j)^2$ is called the Dirichlet energy (or graph Dirichlet form) of the signal $\mathbf{x}: V \to \mathbb{R}$ .

This name comes from the continuous analogue: for a function $f: \Omega \to \mathbb{R}$ on a domain $\Omega \subset \mathbb{R}^d$ , the Dirichlet energy is $\int_\Omega \lVert \nabla f \rVert^2 \, d\mathbf{x}$ , which measures the total variation (smoothness) of $f$ . The graph Laplacian $L$ is the discrete analogue of $-\Delta$ (the negative Laplacian), and $\mathbf{x}^\top L \mathbf{x}$ is the discrete Dirichlet energy.

Interpretations by context:

Context	What $\mathbf{x}^\top L \mathbf{x}$ measures
Social network	Total disagreement when $x_i \in \{-1, +1\}$ labels communities
Signal on graph	Total variation (roughness) of the signal across edges
Temperature field	Total heat flux across edges at steady state
Node embeddings	"Embedding strain" - how much nearby nodes differ
Semi-supervised labels	Penalty for assigning different labels to connected nodes

Critical point of Dirichlet energy. The Rayleigh quotient $R(\mathbf{x}) = \mathbf{x}^\top L \mathbf{x} / \lVert \mathbf{x} \rVert^2$ is minimized by the eigenvector with smallest eigenvalue. Constrained to $\mathbf{x} \perp \mathbf{1}$ (orthogonal to the trivial null vector), the minimum is achieved by $\mathbf{u}_2$ , the Fiedler vector.

3.2 Proof That L \succeq 0

Theorem. For any undirected weighted graph with non-negative edge weights, $L \succeq 0$ .

Proof. For any $\mathbf{x} \in \mathbb{R}^n$ :

\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j) \in E} w_{ij}(x_i - x_j)^2 \geq 0

since $w_{ij} \geq 0$ and $(x_i - x_j)^2 \geq 0$ for all real numbers. $\square$

Corollary. All eigenvalues of $L$ are non-negative: $0 \leq \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n$ .

Corollary. $\mathbf{1}$ is always an eigenvector with $\lambda_1 = 0$ , since $L\mathbf{1} = (D - A)\mathbf{1} = \mathbf{d} - A\mathbf{1} = \mathbf{0}$ (where $\mathbf{d}$ is the degree vector, equal to $A\mathbf{1}$ ).

Strengthened result for normalized Laplacians. For $L_{\text{sym}}$ : since $L_{\text{sym}} = D^{-1/2}LD^{-1/2}$ and $L \succeq 0$ , we have $L_{\text{sym}} \succeq 0$ . Moreover, $\mathbf{x}^\top L_{\text{sym}} \mathbf{x} \leq 2\lVert \mathbf{x} \rVert^2$ for all $\mathbf{x}$ , so $\lambda_k(L_{\text{sym}}) \in [0, 2]$ .

3.3 Connected Components via the Null Space

Theorem (Fiedler, 1973). The multiplicity of eigenvalue $0$ of the graph Laplacian $L$ equals the number of connected components of $G$ .

Proof.

$(\Rightarrow)$ Suppose $G$ has $k$ connected components $C_1, C_2, \ldots, C_k$ . For each component $C_\ell$ , define $\mathbf{v}^\ell \in \mathbb{R}^n$ as the indicator vector of $C_\ell$ : $v^\ell_i = 1$ if $i \in C_\ell$ , else $0$ . Then $L\mathbf{v}^\ell = \mathbf{0}$ because for any vertex $i \in C_\ell$ :

(L\mathbf{v}^\ell)_i = d_i v^\ell_i - \sum_{j:(i,j)\in E} v^\ell_j = d_i - d_i = 0

(all neighbors of $i$ are also in $C_\ell$ since components are isolated). The $k$ vectors $\mathbf{v}^1, \ldots, \mathbf{v}^k$ are linearly independent, so $\dim(\ker L) \geq k$ .

$(\Leftarrow)$ Suppose $L\mathbf{x} = \mathbf{0}$ . Then $0 = \mathbf{x}^\top L \mathbf{x} = \sum_{(i,j)\in E}(x_i - x_j)^2$ , which forces $x_i = x_j$ for every edge $(i,j)$ . Thus $\mathbf{x}$ is constant on each connected component. The dimension of the space of such functions equals the number of components. So $\dim(\ker L) \leq k$ .

Combining both directions, $\dim(\ker L) = k$ . $\square$

Examples:

Fully connected graph ( $k=1$ ): $\lambda_1 = 0$ is simple; $\lambda_2 > 0$ .
Graph with 2 isolated components: $\lambda_1 = \lambda_2 = 0$ ; $\lambda_3 > 0$ .
Path $P_n$ : always connected; $\lambda_2 = 2 - 2\cos(\pi/n) > 0$ .

Non-example: For a disconnected graph with components of different sizes, the eigenvectors for $\lambda = 0$ are NOT all-ones vectors but rather indicator vectors of the components.

3.4 Eigenvalue Bounds and Interlacing

Upper bound. For any connected graph:

\lambda_n(L) \leq 2 d_{\max}

where $d_{\max}$ is the maximum degree. For $d$ -regular graphs: $\lambda_n(L) = 2d$ iff the graph is bipartite.

Lower bound on $\lambda_2$ . From the Cheeger inequality (full treatment in 5):

\lambda_2 \geq \frac{h(G)^2}{2}

where $h(G)$ is the Cheeger constant.

Interlacing theorem. Let $H$ be an induced subgraph of $G$ on $m < n$ vertices, with Laplacian eigenvalues $0 = \mu_1 \leq \mu_2 \leq \cdots \leq \mu_m$ . Then:

\lambda_i(L_G) \leq \lambda_i(L_H) \leq \lambda_{n-m+i}(L_G) \quad \text{for } i = 1, \ldots, m

Interlacing means that removing vertices from a graph cannot increase $\lambda_2$ by more than the increase in $\lambda_n$ . This is used in structural arguments about graph connectivity after vertex removal.

3.5 Courant-Fischer Minimax Theorem

The Courant-Fischer theorem provides a variational characterization of every eigenvalue of a symmetric matrix. For the graph Laplacian $L$ with eigenvalues $0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n$ :

\lambda_k = \min_{\substack{S \leq \mathbb{R}^n \\ \dim(S) = k}} \max_{\substack{\mathbf{x} \in S \\ \mathbf{x} \neq \mathbf{0}}} \frac{\mathbf{x}^\top L \mathbf{x}}{\lVert \mathbf{x} \rVert^2}

In particular, the Fiedler value has the characterization:

\lambda_2 = \min_{\mathbf{x} \perp \mathbf{1},\, \mathbf{x} \neq \mathbf{0}} \frac{\mathbf{x}^\top L \mathbf{x}}{\lVert \mathbf{x} \rVert^2} = \min_{\mathbf{x} \perp \mathbf{1},\, \lVert \mathbf{x} \rVert = 1} \sum_{(i,j)\in E}(x_i - x_j)^2

Proof sketch. Write $\mathbf{x} = \sum_k c_k \mathbf{u}_k$ in the eigenbasis. Then $\mathbf{x}^\top L \mathbf{x} = \sum_k \lambda_k c_k^2$ and $\lVert \mathbf{x} \rVert^2 = \sum_k c_k^2$ . The Rayleigh quotient is $\sum_k \lambda_k c_k^2 / \sum_k c_k^2$ , a convex combination of eigenvalues. Minimizing over $\mathbf{x} \perp \mathbf{u}_1 = \mathbf{1}/\sqrt{n}$ forces $c_1 = 0$ , making the minimum $\lambda_2$ (achieved when $c_2 = 1$ , all others $0$ ).

Practical use. Courant-Fischer justifies using $\mathbf{u}_2$ as the optimal graph bisection vector: it solves the continuous relaxation of the minimum bisection problem, as we prove in 4 and 8.

4. Algebraic Connectivity and the Fiedler Vector

4.1 Algebraic Connectivity \lambda_2

Definition. The algebraic connectivity of a graph $G$ is $a(G) = \lambda_2(L)$ , the second-smallest eigenvalue of the graph Laplacian. It is also called the Fiedler value.

Theorem (Fiedler, 1973). $\lambda_2(L) > 0$ if and only if $G$ is connected.

This follows directly from 3.3: $\lambda_2 = 0$ iff there are at least 2 connected components.

Why "algebraic" connectivity? The classical combinatorial connectivity $\kappa(G)$ (minimum number of vertices whose removal disconnects $G$ ) is NP-hard to compute in general. The algebraic connectivity $\lambda_2$ provides a computable lower bound:

\lambda_2 \leq \kappa(G) \leq \delta(G)

where $\delta(G)$ is the minimum degree. This inequality chain says: algebraic connectivity $\leq$ vertex connectivity $\leq$ minimum degree.

Sensitivity. When a single edge $(u,v)$ is added to a graph, $\lambda_2$ can increase by at most $2$ . When an edge is removed, $\lambda_2$ can decrease by at most $2$ . This quantifies how much the connectivity changes with each graph edit - useful in robust network design.

Regular graphs. For a $d$ -regular graph on $n$ vertices:

\lambda_2(L) = d - \mu_1(A) \quad \text{(second eigenvalue of adjacency)}

where $\mu_1(A)$ is the largest eigenvalue of $A$ not equal to $d$ . The spectral gap $d - \mu_1$ of the adjacency matrix and the algebraic connectivity are directly related for regular graphs.

4.2 The Fiedler Vector

Definition. The Fiedler vector $\mathbf{u}_2$ is the eigenvector of $L$ corresponding to $\lambda_2$ .

The Fiedler vector assigns a real number $(\mathbf{u}_2)_i$ to each vertex $i$ . Vertices with positive values are assigned to one "side" of the graph; vertices with negative values to the other. This is the basis of spectral bisection.

Spectral bisection algorithm:

Compute the Fiedler vector $\mathbf{u}_2$ .
Partition $V = S \cup \bar{S}$ by the sign of $(\mathbf{u}_2)_i$ : let $S = \{i : (\mathbf{u}_2)_i \geq 0\}$ .
The edges $(S, \bar{S})$ form the "spectral cut."

Why does this work? The Courant-Fischer theorem says $\mathbf{u}_2$ minimizes the Dirichlet energy $\sum_{(i,j)\in E}(x_i - x_j)^2$ subject to $\mathbf{x} \perp \mathbf{1}$ and $\lVert \mathbf{x} \rVert = 1$ . If we further constrain $x_i \in \{-c, +c\}$ (a discrete two-way partition), we get the NP-hard graph bisection problem. The Fiedler vector is the continuous relaxation of this discrete problem - the best we can do efficiently.

The ordering property. Sorting vertices by their Fiedler vector value $(\mathbf{u}_2)_i$ reveals the community structure of the graph. Vertices in the same community tend to have similar $(\mathbf{u}_2)_i$ values; the transition from negative to positive marks the community boundary.

For AI: Spectral bisection is used in:

Circuit partitioning (VLSI design): split a circuit graph across two chips to minimize inter-chip connections
Domain decomposition (PDE solvers): partition a mesh graph for parallel computation
Community detection in knowledge graphs: find the two most separated communities in a KG

4.3 Bounding Graph Properties via \lambda_2

Diameter bound (Mohar, 1991):

\operatorname{diam}(G) \leq \left\lfloor \frac{2\ln(n-1)}{\ln\!\left(\frac{\lambda_n}{\lambda_n - \lambda_2}\right)} \right\rfloor

A simpler bound: $\operatorname{diam}(G) \leq \frac{2n}{\lambda_2}$ (rough but useful).

Vertex connectivity. For any connected graph:

\lambda_2(L) \leq \kappa(G)

where $\kappa(G)$ is the vertex connectivity (minimum number of vertices to remove to disconnect). A large $\lambda_2$ means the graph is robustly connected.

Conductance. The conductance $\Phi(G) = \min_{S: \operatorname{vol}(S) \leq \operatorname{vol}(V)/2} \frac{|E(S, \bar{S})|}{\operatorname{vol}(S)}$ measures the minimum normalized cut. The Cheeger inequality (5) gives:

\frac{\lambda_2(L_{\text{sym}})}{2} \leq \Phi(G) \leq \sqrt{2\lambda_2(L_{\text{sym}})}

Isoperimetric number. The Cheeger constant $h(G)$ (using $|S|$ instead of $\operatorname{vol}(S)$ ) satisfies the same type of inequality with the unnormalized $\lambda_2$ .

4.4 Computing the Fiedler Vector in Practice

For small graphs ( $n \leq$ a few thousand), compute all eigenvalues of $L$ directly via dense symmetric eigensolver (scipy.linalg.eigh). The second column of the eigenvector matrix is $\mathbf{u}_2$ .

For large sparse graphs, use iterative methods:

Lanczos algorithm: Builds a tridiagonal matrix $T$ from Krylov vectors $\{L\mathbf{v}, L^2\mathbf{v}, \ldots\}$ . Converges to extreme eigenvalues fastest. For the Fiedler vector, we need the smallest nonzero eigenvalue, which requires the shift-invert trick: compute the largest eigenvalue of $(L + \epsilon I)^{-1}$ for small $\epsilon$ .

Inverse power iteration with deflation: Since $\lambda_1 = 0$ is known, we can deflate it out. Initialize with a random $\mathbf{x} \perp \mathbf{1}$ , repeatedly apply $L$ (via sparse matrix-vector product), normalize, and orthogonalize against $\mathbf{1}$ . Convergence rate is $(\lambda_2/\lambda_3)^k$ per iteration.

Randomized Nystrom approximation: For graphs with $n > 10^6$ , approximate the low-rank spectral structure using randomized sampling of the Laplacian (Spielman & Srivastava, 2011).

4.5 AI Application: Community Detection

Community detection - finding groups of densely interconnected nodes - is one of the most practically important graph problems. Spectral methods are the gold standard for quality guarantees.

Planted partition model. Generate a graph with $k$ communities of size $n/k$ each, intra-community edge probability $p_{\text{in}}$ , inter-community probability $p_{\text{out}} \ll p_{\text{in}}$ . Spectral bisection recovers the communities exactly when:

p_{\text{in}} - p_{\text{out}} > \sqrt{\frac{2\ln n}{n/k} \cdot \frac{1}{\text{SBM gap}}}

(This is the information-theoretic threshold for the Stochastic Block Model.)

Knowledge graph clustering. In a knowledge graph (KG) like Freebase or Wikidata, entities form communities by topic (sports, science, politics). The Fiedler vector of the KG adjacency graph separates these clusters. The resulting community structure can be used to create topic-specific sub-KGs for retrieval-augmented generation.

5. Cheeger's Inequality and Graph Expansion

5.1 The Cheeger Constant h(G)

Definition. For a graph $G = (V, E)$ and a subset $S \subseteq V$ , the edge boundary $\partial S$ is the set of edges between $S$ and its complement $\bar{S} = V \setminus S$ :

\partial S = \{(i,j) \in E : i \in S, j \in \bar{S}\}

The conductance (or isoperimetric ratio) of the cut $(S, \bar{S})$ is:

\Phi(S) = \frac{|\partial S|}{\min(|S|, |\bar{S}|)} \quad \text{(unnormalized)} \quad \text{or} \quad h(S) = \frac{|\partial S|}{\min(\operatorname{vol}(S), \operatorname{vol}(\bar{S}))} \quad \text{(normalized)}

The Cheeger constant (or isoperimetric number) of $G$ is:

h(G) = \min_{S \subset V,\, S \neq \emptyset,\, V} h(S)

This is the minimum conductance cut: the partition that minimizes the fraction of edges leaving the smaller side relative to its volume. A small $h(G)$ means the graph has a bottleneck - a small number of edges separating a large fraction of the volume.

Computing $h(G)$ is NP-hard. This is a major motivation for the Cheeger inequality, which gives a polynomial-time algorithm (via $\lambda_2$ ) to find a cut within a factor of $\sqrt{2}$ of optimal.

Examples:

Path $P_n$ : Remove the middle edge; $h(P_n) \approx 2/n$ . Very small: the path is a severe bottleneck.
Complete graph $K_n$ : Every cut $(S, \bar{S})$ has $|\partial S| = |S| \cdot |\bar{S}|/(n-1) \approx |S|/2$ ; $h(K_n) = n/(2(n-1)) \approx 1/2$ .
Expander graph (5.3): $h(G) = \Omega(1)$ - bounded below by a constant, independent of $n$ .

5.2 Cheeger's Inequality

Theorem (Alon & Milman, 1985; Dodziuk, 1984). For any undirected graph $G$ :

\frac{\lambda_2(L_{\text{sym}})}{2} \leq h(G) \leq \sqrt{2\lambda_2(L_{\text{sym}})}

Proof of the left inequality (easy direction). We show $\lambda_2 \leq 2h(G)$ by exhibiting a test vector $\mathbf{x}$ with $R(\mathbf{x}) \leq 2h(G)$ .

Let $S^*$ be the optimal Cheeger cut with $h(S^*) = h(G)$ . Define:

x_i = \begin{cases} 1/\operatorname{vol}(S^*) & i \in S^* \\ -1/\operatorname{vol}(\bar{S}^*) & i \in \bar{S}^* \end{cases}

This $\mathbf{x}$ is orthogonal to the stationary distribution of the random walk (which plays the role of $\mathbf{1}$ for the normalized Laplacian). Then:

\mathbf{x}^\top L_{\text{sym}} \mathbf{x} = \sum_{(i,j)\in\partial S^*} \left(\frac{x_i - x_j}{\sqrt{d_i d_j}}\right)^2 \cdot d_i d_j

After algebraic simplification using the definition of $h$ : $R(\mathbf{x}) \leq 2h(G)$ . Since $\lambda_2 = \min R(\mathbf{x})$ , we get $\lambda_2 \leq 2h(G)$ .

Proof of the right inequality (hard direction). We show $h(G) \leq \sqrt{2\lambda_2}$ .

Given the Fiedler vector $\mathbf{u}_2$ , sort vertices so $u_{2,1} \leq u_{2,2} \leq \cdots \leq u_{2,n}$ . For each threshold $t$ , let $S_t = \{i : u_{2,i} \leq t\}$ . Consider the sweep over all $n-1$ possible thresholds. By the co-area formula for graphs (a discrete version of the co-area formula in differential geometry), the average conductance of these cuts satisfies:

\min_t h(S_t) \leq \frac{\sum_t h(S_t) \cdot \Delta\operatorname{vol}(S_t)}{\sum_t \Delta\operatorname{vol}(S_t)} \leq \sqrt{2\lambda_2}

The last step uses the Cauchy-Schwarz inequality together with the fact that $\lambda_2 = R(\mathbf{u}_2)$ is the Rayleigh quotient of the Fiedler vector. $\square$

Tightness. The left bound is tight for expanders (5.3). The right bound is tight for paths and other bottleneck graphs where $\lambda_2 \approx h^2/2$ .

Practical implication. Given $\lambda_2$ , we know $h(G) \in [\lambda_2/2, \sqrt{2\lambda_2}]$ . More importantly, the proof of the right inequality is constructive: the sweep over Fiedler vector thresholds finds a cut with conductance $\leq \sqrt{2\lambda_2}$ .

5.3 Expander Graphs

Definition. A family of graphs $\{G_n\}$ is an $(n, d, \lambda)$ -expander family if:

Each $G_n$ has $n$ vertices and is $d$ -regular
The second eigenvalue of $A$ satisfies $\mu_2(A) \leq \lambda < d$
The spectral gap $d - \lambda$ is bounded below by a positive constant

Equivalently (by Cheeger): $h(G_n) = \Omega(1)$ , i.e., the Cheeger constant is bounded below uniformly in $n$ .

Why expanders matter:

Communication networks: In a $d$ -regular expander on $n$ nodes, any message can be routed between any two nodes in $O(\log n)$ hops, using only $d$ connections per node. This is optimal for constant-degree networks.
Error-correcting codes: Expander codes (Sipser & Spielman, 1996) achieve linear-time encoding/decoding of codes close to the Shannon capacity.
Derandomization: Expanders provide pseudorandom number generators - random walks on expanders mix in $O(\log n)$ steps, so short random walks serve as good randomness sources.
GNN depth: A GNN on an expander graph propagates information across the entire graph in $O(\log n)$ layers. This is why expanders are ideal benchmarks for deep GNNs.

Ramanujan graphs. The optimal spectral gap for a $d$ -regular graph is bounded: $\lambda \geq 2\sqrt{d-1}$ (Alon-Boppana theorem). Graphs achieving $\lambda = 2\sqrt{d-1}$ are called Ramanujan graphs - they are the optimal expanders. Explicit Ramanujan graph constructions (Lubotzky, Phillips, Sarnak, 1988; Margulis, 1988) use deep number theory.

5.4 Random Walk Mixing Time

The random walk on $G$ defined by transition matrix $P = D^{-1}A$ has stationary distribution $\boldsymbol{\pi}$ with $\pi_i = d_i / \operatorname{vol}(V)$ . The mixing time is the number of steps needed for the walk to get close to the stationary distribution:

t_{\text{mix}}(\epsilon) = \min\left\{t : \max_i \lVert (P^t)_{i,:} - \boldsymbol{\pi} \rVert_1 \leq \epsilon\right\}

Spectral mixing bound. Let $\alpha = \max(|\mu_2(P)|, |\mu_n(P)|) = 1 - \lambda_2(L_{\text{rw}})$ be the second-largest absolute eigenvalue of $P$ . Then:

t_{\text{mix}}(\epsilon) \leq \frac{\ln(n/\epsilon)}{\lambda_2(L_{\text{rw}})} = \frac{\ln(n/\epsilon)}{1 - \alpha}

Interpretation: The spectral gap $\lambda_2 = 1 - \alpha$ governs the mixing time. Large spectral gap -> fast mixing. For expanders with constant spectral gap: $t_{\text{mix}} = O(\log n)$ . For paths: $\lambda_2 = O(1/n^2)$ , so $t_{\text{mix}} = O(n^2 \log n)$ .

Proof sketch. Write the initial distribution as $\boldsymbol{\delta}_i = \boldsymbol{\pi} + \sum_k c_k \boldsymbol{\phi}_k$ where $\boldsymbol{\phi}_k$ are eigenvectors of $P$ (eigenvalues $1 = \mu_1 > \mu_2 \geq \cdots \geq \mu_n \geq -1$ ). After $t$ steps: $(P^t \boldsymbol{\delta}_i)_j = \pi_j + \sum_{k \geq 2} c_k \mu_k^t (\boldsymbol{\phi}_k)_j$ . The deviation decays as $\alpha^t$ , giving $t_{\text{mix}} \leq \ln(1/\epsilon\pi_{\min})/\ln(1/\alpha)$ .

Lazy walk. To avoid oscillation when $\mu_n \approx -1$ (bipartite-like graphs), use the lazy random walk with $P' = (I + P)/2$ . Eigenvalues of $P'$ are $(1 + \mu_k)/2 \in [0, 1]$ , avoiding negative eigenvalues.

5.5 AI Connection: Over-Smoothing as Diffusion

Over-smoothing is the well-documented phenomenon in deep GNNs where node representations become indistinguishable as the number of layers increases (Li et al., 2018; Oono & Suzuki, 2020). Spectral theory provides the exact mechanism:

A $k$ -layer GCN computes (roughly) $\hat{A}^k X W$ where $\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ is the normalized adjacency. The eigenvalues of $\hat{A}$ satisfy $\hat{\mu}_i = 1 - \tilde{\lambda}_i \in [-1, 1]$ where $\tilde{\lambda}_i$ are eigenvalues of $L_{\text{sym}}$ of the augmented graph. After $k$ iterations:

(\hat{A}^k)_{ij} = \sum_\ell \hat{\mu}_\ell^k (U)_{i\ell}(U)_{j\ell} \xrightarrow{k \to \infty} \hat{\mu}_1^k (U)_{i1}(U)_{j1} = \frac{\sqrt{d_i d_j}}{\operatorname{vol}(V)}

All node features converge to a value proportional to $\sqrt{d_i}$ , determined only by degree - all structural information is lost.

Rate of collapse. The convergence rate is governed by the spectral gap: $\hat{\mu}_2 = 1 - \lambda_2(L_{\text{sym}}) < 1$ . Faster collapse on expanders (large $\lambda_2$ ), slower on bottleneck graphs. This is counterintuitive: the "most connected" graphs (expanders) over-smooth fastest.

Mitigation strategies:

Residual connections (GCNII, Chen et al., 2020): $H^{(k+1)} = \sigma\!\left((1-\alpha)\hat{A}H^{(k)}W^{(k)} + \alpha H^{(0)}\right)$ - preserve initial features
DropEdge (Rong et al., 2020): randomly remove edges during training, reducing effective $k$
PairNorm (Zhao & Akoglu, 2020): explicitly normalize pairwise distances to prevent collapse
Jumping knowledge (Xu et al., 2018): aggregate representations from all layers

Forward reference: The full architecture-level treatment of over-smoothing, including the WL expressiveness hierarchy and architectural mitigations, is in 11-05 Graph Neural Networks.

6. Graph Fourier Transform and Signal Processing

6.1 Classical Fourier Analogy

The classical Fourier transform on $\mathbb{R}^n$ decomposes a function $f$ into a linear combination of eigenfunctions of the Laplace operator $\Delta$ :

\hat{f}(\boldsymbol{\omega}) = \int_{\mathbb{R}^n} f(\mathbf{x}) e^{-i\boldsymbol{\omega}^\top \mathbf{x}} d\mathbf{x}

The functions $e^{i\boldsymbol{\omega}^\top \mathbf{x}}$ are eigenfunctions of the continuous Laplacian: $-\Delta e^{i\boldsymbol{\omega}^\top \mathbf{x}} = \lVert \boldsymbol{\omega} \rVert^2 e^{i\boldsymbol{\omega}^\top \mathbf{x}}$ .

On a graph, the Laplacian $L$ plays the role of $-\Delta$ , and its eigenvectors $\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_n$ (with eigenvalues $\lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n$ ) play the role of the complex exponentials $e^{i\boldsymbol{\omega}^\top \mathbf{x}}$ .

The analogy:

FOURIER TRANSFORM ANALOGY
========================================================================

  Classical Fourier                   Graph Fourier
  ---------------------------------   ---------------------------------
  Domain          \mathbb{R}^n                  Vertex set V (finite)
  Operator        -\Delta (Laplacian)      L = D - A (graph Laplacian)
  Eigenfunctions  exp(i\omega*x)           Eigenvectors u_1, u_2, ..., u_n
  Frequencies     ||\omega||^2 \in [0, \infty)      Eigenvalues \lambda_1 \leq \lambda_2 \leq ... \leq \lambda_n
  Low freq.       ||\omega|| small -> smooth  \lambda_k small -> smooth on graph
  High freq.      ||\omega|| large -> rapid   \lambda_k large -> rapid variation
  Transform       Continuous integral  Finite matrix multiply (U^Tx)

========================================================================

This analogy is the conceptual foundation for defining convolution, filtering, and signal processing on irregular graph domains.

6.2 Graph Fourier Transform

Definition. Let $L = U \Lambda U^\top$ be the eigendecomposition of the graph Laplacian, with $U = [\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_n]$ the matrix of eigenvectors (columns). For a signal $\mathbf{x} \in \mathbb{R}^n$ (assigning a value $x_i$ to each vertex $i$ ), the Graph Fourier Transform (GFT) is:

\hat{\mathbf{x}} = U^\top \mathbf{x} \in \mathbb{R}^n

The inverse GFT is:

\mathbf{x} = U \hat{\mathbf{x}} = \sum_{k=1}^n \hat{x}_k \mathbf{u}_k

Properties:

Parseval's theorem: $\lVert \hat{\mathbf{x}} \rVert^2 = \lVert \mathbf{x} \rVert^2$ (since $U$ is orthogonal).
Linearity: $\widehat{\mathbf{x} + \mathbf{y}} = \hat{\mathbf{x}} + \hat{\mathbf{y}}$ .
Energy decomposition: $\lVert \mathbf{x} \rVert^2 = \sum_k \hat{x}_k^2$ (energy in each frequency component).
Shift property: There is no clean "shift theorem" for graphs as there is for the DFT, because graphs lack translation symmetry. This is a fundamental difference.

Graph convolution. The convolution of two signals $\mathbf{x}$ and $\mathbf{y}$ on a graph is defined spectrally:

\mathbf{x} \star_G \mathbf{y} = U (\hat{\mathbf{x}} \odot \hat{\mathbf{y}}) = U \operatorname{diag}(\hat{\mathbf{x}}) \hat{\mathbf{y}}

where $\odot$ is element-wise multiplication. This is the analogue of the convolution theorem: convolution in the vertex domain equals pointwise multiplication in the spectral domain.

Limitation of full GFT: Computing $U$ requires $O(n^3)$ time (eigendecomposition). For large graphs ( $n > 10^4$ ), this is infeasible, motivating polynomial approximations (7).

6.3 Frequency Interpretation

The $k$ -th frequency component $\hat{x}_k = \langle \mathbf{u}_k, \mathbf{x} \rangle = \mathbf{u}_k^\top \mathbf{x}$ measures how much of the signal $\mathbf{x}$ "oscillates at frequency $\lambda_k$ ."

Low-frequency signals correspond to small $\lambda_k$ : the eigenvectors $\mathbf{u}_k$ for small eigenvalues vary smoothly across edges (since $\mathbf{u}_k^\top L \mathbf{u}_k = \lambda_k$ is small). A signal concentrated in low frequencies is smooth: nearby vertices have similar values.

High-frequency signals correspond to large $\lambda_k$ : the eigenvectors for large eigenvalues oscillate rapidly, with $(\mathbf{u}_k)_i$ and $(\mathbf{u}_k)_j$ having opposite signs for many edges $(i,j)$ . A pure high-frequency signal looks like a checkerboard on the graph.

Example on a path graph. For $P_n$ , the eigenvectors are $u_{k,i} = \sqrt{2/n} \cos((k-1)\pi(2i-1)/(2n))$ - the discrete cosine transform (DCT). The eigenvalues $\lambda_k = 2 - 2\cos((k-1)\pi/n)$ are just the squared DCT frequencies. The GFT on a path is the DCT.

Example on a community graph. A graph with two tightly connected communities has:

$\mathbf{u}_1 = \mathbf{1}/\sqrt{n}$ : constant (DC component)
$\mathbf{u}_2$ : Fiedler vector, positive on community 1, negative on community 2 - the community membership function is a low-frequency signal
$\mathbf{u}_n$ : highest frequency, alternates sign on bipartite-like structure

6.4 Dirichlet Energy Revisited

The Dirichlet energy decomposes cleanly in the spectral domain:

\mathbf{x}^\top L \mathbf{x} = \hat{\mathbf{x}}^\top \Lambda \hat{\mathbf{x}} = \sum_{k=1}^n \lambda_k \hat{x}_k^2

This is the "power spectrum" interpretation: the Dirichlet energy is the weighted sum of spectral components, weighted by frequency. A signal is smooth (low Dirichlet energy) iff its energy is concentrated in low-frequency components ( $\lambda_k$ small).

Spectral analysis of node features. Given a node feature matrix $X \in \mathbb{R}^{n \times d}$ , we can compute the spectral content of each feature dimension:

\operatorname{Dirichlet}(X_{:,j}) = \sum_{k=1}^n \lambda_k \hat{X}_{kj}^2

Feature dimensions with low Dirichlet energy are "community-consistent" features (e.g., political affiliation in a social network). Feature dimensions with high Dirichlet energy are "noisy" local features.

For AI: Graph regularization in semi-supervised learning minimizes:

\mathcal{L} = \mathcal{L}_{\text{supervised}} + \gamma \sum_j \mathbf{f}_{:,j}^\top L \mathbf{f}_{:,j}

This penalizes high-frequency components in the predicted label function $\mathbf{f}$ , implementing a "smoothness prior": connected nodes likely have the same label.

6.5 Uncertainty Principle on Graphs

In classical signal processing, the Heisenberg uncertainty principle states that a signal cannot be simultaneously concentrated in both time and frequency: the product of time spread and frequency spread is at least $1/4\pi$ .

On graphs, an analogous uncertainty principle holds (Agaskar & Lu, 2013):

\Delta_G(\mathbf{x})^2 \cdot \Delta_S(\mathbf{x})^2 \geq C

where $\Delta_G$ measures how localized $\mathbf{x}$ is in the vertex domain (concentrated on a small set of vertices) and $\Delta_S$ measures how localized $\hat{\mathbf{x}}$ is in the spectral domain (concentrated on a small band of frequencies), and $C$ is a constant depending on the graph structure.

Implications for graph signal processing:

A signal perfectly localized on a single vertex ( $\Delta_G = 0$ ) is spread across all frequencies ( $\Delta_S =$ maximum)
Smooth signals (concentrated on low frequencies, small $\Delta_S$ ) are necessarily spread across many vertices ( $\Delta_G$ large)
This tradeoff motivates graph wavelets (11.3): basis functions that are approximately localized in both vertex and spectral domains

6.6 AI Application: Node Feature Smoothing

Label propagation (Zhou et al., 2004) is a classic semi-supervised learning algorithm that directly implements low-pass graph filtering. Starting from a partially labeled graph, labels propagate according to:

\mathbf{F}^{(t+1)} = \alpha \hat{A} \mathbf{F}^{(t)} + (1 - \alpha) Y

where $Y$ is the initial label matrix (zeros for unlabeled nodes), $\hat{A}$ is the normalized adjacency, and $\alpha \in (0,1)$ controls the smoothing strength. In the spectral domain, this converges to:

\mathbf{F}^* = (I - \alpha \hat{A})^{-1} (1-\alpha) Y = \sum_k \frac{1-\alpha}{1 - \alpha(1-\tilde{\lambda}_k)} \hat{Y}_k \mathbf{u}_k

The filter $g(\lambda) = (1-\alpha)/(1-\alpha(1-\lambda))$ is a low-pass filter: it attenuates high-frequency components ( $\lambda$ large) more than low-frequency ones.

For modern LLMs: When an LLM reasons over a knowledge graph, smooth graph signals correspond to consistent facts (nearby entities agree), while high-frequency signals correspond to noise or inconsistencies. Spectral filtering provides a principled way to denoise knowledge graphs before retrieval.

Spectral Graph Theory: Part 1 - Intuition To 6 Graph Fourier Transform And Signal Processing