NotesMath for LLMs

Spectral Graph Theory

Graph Theory / Spectral Graph Theory

Notes

"To understand a graph, listen to its spectrum. The eigenvalues of the Laplacian are the resonant frequencies of the graph - they reveal clusters, bottlenecks, expansion, and the rate at which information diffuses across every edge."

Overview

Spectral graph theory is the study of graphs through the eigenvalues and eigenvectors of matrices naturally associated with them - principally the adjacency matrix AA, the degree matrix DD, and the graph Laplacian L=DAL = D - A. The central insight is that algebraic properties of these matrices correspond precisely to combinatorial and geometric properties of the graph: the number of connected components equals the multiplicity of eigenvalue zero; the second-smallest eigenvalue λ2\lambda_2 quantifies how "hard" the graph is to disconnect; the eigenvectors of LL form a natural Fourier basis for signals defined on the graph.

This connection between spectral algebra and graph topology has made spectral graph theory one of the most productive areas of modern discrete mathematics - and, increasingly, one of the most important mathematical foundations for machine learning. Spectral clustering (Shi & Malik, 2000; Ng, Jordan & Weiss, 2002) remains a gold-standard unsupervised learning method for non-convex clusters. Graph Convolutional Networks (Kipf & Welling, 2017) are derived from first principles as first-order Chebyshev approximations to spectral filters. Laplacian positional encodings power modern graph Transformers (Dwivedi et al., 2022; GPS, 2022). Even language model attention matrices can be analyzed as weighted graphs whose spectral properties reveal information flow.

This section develops the full theory from scratch. We begin with the three fundamental graph matrices and their spectral properties, build up to the Cheeger inequality and expander graphs, construct the graph Fourier transform, derive spectral clustering rigorously, and connect everything to modern AI applications. Students who complete this section will have the mathematical fluency to read GNN papers, design graph-based ML systems, and understand why spectral methods work when they work - and why they fail when they do.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: Laplacian spectra, Fiedler vector bisection, Cheeger inequality, Graph Fourier Transform, spectral clustering, Laplacian eigenmaps, PageRank
exercises.ipynb8 graded exercises from Laplacian PSD proofs through spectral clustering and Laplacian positional encodings

Learning Objectives

After completing this section, you will:

  • Construct the adjacency matrix AA, degree matrix DD, and graph Laplacian L=DAL = D - A for any graph, and derive the normalized variants LsymL_{\text{sym}} and LrwL_{\text{rw}}
  • Prove that the graph Laplacian is positive semidefinite using the quadratic form xLx=(i,j)E(xixj)2\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j)\in E}(x_i - x_j)^2
  • State and prove that the multiplicity of eigenvalue 00 of LL equals the number of connected components
  • Define algebraic connectivity λ2(L)\lambda_2(L) and interpret the Fiedler vector as a graph bisection tool
  • State Cheeger's inequality h2/2λ22hh^2/2 \leq \lambda_2 \leq 2h and explain its implications for expander graphs and random walk mixing
  • Define the Graph Fourier Transform and interpret graph signals in the spectral domain
  • Derive spectral clustering (RatioCut and NCut) from graph partitioning relaxations
  • Implement the Laplacian eigenmaps algorithm and connect it to spectral positional encodings in graph Transformers
  • Derive the GCN layer as a first-order Chebyshev approximation to a spectral filter
  • Analyze PageRank as a spectral problem on directed graphs

Table of Contents


1. Intuition

1.1 Hearing the Shape of a Graph

In 1966, mathematician Mark Kac posed the question: "Can you hear the shape of a drum?" - meaning, can you reconstruct the geometry of a vibrating membrane from the frequencies it produces? The question turned out to have a negative answer in general, but it crystallized one of the deepest ideas in mathematics: the spectrum of a differential operator encodes geometric information.

Spectral graph theory asks the same question for discrete structures. A graph G=(V,E)G = (V, E) has an associated matrix - the Laplacian LL - whose eigenvalues form the graph spectrum. These numbers are not arbitrary: they encode whether the graph is connected, how tightly its communities are glued together, how quickly a random walk mixes across its edges, how hard it is to cut the graph in two.

Think of a social network. Each person is a node; each friendship is an edge. The graph has "natural frequencies": a society with two isolated groups (a disconnected graph) has a different spectrum from one that is fully interconnected. The small eigenvalues of LL correspond to smooth, slowly-varying signals - the overall community membership function. The large eigenvalues correspond to rapidly-oscillating signals - the microscopic variation from person to person. This is the graph analogue of low and high frequencies in audio.

Three statements, each surprising when first encountered, that spectral graph theory makes precise:

  1. The number of connected components of GG equals the number of times 00 appears as an eigenvalue of LL.
  2. The second-smallest eigenvalue λ2(L)\lambda_2(L) - the "Fiedler value" - tells you how hard it is to disconnect the graph. A graph is harder to cut when λ2\lambda_2 is larger.
  3. The eigenvector corresponding to λ2\lambda_2 - the "Fiedler vector" - assigns a real number to each vertex, and the sign of this number tells you which side of the best bisection each vertex belongs to.

These are not vague analogies. They are theorems with proofs, and they form the backbone of a theory that has become indispensable in machine learning.

1.2 The Three Graph Matrices

For a graph G=(V,E)G = (V, E) with n=Vn = |V| vertices and m=Em = |E| edges, three matrices appear constantly:

Adjacency matrix ARn×nA \in \mathbb{R}^{n \times n}:

Aij={1if (i,j)E0otherwiseA_{ij} = \begin{cases} 1 & \text{if } (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

For undirected graphs, AA is symmetric. For weighted graphs, Aij=wijA_{ij} = w_{ij}, the edge weight. The adjacency matrix encodes the direct connections in the graph.

Degree matrix DRn×nD \in \mathbb{R}^{n \times n}: a diagonal matrix with Dii=di=jAijD_{ii} = d_i = \sum_j A_{ij}, the degree of vertex ii. For weighted graphs, di=jwijd_i = \sum_j w_{ij} is the weighted degree (also called strength).

Graph Laplacian L=DAL = D - A: the central object of spectral graph theory. Explicitly:

Lij={diif i=j1if (i,j)E0otherwiseL_{ij} = \begin{cases} d_i & \text{if } i = j \\ -1 & \text{if } (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

The Laplacian is named after Pierre-Simon Laplace because it is the discrete analogue of the continuous Laplace operator Δ=i2/xi2\Delta = \sum_i \partial^2/\partial x_i^2. For a function f:VRf: V \to \mathbb{R} defined on the vertices:

(Lf)i=j:(i,j)E(f(i)f(j))=dif(i)j:(i,j)Ef(j)(Lf)_i = \sum_{j: (i,j) \in E} (f(i) - f(j)) = d_i f(i) - \sum_{j: (i,j) \in E} f(j)

This is the "discrete second derivative" - it measures how much the value at vertex ii differs from the average value among its neighbors.

For AI: In a Graph Neural Network, the operation A~H\tilde{A} \mathbf{H} (multiplying node features by the adjacency matrix with self-loops) is equivalent to computing (D~L~)H(\tilde{D} - \tilde{L})\mathbf{H}. The Laplacian is implicitly present in every GNN layer.

1.3 Why Eigenvalues Reveal Structure

The Laplacian LL is a real symmetric positive semidefinite matrix. By the spectral theorem (03-Advanced-Linear-Algebra), it has a complete orthonormal basis of eigenvectors u1,u2,,un\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_n with real non-negative eigenvalues:

0=λ1λ2λn0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n

Why is λ1=0\lambda_1 = 0 always? Because L1=0L \mathbf{1} = \mathbf{0} - the all-ones vector is always in the null space of LL (every row of LL sums to zero). The constant function "assign the same value to every vertex" has zero variation across every edge, so it has zero energy.

The deeper result: λ1=λ2==λk=0\lambda_1 = \lambda_2 = \cdots = \lambda_k = 0 if and only if the graph has exactly kk connected components. On a disconnected graph with kk components, the eigenvectors for eigenvalue 00 are the indicator vectors of each component.

The first nonzero eigenvalue λ2\lambda_2 - if it exists - is called the algebraic connectivity or Fiedler value (after Miroslav Fiedler, who proved its key properties in 1973). A larger λ2\lambda_2 means the graph is harder to disconnect; a λ2\lambda_2 close to zero means there is almost a disconnection - a bottleneck.

The largest eigenvalue λn\lambda_n gives the spectral radius of the Laplacian and satisfies λn2dmax\lambda_n \leq 2 d_{\max}.

1.4 Historical Timeline

SPECTRAL GRAPH THEORY - HISTORICAL TIMELINE
========================================================================

  1847  Kirchhoff     - Matrix-Tree theorem; Laplacian for electrical circuits
  1931  Whitney       - Graph isomorphism; chromatic polynomials
  1954  Collatz &     - Systematic study of graph spectra begins
        Sinogowitz
  1973  Fiedler       - Algebraic connectivity; Fiedler vector; graph bisection
  1981  Cheeger       - Cheeger inequality (originally for manifolds)
  1988  Alon & Milman - Discrete Cheeger inequality for graphs
  1996  Belkin &      - Laplacian eigenmaps for manifold learning (pub. 2001)
        Niyogi
  2000  Shi & Malik   - Normalized Cuts and image segmentation
  2002  Ng, Jordan,   - Spectral clustering algorithm (the standard version)
        Weiss
  2004  Spielman &    - Spectral sparsification; fast Laplacian solvers
        Teng
  2011  Hammond et al - Wavelets on graphs
  2014  Bruna et al   - Spectral graph CNNs (first spectral GNN)
  2016  Defferrard    - ChebNet: Chebyshev polynomial filters on graphs
        et al
  2017  Kipf &        - GCN: first-order Chebyshev -> simple spatial rule
        Welling
  2022  Dwivedi et al - Laplacian positional encodings for graph Transformers
  2022  Rampasek et   - GPS: General, Powerful, Scalable graph Transformer
        al                with spectral PE

========================================================================

1.5 Roadmap of the Section

This section follows a deliberate progression from foundational algebra to modern AI applications:

SECTION ROADMAP
========================================================================

  2 Graph Matrices          Build the algebraic objects
         down
  3 Quadratic Form / PSD    Prove fundamental spectral properties
         down
  4 Fiedler Vector          Connect \lambda_2 to graph connectivity
         down
  5 Cheeger Inequality      Connect \lambda_2 to cut structure and mixing
         down
  6 Graph Fourier Transform  Signal processing on graphs
         down
  7 Spectral Filtering      From Fourier to polynomial approximations
         down
  8 Spectral Clustering     Partition graphs via eigenvectors
         down
  9 Laplacian Eigenmaps     Embed graphs; PE for transformers
         down
  10 Directed Graphs        PageRank; complex eigenvalues
         down
  11 Advanced Topics        Sparsification; wavelets; random matrices
         down
  12 ML Applications        KGs, molecules, LLM attention analysis

========================================================================

2. Graph Matrices and Their Spectra

2.1 Adjacency Matrix: Spectral View

Definition. For G=(V,E)G = (V, E) with nn vertices, the adjacency matrix ARn×nA \in \mathbb{R}^{n \times n} is defined by Aij=wijA_{ij} = w_{ij} if (i,j)E(i,j) \in E and Aij=0A_{ij} = 0 otherwise (with wij=1w_{ij} = 1 for unweighted graphs).

Key spectral property: Walk counting. The (i,j)(i,j) entry of AkA^k counts the number of walks of length exactly kk from vertex ii to vertex jj. This follows by induction: (Ak)ij=(Ak1)iAj(A^k)_{ij} = \sum_\ell (A^{k-1})_{i\ell} A_{\ell j} sums over all ways to reach jj in kk steps by first taking k1k-1 steps to reach \ell, then one step to jj.

For AI: This walk-counting property is the spectral justification for why a kk-layer GNN can "see" information from the kk-hop neighborhood. The matrix AkA^k is what a linear GNN with kk layers computes.

Eigenvalues of AA. For an undirected graph, AA is symmetric, so all eigenvalues are real. Let μ1μ2μn\mu_1 \geq \mu_2 \geq \cdots \geq \mu_n denote the eigenvalues of AA in decreasing order. Key bounds:

  • For any graph: μidmax|\mu_i| \leq d_{\max} (the maximum degree), since A2=μ1dmax\lVert A \rVert_2 = \mu_1 \leq d_{\max}.
  • For a dd-regular graph: μ1=d\mu_1 = d with eigenvector 1/n\mathbf{1}/\sqrt{n}.
  • Bipartite graphs have symmetric spectra: μi=μn+1i\mu_i = -\mu_{n+1-i}.
  • The number of distinct eigenvalues is at least diam(G)+1\text{diam}(G) + 1 (where diam\text{diam} is graph diameter).

Non-examples of symmetry: For a directed graph, AA is not symmetric and eigenvalues may be complex. This is why directed spectral theory (10) requires separate treatment.

2.2 Degree Matrix and Volume

The degree matrix D=diag(d1,d2,,dn)D = \operatorname{diag}(d_1, d_2, \ldots, d_n) is diagonal with Dii=di=jAijD_{ii} = d_i = \sum_j A_{ij}.

Volume. For a subset SVS \subseteq V, the volume is vol(S)=iSdi\operatorname{vol}(S) = \sum_{i \in S} d_i. For the full graph, vol(V)=2m\operatorname{vol}(V) = 2m (each edge contributes 2 to the total degree sum). Volume plays the role of "mass" in the normalized Laplacian theory.

For a dd-regular graph, D=dInD = d \cdot I_n and vol(S)=dS\operatorname{vol}(S) = d |S|, making the theory particularly clean. Most derivations proceed with general DD but reduce to simpler formulas in the regular case.

Random walk transition matrix. The matrix P=D1AP = D^{-1} A is a row-stochastic matrix: jPij=1\sum_j P_{ij} = 1 for all ii. It defines a random walk on the graph: from vertex ii, move to neighbor jj with probability Aij/diA_{ij}/d_i. The stationary distribution of this walk is π\boldsymbol{\pi} with πi=di/vol(V)\pi_i = d_i / \operatorname{vol}(V) - proportional to degree. This connection between DD, AA, and random walks is central to the normalized Laplacian theory.

2.3 Unnormalized Laplacian L = D - A

Definition. L=DAL = D - A. Entry-by-entry:

Lij={dii=jwij(i,j)E0otherwiseL_{ij} = \begin{cases} d_i & i = j \\ -w_{ij} & (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

Every row (and column) sums to zero: jLij=dij:(i,j)E1=0\sum_j L_{ij} = d_i - \sum_{j:(i,j)\in E} 1 = 0. Equivalently, L1=0L \mathbf{1} = \mathbf{0}.

The fundamental quadratic form:

xLx=(i,j)Ewij(xixj)2for all xRn\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j) \in E} w_{ij}(x_i - x_j)^2 \quad \text{for all } \mathbf{x} \in \mathbb{R}^n

Proof:

xLx=xDxxAx\mathbf{x}^\top L \mathbf{x} = \mathbf{x}^\top D \mathbf{x} - \mathbf{x}^\top A \mathbf{x} =idixi2(i,j)E2wijxixj= \sum_i d_i x_i^2 - \sum_{(i,j)\in E} 2 w_{ij} x_i x_j =(i,j)Ewij(xi2+xj2)2(i,j)Ewijxixj=(i,j)Ewij(xixj)20= \sum_{(i,j)\in E} w_{ij}(x_i^2 + x_j^2) - 2\sum_{(i,j)\in E} w_{ij} x_i x_j = \sum_{(i,j)\in E} w_{ij}(x_i - x_j)^2 \geq 0

Since wij>0w_{ij} > 0, this is always non-negative: L0L \succeq 0.

Geometric meaning: xLx\mathbf{x}^\top L \mathbf{x} measures the total variation of the signal x:VR\mathbf{x}: V \to \mathbb{R} across all edges. It is zero if and only if xi=xjx_i = x_j for all edges (i,j)(i,j) - i.e., x\mathbf{x} is constant on each connected component.

For AI: Graph regularization in semi-supervised learning minimizes fLf\mathbf{f}^\top L \mathbf{f} subject to labeling constraints. This penalizes label functions that change rapidly across edges - a smoothness prior that says "connected nodes likely have the same label."

2.4 Normalized Laplacians

Two normalized variants of the Laplacian are used in practice:

Symmetric normalized Laplacian:

Lsym=D1/2LD1/2=ID1/2AD1/2L_{\text{sym}} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2}

with entries:

(Lsym)ij={1i=jwij/didj(i,j)E0otherwise\left(L_{\text{sym}}\right)_{ij} = \begin{cases} 1 & i = j \\ -w_{ij}/\sqrt{d_i d_j} & (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

Properties: Symmetric; eigenvalues in [0,2][0, 2]; λk(Lsym)=1\lambda_k(L_{\text{sym}}) = 1 for all kk iff the graph is bipartite (eigenvalues symmetric around 1); u~k=D1/2uk\tilde{\mathbf{u}}_k = D^{1/2} \mathbf{u}_k where uk\mathbf{u}_k are eigenvectors of LrwL_{\text{rw}}.

Random-walk normalized Laplacian:

Lrw=D1L=ID1A=IPL_{\text{rw}} = D^{-1} L = I - D^{-1} A = I - P

with P=D1AP = D^{-1}A the random walk transition matrix. Properties: Not symmetric, but has the same eigenvalues as LsymL_{\text{sym}} (they are similar matrices). Eigenvalues in [0,2][0, 2]. The eigenvectors of LrwL_{\text{rw}} for eigenvalue 00 are the constant vectors on each component.

When to use which:

LaplacianUse caseWhy
L=DAL = D - AGraphs with uniform degree; theoretical proofsSimplest form
LsymL_{\text{sym}}Spectral clustering (Ng et al.); GCN normalizationSymmetric -> orthogonal eigenvectors
LrwL_{\text{rw}}Random walk analysis; Shi-Malik NCutDirect connection to PP

For AI (GCN connection): The GCN propagation rule A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} uses the symmetric normalized adjacency of the graph with self-loops - equivalently, ILsymI - L_{\text{sym}} of the augmented graph.

2.5 Spectra of Special Graphs

Closed-form eigenvalues for key graph families provide calibration and test cases:

Complete graph KnK_n: A=11IA = \mathbf{1}\mathbf{1}^\top - I. Eigenvalues of AA: n1n-1 (once) and 1-1 (n1n-1 times). Eigenvalues of LL: 00 (once) and nn (n1n-1 times). The graph is maximally connected: λ2(L)=n\lambda_2(L) = n.

Path graph PnP_n: Vertices {1,,n}\{1, \ldots, n\}, edges {(i,i+1)}\{(i, i+1)\}. Eigenvalues of LL:

λk=22cos ⁣((k1)πn),k=1,2,,n\lambda_k = 2 - 2\cos\!\left(\frac{(k-1)\pi}{n}\right), \quad k = 1, 2, \ldots, n

So λ1=0\lambda_1 = 0, λ2=22cos(π/n)π2/n2\lambda_2 = 2 - 2\cos(\pi/n) \approx \pi^2/n^2 for large nn - very small. This reflects the intuition that a long path is easy to cut (just remove the middle edge).

Cycle graph CnC_n: Eigenvalues of LL:

λk=22cos ⁣(2π(k1)n),k=1,2,,n\lambda_k = 2 - 2\cos\!\left(\frac{2\pi(k-1)}{n}\right), \quad k = 1, 2, \ldots, n

The spectrum is symmetric around λ=2\lambda = 2. λ2=22cos(2π/n)4π2/n2\lambda_2 = 2 - 2\cos(2\pi/n) \approx 4\pi^2/n^2 for large nn.

Star graph SnS_n: One hub connected to n1n-1 leaves. Eigenvalues of LL: 00 (once), 11 (n2n-2 times), nn (once). λ2=1\lambda_2 = 1 regardless of how many leaves there are - the star is easy to disconnect (remove the hub).

dd-regular bipartite graph Kn/2,n/2K_{n/2, n/2}: Eigenvalues of LL: 0,d,d,,d,2d0, d, d, \ldots, d, 2d with the pattern dictated by the bipartite structure; eigenvalue 2d2d indicates bipartiteness.

2.6 Characteristic Polynomial and Cospectral Graphs

The characteristic polynomial of a graph GG is pG(λ)=det(λIA)p_G(\lambda) = \det(\lambda I - A). The roots are the eigenvalues of AA. The coefficients of pGp_G are spectral invariants: the sum of eigenvalues equals tr(A)=0\operatorname{tr}(A) = 0 (no self-loops); the sum of squares of eigenvalues equals tr(A2)=2m\operatorname{tr}(A^2) = 2m.

Cospectral (isospectral) graphs are graphs with identical characteristic polynomials but non-isomorphic structures. The simplest pair: two graphs on 6 vertices found by Schwenk (1973). Cospectrality shows that the spectrum does not uniquely determine a graph - a fundamental limitation of spectral methods. For graph isomorphism testing, the Weisfeiler-Lehman test (05) captures structure that the spectrum misses.

For AI: The WL-expressiveness hierarchy of GNNs (Xu et al., 2019) parallels this cospectrality result. GNNs based on spectral convolution can distinguish everything the Laplacian spectrum distinguishes - but no more. This is one motivation for higher-order GNNs and attention-based methods.


3. The Fundamental Quadratic Form and PSD Structure

3.1 Dirichlet Energy

The quadratic form xLx=(i,j)E(xixj)2\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j)\in E}(x_i - x_j)^2 is called the Dirichlet energy (or graph Dirichlet form) of the signal x:VR\mathbf{x}: V \to \mathbb{R}.

This name comes from the continuous analogue: for a function f:ΩRf: \Omega \to \mathbb{R} on a domain ΩRd\Omega \subset \mathbb{R}^d, the Dirichlet energy is Ωf2dx\int_\Omega \lVert \nabla f \rVert^2 \, d\mathbf{x}, which measures the total variation (smoothness) of ff. The graph Laplacian LL is the discrete analogue of Δ-\Delta (the negative Laplacian), and xLx\mathbf{x}^\top L \mathbf{x} is the discrete Dirichlet energy.

Interpretations by context:

ContextWhat xLx\mathbf{x}^\top L \mathbf{x} measures
Social networkTotal disagreement when xi{1,+1}x_i \in \{-1, +1\} labels communities
Signal on graphTotal variation (roughness) of the signal across edges
Temperature fieldTotal heat flux across edges at steady state
Node embeddings"Embedding strain" - how much nearby nodes differ
Semi-supervised labelsPenalty for assigning different labels to connected nodes

Critical point of Dirichlet energy. The Rayleigh quotient R(x)=xLx/x2R(\mathbf{x}) = \mathbf{x}^\top L \mathbf{x} / \lVert \mathbf{x} \rVert^2 is minimized by the eigenvector with smallest eigenvalue. Constrained to x1\mathbf{x} \perp \mathbf{1} (orthogonal to the trivial null vector), the minimum is achieved by u2\mathbf{u}_2, the Fiedler vector.

3.2 Proof That L \succeq 0

Theorem. For any undirected weighted graph with non-negative edge weights, L0L \succeq 0.

Proof. For any xRn\mathbf{x} \in \mathbb{R}^n:

xLx=(i,j)Ewij(xixj)20\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j) \in E} w_{ij}(x_i - x_j)^2 \geq 0

since wij0w_{ij} \geq 0 and (xixj)20(x_i - x_j)^2 \geq 0 for all real numbers. \square

Corollary. All eigenvalues of LL are non-negative: 0λ1λ2λn0 \leq \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n.

Corollary. 1\mathbf{1} is always an eigenvector with λ1=0\lambda_1 = 0, since L1=(DA)1=dA1=0L\mathbf{1} = (D - A)\mathbf{1} = \mathbf{d} - A\mathbf{1} = \mathbf{0} (where d\mathbf{d} is the degree vector, equal to A1A\mathbf{1}).

Strengthened result for normalized Laplacians. For LsymL_{\text{sym}}: since Lsym=D1/2LD1/2L_{\text{sym}} = D^{-1/2}LD^{-1/2} and L0L \succeq 0, we have Lsym0L_{\text{sym}} \succeq 0. Moreover, xLsymx2x2\mathbf{x}^\top L_{\text{sym}} \mathbf{x} \leq 2\lVert \mathbf{x} \rVert^2 for all x\mathbf{x}, so λk(Lsym)[0,2]\lambda_k(L_{\text{sym}}) \in [0, 2].

3.3 Connected Components via the Null Space

Theorem (Fiedler, 1973). The multiplicity of eigenvalue 00 of the graph Laplacian LL equals the number of connected components of GG.

Proof.

()(\Rightarrow) Suppose GG has kk connected components C1,C2,,CkC_1, C_2, \ldots, C_k. For each component CC_\ell, define vRn\mathbf{v}^\ell \in \mathbb{R}^n as the indicator vector of CC_\ell: vi=1v^\ell_i = 1 if iCi \in C_\ell, else 00. Then Lv=0L\mathbf{v}^\ell = \mathbf{0} because for any vertex iCi \in C_\ell:

(Lv)i=divij:(i,j)Evj=didi=0(L\mathbf{v}^\ell)_i = d_i v^\ell_i - \sum_{j:(i,j)\in E} v^\ell_j = d_i - d_i = 0

(all neighbors of ii are also in CC_\ell since components are isolated). The kk vectors v1,,vk\mathbf{v}^1, \ldots, \mathbf{v}^k are linearly independent, so dim(kerL)k\dim(\ker L) \geq k.

()(\Leftarrow) Suppose Lx=0L\mathbf{x} = \mathbf{0}. Then 0=xLx=(i,j)E(xixj)20 = \mathbf{x}^\top L \mathbf{x} = \sum_{(i,j)\in E}(x_i - x_j)^2, which forces xi=xjx_i = x_j for every edge (i,j)(i,j). Thus x\mathbf{x} is constant on each connected component. The dimension of the space of such functions equals the number of components. So dim(kerL)k\dim(\ker L) \leq k.

Combining both directions, dim(kerL)=k\dim(\ker L) = k. \square

Examples:

  • Fully connected graph (k=1k=1): λ1=0\lambda_1 = 0 is simple; λ2>0\lambda_2 > 0.
  • Graph with 2 isolated components: λ1=λ2=0\lambda_1 = \lambda_2 = 0; λ3>0\lambda_3 > 0.
  • Path PnP_n: always connected; λ2=22cos(π/n)>0\lambda_2 = 2 - 2\cos(\pi/n) > 0.

Non-example: For a disconnected graph with components of different sizes, the eigenvectors for λ=0\lambda = 0 are NOT all-ones vectors but rather indicator vectors of the components.

3.4 Eigenvalue Bounds and Interlacing

Upper bound. For any connected graph:

λn(L)2dmax\lambda_n(L) \leq 2 d_{\max}

where dmaxd_{\max} is the maximum degree. For dd-regular graphs: λn(L)=2d\lambda_n(L) = 2d iff the graph is bipartite.

Lower bound on λ2\lambda_2. From the Cheeger inequality (full treatment in 5):

λ2h(G)22\lambda_2 \geq \frac{h(G)^2}{2}

where h(G)h(G) is the Cheeger constant.

Interlacing theorem. Let HH be an induced subgraph of GG on m<nm < n vertices, with Laplacian eigenvalues 0=μ1μ2μm0 = \mu_1 \leq \mu_2 \leq \cdots \leq \mu_m. Then:

λi(LG)λi(LH)λnm+i(LG)for i=1,,m\lambda_i(L_G) \leq \lambda_i(L_H) \leq \lambda_{n-m+i}(L_G) \quad \text{for } i = 1, \ldots, m

Interlacing means that removing vertices from a graph cannot increase λ2\lambda_2 by more than the increase in λn\lambda_n. This is used in structural arguments about graph connectivity after vertex removal.

3.5 Courant-Fischer Minimax Theorem

The Courant-Fischer theorem provides a variational characterization of every eigenvalue of a symmetric matrix. For the graph Laplacian LL with eigenvalues 0=λ1λ2λn0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n:

λk=minSRndim(S)=kmaxxSx0xLxx2\lambda_k = \min_{\substack{S \leq \mathbb{R}^n \\ \dim(S) = k}} \max_{\substack{\mathbf{x} \in S \\ \mathbf{x} \neq \mathbf{0}}} \frac{\mathbf{x}^\top L \mathbf{x}}{\lVert \mathbf{x} \rVert^2}

In particular, the Fiedler value has the characterization:

λ2=minx1,x0xLxx2=minx1,x=1(i,j)E(xixj)2\lambda_2 = \min_{\mathbf{x} \perp \mathbf{1},\, \mathbf{x} \neq \mathbf{0}} \frac{\mathbf{x}^\top L \mathbf{x}}{\lVert \mathbf{x} \rVert^2} = \min_{\mathbf{x} \perp \mathbf{1},\, \lVert \mathbf{x} \rVert = 1} \sum_{(i,j)\in E}(x_i - x_j)^2

Proof sketch. Write x=kckuk\mathbf{x} = \sum_k c_k \mathbf{u}_k in the eigenbasis. Then xLx=kλkck2\mathbf{x}^\top L \mathbf{x} = \sum_k \lambda_k c_k^2 and x2=kck2\lVert \mathbf{x} \rVert^2 = \sum_k c_k^2. The Rayleigh quotient is kλkck2/kck2\sum_k \lambda_k c_k^2 / \sum_k c_k^2, a convex combination of eigenvalues. Minimizing over xu1=1/n\mathbf{x} \perp \mathbf{u}_1 = \mathbf{1}/\sqrt{n} forces c1=0c_1 = 0, making the minimum λ2\lambda_2 (achieved when c2=1c_2 = 1, all others 00).

Practical use. Courant-Fischer justifies using u2\mathbf{u}_2 as the optimal graph bisection vector: it solves the continuous relaxation of the minimum bisection problem, as we prove in 4 and 8.


4. Algebraic Connectivity and the Fiedler Vector

4.1 Algebraic Connectivity \lambda_2

Definition. The algebraic connectivity of a graph GG is a(G)=λ2(L)a(G) = \lambda_2(L), the second-smallest eigenvalue of the graph Laplacian. It is also called the Fiedler value.

Theorem (Fiedler, 1973). λ2(L)>0\lambda_2(L) > 0 if and only if GG is connected.

This follows directly from 3.3: λ2=0\lambda_2 = 0 iff there are at least 2 connected components.

Why "algebraic" connectivity? The classical combinatorial connectivity κ(G)\kappa(G) (minimum number of vertices whose removal disconnects GG) is NP-hard to compute in general. The algebraic connectivity λ2\lambda_2 provides a computable lower bound:

λ2κ(G)δ(G)\lambda_2 \leq \kappa(G) \leq \delta(G)

where δ(G)\delta(G) is the minimum degree. This inequality chain says: algebraic connectivity \leq vertex connectivity \leq minimum degree.

Sensitivity. When a single edge (u,v)(u,v) is added to a graph, λ2\lambda_2 can increase by at most 22. When an edge is removed, λ2\lambda_2 can decrease by at most 22. This quantifies how much the connectivity changes with each graph edit - useful in robust network design.

Regular graphs. For a dd-regular graph on nn vertices:

λ2(L)=dμ1(A)(second eigenvalue of adjacency)\lambda_2(L) = d - \mu_1(A) \quad \text{(second eigenvalue of adjacency)}

where μ1(A)\mu_1(A) is the largest eigenvalue of AA not equal to dd. The spectral gap dμ1d - \mu_1 of the adjacency matrix and the algebraic connectivity are directly related for regular graphs.

4.2 The Fiedler Vector

Definition. The Fiedler vector u2\mathbf{u}_2 is the eigenvector of LL corresponding to λ2\lambda_2.

The Fiedler vector assigns a real number (u2)i(\mathbf{u}_2)_i to each vertex ii. Vertices with positive values are assigned to one "side" of the graph; vertices with negative values to the other. This is the basis of spectral bisection.

Spectral bisection algorithm:

  1. Compute the Fiedler vector u2\mathbf{u}_2.
  2. Partition V=SSˉV = S \cup \bar{S} by the sign of (u2)i(\mathbf{u}_2)_i: let S={i:(u2)i0}S = \{i : (\mathbf{u}_2)_i \geq 0\}.
  3. The edges (S,Sˉ)(S, \bar{S}) form the "spectral cut."

Why does this work? The Courant-Fischer theorem says u2\mathbf{u}_2 minimizes the Dirichlet energy (i,j)E(xixj)2\sum_{(i,j)\in E}(x_i - x_j)^2 subject to x1\mathbf{x} \perp \mathbf{1} and x=1\lVert \mathbf{x} \rVert = 1. If we further constrain xi{c,+c}x_i \in \{-c, +c\} (a discrete two-way partition), we get the NP-hard graph bisection problem. The Fiedler vector is the continuous relaxation of this discrete problem - the best we can do efficiently.

The ordering property. Sorting vertices by their Fiedler vector value (u2)i(\mathbf{u}_2)_i reveals the community structure of the graph. Vertices in the same community tend to have similar (u2)i(\mathbf{u}_2)_i values; the transition from negative to positive marks the community boundary.

For AI: Spectral bisection is used in:

  • Circuit partitioning (VLSI design): split a circuit graph across two chips to minimize inter-chip connections
  • Domain decomposition (PDE solvers): partition a mesh graph for parallel computation
  • Community detection in knowledge graphs: find the two most separated communities in a KG

4.3 Bounding Graph Properties via \lambda_2

Diameter bound (Mohar, 1991):

diam(G)2ln(n1)ln ⁣(λnλnλ2)\operatorname{diam}(G) \leq \left\lfloor \frac{2\ln(n-1)}{\ln\!\left(\frac{\lambda_n}{\lambda_n - \lambda_2}\right)} \right\rfloor

A simpler bound: diam(G)2nλ2\operatorname{diam}(G) \leq \frac{2n}{\lambda_2} (rough but useful).

Vertex connectivity. For any connected graph:

λ2(L)κ(G)\lambda_2(L) \leq \kappa(G)

where κ(G)\kappa(G) is the vertex connectivity (minimum number of vertices to remove to disconnect). A large λ2\lambda_2 means the graph is robustly connected.

Conductance. The conductance Φ(G)=minS:vol(S)vol(V)/2E(S,Sˉ)vol(S)\Phi(G) = \min_{S: \operatorname{vol}(S) \leq \operatorname{vol}(V)/2} \frac{|E(S, \bar{S})|}{\operatorname{vol}(S)} measures the minimum normalized cut. The Cheeger inequality (5) gives:

λ2(Lsym)2Φ(G)2λ2(Lsym)\frac{\lambda_2(L_{\text{sym}})}{2} \leq \Phi(G) \leq \sqrt{2\lambda_2(L_{\text{sym}})}

Isoperimetric number. The Cheeger constant h(G)h(G) (using S|S| instead of vol(S)\operatorname{vol}(S)) satisfies the same type of inequality with the unnormalized λ2\lambda_2.

4.4 Computing the Fiedler Vector in Practice

For small graphs (nn \leq a few thousand), compute all eigenvalues of LL directly via dense symmetric eigensolver (scipy.linalg.eigh). The second column of the eigenvector matrix is u2\mathbf{u}_2.

For large sparse graphs, use iterative methods:

Lanczos algorithm: Builds a tridiagonal matrix TT from Krylov vectors {Lv,L2v,}\{L\mathbf{v}, L^2\mathbf{v}, \ldots\}. Converges to extreme eigenvalues fastest. For the Fiedler vector, we need the smallest nonzero eigenvalue, which requires the shift-invert trick: compute the largest eigenvalue of (L+ϵI)1(L + \epsilon I)^{-1} for small ϵ\epsilon.

Inverse power iteration with deflation: Since λ1=0\lambda_1 = 0 is known, we can deflate it out. Initialize with a random x1\mathbf{x} \perp \mathbf{1}, repeatedly apply LL (via sparse matrix-vector product), normalize, and orthogonalize against 1\mathbf{1}. Convergence rate is (λ2/λ3)k(\lambda_2/\lambda_3)^k per iteration.

Randomized Nystrom approximation: For graphs with n>106n > 10^6, approximate the low-rank spectral structure using randomized sampling of the Laplacian (Spielman & Srivastava, 2011).

4.5 AI Application: Community Detection

Community detection - finding groups of densely interconnected nodes - is one of the most practically important graph problems. Spectral methods are the gold standard for quality guarantees.

Planted partition model. Generate a graph with kk communities of size n/kn/k each, intra-community edge probability pinp_{\text{in}}, inter-community probability poutpinp_{\text{out}} \ll p_{\text{in}}. Spectral bisection recovers the communities exactly when:

pinpout>2lnnn/k1SBM gapp_{\text{in}} - p_{\text{out}} > \sqrt{\frac{2\ln n}{n/k} \cdot \frac{1}{\text{SBM gap}}}

(This is the information-theoretic threshold for the Stochastic Block Model.)

Knowledge graph clustering. In a knowledge graph (KG) like Freebase or Wikidata, entities form communities by topic (sports, science, politics). The Fiedler vector of the KG adjacency graph separates these clusters. The resulting community structure can be used to create topic-specific sub-KGs for retrieval-augmented generation.


5. Cheeger's Inequality and Graph Expansion

5.1 The Cheeger Constant h(G)

Definition. For a graph G=(V,E)G = (V, E) and a subset SVS \subseteq V, the edge boundary S\partial S is the set of edges between SS and its complement Sˉ=VS\bar{S} = V \setminus S:

S={(i,j)E:iS,jSˉ}\partial S = \{(i,j) \in E : i \in S, j \in \bar{S}\}

The conductance (or isoperimetric ratio) of the cut (S,Sˉ)(S, \bar{S}) is:

Φ(S)=Smin(S,Sˉ)(unnormalized)orh(S)=Smin(vol(S),vol(Sˉ))(normalized)\Phi(S) = \frac{|\partial S|}{\min(|S|, |\bar{S}|)} \quad \text{(unnormalized)} \quad \text{or} \quad h(S) = \frac{|\partial S|}{\min(\operatorname{vol}(S), \operatorname{vol}(\bar{S}))} \quad \text{(normalized)}

The Cheeger constant (or isoperimetric number) of GG is:

h(G)=minSV,S,Vh(S)h(G) = \min_{S \subset V,\, S \neq \emptyset,\, V} h(S)

This is the minimum conductance cut: the partition that minimizes the fraction of edges leaving the smaller side relative to its volume. A small h(G)h(G) means the graph has a bottleneck - a small number of edges separating a large fraction of the volume.

Computing h(G)h(G) is NP-hard. This is a major motivation for the Cheeger inequality, which gives a polynomial-time algorithm (via λ2\lambda_2) to find a cut within a factor of 2\sqrt{2} of optimal.

Examples:

  • Path PnP_n: Remove the middle edge; h(Pn)2/nh(P_n) \approx 2/n. Very small: the path is a severe bottleneck.
  • Complete graph KnK_n: Every cut (S,Sˉ)(S, \bar{S}) has S=SSˉ/(n1)S/2|\partial S| = |S| \cdot |\bar{S}|/(n-1) \approx |S|/2; h(Kn)=n/(2(n1))1/2h(K_n) = n/(2(n-1)) \approx 1/2.
  • Expander graph (5.3): h(G)=Ω(1)h(G) = \Omega(1) - bounded below by a constant, independent of nn.

5.2 Cheeger's Inequality

Theorem (Alon & Milman, 1985; Dodziuk, 1984). For any undirected graph GG:

λ2(Lsym)2h(G)2λ2(Lsym)\frac{\lambda_2(L_{\text{sym}})}{2} \leq h(G) \leq \sqrt{2\lambda_2(L_{\text{sym}})}

Proof of the left inequality (easy direction). We show λ22h(G)\lambda_2 \leq 2h(G) by exhibiting a test vector x\mathbf{x} with R(x)2h(G)R(\mathbf{x}) \leq 2h(G).

Let SS^* be the optimal Cheeger cut with h(S)=h(G)h(S^*) = h(G). Define:

xi={1/vol(S)iS1/vol(Sˉ)iSˉx_i = \begin{cases} 1/\operatorname{vol}(S^*) & i \in S^* \\ -1/\operatorname{vol}(\bar{S}^*) & i \in \bar{S}^* \end{cases}

This x\mathbf{x} is orthogonal to the stationary distribution of the random walk (which plays the role of 1\mathbf{1} for the normalized Laplacian). Then:

xLsymx=(i,j)S(xixjdidj)2didj\mathbf{x}^\top L_{\text{sym}} \mathbf{x} = \sum_{(i,j)\in\partial S^*} \left(\frac{x_i - x_j}{\sqrt{d_i d_j}}\right)^2 \cdot d_i d_j

After algebraic simplification using the definition of hh: R(x)2h(G)R(\mathbf{x}) \leq 2h(G). Since λ2=minR(x)\lambda_2 = \min R(\mathbf{x}), we get λ22h(G)\lambda_2 \leq 2h(G).

Proof of the right inequality (hard direction). We show h(G)2λ2h(G) \leq \sqrt{2\lambda_2}.

Given the Fiedler vector u2\mathbf{u}_2, sort vertices so u2,1u2,2u2,nu_{2,1} \leq u_{2,2} \leq \cdots \leq u_{2,n}. For each threshold tt, let St={i:u2,it}S_t = \{i : u_{2,i} \leq t\}. Consider the sweep over all n1n-1 possible thresholds. By the co-area formula for graphs (a discrete version of the co-area formula in differential geometry), the average conductance of these cuts satisfies:

minth(St)th(St)Δvol(St)tΔvol(St)2λ2\min_t h(S_t) \leq \frac{\sum_t h(S_t) \cdot \Delta\operatorname{vol}(S_t)}{\sum_t \Delta\operatorname{vol}(S_t)} \leq \sqrt{2\lambda_2}

The last step uses the Cauchy-Schwarz inequality together with the fact that λ2=R(u2)\lambda_2 = R(\mathbf{u}_2) is the Rayleigh quotient of the Fiedler vector. \square

Tightness. The left bound is tight for expanders (5.3). The right bound is tight for paths and other bottleneck graphs where λ2h2/2\lambda_2 \approx h^2/2.

Practical implication. Given λ2\lambda_2, we know h(G)[λ2/2,2λ2]h(G) \in [\lambda_2/2, \sqrt{2\lambda_2}]. More importantly, the proof of the right inequality is constructive: the sweep over Fiedler vector thresholds finds a cut with conductance 2λ2\leq \sqrt{2\lambda_2}.

5.3 Expander Graphs

Definition. A family of graphs {Gn}\{G_n\} is an (n,d,λ)(n, d, \lambda)-expander family if:

  • Each GnG_n has nn vertices and is dd-regular
  • The second eigenvalue of AA satisfies μ2(A)λ<d\mu_2(A) \leq \lambda < d
  • The spectral gap dλd - \lambda is bounded below by a positive constant

Equivalently (by Cheeger): h(Gn)=Ω(1)h(G_n) = \Omega(1), i.e., the Cheeger constant is bounded below uniformly in nn.

Why expanders matter:

  1. Communication networks: In a dd-regular expander on nn nodes, any message can be routed between any two nodes in O(logn)O(\log n) hops, using only dd connections per node. This is optimal for constant-degree networks.
  2. Error-correcting codes: Expander codes (Sipser & Spielman, 1996) achieve linear-time encoding/decoding of codes close to the Shannon capacity.
  3. Derandomization: Expanders provide pseudorandom number generators - random walks on expanders mix in O(logn)O(\log n) steps, so short random walks serve as good randomness sources.
  4. GNN depth: A GNN on an expander graph propagates information across the entire graph in O(logn)O(\log n) layers. This is why expanders are ideal benchmarks for deep GNNs.

Ramanujan graphs. The optimal spectral gap for a dd-regular graph is bounded: λ2d1\lambda \geq 2\sqrt{d-1} (Alon-Boppana theorem). Graphs achieving λ=2d1\lambda = 2\sqrt{d-1} are called Ramanujan graphs - they are the optimal expanders. Explicit Ramanujan graph constructions (Lubotzky, Phillips, Sarnak, 1988; Margulis, 1988) use deep number theory.

5.4 Random Walk Mixing Time

The random walk on GG defined by transition matrix P=D1AP = D^{-1}A has stationary distribution π\boldsymbol{\pi} with πi=di/vol(V)\pi_i = d_i / \operatorname{vol}(V). The mixing time is the number of steps needed for the walk to get close to the stationary distribution:

tmix(ϵ)=min{t:maxi(Pt)i,:π1ϵ}t_{\text{mix}}(\epsilon) = \min\left\{t : \max_i \lVert (P^t)_{i,:} - \boldsymbol{\pi} \rVert_1 \leq \epsilon\right\}

Spectral mixing bound. Let α=max(μ2(P),μn(P))=1λ2(Lrw)\alpha = \max(|\mu_2(P)|, |\mu_n(P)|) = 1 - \lambda_2(L_{\text{rw}}) be the second-largest absolute eigenvalue of PP. Then:

tmix(ϵ)ln(n/ϵ)λ2(Lrw)=ln(n/ϵ)1αt_{\text{mix}}(\epsilon) \leq \frac{\ln(n/\epsilon)}{\lambda_2(L_{\text{rw}})} = \frac{\ln(n/\epsilon)}{1 - \alpha}

Interpretation: The spectral gap λ2=1α\lambda_2 = 1 - \alpha governs the mixing time. Large spectral gap -> fast mixing. For expanders with constant spectral gap: tmix=O(logn)t_{\text{mix}} = O(\log n). For paths: λ2=O(1/n2)\lambda_2 = O(1/n^2), so tmix=O(n2logn)t_{\text{mix}} = O(n^2 \log n).

Proof sketch. Write the initial distribution as δi=π+kckϕk\boldsymbol{\delta}_i = \boldsymbol{\pi} + \sum_k c_k \boldsymbol{\phi}_k where ϕk\boldsymbol{\phi}_k are eigenvectors of PP (eigenvalues 1=μ1>μ2μn11 = \mu_1 > \mu_2 \geq \cdots \geq \mu_n \geq -1). After tt steps: (Ptδi)j=πj+k2ckμkt(ϕk)j(P^t \boldsymbol{\delta}_i)_j = \pi_j + \sum_{k \geq 2} c_k \mu_k^t (\boldsymbol{\phi}_k)_j. The deviation decays as αt\alpha^t, giving tmixln(1/ϵπmin)/ln(1/α)t_{\text{mix}} \leq \ln(1/\epsilon\pi_{\min})/\ln(1/\alpha).

Lazy walk. To avoid oscillation when μn1\mu_n \approx -1 (bipartite-like graphs), use the lazy random walk with P=(I+P)/2P' = (I + P)/2. Eigenvalues of PP' are (1+μk)/2[0,1](1 + \mu_k)/2 \in [0, 1], avoiding negative eigenvalues.

5.5 AI Connection: Over-Smoothing as Diffusion

Over-smoothing is the well-documented phenomenon in deep GNNs where node representations become indistinguishable as the number of layers increases (Li et al., 2018; Oono & Suzuki, 2020). Spectral theory provides the exact mechanism:

A kk-layer GCN computes (roughly) A^kXW\hat{A}^k X W where A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} is the normalized adjacency. The eigenvalues of A^\hat{A} satisfy μ^i=1λ~i[1,1]\hat{\mu}_i = 1 - \tilde{\lambda}_i \in [-1, 1] where λ~i\tilde{\lambda}_i are eigenvalues of LsymL_{\text{sym}} of the augmented graph. After kk iterations:

(A^k)ij=μ^k(U)i(U)jkμ^1k(U)i1(U)j1=didjvol(V)(\hat{A}^k)_{ij} = \sum_\ell \hat{\mu}_\ell^k (U)_{i\ell}(U)_{j\ell} \xrightarrow{k \to \infty} \hat{\mu}_1^k (U)_{i1}(U)_{j1} = \frac{\sqrt{d_i d_j}}{\operatorname{vol}(V)}

All node features converge to a value proportional to di\sqrt{d_i}, determined only by degree - all structural information is lost.

Rate of collapse. The convergence rate is governed by the spectral gap: μ^2=1λ2(Lsym)<1\hat{\mu}_2 = 1 - \lambda_2(L_{\text{sym}}) < 1. Faster collapse on expanders (large λ2\lambda_2), slower on bottleneck graphs. This is counterintuitive: the "most connected" graphs (expanders) over-smooth fastest.

Mitigation strategies:

  • Residual connections (GCNII, Chen et al., 2020): H(k+1)=σ ⁣((1α)A^H(k)W(k)+αH(0))H^{(k+1)} = \sigma\!\left((1-\alpha)\hat{A}H^{(k)}W^{(k)} + \alpha H^{(0)}\right) - preserve initial features
  • DropEdge (Rong et al., 2020): randomly remove edges during training, reducing effective kk
  • PairNorm (Zhao & Akoglu, 2020): explicitly normalize pairwise distances to prevent collapse
  • Jumping knowledge (Xu et al., 2018): aggregate representations from all layers

Forward reference: The full architecture-level treatment of over-smoothing, including the WL expressiveness hierarchy and architectural mitigations, is in 11-05 Graph Neural Networks.


6. Graph Fourier Transform and Signal Processing

6.1 Classical Fourier Analogy

The classical Fourier transform on Rn\mathbb{R}^n decomposes a function ff into a linear combination of eigenfunctions of the Laplace operator Δ\Delta:

f^(ω)=Rnf(x)eiωxdx\hat{f}(\boldsymbol{\omega}) = \int_{\mathbb{R}^n} f(\mathbf{x}) e^{-i\boldsymbol{\omega}^\top \mathbf{x}} d\mathbf{x}

The functions eiωxe^{i\boldsymbol{\omega}^\top \mathbf{x}} are eigenfunctions of the continuous Laplacian: Δeiωx=ω2eiωx-\Delta e^{i\boldsymbol{\omega}^\top \mathbf{x}} = \lVert \boldsymbol{\omega} \rVert^2 e^{i\boldsymbol{\omega}^\top \mathbf{x}}.

On a graph, the Laplacian LL plays the role of Δ-\Delta, and its eigenvectors u1,u2,,un\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_n (with eigenvalues λ1λ2λn\lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n) play the role of the complex exponentials eiωxe^{i\boldsymbol{\omega}^\top \mathbf{x}}.

The analogy:

FOURIER TRANSFORM ANALOGY
========================================================================

  Classical Fourier                   Graph Fourier
  ---------------------------------   ---------------------------------
  Domain          \mathbb{R}^n                  Vertex set V (finite)
  Operator        -\Delta (Laplacian)      L = D - A (graph Laplacian)
  Eigenfunctions  exp(i\omega*x)           Eigenvectors u_1, u_2, ..., u_n
  Frequencies     ||\omega||^2 \in [0, \infty)      Eigenvalues \lambda_1 \leq \lambda_2 \leq ... \leq \lambda_n
  Low freq.       ||\omega|| small -> smooth  \lambda_k small -> smooth on graph
  High freq.      ||\omega|| large -> rapid   \lambda_k large -> rapid variation
  Transform       Continuous integral  Finite matrix multiply (U^Tx)

========================================================================

This analogy is the conceptual foundation for defining convolution, filtering, and signal processing on irregular graph domains.

6.2 Graph Fourier Transform

Definition. Let L=UΛUL = U \Lambda U^\top be the eigendecomposition of the graph Laplacian, with U=[u1,u2,,un]U = [\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_n] the matrix of eigenvectors (columns). For a signal xRn\mathbf{x} \in \mathbb{R}^n (assigning a value xix_i to each vertex ii), the Graph Fourier Transform (GFT) is:

x^=UxRn\hat{\mathbf{x}} = U^\top \mathbf{x} \in \mathbb{R}^n

The inverse GFT is:

x=Ux^=k=1nx^kuk\mathbf{x} = U \hat{\mathbf{x}} = \sum_{k=1}^n \hat{x}_k \mathbf{u}_k

Properties:

  1. Parseval's theorem: x^2=x2\lVert \hat{\mathbf{x}} \rVert^2 = \lVert \mathbf{x} \rVert^2 (since UU is orthogonal).
  2. Linearity: x+y^=x^+y^\widehat{\mathbf{x} + \mathbf{y}} = \hat{\mathbf{x}} + \hat{\mathbf{y}}.
  3. Energy decomposition: x2=kx^k2\lVert \mathbf{x} \rVert^2 = \sum_k \hat{x}_k^2 (energy in each frequency component).
  4. Shift property: There is no clean "shift theorem" for graphs as there is for the DFT, because graphs lack translation symmetry. This is a fundamental difference.

Graph convolution. The convolution of two signals x\mathbf{x} and y\mathbf{y} on a graph is defined spectrally:

xGy=U(x^y^)=Udiag(x^)y^\mathbf{x} \star_G \mathbf{y} = U (\hat{\mathbf{x}} \odot \hat{\mathbf{y}}) = U \operatorname{diag}(\hat{\mathbf{x}}) \hat{\mathbf{y}}

where \odot is element-wise multiplication. This is the analogue of the convolution theorem: convolution in the vertex domain equals pointwise multiplication in the spectral domain.

Limitation of full GFT: Computing UU requires O(n3)O(n^3) time (eigendecomposition). For large graphs (n>104n > 10^4), this is infeasible, motivating polynomial approximations (7).

6.3 Frequency Interpretation

The kk-th frequency component x^k=uk,x=ukx\hat{x}_k = \langle \mathbf{u}_k, \mathbf{x} \rangle = \mathbf{u}_k^\top \mathbf{x} measures how much of the signal x\mathbf{x} "oscillates at frequency λk\lambda_k."

Low-frequency signals correspond to small λk\lambda_k: the eigenvectors uk\mathbf{u}_k for small eigenvalues vary smoothly across edges (since ukLuk=λk\mathbf{u}_k^\top L \mathbf{u}_k = \lambda_k is small). A signal concentrated in low frequencies is smooth: nearby vertices have similar values.

High-frequency signals correspond to large λk\lambda_k: the eigenvectors for large eigenvalues oscillate rapidly, with (uk)i(\mathbf{u}_k)_i and (uk)j(\mathbf{u}_k)_j having opposite signs for many edges (i,j)(i,j). A pure high-frequency signal looks like a checkerboard on the graph.

Example on a path graph. For PnP_n, the eigenvectors are uk,i=2/ncos((k1)π(2i1)/(2n))u_{k,i} = \sqrt{2/n} \cos((k-1)\pi(2i-1)/(2n)) - the discrete cosine transform (DCT). The eigenvalues λk=22cos((k1)π/n)\lambda_k = 2 - 2\cos((k-1)\pi/n) are just the squared DCT frequencies. The GFT on a path is the DCT.

Example on a community graph. A graph with two tightly connected communities has:

  • u1=1/n\mathbf{u}_1 = \mathbf{1}/\sqrt{n}: constant (DC component)
  • u2\mathbf{u}_2: Fiedler vector, positive on community 1, negative on community 2 - the community membership function is a low-frequency signal
  • un\mathbf{u}_n: highest frequency, alternates sign on bipartite-like structure

6.4 Dirichlet Energy Revisited

The Dirichlet energy decomposes cleanly in the spectral domain:

xLx=x^Λx^=k=1nλkx^k2\mathbf{x}^\top L \mathbf{x} = \hat{\mathbf{x}}^\top \Lambda \hat{\mathbf{x}} = \sum_{k=1}^n \lambda_k \hat{x}_k^2

This is the "power spectrum" interpretation: the Dirichlet energy is the weighted sum of spectral components, weighted by frequency. A signal is smooth (low Dirichlet energy) iff its energy is concentrated in low-frequency components (λk\lambda_k small).

Spectral analysis of node features. Given a node feature matrix XRn×dX \in \mathbb{R}^{n \times d}, we can compute the spectral content of each feature dimension:

Dirichlet(X:,j)=k=1nλkX^kj2\operatorname{Dirichlet}(X_{:,j}) = \sum_{k=1}^n \lambda_k \hat{X}_{kj}^2

Feature dimensions with low Dirichlet energy are "community-consistent" features (e.g., political affiliation in a social network). Feature dimensions with high Dirichlet energy are "noisy" local features.

For AI: Graph regularization in semi-supervised learning minimizes:

L=Lsupervised+γjf:,jLf:,j\mathcal{L} = \mathcal{L}_{\text{supervised}} + \gamma \sum_j \mathbf{f}_{:,j}^\top L \mathbf{f}_{:,j}

This penalizes high-frequency components in the predicted label function f\mathbf{f}, implementing a "smoothness prior": connected nodes likely have the same label.

6.5 Uncertainty Principle on Graphs

In classical signal processing, the Heisenberg uncertainty principle states that a signal cannot be simultaneously concentrated in both time and frequency: the product of time spread and frequency spread is at least 1/4π1/4\pi.

On graphs, an analogous uncertainty principle holds (Agaskar & Lu, 2013):

ΔG(x)2ΔS(x)2C\Delta_G(\mathbf{x})^2 \cdot \Delta_S(\mathbf{x})^2 \geq C

where ΔG\Delta_G measures how localized x\mathbf{x} is in the vertex domain (concentrated on a small set of vertices) and ΔS\Delta_S measures how localized x^\hat{\mathbf{x}} is in the spectral domain (concentrated on a small band of frequencies), and CC is a constant depending on the graph structure.

Implications for graph signal processing:

  • A signal perfectly localized on a single vertex (ΔG=0\Delta_G = 0) is spread across all frequencies (ΔS=\Delta_S = maximum)
  • Smooth signals (concentrated on low frequencies, small ΔS\Delta_S) are necessarily spread across many vertices (ΔG\Delta_G large)
  • This tradeoff motivates graph wavelets (11.3): basis functions that are approximately localized in both vertex and spectral domains

6.6 AI Application: Node Feature Smoothing

Label propagation (Zhou et al., 2004) is a classic semi-supervised learning algorithm that directly implements low-pass graph filtering. Starting from a partially labeled graph, labels propagate according to:

F(t+1)=αA^F(t)+(1α)Y\mathbf{F}^{(t+1)} = \alpha \hat{A} \mathbf{F}^{(t)} + (1 - \alpha) Y

where YY is the initial label matrix (zeros for unlabeled nodes), A^\hat{A} is the normalized adjacency, and α(0,1)\alpha \in (0,1) controls the smoothing strength. In the spectral domain, this converges to:

F=(IαA^)1(1α)Y=k1α1α(1λ~k)Y^kuk\mathbf{F}^* = (I - \alpha \hat{A})^{-1} (1-\alpha) Y = \sum_k \frac{1-\alpha}{1 - \alpha(1-\tilde{\lambda}_k)} \hat{Y}_k \mathbf{u}_k

The filter g(λ)=(1α)/(1α(1λ))g(\lambda) = (1-\alpha)/(1-\alpha(1-\lambda)) is a low-pass filter: it attenuates high-frequency components (λ\lambda large) more than low-frequency ones.

For modern LLMs: When an LLM reasons over a knowledge graph, smooth graph signals correspond to consistent facts (nearby entities agree), while high-frequency signals correspond to noise or inconsistencies. Spectral filtering provides a principled way to denoise knowledge graphs before retrieval.


7. Spectral Filtering

7.1 Filtering in the Spectral Domain

A spectral filter on a graph is an operation that modifies the frequency content of a graph signal:

y=g(L)x=Ug(Λ)Ux\mathbf{y} = g(L)\mathbf{x} = U g(\Lambda) U^\top \mathbf{x}

where g:RRg: \mathbb{R} \to \mathbb{R} is a scalar function applied pointwise to the eigenvalues: g(Λ)=diag(g(λ1),g(λ2),,g(λn))g(\Lambda) = \operatorname{diag}(g(\lambda_1), g(\lambda_2), \ldots, g(\lambda_n)).

Common filters:

Filterg(λ)g(\lambda)EffectAI use case
Low-pass1[λλc]\mathbf{1}[\lambda \leq \lambda_c]Keep low frequenciesSmooth node features
High-pass1[λ>λc]\mathbf{1}[\lambda > \lambda_c]Keep high frequenciesEdge detection on graphs
Band-pass1[λaλλb]\mathbf{1}[\lambda_a \leq \lambda \leq \lambda_b]Keep a frequency bandCommunity detection at scale kk
Heat kerneletλe^{-t\lambda}Exponential dampingGraph diffusion, PPMI
Identity11No changeTrivial
GCN1λ/21 - \lambda/2Linear attenuationFirst-order spectral convolution

Implementation cost: Directly computing Ug(Λ)UxU g(\Lambda) U^\top \mathbf{x} requires the full eigendecomposition - O(n3)O(n^3) preprocessing and O(n2)O(n^2) per signal. This is intractable for large graphs. Polynomial approximation (7.2) reduces cost to O(KE)O(K|E|) per signal.

7.2 Polynomial Filters and Localization

A KK-th order polynomial filter has the form:

g(L)=k=0KθkLkg(L) = \sum_{k=0}^K \theta_k L^k

Key property: K-localization. The filter g(L)=k=0KθkLkg(L) = \sum_{k=0}^K \theta_k L^k is exactly KK-localized: (g(L)x)i(g(L)\mathbf{x})_i depends only on the values of x\mathbf{x} at vertices within graph distance KK from ii.

Proof. (Lk)ij=0(L^k)_{ij} = 0 whenever dist(i,j)>k\text{dist}(i,j) > k (by the walk-counting property of graph matrix powers). Therefore (g(L))ij=kθk(Lk)ij=0(g(L))_{ij} = \sum_k \theta_k (L^k)_{ij} = 0 whenever dist(i,j)>K\text{dist}(i,j) > K.

Complexity. Computing y=g(L)x\mathbf{y} = g(L)\mathbf{x} using the recurrence Lkx=L(Lk1x)L^k \mathbf{x} = L \cdot (L^{k-1}\mathbf{x}) requires KK sparse matrix-vector multiplications, each O(E)O(|E|). Total: O(KE)O(K|E|).

Spatial interpretation. A polynomial filter is exactly equivalent to a KK-hop neighborhood aggregation, connecting spectral and spatial GNN views. This is the theoretical justification for why GNNs with KK layers aggregate information from KK-hop neighborhoods.

Approximation theorem. By the Stone-Weierstrass theorem, any continuous function g:[0,λmax]Rg: [0, \lambda_{\max}] \to \mathbb{R} can be uniformly approximated by polynomials. So polynomial filters are universal approximators for spectral filters on any graph.

7.3 Chebyshev Polynomial Approximation

Why Chebyshev? Among all polynomials of degree K\leq K, the KK-th Chebyshev polynomial TKT_K has the smallest maximum deviation from zero on [1,1][-1, 1] - it is the optimal polynomial approximation basis.

Definition. The Chebyshev polynomials Tk:[1,1][1,1]T_k: [-1, 1] \to [-1, 1] satisfy:

T0(x)=1,T1(x)=x,Tk(x)=2xTk1(x)Tk2(x)T_0(x) = 1, \quad T_1(x) = x, \quad T_k(x) = 2x T_{k-1}(x) - T_{k-2}(x)

They have the closed form Tk(x)=cos(karccosx)T_k(x) = \cos(k \arccos x).

Chebyshev graph filter (ChebNet, Defferrard et al., 2016). Scale the Laplacian to L~=2L/λmaxI[I,I]\tilde{L} = 2L/\lambda_{\max} - I \in [-I, I] (shifting eigenvalues from [0,λmax][0, \lambda_{\max}] to [1,1][-1, 1]). Define:

gθ(L)=k=0KθkTk(L~)g_{\boldsymbol{\theta}}(L) = \sum_{k=0}^K \theta_k T_k(\tilde{L})

Computation via recurrence:

xˉ0=x,xˉ1=L~x,xˉk=2L~xˉk1xˉk2\bar{\mathbf{x}}_0 = \mathbf{x}, \quad \bar{\mathbf{x}}_1 = \tilde{L}\mathbf{x}, \quad \bar{\mathbf{x}}_k = 2\tilde{L}\bar{\mathbf{x}}_{k-1} - \bar{\mathbf{x}}_{k-2} y=k=0Kθkxˉk\mathbf{y} = \sum_{k=0}^K \theta_k \bar{\mathbf{x}}_k

Each step requires one sparse matrix-vector multiply O(E)O(|E|); total cost O(KE)O(K|E|).

Advantages over truncated Taylor series:

  • The Chebyshev approximation error decays exponentially in KK (geometric convergence for smooth gg)
  • No numerical instability from large powers of L~\tilde{L}
  • The learned parameters θk\theta_k have clear frequency interpretation

7.4 Heat Kernel and Diffusion Filters

The graph heat equation generalizes diffusion to graphs:

x(t)t=Lx(t),x(0)=x0\frac{\partial \mathbf{x}(t)}{\partial t} = -L\mathbf{x}(t), \quad \mathbf{x}(0) = \mathbf{x}_0

Solution: x(t)=etLx0\mathbf{x}(t) = e^{-tL}\mathbf{x}_0. In the spectral domain: x^k(t)=etλkx^k,0\hat{x}_k(t) = e^{-t\lambda_k}\hat{x}_{k,0} - each frequency decays at rate λk\lambda_k.

The heat kernel Ht=etLH_t = e^{-tL} is a positive semidefinite matrix representing the diffusion of heat on the graph over time tt. Entries (Ht)ij(H_t)_{ij} give the heat at vertex jj after time tt when a unit heat source is placed at vertex ii.

Properties:

  • For t0t \to 0: HtIH_t \to I (no diffusion)
  • For tt \to \infty: Ht1n11H_t \to \frac{1}{n}\mathbf{1}\mathbf{1}^\top (heat equalizes, constant temperature on each component)
  • Relates to random walk: (Ht)ij=ketλk(U)ik(U)jk(H_t)_{ij} = \sum_k e^{-t\lambda_k}(U)_{ik}(U)_{jk}

Diffusion distance. The distance between vertices ii and jj at time scale tt is:

Dt(i,j)2=(Ht)i,:(Ht)j,:2=ke2tλk(uk,iuk,j)2D_t(i,j)^2 = \lVert (H_t)_{i,:} - (H_t)_{j,:} \rVert^2 = \sum_k e^{-2t\lambda_k}(u_{k,i} - u_{k,j})^2

This diffusion distance is more robust than shortest-path distance: it accounts for all paths between ii and jj, not just the shortest one.

For AI: The PPMI (Positive Pointwise Mutual Information) matrix used in graph-based word embeddings is approximately a diffusion kernel. The node2vec random walk (Grover & Leskovec, 2016) approximates diffusion distance.

7.5 From Chebyshev to GCN

The GCN layer (Kipf & Welling, 2017) is derived from ChebNet by:

Step 1: Set K=1K = 1 (first-order Chebyshev approximation): gθ(L)θ0T0(L~)+θ1T1(L~)=θ0I+θ1L~g_{\boldsymbol{\theta}}(L) \approx \theta_0 T_0(\tilde{L}) + \theta_1 T_1(\tilde{L}) = \theta_0 I + \theta_1 \tilde{L}.

Step 2: Approximate λmax2\lambda_{\max} \approx 2 (holding for regular and near-regular graphs), so L~LI\tilde{L} \approx L - I.

gθ(L)θ0I+θ1(LI)=θ0I+θ1(DAI)g_{\boldsymbol{\theta}}(L) \approx \theta_0 I + \theta_1(L - I) = \theta_0 I + \theta_1(D - A - I)

Step 3: Constrain θ=θ0=θ1\theta = \theta_0 = -\theta_1 (reduce parameters to prevent overfitting):

gθ(L)θ(I+D1/2AD1/2)=θA^g_\theta(L) \approx \theta(I + D^{-1/2}AD^{-1/2}) = \theta \hat{A}

Step 4: Add self-loops A~=A+I\tilde{A} = A + I, renormalize with D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij} to prevent numerical issues (the "renormalization trick"):

A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}

Full GCN layer:

H(l+1)=σ ⁣(D~1/2A~D~1/2H(l)W(l))=σ ⁣(A^H(l)W(l))H^{(l+1)} = \sigma\!\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{(l)} W^{(l)}\right) = \sigma\!\left(\hat{A} H^{(l)} W^{(l)}\right)

Spectral interpretation. The GCN filter g(λ)1λ/2g(\lambda) \approx 1 - \lambda/2 is a low-pass filter: it passes low frequencies (λ0\lambda \approx 0, smooth signals) and attenuates high frequencies (λ2\lambda \approx 2, rapidly varying signals). GCN is fundamentally a graph smoother.

Full GNN treatment: For GraphSAGE, GAT, MPNN framework, over-smoothing fixes, and expressiveness theory, see 11-05 Graph Neural Networks.


8. Spectral Clustering

8.1 Graph Partitioning Objectives

Minimum cut. Given a graph and integer kk, partition V=A1AkV = A_1 \cup \cdots \cup A_k (disjoint, non-empty) to minimize:

Cut(A1,,Ak)=12=1kE(A,Aˉ)\text{Cut}(A_1, \ldots, A_k) = \frac{1}{2}\sum_{\ell=1}^k |E(A_\ell, \bar{A}_\ell)|

where E(S,Sˉ)E(S, \bar{S}) is the set of edges between SS and its complement.

Problem with minimum cut. Minimum cut tends to cut off isolated vertices or very small sets - the trivial solution A1={v}A_1 = \{v\} for a low-degree vertex vv has very few edges to cut. We need objectives that balance cluster sizes.

RatioCut (Hagen & Kahng, 1992):

RatioCut(A1,,Ak)==1kE(A,Aˉ)A\text{RatioCut}(A_1, \ldots, A_k) = \sum_{\ell=1}^k \frac{|E(A_\ell, \bar{A}_\ell)|}{|A_\ell|}

Normalizes by the number of vertices in each partition - prevents very small cuts.

Normalized Cut (NCut) (Shi & Malik, 2000):

NCut(A1,,Ak)==1kE(A,Aˉ)vol(A)\text{NCut}(A_1, \ldots, A_k) = \sum_{\ell=1}^k \frac{|E(A_\ell, \bar{A}_\ell)|}{\operatorname{vol}(A_\ell)}

Normalizes by the volume (total degree) - weighted version of RatioCut.

Both problems are NP-hard in general. Spectral clustering relaxes them to tractable eigenvalue problems.

8.2 RatioCut and Unnormalized Spectral Clustering

Two-cluster RatioCut. For k=2k = 2 with partition (S,Sˉ)(S, \bar{S}):

RatioCut(S,Sˉ)=E(S,Sˉ)(1S+1Sˉ)\text{RatioCut}(S, \bar{S}) = |E(S,\bar{S})| \cdot \left(\frac{1}{|S|} + \frac{1}{|\bar{S}|}\right)

Define the indicator vector hRn\mathbf{h} \in \mathbb{R}^n:

hi={Sˉ/SiSS/SˉiSˉh_i = \begin{cases} \sqrt{|\bar{S}|/|S|} & i \in S \\ -\sqrt{|S|/|\bar{S}|} & i \in \bar{S} \end{cases}

Claim. hLh=nRatioCut(S,Sˉ)\mathbf{h}^\top L \mathbf{h} = n \cdot \text{RatioCut}(S, \bar{S}). Also: h2=n\lVert \mathbf{h} \rVert^2 = n and h1=0\mathbf{h}^\top \mathbf{1} = 0.

Proof: hLh=(i,j)E(hihj)2\mathbf{h}^\top L \mathbf{h} = \sum_{(i,j)\in E}(h_i - h_j)^2. The only nonzero terms come from edges crossing the cut: for (i,j)E(S,Sˉ)(i,j) \in E(S, \bar{S}):

(hihj)2=(Sˉ/S+S/Sˉ)2=n2SSˉ(h_i - h_j)^2 = \left(\sqrt{|\bar{S}|/|S|} + \sqrt{|S|/|\bar{S}|}\right)^2 = \frac{n^2}{|S||\bar{S}|}

Summing over all E(S,Sˉ)|E(S,\bar{S})| cut edges and using 1/S+1/Sˉ=n/(SSˉ)1/|S| + 1/|\bar{S}| = n/(|S||\bar{S}|):

hLh=E(S,Sˉ)n2SSˉ=nRatioCut(S,Sˉ)\mathbf{h}^\top L \mathbf{h} = |E(S,\bar{S})| \cdot \frac{n^2}{|S||\bar{S}|} = n \cdot \text{RatioCut}(S, \bar{S}) \qquad \square

Relaxation. The discrete optimization minShLh\min_S \mathbf{h}^\top L \mathbf{h} subject to h1=0\mathbf{h}^\top \mathbf{1} = 0, h=n\lVert \mathbf{h} \rVert = \sqrt{n}, hi{c+,c}h_i \in \{c_+, c_-\} is NP-hard. Relax the integrality constraint: allow hiRh_i \in \mathbb{R}. By Courant-Fischer, the solution is the Fiedler vector u2\mathbf{u}_2.

Recovery. Given u2\mathbf{u}_2, assign vertex ii to SS if (u2)i0(\mathbf{u}_2)_i \geq 0, to Sˉ\bar{S} otherwise. In practice, use k-means with k=2k=2 on u2\mathbf{u}_2 for robustness.

8.3 Normalized Cut (Shi & Malik 2000)

NCut relaxation. Define the indicator h\mathbf{h} analogously to RatioCut but with volume weights: for partition (S,Sˉ)(S, \bar{S}):

hi={vol(Sˉ)/vol(S)iSvol(S)/vol(Sˉ)iSˉh_i = \begin{cases} \sqrt{\operatorname{vol}(\bar{S})/\operatorname{vol}(S)} & i \in S \\ -\sqrt{\operatorname{vol}(S)/\operatorname{vol}(\bar{S})} & i \in \bar{S} \end{cases}

Then hLh=vol(V)NCut(S,Sˉ)\mathbf{h}^\top L \mathbf{h} = \operatorname{vol}(V) \cdot \text{NCut}(S, \bar{S}), subject to hD1=0\mathbf{h}^\top D \mathbf{1} = 0 and hDh=vol(V)\mathbf{h}^\top D \mathbf{h} = \operatorname{vol}(V).

Generalized eigenvalue problem. The continuous relaxation is:

minhD1hLhhDh=minh~D1/21h~Lsymh~h~2\min_{\mathbf{h} \perp_D \mathbf{1}} \frac{\mathbf{h}^\top L \mathbf{h}}{\mathbf{h}^\top D \mathbf{h}} = \min_{\tilde{\mathbf{h}} \perp D^{1/2}\mathbf{1}} \frac{\tilde{\mathbf{h}}^\top L_{\text{sym}} \tilde{\mathbf{h}}}{\lVert \tilde{\mathbf{h}} \rVert^2}

via the substitution h~=D1/2h\tilde{\mathbf{h}} = D^{1/2}\mathbf{h}. This is the standard Rayleigh quotient for LsymL_{\text{sym}}, minimized by u~2\tilde{\mathbf{u}}_2. Thus:

h=D1/2u~2\mathbf{h}^* = D^{-1/2}\tilde{\mathbf{u}}_2

where u~2\tilde{\mathbf{u}}_2 is the Fiedler vector of LsymL_{\text{sym}}.

Shi-Malik algorithm (2-cluster):

  1. Build Lsym=D1/2(DA)D1/2L_{\text{sym}} = D^{-1/2}(D-A)D^{-1/2}
  2. Compute Fiedler vector u~2\tilde{\mathbf{u}}_2 of LsymL_{\text{sym}}
  3. Assign vertex ii to SS if (D1/2u~2)ithreshold(D^{-1/2}\tilde{\mathbf{u}}_2)_i \geq \text{threshold}
  4. Choose threshold: empirically (try all n1n-1 thresholds) or at 0

Multi-class NCut. For kk clusters, use the kk smallest eigenvectors of LsymL_{\text{sym}}, form the n×kn \times k matrix UkU_k, normalize each row to unit norm, then apply k-means to the rows.

8.4 Multi-Way Spectral Clustering

The Ng-Jordan-Weiss (NJW) algorithm (2002) is the standard multi-class spectral clustering:

  1. Build the normalized Laplacian LsymL_{\text{sym}}
  2. Compute the kk smallest eigenvectors u~1,,u~k\tilde{\mathbf{u}}_1, \ldots, \tilde{\mathbf{u}}_k (smallest eigenvalues of LsymL_{\text{sym}})
  3. Form UkRn×kU_k \in \mathbb{R}^{n \times k} with these eigenvectors as columns
  4. Normalize rows: let Yi,:=Uk,i,:/Uk,i,:2Y_{i,:} = U_{k,i,:} / \lVert U_{k,i,:} \rVert_2 (row normalization)
  5. Apply k-means to the rows of YY

Why row normalization? The perturbation theory of 8.5 shows that in a perfect kk-cluster graph, the rows of UkU_k lie exactly on kk orthogonal vectors. Row normalization maps these to the same point on the unit sphere regardless of degree, making k-means converge cleanly.

Perturbation theory justification. Consider a "block graph" G0G_0 consisting of kk disconnected cliques. The kk smallest eigenvalues of LsymL_{\text{sym}} are all 00, with eigenvectors being the normalized indicators of each clique. Any real graph with kk communities can be seen as a perturbed block graph; if the perturbation (inter-community edges) is small, the eigenvectors are close to the block indicators. Weyl's perturbation theorem quantifies how much λ2\lambda_2 changes.

8.5 Complete Algorithm and Implementation

SPECTRAL CLUSTERING ALGORITHM
========================================================================

  Input:  Adjacency matrix A \in \mathbb{R}^n^x^n, number of clusters k
  Output: Cluster assignments c \in {1,...,k}^n

  1. Compute degree matrix D = diag(A*1)
  2. Compute normalized Laplacian L_sym = D^(-1/2) (D - A) D^(-1/2)
     (or use L_rw = I - D^(-1) A, but use L_sym for symmetric version)

  3. Compute k smallest eigenvalues and eigenvectors of L_sym
     -> eigenvectors form columns of U_k \in \mathbb{R}^n^x^k

  4. Normalize rows: Y_i = U_k[i,:] / ||U_k[i,:]||_2
     (skip for RatioCut; required for NCut)

  5. Apply k-means clustering to rows of Y
     -> cluster centers \mu_1,...,\mu_k; assignments c[i] \in {1,...,k}

  6. Return c

  Complexity: O(n^3) for full eigendecomposition;
              O(k*n*|E|) with Lanczos + k-means (large graphs)

========================================================================

Practical notes:

  • Use Lanczos algorithm or LOBPCG for computing the kk smallest eigenvectors of LsymL_{\text{sym}} on large sparse graphs (avoid full eigendecomposition)
  • The choice of kk can be guided by the eigengap heuristic: choose kk where the gap λk+1λk\lambda_{k+1} - \lambda_k is largest
  • K-means is run multiple times with random restarts; take the best result (lowest inertia)
  • For disconnected graphs, the kk zero eigenvalues directly give the cluster indicators

8.6 When Spectral Clustering Beats k-Means

K-means minimizes within-cluster variance assuming convex, isotropic, similarly-sized clusters. It fails on non-convex cluster shapes. Spectral clustering has no shape assumption - it works on any cluster structure that is well-separated in the graph.

When spectral clustering excels:

  • Concentric rings, spirals, moons - any shape detectable by graph connectivity
  • Clusters at multiple scales (nested communities)
  • Data with non-Euclidean structure (molecules, social networks)

When k-means excels:

  • Truly Gaussian clusters in Rd\mathbb{R}^d
  • Very large nn where eigenvector computation is too slow
  • Cluster structure is well-captured by Euclidean distance

A critical nuance: Spectral clustering requires building the adjacency/affinity graph first. The kk-NN graph or ϵ\epsilon-neighborhood graph choice matters enormously for quality. A common failure mode: if the graph is built with too small kk or ϵ\epsilon, communities may become disconnected even within a true cluster. If too large, community structure is washed out.


9. Laplacian Eigenmaps and Graph Embeddings

9.1 The Embedding Problem

Given a graph G=(V,E)G = (V, E), we want a mapping ϕ:VRd\phi: V \to \mathbb{R}^d (dnd \ll n) that preserves the graph structure: vertices that are nearby in the graph should be nearby in the embedding. Formally, we want:

ϕ=argminϕ:VRd(i,j)Ewijϕ(i)ϕ(j)2\phi = \arg\min_{\phi: V \to \mathbb{R}^d} \sum_{(i,j) \in E} w_{ij} \lVert \phi(i) - \phi(j) \rVert^2

subject to constraints that prevent the trivial solution ϕ(i)=0\phi(i) = \mathbf{0} for all ii.

Decomposing dimension by dimension, this is dd separate problems, each of the form:

minfRnfLfsubject to normalization and orthogonality constraints\min_{\mathbf{f} \in \mathbb{R}^n} \mathbf{f}^\top L \mathbf{f} \quad \text{subject to normalization and orthogonality constraints}

This is exactly minimizing the Dirichlet energy, solved by the eigenvectors of LL.

9.2 Laplacian Eigenmaps Algorithm

Belkin & Niyogi (2001/2003). Given nn data points x(1),,x(n)RD\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} \in \mathbb{R}^D:

  1. Build the adjacency graph: Connect points ii and jj if they are among each other's kk nearest neighbors (or x(i)x(j)<ϵ\lVert \mathbf{x}^{(i)} - \mathbf{x}^{(j)} \rVert < \epsilon).

  2. Set edge weights: Use the heat kernel wij=exp(x(i)x(j)2/t)w_{ij} = \exp(-\lVert \mathbf{x}^{(i)} - \mathbf{x}^{(j)} \rVert^2 / t) for connected pairs (with t>0t > 0 a bandwidth parameter).

  3. Compute degree and Laplacian: Dii=jwijD_{ii} = \sum_j w_{ij}, L=DWL = D - W.

  4. Solve the generalized eigenvalue problem:

Lu=λDuL \mathbf{u} = \lambda D \mathbf{u}

Equivalently: find eigenvectors of Lrw=D1LL_{\text{rw}} = D^{-1}L (or LsymL_{\text{sym}}).

  1. Embed: Take the dd eigenvectors u2,u3,,ud+1\mathbf{u}_2, \mathbf{u}_3, \ldots, \mathbf{u}_{d+1} (skip u1=1\mathbf{u}_1 = \mathbf{1}) and set ϕ(i)=(u2,i,u3,i,,ud+1,i)Rd\phi(i) = (u_{2,i}, u_{3,i}, \ldots, u_{d+1,i}) \in \mathbb{R}^d.

Optimality theorem. The Laplacian eigenmap embedding is the solution to the optimization problem:

minf1,,fdkfkLfks.t. fkDfk=1,  fkDfj=0 for kj\min_{\mathbf{f}_1, \ldots, \mathbf{f}_d} \sum_k \mathbf{f}_k^\top L \mathbf{f}_k \quad \text{s.t. } \mathbf{f}_k^\top D \mathbf{f}_k = 1,\; \mathbf{f}_k^\top D \mathbf{f}_j = 0 \text{ for } k \neq j

The solution is fk=uk+1\mathbf{f}_k = \mathbf{u}_{k+1} (the (k+1)(k+1)-th eigenvector of LsymL_{\text{sym}}). This is optimal in the sense that no other dd-dimensional embedding has smaller total Dirichlet energy.

Manifold learning interpretation. If the data points x(i)\mathbf{x}^{(i)} lie on a dd-dimensional manifold embedded in RD\mathbb{R}^D, the Laplacian eigenmap recovers the intrinsic coordinates of the manifold. As nn \to \infty and the bandwidth t0t \to 0 at an appropriate rate, the graph Laplacian converges to the Laplace-Beltrami operator on the manifold (Belkin & Niyogi, 2008).

9.3 Diffusion Maps

Coifman & Lafon (2006) introduced diffusion maps as a multiscale version of Laplacian eigenmaps.

Define the diffusion operator M=D1WM = D^{-1}W (the random walk matrix) and its tt-step version MtM^t. The diffusion distance at scale tt:

Dt(i,j)2=kμk2t(ψk(i)ψk(j))2D_t(i,j)^2 = \sum_k \mu_k^{2t} (\psi_k(i) - \psi_k(j))^2

where μk,ψk\mu_k, \psi_k are eigenvalues/eigenvectors of MM. The diffusion map embedding:

Φt(i)=(μ2tψ2,i,μ3tψ3,i,,μdtψd,i)Rd1\Phi^t(i) = (\mu_2^t \psi_{2,i},\, \mu_3^t \psi_{3,i},\, \ldots,\, \mu_d^t \psi_{d,i}) \in \mathbb{R}^{d-1}

The Euclidean distance in the diffusion map equals the diffusion distance: Φt(i)Φt(j)=Dt(i,j)\lVert \Phi^t(i) - \Phi^t(j) \rVert = D_t(i,j).

Multi-scale property. By varying tt, diffusion maps reveal structure at different scales:

  • Small tt: local neighborhood structure
  • Large tt: global cluster structure (only the dominant eigenvectors with μkt0\mu_k^t \gg 0 remain)

9.4 Relationship to PCA and Kernel PCA

Kernel PCA (Scholkopf et al., 1998) computes the principal components of data in a feature space defined by a kernel k(x,y)k(\mathbf{x}, \mathbf{y}). For a kernel matrix KRn×nK \in \mathbb{R}^{n \times n} with Kij=k(x(i),x(j))K_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}), kernel PCA computes the eigenvectors of the centered kernel matrix.

Commute-time embedding. The commute time C(i,j)C(i,j) between vertices ii and jj is the expected number of steps for a random walk starting at ii to reach jj and return. It equals:

C(i,j)=vol(V)k=2n1λk((uk,i)di(uk,j)dj)2C(i,j) = \operatorname{vol}(V) \sum_{k=2}^n \frac{1}{\lambda_k}\left(\frac{(u_{k,i})}{\sqrt{d_i}} - \frac{(u_{k,j})}{\sqrt{d_j}}\right)^2

This is kernel PCA with the commute-time kernel k(i,j)=(C(i,i)+C(j,j)C(i,j))/2k(i,j) = (C(i,i) + C(j,j) - C(i,j))/2. So Laplacian eigenmaps is a special case of kernel PCA.

9.5 Spectral Positional Encodings for Transformers

Standard Transformers process tokens with positional encodings to handle sequence order. Graph Transformers need analogous positional encodings for graph nodes - but graphs have no canonical ordering.

Laplacian Positional Encoding (LapPE). Use the eigenvectors of the graph Laplacian as node positional encodings:

PE(v)=[u2(v),u3(v),,uk+1(v)]Rk\text{PE}(v) = [\mathbf{u}_2(v), \mathbf{u}_3(v), \ldots, \mathbf{u}_{k+1}(v)] \in \mathbb{R}^k

where uj(v)\mathbf{u}_j(v) is the vv-th entry of the jj-th Laplacian eigenvector.

Challenge: Sign ambiguity. Each eigenvector uj\mathbf{u}_j is defined only up to sign: uj-\mathbf{u}_j is also a valid eigenvector. This creates non-uniqueness in the PE.

Solutions:

  • Random sign flips during training (Dwivedi et al., 2022): randomly flip signs in training; the Transformer learns sign-invariant functions
  • SignNet (Lim et al., 2022): use a Deep Sets architecture that is invariant to sign flips: ϕ(uj)+ϕ(uj)\phi(\mathbf{u}_j) + \phi(-\mathbf{u}_j)
  • BasisNet: extend to the case of repeated eigenvalues (multiplicity >1> 1), which introduce rotational ambiguity

RWPE (Random Walk Positional Encoding). Instead of Laplacian eigenvectors, use kk steps of a random walk:

RWPE(v)j=(Pj)vv=probability of returning to v after j steps\text{RWPE}(v)_j = (P^j)_{vv} = \text{probability of returning to } v \text{ after } j \text{ steps}

where P=D1AP = D^{-1}A. This avoids the sign ambiguity issue and is invariant to graph automorphisms. Used in GPS (Rampasek et al., 2022) - one of the best-performing graph Transformers.

Why LapPE/RWPE matter. Without positional encodings, graph Transformers cannot distinguish graph structure - all nodes with the same degree distribution look identical. Spectral PE gives each node a unique "spectral fingerprint" derived from its position in the graph's Fourier basis.


10. Directed Graph Spectra

10.1 Directed Laplacians

For a directed graph (digraph) G=(V,E)G = (V, E) with EV×VE \subseteq V \times V (ordered pairs), the adjacency matrix AA is not symmetric: Aij=1A_{ij} = 1 if (ij)E(i \to j) \in E but AjiA_{ji} may be 00.

In-degree and out-degree: For each vertex ii, diin=jAjid_i^{\text{in}} = \sum_j A_{ji} (number of incoming edges) and diout=jAijd_i^{\text{out}} = \sum_j A_{ij} (outgoing edges).

Out-degree Laplacian: Lout=DoutAL^{\text{out}} = D^{\text{out}} - A where Dout=diag(d1out,,dnout)D^{\text{out}} = \operatorname{diag}(d_1^{\text{out}}, \ldots, d_n^{\text{out}}).

In-degree Laplacian: Lin=DinAL^{\text{in}} = D^{\text{in}} - A^\top (or equivalently, the out-degree Laplacian of the reversed graph).

Key difference from undirected case:

  • LoutL^{\text{out}} is NOT symmetric in general
  • Eigenvalues may be complex
  • The row-sum-zero property holds: Lout1=0L^{\text{out}}\mathbf{1} = \mathbf{0} (since each row sums to dioutdiout=0d_i^{\text{out}} - d_i^{\text{out}} = 0)
  • But column sums are djindjoutd_j^{\text{in}} - d_j^{\text{out}}, not necessarily zero

Stationary distribution. The directed random walk P=(Dout)1AP = (D^{\text{out}})^{-1}A is row-stochastic. For a strongly connected digraph, the unique stationary distribution π\boldsymbol{\pi} satisfies πP=π\boldsymbol{\pi}^\top P = \boldsymbol{\pi}^\top. The stationary distribution is NOT necessarily uniform (unlike dd-regular undirected graphs).

10.2 Kirchhoff's Matrix-Tree Theorem

Theorem (Kirchhoff, 1847). For a connected undirected graph GG, the number of spanning trees equals any cofactor of LL:

τ(G)=1nk=2nλk(L)\tau(G) = \frac{1}{n}\prod_{k=2}^n \lambda_k(L)

where λ2,,λn\lambda_2, \ldots, \lambda_n are the non-zero eigenvalues of LL.

Proof via the Matrix-Tree theorem. By the Matrix-Tree theorem, τ(G)\tau(G) equals any (n1)×(n1)(n-1) \times (n-1) principal minor of LL. By the Cauchy-Binet formula, this minor equals 1nλ2λ3λn\frac{1}{n}\lambda_2\lambda_3\cdots\lambda_n, which follows from the spectral decomposition and the fact that λ1=0\lambda_1 = 0 with eigenvector 1/n\mathbf{1}/\sqrt{n}.

Examples:

  • KnK_n: λ2==λn=n\lambda_2 = \cdots = \lambda_n = n (all equal), so τ(Kn)=nn2\tau(K_n) = n^{n-2} (Cayley's formula).
  • PnP_n (path): τ(Pn)=1\tau(P_n) = 1 (only one spanning tree - the path itself).
  • CnC_n (cycle): τ(Cn)=n\tau(C_n) = n.

For AI: The number of spanning trees measures "graph robustness." Networks with many spanning trees (like expanders) remain connected even after many edge failures. This metric appears in network reliability analysis for distributed training clusters.

10.3 PageRank as a Spectral Problem

PageRank (Page, Brin, Motwani, Winograd, 1998) - the algorithm behind Google Search - is fundamentally a spectral computation on a directed graph.

Setup. Model the Web as a directed graph: pages are vertices, hyperlinks are directed edges. Define the Google matrix:

G=αP+(1α)11nG = \alpha P + (1-\alpha) \frac{\mathbf{1}\mathbf{1}^\top}{n}

where P=(Dout)1AP = (D^{\text{out}})^{-1}A is the column-stochastic random-walk matrix, α(0,1)\alpha \in (0,1) is the damping factor (typically α=0.85\alpha = 0.85), and (1α)11/n(1-\alpha)\mathbf{1}\mathbf{1}^\top/n represents teleportation (random jumps to any page).

PageRank vector. The PageRank of each page is the stationary distribution π\boldsymbol{\pi} of the Markov chain defined by GG:

π=πG    π(IG)=0\boldsymbol{\pi}^\top = \boldsymbol{\pi}^\top G \implies \boldsymbol{\pi}^\top(I - G) = \mathbf{0}

Equivalently, π\boldsymbol{\pi} is the dominant left eigenvector of GG (eigenvalue 11).

Spectral computation. By the Perron-Frobenius theorem, GG is a positive stochastic matrix (all entries >0> 0 due to the teleportation term), so it has a unique dominant eigenvalue μ1=1\mu_1 = 1 with a unique positive eigenvector π\boldsymbol{\pi}.

Power iteration. PageRank is computed by:

π(t+1)=π(t)G=απ(t)P+(1α)1n\boldsymbol{\pi}^{(t+1)} = \boldsymbol{\pi}^{(t)} G = \alpha \boldsymbol{\pi}^{(t)} P + (1-\alpha)\frac{\mathbf{1}}{n}

Convergence rate: geometric with ratio α\alpha - the second eigenvalue of GG is at most α\alpha. Each iteration is a sparse matrix-vector multiply O(E)O(|E|).

For AI (RLHF and LLM preference graphs): In reinforcement learning from human feedback (RLHF), preference data can be modeled as a directed graph over responses, with edge (ri,rj)(r_i, r_j) meaning "response rir_i is preferred over rjr_j." PageRank on this preference graph gives a global ranking consistent with pairwise preferences. This is closely related to Bradley-Terry models used in reward model training (Ouyang et al., 2022).

10.4 Directed Graph Spectra in AI

Attention as a directed graph. In a Transformer, the attention weights AijA_{ij} define a directed weighted graph over tokens. The spectral properties of this attention graph have interpretability implications:

  • The dominant eigenvector of the attention matrix identifies "hub" tokens - tokens that receive most attention
  • Spectral analysis of attention graphs has been used in mechanistic interpretability to identify "induction heads" and "name mover heads" (Olsson et al., 2022)
  • The spectral gap of the attention graph determines how quickly information mixes across token positions

Causal DAGs. In causal inference (Chapter 22), structural causal models are represented as DAGs. The spectral properties of the DAG adjacency matrix are related to the "depth" of causal chains: a large spectral radius means long-range causal effects.


11. Advanced Topics

11.1 Spectral Sparsification

Problem. For a dense graph GG with nn vertices and Θ(n2)\Theta(n^2) edges, many spectral algorithms are too slow. Can we find a sparse graph G~\tilde{G} with O(nlogn)O(n \log n) edges that preserves the spectrum of LL?

Definition. G~\tilde{G} is an ϵ\epsilon-spectral sparsifier of GG if for all xRn\mathbf{x} \in \mathbb{R}^n:

(1ϵ)xLGxxLG~x(1+ϵ)xLGx(1-\epsilon)\mathbf{x}^\top L_G \mathbf{x} \leq \mathbf{x}^\top L_{\tilde{G}} \mathbf{x} \leq (1+\epsilon)\mathbf{x}^\top L_G \mathbf{x}

Equivalently, (1ϵ)LGLG~(1+ϵ)LG(1-\epsilon)L_G \preceq L_{\tilde{G}} \preceq (1+\epsilon)L_G in the PSD order.

Theorem (Spielman & Srivastava, 2011). Every graph GG has an ϵ\epsilon-spectral sparsifier with O(nlogn/ϵ2)O(n \log n / \epsilon^2) edges, computable in near-linear time using random sampling weighted by effective resistances.

Effective resistance. The effective resistance Reff(i,j)R_{\text{eff}}(i,j) between vertices ii and jj is the electrical resistance between them when unit resistors are placed on each edge. It equals:

Reff(i,j)=(eiej)L(eiej)R_{\text{eff}}(i,j) = (\mathbf{e}_i - \mathbf{e}_j)^\top L^\dagger (\mathbf{e}_i - \mathbf{e}_j)

where LL^\dagger is the pseudoinverse of LL.

Algorithm: Sample each edge (i,j)(i,j) independently with probability proportional to wijReff(i,j)w_{ij} R_{\text{eff}}(i,j), and rescale the weight. The resulting sparse graph preserves all spectral properties.

For AI: Spectral sparsification can reduce the computational cost of graph-based ML. A 10-million-edge social graph can be sparsified to 500K\sim 500K edges while preserving spectral clustering quality.

11.2 Random Matrix Theory and Graph Spectra

The spectrum of a random graph has universal limiting behavior described by random matrix theory.

Erdos-Renyi model G(n,p)G(n, p). For a random graph where each edge appears independently with probability pp, the empirical spectral distribution of A/np(1p)A/\sqrt{np(1-p)} converges to the semicircle law (Wigner, 1955):

ρ(λ)=12π4λ2for λ[2,2]\rho(\lambda) = \frac{1}{2\pi}\sqrt{4 - \lambda^2} \quad \text{for } \lambda \in [-2, 2]

The leading eigenvalue separates from the bulk at μ1np\mu_1 \approx np when plnn/np \gg \ln n / n, corresponding to the emergence of a giant connected component.

Implications for GNNs:

  • Random weight matrices in GNNs have spectra approximated by the semicircle law (for large enough hidden dimensions)
  • The alignment between the spectra of data graph AA and random weight matrices WW affects gradient flow in training
  • Spectral norm regularization of GNN weights controls training stability by constraining the spectral radius

11.3 Graph Wavelets

Motivation. Laplacian eigenvectors are global: uk\mathbf{u}_k is supported on all nn vertices. For signals with local structure (e.g., a signal that varies in one part of the graph but is constant elsewhere), global eigenvectors are inefficient. We need a local, multiscale basis - a graph wavelet transform.

Hammond, Vandergheynst & Gribonval (2011). For a vertex tt and scale ss, define the graph wavelet centered at tt at scale ss:

ψs,t=gs(L)δt\psi_{s,t} = g_s(L)\boldsymbol{\delta}_t

where δt\boldsymbol{\delta}_t is the indicator vector of vertex tt and gs(λ)=sg(sλ)g_s(\lambda) = s \cdot g(s\lambda) is a scaled spectral filter (bandpass at frequency 1/s1/s). In the spectral domain:

ψ^s,t(λ)=gs(λ)Uδt=gs(λ)u(:,t)\hat{\psi}_{s,t}(\lambda) = g_s(\lambda) U^\top \boldsymbol{\delta}_t = g_s(\lambda) \mathbf{u}(:,t)

Properties of graph wavelets:

  • Spatially localized: if gg has compact spectral support [ωmin,ωmax][\omega_{\min}, \omega_{\max}], then ψs,t\psi_{s,t} is KK-hop localized where KK depends on the bandwidth and ss
  • Frequency selective: wavelets at scale ss are sensitive to frequency 1/s\approx 1/s
  • Frame bounds: for appropriate gg, {ψs,t}\{\psi_{s,t}\} forms a frame (redundant but stable basis)

Scattering transform on graphs (Gama et al., 2019): Compose multiple wavelet transforms with pointwise nonlinearities to build invariant/equivariant features. Provides theoretical guarantees for GNN expressiveness.

11.4 Infinite Graphs and Spectral Measures

For infinite graphs (e.g., the integer lattice Zd\mathbb{Z}^d, infinite trees), the Laplacian is an unbounded operator on 2(V)\ell^2(V) and the spectrum is no longer a finite set but a spectral measure μL\mu_L.

Example: Integer lattice Zd\mathbb{Z}^d. The Laplacian LL on Zd\mathbb{Z}^d has a continuous spectrum [0,4d][0, 4d] (the dd-dimensional discrete Laplacian spectrum). This connects to the theory of periodic operators in solid-state physics (Bloch's theorem).

Spectral measure. For a vertex vv, the spectral measure μv\mu_v is defined by:

δv,f(L)δv=f(λ)dμv(λ)\langle \boldsymbol{\delta}_v, f(L) \boldsymbol{\delta}_v \rangle = \int f(\lambda)\, d\mu_v(\lambda)

The spectral measure encodes everything about the local geometry of the graph as seen from vv.

Convergence of finite graphs. If a sequence of finite graphs GnG_n converges in the Benjamini-Schramm sense to an infinite graph GG_\infty, the empirical spectral distributions of LGnL_{G_n} converge weakly to the spectral measure of LGL_{G_\infty}.

Preview: The spectral theory of infinite-dimensional operators is the subject of Chapter 12: Functional Analysis, where Hilbert spaces, unbounded operators, and spectral measures are developed fully.


12. Applications in Machine Learning

12.1 Semi-Supervised Learning on Graphs

The problem. Given a graph GG with nn nodes, a few labeled nodes LV\mathcal{L} \subset V with labels yi{1,,k}y_i \in \{1, \ldots, k\}, and many unlabeled nodes U=VL\mathcal{U} = V \setminus \mathcal{L}, assign labels to all unlabeled nodes.

Graph-based regularization (Zhou et al., 2004; Zhu et al., 2003). Find a label function f:V[0,1]k\mathbf{f}: V \to [0,1]^k that:

  1. Agrees with the given labels on L\mathcal{L}
  2. Is smooth on the graph (nearby nodes have similar labels)

The objective:

minfiLfiyi2+γ(i,j)Ewijfifj2=minffLy2+γtr(fLf)\min_{\mathbf{f}} \sum_{i \in \mathcal{L}} \lVert \mathbf{f}_i - \mathbf{y}_i \rVert^2 + \gamma \sum_{(i,j) \in E} w_{ij} \lVert \mathbf{f}_i - \mathbf{f}_j \rVert^2 = \min_\mathbf{f} \lVert \mathbf{f}_\mathcal{L} - \mathbf{y} \rVert^2 + \gamma \operatorname{tr}(\mathbf{f}^\top L \mathbf{f})

The closed-form solution involves (I+γL)1(I + \gamma L)^{-1} - a smoothing operator. In the spectral domain:

f^k=y^k1+γλk\hat{f}_k = \frac{\hat{y}_k}{1 + \gamma\lambda_k}

High-frequency components (λk\lambda_k large) are strongly regularized toward zero; low-frequency components are preserved.

Connection to label propagation. The Gaussian Fields and Harmonic Functions algorithm (Zhu et al., 2003) sets labeled node values to the true labels and propagates via:

fU=LUU1LULyL\mathbf{f}_\mathcal{U} = L_{\mathcal{U}\mathcal{U}}^{-1} L_{\mathcal{U}\mathcal{L}} \mathbf{y}_\mathcal{L}

(where LUUL_{\mathcal{U}\mathcal{U}} is the Laplacian restricted to unlabeled nodes). This harmonic interpolation assigns each unlabeled node the weighted average of its neighbors' labels, with weights determined by graph structure.

Modern variant: GCN for semi-supervised learning. The two-layer GCN of Kipf & Welling (2017) was originally proposed for exactly this task: semi-supervised node classification. The Laplacian smoothing is built into the propagation rule, making it a parameterized (learnable) version of label propagation.

12.2 Knowledge Graph Analysis

A knowledge graph (KG) represents world knowledge as a graph: entities (nodes) connected by typed relations (edges). Examples: Freebase, Wikidata, ConceptNet, UMLS (medical).

Spectral properties of KGs:

  • KGs are heterogeneous (multiple edge types) and sparse (E=O(V)|E| = O(|V|))
  • The adjacency spectrum often follows a power law: many small eigenvalues, a few large ones
  • The spectral gap λ2\lambda_2 measures how well-integrated the KG is: a small gap indicates a KG with isolated sub-graphs (different domains not well-connected)

Spectral regularization. KG embedding models (TransE, RotatE, ComplEx) learn entity and relation embeddings. Adding a spectral regularization term:

Lsmooth=rtr(ErLrEr)\mathcal{L}_{\text{smooth}} = \sum_r \text{tr}(E_r^\top L_r E_r)

encourages entity embeddings to be smooth with respect to each relation type rr's graph - entities connected by relation rr should have similar embeddings. This improves link prediction accuracy, especially for rare relations.

12.3 Molecular Property Prediction

Molecules are naturally represented as graphs: atoms are nodes, chemical bonds are edges. Predicting molecular properties (solubility, toxicity, drug-likeness) from molecular graphs is a key application of GNNs.

Spectral molecular fingerprints. The eigenvalue spectrum of the molecular graph Laplacian provides rotation- and permutation-invariant descriptors. The "spectral profile" (λ1,λ2,,λn)(\lambda_1, \lambda_2, \ldots, \lambda_n) uniquely characterizes many molecular structures.

Graph edit distance and spectral distance. Two molecules have similar properties if they have similar spectral profiles. The distance:

dspec(G1,G2)=λ(G1)λ(G2)2d_{\text{spec}}(G_1, G_2) = \lVert \boldsymbol{\lambda}(G_1) - \boldsymbol{\lambda}(G_2) \rVert_2

(where eigenvalues are sorted and zero-padded to the same length) approximates graph edit distance and correlates with molecular similarity.

Equivariance and invariance. Spectral fingerprints are invariant to atom permutation (graph isomorphism), which is the correct invariance for molecular property prediction. However, they are blind to chirality (mirror image molecules) - a known limitation requiring higher-order structural features.

12.4 Attention Pattern Analysis in LLMs

A dd-head attention layer in a Transformer computes hh attention weight matrices {A(1),,A(h)}\{A^{(1)}, \ldots, A^{(h)}\}, each A(k)RT×TA^{(k)} \in \mathbb{R}^{T \times T} for sequence length TT. These define hh weighted directed graphs over the token positions.

Spectral analysis of attention. The eigenvalues of A(k)A^{(k)} reveal the attention pattern structure:

  • If A(k)11/TA^{(k)} \approx \mathbf{1}\mathbf{1}^\top/T (uniform attention): μ1=1\mu_1 = 1, all others 0\approx 0
  • If A(k)IA^{(k)} \approx I (attend only to self): μ1==μT=1\mu_1 = \cdots = \mu_T = 1
  • Induction heads (Olsson et al., 2022) have A(k)A^{(k)} with large spectral gap between μ1\mu_1 and μ2\mu_2 - they attend sharply to a few positions

Attention graph Laplacian. Define the symmetrized attention Laplacian L(k)=D(k)(A(k)+A(k))/2L^{(k)} = D^{(k)} - (A^{(k)} + A^{(k)^\top})/2. The Fiedler vector u2(k)\mathbf{u}_2^{(k)} of L(k)L^{(k)} identifies the two groups of tokens most separated by head kk's attention.

Applications:

  • Attention head pruning: Heads whose attention graphs have very small spectral gap (uniform attention) contribute little and can be pruned (Michel et al., 2019; Voita et al., 2019)
  • Mechanistic interpretability: Spectral analysis of multi-head attention composition identifies information routing circuits (Elhage et al., 2021)
  • Context window analysis: The Laplacian spectrum of attention graphs grows as more tokens are added; sudden changes in λ2\lambda_2 indicate "phase transitions" in how the model processes context

13. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing L=DAL = D - A with L=ADL = A - DSign convention: L0L \succeq 0 requires L=DAL = D - A. With ADA - D, all eigenvalues are 0\leq 0.Always check Lii=di>0L_{ii} = d_i > 0 (positive diagonal) to verify sign convention
2Using unnormalized LL for spectral clustering on graphs with varying degreesRatioCut (unnormalized) penalizes unequally-sized partitions; for real data, NCut (normalized) gives much better clustersUse LsymL_{\text{sym}} or LrwL_{\text{rw}} for spectral clustering; unnormalized LL only for regular graphs
3Taking the Fiedler vector u1\mathbf{u}_1 (index 1) instead of u2\mathbf{u}_2u1\mathbf{u}_1 is the constant vector 1/n\mathbf{1}/\sqrt{n} (trivial null vector); it has no discriminative informationIn NumPy: eigenvectors are sorted ascending; take column index 1 (0-indexed), not index 0
4Ignoring sign ambiguity of eigenvectorsFor each eigenvector uk\mathbf{u}_k, both uk\mathbf{u}_k and uk-\mathbf{u}_k are valid; different runs give different signsUse absolute value (uk)i\lvert (\mathbf{u}_k)_i \rvert for visualizations; or use RWPE instead of LapPE to avoid sign issues
5Confusing the Cheeger inequality direction: λ22h\lambda_2 \leq 2h vs hλ2/2h \geq \lambda_2/2These are the same inequality; the confusing part is that "large λ2\lambda_2" implies "large hh" (good expander), not "small hh"Remember: small λ2\lambda_2 <-> bottleneck <-> small hh <-> easy to cut. Large λ2\lambda_2 <-> expander <-> large hh <-> hard to cut
6Computing the graph Laplacian for disconnected graphs and expecting λ2>0\lambda_2 > 0For a disconnected graph, λ2=0\lambda_2 = 0 always. The null space has dimension equal to the number of components.Check connectivity before spectral clustering. If graph is disconnected, handle each component separately or add a small connectivity term
7Treating spectral clustering as scale-free (the same regardless of kk)The cluster structure at scale kk uses eigenvectors u2,,uk+1\mathbf{u}_2, \ldots, \mathbf{u}_{k+1}. The kk-th eigenvector captures increasingly fine-grained structureChoose kk using the eigengap heuristic: k=argmaxk(λk+1λk)k^* = \arg\max_k (\lambda_{k+1} - \lambda_k)
8Applying GCN (a low-pass filter) to heterophilic graphsGCN smooths features toward neighborhood averages. For heterophilic graphs (connected nodes have different labels), this destroys discriminative informationUse high-pass or band-pass graph filters (e.g., GPRGNN, FAGCN, BernNet) for heterophilic settings
9Confusing LsymL_{\text{sym}} and LrwL_{\text{rw}}: using LrwL_{\text{rw}} for Ng-Jordan-WeissNJW requires LsymL_{\text{sym}} (symmetric, orthogonal eigenvectors) for row normalization to work. LrwL_{\text{rw}} is not symmetric, so its eigenvectors are not orthogonalUse scipy.linalg.eigh(L_sym) for symmetric eigendecomposition; eigenvectors form orthonormal columns
10Over-interpreting spectral methods on cospectral graphsTwo different graphs can have identical Laplacian spectra. Spectral features cannot distinguish themAugment spectral features with structural features (degree, triangle count, etc.) or use WL-based methods
11Forgetting the self-loop normalization in GCN derivationWithout self-loops, the K=1K=1 Chebyshev filter has eigenvalues in [1,1][-1, 1]; adding A~=A+I\tilde{A} = A+I shifts them to [0,2][0,2]Always include self-loops A~=A+I\tilde{A} = A + I and renormalize with D~\tilde{D} in the GCN propagation rule
12Using the full GFT (O(n3)O(n^3)) on graphs with n>1000n > 1000Full eigendecomposition is O(n3)O(n^3); for n=106n = 10^6, this is 101810^{18} operations - completely intractableUse polynomial filters (Chebyshev, KK sparse matrix-vector products), Lanczos for top-kk eigenvectors, or RWPE

14. Exercises

Exercise 1 * - Laplacian Construction and Properties

For the following graph GG: vertices {1,2,3,4}\{1, 2, 3, 4\}, edges {(1,2),(2,3),(3,4),(4,1),(1,3)}\{(1,2), (2,3), (3,4), (4,1), (1,3)\} (an unweighted undirected graph):

(a) Write out AA, DD, and L=DAL = D - A. (b) Verify xLx=(i,j)E(xixj)2\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j)\in E}(x_i - x_j)^2 for x=[1,2,0,1]\mathbf{x} = [1, 2, 0, -1]^\top. (c) Compute all eigenvalues of LL. How many connected components does GG have? (d) Compute LsymL_{\text{sym}} and verify its eigenvalues lie in [0,2][0, 2]. (e) For AI: Which eigenvector of LL would be used for spectral bisection? What partition does it suggest?


Exercise 2 * - Spectrum of Special Graphs

(a) Derive the eigenvalues of LL for the cycle graph C5C_5 (5 vertices in a cycle). Show all work. (b) Compute λ2(C5)\lambda_2(C_5). Is the cycle more or less connected (in the algebraic sense) than the path P5P_5? Use the formula for λ2(Pn)\lambda_2(P_n). (c) For the complete graph KnK_n: prove that λ2(LKn)=n\lambda_2(L_{K_n}) = n and all λ2,,λn\lambda_2, \ldots, \lambda_n are equal. (d) For a star graph SnS_n (one hub, n1n-1 leaves): find all eigenvalues and explain geometrically why λ2=1\lambda_2 = 1 regardless of nn.


Exercise 3 * - Fiedler Vector and Graph Bisection

Given a barbell graph: two cliques K5K_5 connected by a single bridge edge (u,v)(u, v):

(a) Describe qualitatively what the Fiedler vector looks like without computing it. Which vertices get positive values? Negative? (b) Implement the graph in NumPy, compute LL, and find u2\mathbf{u}_2 using scipy.linalg.eigh. Plot the Fiedler vector values at each vertex. (c) Use the Fiedler vector to perform spectral bisection. What is E(S,Sˉ)|E(S, \bar{S})| for the resulting cut? (d) What is λ2\lambda_2 for this graph? Is it close to 0? What does this say about the graph's connectivity?


Exercise 4 ** - Cheeger Inequality Verification

For the path graph P10P_{10} (10 vertices in a line):

(a) Compute λ2(Lsym)\lambda_2(L_{\text{sym}}) analytically using the known formula. (b) Find the Cheeger constant h(P10)h(P_{10}) by enumerating the optimal cut. (Hint: by symmetry, the optimal cut is in the middle.) (c) Verify that Cheeger's inequality λ2/2h2λ2\lambda_2/2 \leq h \leq \sqrt{2\lambda_2} holds. How tight are the bounds? (d) Implement the Fiedler vector sweep algorithm to find a cut with conductance 2λ2\leq \sqrt{2\lambda_2}. (e) For AI: If P10P_{10} were the attention graph of a 10-token sequence, what does the Cheeger constant tell you about information flow?


Exercise 5 ** - Graph Fourier Transform

Define a "community signal" x\mathbf{x} on the karate club graph (Zachary 1977, available in NetworkX) where xi=+1x_i = +1 if node ii is in community 1 and xi=1x_i = -1 otherwise.

(a) Compute the GFT x^=Ux\hat{\mathbf{x}} = U^\top \mathbf{x} of the community signal. (b) Plot x^k2|\hat{x}_k|^2 vs. λk\lambda_k. Is the community signal concentrated in low or high frequencies? (c) Define a "noisy" signal x=x+0.5η\mathbf{x}' = \mathbf{x} + 0.5\boldsymbol{\eta} where η\boldsymbol{\eta} is Gaussian noise. Apply a low-pass filter g(λ)=1[λλ5]g(\lambda) = \mathbf{1}[\lambda \leq \lambda_5] to x\mathbf{x}' (keep only the first 5 frequency components). (d) Compare the filtered signal to the true community signal. What fraction of nodes are correctly assigned? (e) How does this connect to label propagation in semi-supervised learning?


Exercise 6 ** - Spectral Clustering

Generate a synthetic graph with 3 communities using the Stochastic Block Model:

  • 3 blocks of 50 nodes each
  • Intra-block edge probability pin=0.3p_{\text{in}} = 0.3, inter-block pout=0.02p_{\text{out}} = 0.02

(a) Compute the unnormalized Laplacian LL. Plot the first 6 eigenvalues. Where is the largest eigengap? (b) Implement the NJW spectral clustering algorithm (Ng-Jordan-Weiss) for k=3k=3. (c) Compute the accuracy of the spectral clustering (comparing to the known ground-truth communities, accounting for label permutations). (d) Repeat with pout=0.15p_{\text{out}} = 0.15 (near the phase transition). How does the clustering accuracy degrade? (e) For AI: How does the eigengap heuristic perform? Plot accuracy vs. kk to verify the correct number of clusters is detected.


Exercise 7 *** - Laplacian Positional Encodings

Implement Laplacian Positional Encodings (LapPE) and test them on a simple graph classification task:

(a) For each graph in a small graph dataset (or a synthetic set with 3 classes: path, cycle, star variants), compute the top-kk Laplacian eigenvectors as node features. (b) Handle sign ambiguity by randomly flipping signs of each eigenvector at each forward pass (as in Dwivedi et al., 2022). Show that a model trained this way is sign-invariant. (c) Implement RWPE as an alternative: RWPE(v)j=(Pj)vv\text{RWPE}(v)_j = (P^j)_{vv} for j=1,,kj = 1, \ldots, k. Compare LapPE and RWPE on the classification task. (d) Explain theoretically why RWPE avoids the sign ambiguity problem while LapPE does not. (e) For AI: In GPT-style attention, can you use RWPE to give the model a "graph-aware" positional encoding? What would this enable for graph reasoning tasks?


Exercise 8 *** - PageRank and Spectral Analysis

Construct a small directed graph representing a citation network (10 papers, edges from citing paper to cited paper):

(a) Implement power iteration to compute the PageRank vector π\boldsymbol{\pi} with damping factor α=0.85\alpha = 0.85. Verify convergence. (b) Compute the dominant eigenvalue and eigenvector of the Google matrix G=αP+(1α)11/nG = \alpha P + (1-\alpha)\mathbf{1}\mathbf{1}^\top/n directly via scipy.linalg.eig. Compare to the power iteration result. (c) Add a "dangling node" (a paper with no outgoing citations). How does this affect the Google matrix? How is it handled in practice? (d) Compute the mixing time: how many iterations of power iteration are needed to achieve π(t)π<106\lVert \boldsymbol{\pi}^{(t)} - \boldsymbol{\pi}^* \rVert < 10^{-6}? How does this relate to α\alpha? (e) For AI: In RLHF, model responses can be ranked using a directed preference graph. Implement PageRank-based ranking on a set of 5 responses with pairwise preference comparisons. How does it compare to simple win-count ranking?


15. Why This Matters for AI (2026 Perspective)

ConceptAI/ML Impact
Graph Laplacian spectrumFoundation of Graph Convolutional Networks (Kipf & Welling, 2017); GCN layer = first-order Chebyshev filter; used in every graph-based ML system
Fiedler vector / spectral bisectionGraph partitioning for distributed training (model parallel + pipeline parallel); partition the computation graph of a large model across devices
Cheeger inequalityQuantifies over-smoothing rate in deep GNNs; expanders over-smooth fastest; guides choice of GNN depth and skip-connection design
Spectral clusteringGold-standard for community detection in social networks, knowledge graphs, citation networks; used in data curation for LLM pretraining
Graph Fourier TransformSpectral convolution -> polynomial approximation -> spatial GNN: the entire GNN derivation hierarchy is a spectral story; ChebNet (Defferrard et al., 2016)
Laplacian Positional EncodingsLapPE in GPS (Rampasek et al., 2022); RWPE in graph Transformers; enables graph Transformers to be position-aware without hardcoded sequence order
Random walk mixingRWPE computation; node2vec walks; GraphSAGE neighborhood sampling; mixing time determines required walk length for meaningful embeddings
Heat kernel / diffusionGraph diffusion networks (Klicpera et al., 2019); APPNP; diffusion-based denoising on knowledge graphs; personalized PageRank for neighborhood aggregation
Spectral sparsificationFast GNN training on large graphs: sparsify the graph while preserving spectral properties; used in GraphSAINT, ClusterGCN
PageRankRLHF preference aggregation; importance weighting in retrieval-augmented generation; entity importance in knowledge graphs
Matrix-Tree theoremSpanning tree sampling for graph augmentation in self-supervised GNN training; tree-structured attention in structured state space models
Directed graph spectraAttention head analysis in mechanistic interpretability (Olsson et al., 2022); causal graph structure in causal LLMs; knowledge graph relation asymmetry

16. Conceptual Bridge

Where we came from. This section builds directly on three pillars:

  • 11-01 Graph Basics provided the combinatorial vocabulary: vertices, edges, paths, connectivity, bipartiteness. These definitions are the domain of spectral graph theory.
  • 11-02 Graph Representations introduced the adjacency matrix and Laplacian as data structures. We now treat them as linear operators with rich algebraic structure.
  • 03 Advanced Linear Algebra (eigenvalues, spectral theorem, PSD matrices) gave us the mathematical machinery. Spectral graph theory is linear algebra applied to graphs.

What this section proved. Starting from the simple definition L=DAL = D - A, we established:

  1. L0L \succeq 0 (proved via the quadratic form xLx=(i,j)(xixj)2\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j)}(x_i - x_j)^2)
  2. The null space of LL encodes connected components
  3. λ2\lambda_2 quantifies connectivity (Fiedler, 1973)
  4. λ2\lambda_2 is tightly related to the minimum normalized cut (Cheeger, 1985; Alon-Milman, 1985)
  5. Eigenvectors of LL form a natural Fourier basis for graph signals
  6. Spectral clustering is the continuous relaxation of NP-hard graph partitioning
  7. The GCN layer is a first-order spectral filter (Kipf & Welling, 2017)

Where we are going. Two sections lie ahead:

11-05 Graph Neural Networks (immediate next): The spectral foundation developed here - GCN derivation, over-smoothing as diffusion, spectral filters - motivates the full GNN architecture zoo. The MPNN framework, GAT attention, GraphSAGE induction, and over-smoothing mitigations are all seen more clearly through the spectral lens.

12 Functional Analysis (next chapter): The spectral theory of the discrete graph Laplacian is a special case of the spectral theory of self-adjoint operators on Hilbert spaces. The Laplace-Beltrami operator on Riemannian manifolds is the continuous limit of the graph Laplacian. Kernel methods, Mercer's theorem, and Reproducing Kernel Hilbert Spaces (RKHS) generalize what we built here to infinite-dimensional settings.

SPECTRAL GRAPH THEORY IN THE CURRICULUM
============================================================================

  Chapter 02-03           Chapter 11                Chapter 12
  Linear Algebra          Graph Theory              Functional Analysis
  -------------           ------------              --------------------
  Eigenvalues     ------> 04 Spectral       ------> Laplace-Beltrami
  PSD matrices            Graph Theory               operator
  Spectral                     |                    Spectral measure
  theorem                      |                    RKHS
                               |
                    +----------+----------+
                    v                     v
               11-05 GNNs          22 Causal
               GCN, GAT             Inference
               GraphSAGE            Causal DAGs
               MPNN                 d-separation

  KEY RESULTS IN 04:
  ---------------------------------------------------------------------
  L = D - A \succeq 0          (proved via quadratic form)
  ker(L) = span of        (connected components theorem)
  component indicators
  Cheeger: \lambda_2/2 \leq h \leq \sqrt(2\lambda_2)  (connectivity <-> eigenvalue)
  GFT: x = U^Tx          (graph Fourier transform)
  GCN = 1st-order         (from ChebNet K=1, \lambda_max\approx2)
  Chebyshev filter

============================================================================

The unifying theme. Spectral graph theory teaches a single lesson: linear algebraic structure encodes combinatorial structure. The eigenvalues of a matrix you can compute in O(n3)O(n^3) reveal properties of the graph that are NP-hard to compute directly. This is the power of the spectral approach, and it is why spectral methods remain foundational even as spatial GNNs dominate in practice - the theory explains why the practice works.


<- Back to Graph Theory | Previous: Graph Algorithms <- | Next: Graph Neural Networks ->