Part 2Math for LLMs

Spectral Graph Theory: Part 2 - Spectral Filtering To 13 Common Mistakes

Graph Theory / Spectral Graph Theory

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Spectral Graph Theory: Part 7: Spectral Filtering to 13. Common Mistakes

7. Spectral Filtering

7.1 Filtering in the Spectral Domain

A spectral filter on a graph is an operation that modifies the frequency content of a graph signal:

y=g(L)x=Ug(Λ)Ux\mathbf{y} = g(L)\mathbf{x} = U g(\Lambda) U^\top \mathbf{x}

where g:RRg: \mathbb{R} \to \mathbb{R} is a scalar function applied pointwise to the eigenvalues: g(Λ)=diag(g(λ1),g(λ2),,g(λn))g(\Lambda) = \operatorname{diag}(g(\lambda_1), g(\lambda_2), \ldots, g(\lambda_n)).

Common filters:

Filterg(λ)g(\lambda)EffectAI use case
Low-pass1[λλc]\mathbf{1}[\lambda \leq \lambda_c]Keep low frequenciesSmooth node features
High-pass1[λ>λc]\mathbf{1}[\lambda > \lambda_c]Keep high frequenciesEdge detection on graphs
Band-pass1[λaλλb]\mathbf{1}[\lambda_a \leq \lambda \leq \lambda_b]Keep a frequency bandCommunity detection at scale kk
Heat kerneletλe^{-t\lambda}Exponential dampingGraph diffusion, PPMI
Identity11No changeTrivial
GCN1λ/21 - \lambda/2Linear attenuationFirst-order spectral convolution

Implementation cost: Directly computing Ug(Λ)UxU g(\Lambda) U^\top \mathbf{x} requires the full eigendecomposition - O(n3)O(n^3) preprocessing and O(n2)O(n^2) per signal. This is intractable for large graphs. Polynomial approximation (7.2) reduces cost to O(KE)O(K|E|) per signal.

7.2 Polynomial Filters and Localization

A KK-th order polynomial filter has the form:

g(L)=k=0KθkLkg(L) = \sum_{k=0}^K \theta_k L^k

Key property: K-localization. The filter g(L)=k=0KθkLkg(L) = \sum_{k=0}^K \theta_k L^k is exactly KK-localized: (g(L)x)i(g(L)\mathbf{x})_i depends only on the values of x\mathbf{x} at vertices within graph distance KK from ii.

Proof. (Lk)ij=0(L^k)_{ij} = 0 whenever dist(i,j)>k\text{dist}(i,j) > k (by the walk-counting property of graph matrix powers). Therefore (g(L))ij=kθk(Lk)ij=0(g(L))_{ij} = \sum_k \theta_k (L^k)_{ij} = 0 whenever dist(i,j)>K\text{dist}(i,j) > K.

Complexity. Computing y=g(L)x\mathbf{y} = g(L)\mathbf{x} using the recurrence Lkx=L(Lk1x)L^k \mathbf{x} = L \cdot (L^{k-1}\mathbf{x}) requires KK sparse matrix-vector multiplications, each O(E)O(|E|). Total: O(KE)O(K|E|).

Spatial interpretation. A polynomial filter is exactly equivalent to a KK-hop neighborhood aggregation, connecting spectral and spatial GNN views. This is the theoretical justification for why GNNs with KK layers aggregate information from KK-hop neighborhoods.

Approximation theorem. By the Stone-Weierstrass theorem, any continuous function g:[0,λmax]Rg: [0, \lambda_{\max}] \to \mathbb{R} can be uniformly approximated by polynomials. So polynomial filters are universal approximators for spectral filters on any graph.

7.3 Chebyshev Polynomial Approximation

Why Chebyshev? Among all polynomials of degree K\leq K, the KK-th Chebyshev polynomial TKT_K has the smallest maximum deviation from zero on [1,1][-1, 1] - it is the optimal polynomial approximation basis.

Definition. The Chebyshev polynomials Tk:[1,1][1,1]T_k: [-1, 1] \to [-1, 1] satisfy:

T0(x)=1,T1(x)=x,Tk(x)=2xTk1(x)Tk2(x)T_0(x) = 1, \quad T_1(x) = x, \quad T_k(x) = 2x T_{k-1}(x) - T_{k-2}(x)

They have the closed form Tk(x)=cos(karccosx)T_k(x) = \cos(k \arccos x).

Chebyshev graph filter (ChebNet, Defferrard et al., 2016). Scale the Laplacian to L~=2L/λmaxI[I,I]\tilde{L} = 2L/\lambda_{\max} - I \in [-I, I] (shifting eigenvalues from [0,λmax][0, \lambda_{\max}] to [1,1][-1, 1]). Define:

gθ(L)=k=0KθkTk(L~)g_{\boldsymbol{\theta}}(L) = \sum_{k=0}^K \theta_k T_k(\tilde{L})

Computation via recurrence:

xˉ0=x,xˉ1=L~x,xˉk=2L~xˉk1xˉk2\bar{\mathbf{x}}_0 = \mathbf{x}, \quad \bar{\mathbf{x}}_1 = \tilde{L}\mathbf{x}, \quad \bar{\mathbf{x}}_k = 2\tilde{L}\bar{\mathbf{x}}_{k-1} - \bar{\mathbf{x}}_{k-2} y=k=0Kθkxˉk\mathbf{y} = \sum_{k=0}^K \theta_k \bar{\mathbf{x}}_k

Each step requires one sparse matrix-vector multiply O(E)O(|E|); total cost O(KE)O(K|E|).

Advantages over truncated Taylor series:

  • The Chebyshev approximation error decays exponentially in KK (geometric convergence for smooth gg)
  • No numerical instability from large powers of L~\tilde{L}
  • The learned parameters θk\theta_k have clear frequency interpretation

7.4 Heat Kernel and Diffusion Filters

The graph heat equation generalizes diffusion to graphs:

x(t)t=Lx(t),x(0)=x0\frac{\partial \mathbf{x}(t)}{\partial t} = -L\mathbf{x}(t), \quad \mathbf{x}(0) = \mathbf{x}_0

Solution: x(t)=etLx0\mathbf{x}(t) = e^{-tL}\mathbf{x}_0. In the spectral domain: x^k(t)=etλkx^k,0\hat{x}_k(t) = e^{-t\lambda_k}\hat{x}_{k,0} - each frequency decays at rate λk\lambda_k.

The heat kernel Ht=etLH_t = e^{-tL} is a positive semidefinite matrix representing the diffusion of heat on the graph over time tt. Entries (Ht)ij(H_t)_{ij} give the heat at vertex jj after time tt when a unit heat source is placed at vertex ii.

Properties:

  • For t0t \to 0: HtIH_t \to I (no diffusion)
  • For tt \to \infty: Ht1n11H_t \to \frac{1}{n}\mathbf{1}\mathbf{1}^\top (heat equalizes, constant temperature on each component)
  • Relates to random walk: (Ht)ij=ketλk(U)ik(U)jk(H_t)_{ij} = \sum_k e^{-t\lambda_k}(U)_{ik}(U)_{jk}

Diffusion distance. The distance between vertices ii and jj at time scale tt is:

Dt(i,j)2=(Ht)i,:(Ht)j,:2=ke2tλk(uk,iuk,j)2D_t(i,j)^2 = \lVert (H_t)_{i,:} - (H_t)_{j,:} \rVert^2 = \sum_k e^{-2t\lambda_k}(u_{k,i} - u_{k,j})^2

This diffusion distance is more robust than shortest-path distance: it accounts for all paths between ii and jj, not just the shortest one.

For AI: The PPMI (Positive Pointwise Mutual Information) matrix used in graph-based word embeddings is approximately a diffusion kernel. The node2vec random walk (Grover & Leskovec, 2016) approximates diffusion distance.

7.5 From Chebyshev to GCN

The GCN layer (Kipf & Welling, 2017) is derived from ChebNet by:

Step 1: Set K=1K = 1 (first-order Chebyshev approximation): gθ(L)θ0T0(L~)+θ1T1(L~)=θ0I+θ1L~g_{\boldsymbol{\theta}}(L) \approx \theta_0 T_0(\tilde{L}) + \theta_1 T_1(\tilde{L}) = \theta_0 I + \theta_1 \tilde{L}.

Step 2: Approximate λmax2\lambda_{\max} \approx 2 (holding for regular and near-regular graphs), so L~LI\tilde{L} \approx L - I.

gθ(L)θ0I+θ1(LI)=θ0I+θ1(DAI)g_{\boldsymbol{\theta}}(L) \approx \theta_0 I + \theta_1(L - I) = \theta_0 I + \theta_1(D - A - I)

Step 3: Constrain θ=θ0=θ1\theta = \theta_0 = -\theta_1 (reduce parameters to prevent overfitting):

gθ(L)θ(I+D1/2AD1/2)=θA^g_\theta(L) \approx \theta(I + D^{-1/2}AD^{-1/2}) = \theta \hat{A}

Step 4: Add self-loops A~=A+I\tilde{A} = A + I, renormalize with D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij} to prevent numerical issues (the "renormalization trick"):

A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}

Full GCN layer:

H(l+1)=σ ⁣(D~1/2A~D~1/2H(l)W(l))=σ ⁣(A^H(l)W(l))H^{(l+1)} = \sigma\!\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{(l)} W^{(l)}\right) = \sigma\!\left(\hat{A} H^{(l)} W^{(l)}\right)

Spectral interpretation. The GCN filter g(λ)1λ/2g(\lambda) \approx 1 - \lambda/2 is a low-pass filter: it passes low frequencies (λ0\lambda \approx 0, smooth signals) and attenuates high frequencies (λ2\lambda \approx 2, rapidly varying signals). GCN is fundamentally a graph smoother.

Full GNN treatment: For GraphSAGE, GAT, MPNN framework, over-smoothing fixes, and expressiveness theory, see 11-05 Graph Neural Networks.


8. Spectral Clustering

8.1 Graph Partitioning Objectives

Minimum cut. Given a graph and integer kk, partition V=A1AkV = A_1 \cup \cdots \cup A_k (disjoint, non-empty) to minimize:

Cut(A1,,Ak)=12=1kE(A,Aˉ)\text{Cut}(A_1, \ldots, A_k) = \frac{1}{2}\sum_{\ell=1}^k |E(A_\ell, \bar{A}_\ell)|

where E(S,Sˉ)E(S, \bar{S}) is the set of edges between SS and its complement.

Problem with minimum cut. Minimum cut tends to cut off isolated vertices or very small sets - the trivial solution A1={v}A_1 = \{v\} for a low-degree vertex vv has very few edges to cut. We need objectives that balance cluster sizes.

RatioCut (Hagen & Kahng, 1992):

RatioCut(A1,,Ak)==1kE(A,Aˉ)A\text{RatioCut}(A_1, \ldots, A_k) = \sum_{\ell=1}^k \frac{|E(A_\ell, \bar{A}_\ell)|}{|A_\ell|}

Normalizes by the number of vertices in each partition - prevents very small cuts.

Normalized Cut (NCut) (Shi & Malik, 2000):

NCut(A1,,Ak)==1kE(A,Aˉ)vol(A)\text{NCut}(A_1, \ldots, A_k) = \sum_{\ell=1}^k \frac{|E(A_\ell, \bar{A}_\ell)|}{\operatorname{vol}(A_\ell)}

Normalizes by the volume (total degree) - weighted version of RatioCut.

Both problems are NP-hard in general. Spectral clustering relaxes them to tractable eigenvalue problems.

8.2 RatioCut and Unnormalized Spectral Clustering

Two-cluster RatioCut. For k=2k = 2 with partition (S,Sˉ)(S, \bar{S}):

RatioCut(S,Sˉ)=E(S,Sˉ)(1S+1Sˉ)\text{RatioCut}(S, \bar{S}) = |E(S,\bar{S})| \cdot \left(\frac{1}{|S|} + \frac{1}{|\bar{S}|}\right)

Define the indicator vector hRn\mathbf{h} \in \mathbb{R}^n:

hi={Sˉ/SiSS/SˉiSˉh_i = \begin{cases} \sqrt{|\bar{S}|/|S|} & i \in S \\ -\sqrt{|S|/|\bar{S}|} & i \in \bar{S} \end{cases}

Claim. hLh=nRatioCut(S,Sˉ)\mathbf{h}^\top L \mathbf{h} = n \cdot \text{RatioCut}(S, \bar{S}). Also: h2=n\lVert \mathbf{h} \rVert^2 = n and h1=0\mathbf{h}^\top \mathbf{1} = 0.

Proof: hLh=(i,j)E(hihj)2\mathbf{h}^\top L \mathbf{h} = \sum_{(i,j)\in E}(h_i - h_j)^2. The only nonzero terms come from edges crossing the cut: for (i,j)E(S,Sˉ)(i,j) \in E(S, \bar{S}):

(hihj)2=(Sˉ/S+S/Sˉ)2=n2SSˉ(h_i - h_j)^2 = \left(\sqrt{|\bar{S}|/|S|} + \sqrt{|S|/|\bar{S}|}\right)^2 = \frac{n^2}{|S||\bar{S}|}

Summing over all E(S,Sˉ)|E(S,\bar{S})| cut edges and using 1/S+1/Sˉ=n/(SSˉ)1/|S| + 1/|\bar{S}| = n/(|S||\bar{S}|):

hLh=E(S,Sˉ)n2SSˉ=nRatioCut(S,Sˉ)\mathbf{h}^\top L \mathbf{h} = |E(S,\bar{S})| \cdot \frac{n^2}{|S||\bar{S}|} = n \cdot \text{RatioCut}(S, \bar{S}) \qquad \square

Relaxation. The discrete optimization minShLh\min_S \mathbf{h}^\top L \mathbf{h} subject to h1=0\mathbf{h}^\top \mathbf{1} = 0, h=n\lVert \mathbf{h} \rVert = \sqrt{n}, hi{c+,c}h_i \in \{c_+, c_-\} is NP-hard. Relax the integrality constraint: allow hiRh_i \in \mathbb{R}. By Courant-Fischer, the solution is the Fiedler vector u2\mathbf{u}_2.

Recovery. Given u2\mathbf{u}_2, assign vertex ii to SS if (u2)i0(\mathbf{u}_2)_i \geq 0, to Sˉ\bar{S} otherwise. In practice, use k-means with k=2k=2 on u2\mathbf{u}_2 for robustness.

8.3 Normalized Cut (Shi & Malik 2000)

NCut relaxation. Define the indicator h\mathbf{h} analogously to RatioCut but with volume weights: for partition (S,Sˉ)(S, \bar{S}):

hi={vol(Sˉ)/vol(S)iSvol(S)/vol(Sˉ)iSˉh_i = \begin{cases} \sqrt{\operatorname{vol}(\bar{S})/\operatorname{vol}(S)} & i \in S \\ -\sqrt{\operatorname{vol}(S)/\operatorname{vol}(\bar{S})} & i \in \bar{S} \end{cases}

Then hLh=vol(V)NCut(S,Sˉ)\mathbf{h}^\top L \mathbf{h} = \operatorname{vol}(V) \cdot \text{NCut}(S, \bar{S}), subject to hD1=0\mathbf{h}^\top D \mathbf{1} = 0 and hDh=vol(V)\mathbf{h}^\top D \mathbf{h} = \operatorname{vol}(V).

Generalized eigenvalue problem. The continuous relaxation is:

minhD1hLhhDh=minh~D1/21h~Lsymh~h~2\min_{\mathbf{h} \perp_D \mathbf{1}} \frac{\mathbf{h}^\top L \mathbf{h}}{\mathbf{h}^\top D \mathbf{h}} = \min_{\tilde{\mathbf{h}} \perp D^{1/2}\mathbf{1}} \frac{\tilde{\mathbf{h}}^\top L_{\text{sym}} \tilde{\mathbf{h}}}{\lVert \tilde{\mathbf{h}} \rVert^2}

via the substitution h~=D1/2h\tilde{\mathbf{h}} = D^{1/2}\mathbf{h}. This is the standard Rayleigh quotient for LsymL_{\text{sym}}, minimized by u~2\tilde{\mathbf{u}}_2. Thus:

h=D1/2u~2\mathbf{h}^* = D^{-1/2}\tilde{\mathbf{u}}_2

where u~2\tilde{\mathbf{u}}_2 is the Fiedler vector of LsymL_{\text{sym}}.

Shi-Malik algorithm (2-cluster):

  1. Build Lsym=D1/2(DA)D1/2L_{\text{sym}} = D^{-1/2}(D-A)D^{-1/2}
  2. Compute Fiedler vector u~2\tilde{\mathbf{u}}_2 of LsymL_{\text{sym}}
  3. Assign vertex ii to SS if (D1/2u~2)ithreshold(D^{-1/2}\tilde{\mathbf{u}}_2)_i \geq \text{threshold}
  4. Choose threshold: empirically (try all n1n-1 thresholds) or at 0

Multi-class NCut. For kk clusters, use the kk smallest eigenvectors of LsymL_{\text{sym}}, form the n×kn \times k matrix UkU_k, normalize each row to unit norm, then apply k-means to the rows.

8.4 Multi-Way Spectral Clustering

The Ng-Jordan-Weiss (NJW) algorithm (2002) is the standard multi-class spectral clustering:

  1. Build the normalized Laplacian LsymL_{\text{sym}}
  2. Compute the kk smallest eigenvectors u~1,,u~k\tilde{\mathbf{u}}_1, \ldots, \tilde{\mathbf{u}}_k (smallest eigenvalues of LsymL_{\text{sym}})
  3. Form UkRn×kU_k \in \mathbb{R}^{n \times k} with these eigenvectors as columns
  4. Normalize rows: let Yi,:=Uk,i,:/Uk,i,:2Y_{i,:} = U_{k,i,:} / \lVert U_{k,i,:} \rVert_2 (row normalization)
  5. Apply k-means to the rows of YY

Why row normalization? The perturbation theory of 8.5 shows that in a perfect kk-cluster graph, the rows of UkU_k lie exactly on kk orthogonal vectors. Row normalization maps these to the same point on the unit sphere regardless of degree, making k-means converge cleanly.

Perturbation theory justification. Consider a "block graph" G0G_0 consisting of kk disconnected cliques. The kk smallest eigenvalues of LsymL_{\text{sym}} are all 00, with eigenvectors being the normalized indicators of each clique. Any real graph with kk communities can be seen as a perturbed block graph; if the perturbation (inter-community edges) is small, the eigenvectors are close to the block indicators. Weyl's perturbation theorem quantifies how much λ2\lambda_2 changes.

8.5 Complete Algorithm and Implementation

SPECTRAL CLUSTERING ALGORITHM
========================================================================

  Input:  Adjacency matrix A \in \mathbb{R}^n^x^n, number of clusters k
  Output: Cluster assignments c \in {1,...,k}^n

  1. Compute degree matrix D = diag(A*1)
  2. Compute normalized Laplacian L_sym = D^(-1/2) (D - A) D^(-1/2)
     (or use L_rw = I - D^(-1) A, but use L_sym for symmetric version)

  3. Compute k smallest eigenvalues and eigenvectors of L_sym
     -> eigenvectors form columns of U_k \in \mathbb{R}^n^x^k

  4. Normalize rows: Y_i = U_k[i,:] / ||U_k[i,:]||_2
     (skip for RatioCut; required for NCut)

  5. Apply k-means clustering to rows of Y
     -> cluster centers \mu_1,...,\mu_k; assignments c[i] \in {1,...,k}

  6. Return c

  Complexity: O(n^3) for full eigendecomposition;
              O(k*n*|E|) with Lanczos + k-means (large graphs)

========================================================================

Practical notes:

  • Use Lanczos algorithm or LOBPCG for computing the kk smallest eigenvectors of LsymL_{\text{sym}} on large sparse graphs (avoid full eigendecomposition)
  • The choice of kk can be guided by the eigengap heuristic: choose kk where the gap λk+1λk\lambda_{k+1} - \lambda_k is largest
  • K-means is run multiple times with random restarts; take the best result (lowest inertia)
  • For disconnected graphs, the kk zero eigenvalues directly give the cluster indicators

8.6 When Spectral Clustering Beats k-Means

K-means minimizes within-cluster variance assuming convex, isotropic, similarly-sized clusters. It fails on non-convex cluster shapes. Spectral clustering has no shape assumption - it works on any cluster structure that is well-separated in the graph.

When spectral clustering excels:

  • Concentric rings, spirals, moons - any shape detectable by graph connectivity
  • Clusters at multiple scales (nested communities)
  • Data with non-Euclidean structure (molecules, social networks)

When k-means excels:

  • Truly Gaussian clusters in Rd\mathbb{R}^d
  • Very large nn where eigenvector computation is too slow
  • Cluster structure is well-captured by Euclidean distance

A critical nuance: Spectral clustering requires building the adjacency/affinity graph first. The kk-NN graph or ϵ\epsilon-neighborhood graph choice matters enormously for quality. A common failure mode: if the graph is built with too small kk or ϵ\epsilon, communities may become disconnected even within a true cluster. If too large, community structure is washed out.


9. Laplacian Eigenmaps and Graph Embeddings

9.1 The Embedding Problem

Given a graph G=(V,E)G = (V, E), we want a mapping ϕ:VRd\phi: V \to \mathbb{R}^d (dnd \ll n) that preserves the graph structure: vertices that are nearby in the graph should be nearby in the embedding. Formally, we want:

ϕ=argminϕ:VRd(i,j)Ewijϕ(i)ϕ(j)2\phi = \arg\min_{\phi: V \to \mathbb{R}^d} \sum_{(i,j) \in E} w_{ij} \lVert \phi(i) - \phi(j) \rVert^2

subject to constraints that prevent the trivial solution ϕ(i)=0\phi(i) = \mathbf{0} for all ii.

Decomposing dimension by dimension, this is dd separate problems, each of the form:

minfRnfLfsubject to normalization and orthogonality constraints\min_{\mathbf{f} \in \mathbb{R}^n} \mathbf{f}^\top L \mathbf{f} \quad \text{subject to normalization and orthogonality constraints}

This is exactly minimizing the Dirichlet energy, solved by the eigenvectors of LL.

9.2 Laplacian Eigenmaps Algorithm

Belkin & Niyogi (2001/2003). Given nn data points x(1),,x(n)RD\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} \in \mathbb{R}^D:

  1. Build the adjacency graph: Connect points ii and jj if they are among each other's kk nearest neighbors (or x(i)x(j)<ϵ\lVert \mathbf{x}^{(i)} - \mathbf{x}^{(j)} \rVert < \epsilon).

  2. Set edge weights: Use the heat kernel wij=exp(x(i)x(j)2/t)w_{ij} = \exp(-\lVert \mathbf{x}^{(i)} - \mathbf{x}^{(j)} \rVert^2 / t) for connected pairs (with t>0t > 0 a bandwidth parameter).

  3. Compute degree and Laplacian: Dii=jwijD_{ii} = \sum_j w_{ij}, L=DWL = D - W.

  4. Solve the generalized eigenvalue problem:

Lu=λDuL \mathbf{u} = \lambda D \mathbf{u}

Equivalently: find eigenvectors of Lrw=D1LL_{\text{rw}} = D^{-1}L (or LsymL_{\text{sym}}).

  1. Embed: Take the dd eigenvectors u2,u3,,ud+1\mathbf{u}_2, \mathbf{u}_3, \ldots, \mathbf{u}_{d+1} (skip u1=1\mathbf{u}_1 = \mathbf{1}) and set ϕ(i)=(u2,i,u3,i,,ud+1,i)Rd\phi(i) = (u_{2,i}, u_{3,i}, \ldots, u_{d+1,i}) \in \mathbb{R}^d.

Optimality theorem. The Laplacian eigenmap embedding is the solution to the optimization problem:

minf1,,fdkfkLfks.t. fkDfk=1,  fkDfj=0 for kj\min_{\mathbf{f}_1, \ldots, \mathbf{f}_d} \sum_k \mathbf{f}_k^\top L \mathbf{f}_k \quad \text{s.t. } \mathbf{f}_k^\top D \mathbf{f}_k = 1,\; \mathbf{f}_k^\top D \mathbf{f}_j = 0 \text{ for } k \neq j

The solution is fk=uk+1\mathbf{f}_k = \mathbf{u}_{k+1} (the (k+1)(k+1)-th eigenvector of LsymL_{\text{sym}}). This is optimal in the sense that no other dd-dimensional embedding has smaller total Dirichlet energy.

Manifold learning interpretation. If the data points x(i)\mathbf{x}^{(i)} lie on a dd-dimensional manifold embedded in RD\mathbb{R}^D, the Laplacian eigenmap recovers the intrinsic coordinates of the manifold. As nn \to \infty and the bandwidth t0t \to 0 at an appropriate rate, the graph Laplacian converges to the Laplace-Beltrami operator on the manifold (Belkin & Niyogi, 2008).

9.3 Diffusion Maps

Coifman & Lafon (2006) introduced diffusion maps as a multiscale version of Laplacian eigenmaps.

Define the diffusion operator M=D1WM = D^{-1}W (the random walk matrix) and its tt-step version MtM^t. The diffusion distance at scale tt:

Dt(i,j)2=kμk2t(ψk(i)ψk(j))2D_t(i,j)^2 = \sum_k \mu_k^{2t} (\psi_k(i) - \psi_k(j))^2

where μk,ψk\mu_k, \psi_k are eigenvalues/eigenvectors of MM. The diffusion map embedding:

Φt(i)=(μ2tψ2,i,μ3tψ3,i,,μdtψd,i)Rd1\Phi^t(i) = (\mu_2^t \psi_{2,i},\, \mu_3^t \psi_{3,i},\, \ldots,\, \mu_d^t \psi_{d,i}) \in \mathbb{R}^{d-1}

The Euclidean distance in the diffusion map equals the diffusion distance: Φt(i)Φt(j)=Dt(i,j)\lVert \Phi^t(i) - \Phi^t(j) \rVert = D_t(i,j).

Multi-scale property. By varying tt, diffusion maps reveal structure at different scales:

  • Small tt: local neighborhood structure
  • Large tt: global cluster structure (only the dominant eigenvectors with μkt0\mu_k^t \gg 0 remain)

9.4 Relationship to PCA and Kernel PCA

Kernel PCA (Scholkopf et al., 1998) computes the principal components of data in a feature space defined by a kernel k(x,y)k(\mathbf{x}, \mathbf{y}). For a kernel matrix KRn×nK \in \mathbb{R}^{n \times n} with Kij=k(x(i),x(j))K_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}), kernel PCA computes the eigenvectors of the centered kernel matrix.

Commute-time embedding. The commute time C(i,j)C(i,j) between vertices ii and jj is the expected number of steps for a random walk starting at ii to reach jj and return. It equals:

C(i,j)=vol(V)k=2n1λk((uk,i)di(uk,j)dj)2C(i,j) = \operatorname{vol}(V) \sum_{k=2}^n \frac{1}{\lambda_k}\left(\frac{(u_{k,i})}{\sqrt{d_i}} - \frac{(u_{k,j})}{\sqrt{d_j}}\right)^2

This is kernel PCA with the commute-time kernel k(i,j)=(C(i,i)+C(j,j)C(i,j))/2k(i,j) = (C(i,i) + C(j,j) - C(i,j))/2. So Laplacian eigenmaps is a special case of kernel PCA.

9.5 Spectral Positional Encodings for Transformers

Standard Transformers process tokens with positional encodings to handle sequence order. Graph Transformers need analogous positional encodings for graph nodes - but graphs have no canonical ordering.

Laplacian Positional Encoding (LapPE). Use the eigenvectors of the graph Laplacian as node positional encodings:

PE(v)=[u2(v),u3(v),,uk+1(v)]Rk\text{PE}(v) = [\mathbf{u}_2(v), \mathbf{u}_3(v), \ldots, \mathbf{u}_{k+1}(v)] \in \mathbb{R}^k

where uj(v)\mathbf{u}_j(v) is the vv-th entry of the jj-th Laplacian eigenvector.

Challenge: Sign ambiguity. Each eigenvector uj\mathbf{u}_j is defined only up to sign: uj-\mathbf{u}_j is also a valid eigenvector. This creates non-uniqueness in the PE.

Solutions:

  • Random sign flips during training (Dwivedi et al., 2022): randomly flip signs in training; the Transformer learns sign-invariant functions
  • SignNet (Lim et al., 2022): use a Deep Sets architecture that is invariant to sign flips: ϕ(uj)+ϕ(uj)\phi(\mathbf{u}_j) + \phi(-\mathbf{u}_j)
  • BasisNet: extend to the case of repeated eigenvalues (multiplicity >1> 1), which introduce rotational ambiguity

RWPE (Random Walk Positional Encoding). Instead of Laplacian eigenvectors, use kk steps of a random walk:

RWPE(v)j=(Pj)vv=probability of returning to v after j steps\text{RWPE}(v)_j = (P^j)_{vv} = \text{probability of returning to } v \text{ after } j \text{ steps}

where P=D1AP = D^{-1}A. This avoids the sign ambiguity issue and is invariant to graph automorphisms. Used in GPS (Rampasek et al., 2022) - one of the best-performing graph Transformers.

Why LapPE/RWPE matter. Without positional encodings, graph Transformers cannot distinguish graph structure - all nodes with the same degree distribution look identical. Spectral PE gives each node a unique "spectral fingerprint" derived from its position in the graph's Fourier basis.


10. Directed Graph Spectra

10.1 Directed Laplacians

For a directed graph (digraph) G=(V,E)G = (V, E) with EV×VE \subseteq V \times V (ordered pairs), the adjacency matrix AA is not symmetric: Aij=1A_{ij} = 1 if (ij)E(i \to j) \in E but AjiA_{ji} may be 00.

In-degree and out-degree: For each vertex ii, diin=jAjid_i^{\text{in}} = \sum_j A_{ji} (number of incoming edges) and diout=jAijd_i^{\text{out}} = \sum_j A_{ij} (outgoing edges).

Out-degree Laplacian: Lout=DoutAL^{\text{out}} = D^{\text{out}} - A where Dout=diag(d1out,,dnout)D^{\text{out}} = \operatorname{diag}(d_1^{\text{out}}, \ldots, d_n^{\text{out}}).

In-degree Laplacian: Lin=DinAL^{\text{in}} = D^{\text{in}} - A^\top (or equivalently, the out-degree Laplacian of the reversed graph).

Key difference from undirected case:

  • LoutL^{\text{out}} is NOT symmetric in general
  • Eigenvalues may be complex
  • The row-sum-zero property holds: Lout1=0L^{\text{out}}\mathbf{1} = \mathbf{0} (since each row sums to dioutdiout=0d_i^{\text{out}} - d_i^{\text{out}} = 0)
  • But column sums are djindjoutd_j^{\text{in}} - d_j^{\text{out}}, not necessarily zero

Stationary distribution. The directed random walk P=(Dout)1AP = (D^{\text{out}})^{-1}A is row-stochastic. For a strongly connected digraph, the unique stationary distribution π\boldsymbol{\pi} satisfies πP=π\boldsymbol{\pi}^\top P = \boldsymbol{\pi}^\top. The stationary distribution is NOT necessarily uniform (unlike dd-regular undirected graphs).

10.2 Kirchhoff's Matrix-Tree Theorem

Theorem (Kirchhoff, 1847). For a connected undirected graph GG, the number of spanning trees equals any cofactor of LL:

τ(G)=1nk=2nλk(L)\tau(G) = \frac{1}{n}\prod_{k=2}^n \lambda_k(L)

where λ2,,λn\lambda_2, \ldots, \lambda_n are the non-zero eigenvalues of LL.

Proof via the Matrix-Tree theorem. By the Matrix-Tree theorem, τ(G)\tau(G) equals any (n1)×(n1)(n-1) \times (n-1) principal minor of LL. By the Cauchy-Binet formula, this minor equals 1nλ2λ3λn\frac{1}{n}\lambda_2\lambda_3\cdots\lambda_n, which follows from the spectral decomposition and the fact that λ1=0\lambda_1 = 0 with eigenvector 1/n\mathbf{1}/\sqrt{n}.

Examples:

  • KnK_n: λ2==λn=n\lambda_2 = \cdots = \lambda_n = n (all equal), so τ(Kn)=nn2\tau(K_n) = n^{n-2} (Cayley's formula).
  • PnP_n (path): τ(Pn)=1\tau(P_n) = 1 (only one spanning tree - the path itself).
  • CnC_n (cycle): τ(Cn)=n\tau(C_n) = n.

For AI: The number of spanning trees measures "graph robustness." Networks with many spanning trees (like expanders) remain connected even after many edge failures. This metric appears in network reliability analysis for distributed training clusters.

10.3 PageRank as a Spectral Problem

PageRank (Page, Brin, Motwani, Winograd, 1998) - the algorithm behind Google Search - is fundamentally a spectral computation on a directed graph.

Setup. Model the Web as a directed graph: pages are vertices, hyperlinks are directed edges. Define the Google matrix:

G=αP+(1α)11nG = \alpha P + (1-\alpha) \frac{\mathbf{1}\mathbf{1}^\top}{n}

where P=(Dout)1AP = (D^{\text{out}})^{-1}A is the column-stochastic random-walk matrix, α(0,1)\alpha \in (0,1) is the damping factor (typically α=0.85\alpha = 0.85), and (1α)11/n(1-\alpha)\mathbf{1}\mathbf{1}^\top/n represents teleportation (random jumps to any page).

PageRank vector. The PageRank of each page is the stationary distribution π\boldsymbol{\pi} of the Markov chain defined by GG:

π=πG    π(IG)=0\boldsymbol{\pi}^\top = \boldsymbol{\pi}^\top G \implies \boldsymbol{\pi}^\top(I - G) = \mathbf{0}

Equivalently, π\boldsymbol{\pi} is the dominant left eigenvector of GG (eigenvalue 11).

Spectral computation. By the Perron-Frobenius theorem, GG is a positive stochastic matrix (all entries >0> 0 due to the teleportation term), so it has a unique dominant eigenvalue μ1=1\mu_1 = 1 with a unique positive eigenvector π\boldsymbol{\pi}.

Power iteration. PageRank is computed by:

π(t+1)=π(t)G=απ(t)P+(1α)1n\boldsymbol{\pi}^{(t+1)} = \boldsymbol{\pi}^{(t)} G = \alpha \boldsymbol{\pi}^{(t)} P + (1-\alpha)\frac{\mathbf{1}}{n}

Convergence rate: geometric with ratio α\alpha - the second eigenvalue of GG is at most α\alpha. Each iteration is a sparse matrix-vector multiply O(E)O(|E|).

For AI (RLHF and LLM preference graphs): In reinforcement learning from human feedback (RLHF), preference data can be modeled as a directed graph over responses, with edge (ri,rj)(r_i, r_j) meaning "response rir_i is preferred over rjr_j." PageRank on this preference graph gives a global ranking consistent with pairwise preferences. This is closely related to Bradley-Terry models used in reward model training (Ouyang et al., 2022).

10.4 Directed Graph Spectra in AI

Attention as a directed graph. In a Transformer, the attention weights AijA_{ij} define a directed weighted graph over tokens. The spectral properties of this attention graph have interpretability implications:

  • The dominant eigenvector of the attention matrix identifies "hub" tokens - tokens that receive most attention
  • Spectral analysis of attention graphs has been used in mechanistic interpretability to identify "induction heads" and "name mover heads" (Olsson et al., 2022)
  • The spectral gap of the attention graph determines how quickly information mixes across token positions

Causal DAGs. In causal inference (Chapter 22), structural causal models are represented as DAGs. The spectral properties of the DAG adjacency matrix are related to the "depth" of causal chains: a large spectral radius means long-range causal effects.


11. Advanced Topics

11.1 Spectral Sparsification

Problem. For a dense graph GG with nn vertices and Θ(n2)\Theta(n^2) edges, many spectral algorithms are too slow. Can we find a sparse graph G~\tilde{G} with O(nlogn)O(n \log n) edges that preserves the spectrum of LL?

Definition. G~\tilde{G} is an ϵ\epsilon-spectral sparsifier of GG if for all xRn\mathbf{x} \in \mathbb{R}^n:

(1ϵ)xLGxxLG~x(1+ϵ)xLGx(1-\epsilon)\mathbf{x}^\top L_G \mathbf{x} \leq \mathbf{x}^\top L_{\tilde{G}} \mathbf{x} \leq (1+\epsilon)\mathbf{x}^\top L_G \mathbf{x}

Equivalently, (1ϵ)LGLG~(1+ϵ)LG(1-\epsilon)L_G \preceq L_{\tilde{G}} \preceq (1+\epsilon)L_G in the PSD order.

Theorem (Spielman & Srivastava, 2011). Every graph GG has an ϵ\epsilon-spectral sparsifier with O(nlogn/ϵ2)O(n \log n / \epsilon^2) edges, computable in near-linear time using random sampling weighted by effective resistances.

Effective resistance. The effective resistance Reff(i,j)R_{\text{eff}}(i,j) between vertices ii and jj is the electrical resistance between them when unit resistors are placed on each edge. It equals:

Reff(i,j)=(eiej)L(eiej)R_{\text{eff}}(i,j) = (\mathbf{e}_i - \mathbf{e}_j)^\top L^\dagger (\mathbf{e}_i - \mathbf{e}_j)

where LL^\dagger is the pseudoinverse of LL.

Algorithm: Sample each edge (i,j)(i,j) independently with probability proportional to wijReff(i,j)w_{ij} R_{\text{eff}}(i,j), and rescale the weight. The resulting sparse graph preserves all spectral properties.

For AI: Spectral sparsification can reduce the computational cost of graph-based ML. A 10-million-edge social graph can be sparsified to 500K\sim 500K edges while preserving spectral clustering quality.

11.2 Random Matrix Theory and Graph Spectra

The spectrum of a random graph has universal limiting behavior described by random matrix theory.

Erdos-Renyi model G(n,p)G(n, p). For a random graph where each edge appears independently with probability pp, the empirical spectral distribution of A/np(1p)A/\sqrt{np(1-p)} converges to the semicircle law (Wigner, 1955):

ρ(λ)=12π4λ2for λ[2,2]\rho(\lambda) = \frac{1}{2\pi}\sqrt{4 - \lambda^2} \quad \text{for } \lambda \in [-2, 2]

The leading eigenvalue separates from the bulk at μ1np\mu_1 \approx np when plnn/np \gg \ln n / n, corresponding to the emergence of a giant connected component.

Implications for GNNs:

  • Random weight matrices in GNNs have spectra approximated by the semicircle law (for large enough hidden dimensions)
  • The alignment between the spectra of data graph AA and random weight matrices WW affects gradient flow in training
  • Spectral norm regularization of GNN weights controls training stability by constraining the spectral radius

11.3 Graph Wavelets

Motivation. Laplacian eigenvectors are global: uk\mathbf{u}_k is supported on all nn vertices. For signals with local structure (e.g., a signal that varies in one part of the graph but is constant elsewhere), global eigenvectors are inefficient. We need a local, multiscale basis - a graph wavelet transform.

Hammond, Vandergheynst & Gribonval (2011). For a vertex tt and scale ss, define the graph wavelet centered at tt at scale ss:

ψs,t=gs(L)δt\psi_{s,t} = g_s(L)\boldsymbol{\delta}_t

where δt\boldsymbol{\delta}_t is the indicator vector of vertex tt and gs(λ)=sg(sλ)g_s(\lambda) = s \cdot g(s\lambda) is a scaled spectral filter (bandpass at frequency 1/s1/s). In the spectral domain:

ψ^s,t(λ)=gs(λ)Uδt=gs(λ)u(:,t)\hat{\psi}_{s,t}(\lambda) = g_s(\lambda) U^\top \boldsymbol{\delta}_t = g_s(\lambda) \mathbf{u}(:,t)

Properties of graph wavelets:

  • Spatially localized: if gg has compact spectral support [ωmin,ωmax][\omega_{\min}, \omega_{\max}], then ψs,t\psi_{s,t} is KK-hop localized where KK depends on the bandwidth and ss
  • Frequency selective: wavelets at scale ss are sensitive to frequency 1/s\approx 1/s
  • Frame bounds: for appropriate gg, {ψs,t}\{\psi_{s,t}\} forms a frame (redundant but stable basis)

Scattering transform on graphs (Gama et al., 2019): Compose multiple wavelet transforms with pointwise nonlinearities to build invariant/equivariant features. Provides theoretical guarantees for GNN expressiveness.

11.4 Infinite Graphs and Spectral Measures

For infinite graphs (e.g., the integer lattice Zd\mathbb{Z}^d, infinite trees), the Laplacian is an unbounded operator on 2(V)\ell^2(V) and the spectrum is no longer a finite set but a spectral measure μL\mu_L.

Example: Integer lattice Zd\mathbb{Z}^d. The Laplacian LL on Zd\mathbb{Z}^d has a continuous spectrum [0,4d][0, 4d] (the dd-dimensional discrete Laplacian spectrum). This connects to the theory of periodic operators in solid-state physics (Bloch's theorem).

Spectral measure. For a vertex vv, the spectral measure μv\mu_v is defined by:

δv,f(L)δv=f(λ)dμv(λ)\langle \boldsymbol{\delta}_v, f(L) \boldsymbol{\delta}_v \rangle = \int f(\lambda)\, d\mu_v(\lambda)

The spectral measure encodes everything about the local geometry of the graph as seen from vv.

Convergence of finite graphs. If a sequence of finite graphs GnG_n converges in the Benjamini-Schramm sense to an infinite graph GG_\infty, the empirical spectral distributions of LGnL_{G_n} converge weakly to the spectral measure of LGL_{G_\infty}.

Preview: The spectral theory of infinite-dimensional operators is the subject of Chapter 12: Functional Analysis, where Hilbert spaces, unbounded operators, and spectral measures are developed fully.


12. Applications in Machine Learning

12.1 Semi-Supervised Learning on Graphs

The problem. Given a graph GG with nn nodes, a few labeled nodes LV\mathcal{L} \subset V with labels yi{1,,k}y_i \in \{1, \ldots, k\}, and many unlabeled nodes U=VL\mathcal{U} = V \setminus \mathcal{L}, assign labels to all unlabeled nodes.

Graph-based regularization (Zhou et al., 2004; Zhu et al., 2003). Find a label function f:V[0,1]k\mathbf{f}: V \to [0,1]^k that:

  1. Agrees with the given labels on L\mathcal{L}
  2. Is smooth on the graph (nearby nodes have similar labels)

The objective:

minfiLfiyi2+γ(i,j)Ewijfifj2=minffLy2+γtr(fLf)\min_{\mathbf{f}} \sum_{i \in \mathcal{L}} \lVert \mathbf{f}_i - \mathbf{y}_i \rVert^2 + \gamma \sum_{(i,j) \in E} w_{ij} \lVert \mathbf{f}_i - \mathbf{f}_j \rVert^2 = \min_\mathbf{f} \lVert \mathbf{f}_\mathcal{L} - \mathbf{y} \rVert^2 + \gamma \operatorname{tr}(\mathbf{f}^\top L \mathbf{f})

The closed-form solution involves (I+γL)1(I + \gamma L)^{-1} - a smoothing operator. In the spectral domain:

f^k=y^k1+γλk\hat{f}_k = \frac{\hat{y}_k}{1 + \gamma\lambda_k}

High-frequency components (λk\lambda_k large) are strongly regularized toward zero; low-frequency components are preserved.

Connection to label propagation. The Gaussian Fields and Harmonic Functions algorithm (Zhu et al., 2003) sets labeled node values to the true labels and propagates via:

fU=LUU1LULyL\mathbf{f}_\mathcal{U} = L_{\mathcal{U}\mathcal{U}}^{-1} L_{\mathcal{U}\mathcal{L}} \mathbf{y}_\mathcal{L}

(where LUUL_{\mathcal{U}\mathcal{U}} is the Laplacian restricted to unlabeled nodes). This harmonic interpolation assigns each unlabeled node the weighted average of its neighbors' labels, with weights determined by graph structure.

Modern variant: GCN for semi-supervised learning. The two-layer GCN of Kipf & Welling (2017) was originally proposed for exactly this task: semi-supervised node classification. The Laplacian smoothing is built into the propagation rule, making it a parameterized (learnable) version of label propagation.

12.2 Knowledge Graph Analysis

A knowledge graph (KG) represents world knowledge as a graph: entities (nodes) connected by typed relations (edges). Examples: Freebase, Wikidata, ConceptNet, UMLS (medical).

Spectral properties of KGs:

  • KGs are heterogeneous (multiple edge types) and sparse (E=O(V)|E| = O(|V|))
  • The adjacency spectrum often follows a power law: many small eigenvalues, a few large ones
  • The spectral gap λ2\lambda_2 measures how well-integrated the KG is: a small gap indicates a KG with isolated sub-graphs (different domains not well-connected)

Spectral regularization. KG embedding models (TransE, RotatE, ComplEx) learn entity and relation embeddings. Adding a spectral regularization term:

Lsmooth=rtr(ErLrEr)\mathcal{L}_{\text{smooth}} = \sum_r \text{tr}(E_r^\top L_r E_r)

encourages entity embeddings to be smooth with respect to each relation type rr's graph - entities connected by relation rr should have similar embeddings. This improves link prediction accuracy, especially for rare relations.

12.3 Molecular Property Prediction

Molecules are naturally represented as graphs: atoms are nodes, chemical bonds are edges. Predicting molecular properties (solubility, toxicity, drug-likeness) from molecular graphs is a key application of GNNs.

Spectral molecular fingerprints. The eigenvalue spectrum of the molecular graph Laplacian provides rotation- and permutation-invariant descriptors. The "spectral profile" (λ1,λ2,,λn)(\lambda_1, \lambda_2, \ldots, \lambda_n) uniquely characterizes many molecular structures.

Graph edit distance and spectral distance. Two molecules have similar properties if they have similar spectral profiles. The distance:

dspec(G1,G2)=λ(G1)λ(G2)2d_{\text{spec}}(G_1, G_2) = \lVert \boldsymbol{\lambda}(G_1) - \boldsymbol{\lambda}(G_2) \rVert_2

(where eigenvalues are sorted and zero-padded to the same length) approximates graph edit distance and correlates with molecular similarity.

Equivariance and invariance. Spectral fingerprints are invariant to atom permutation (graph isomorphism), which is the correct invariance for molecular property prediction. However, they are blind to chirality (mirror image molecules) - a known limitation requiring higher-order structural features.

12.4 Attention Pattern Analysis in LLMs

A dd-head attention layer in a Transformer computes hh attention weight matrices {A(1),,A(h)}\{A^{(1)}, \ldots, A^{(h)}\}, each A(k)RT×TA^{(k)} \in \mathbb{R}^{T \times T} for sequence length TT. These define hh weighted directed graphs over the token positions.

Spectral analysis of attention. The eigenvalues of A(k)A^{(k)} reveal the attention pattern structure:

  • If A(k)11/TA^{(k)} \approx \mathbf{1}\mathbf{1}^\top/T (uniform attention): μ1=1\mu_1 = 1, all others 0\approx 0
  • If A(k)IA^{(k)} \approx I (attend only to self): μ1==μT=1\mu_1 = \cdots = \mu_T = 1
  • Induction heads (Olsson et al., 2022) have A(k)A^{(k)} with large spectral gap between μ1\mu_1 and μ2\mu_2 - they attend sharply to a few positions

Attention graph Laplacian. Define the symmetrized attention Laplacian L(k)=D(k)(A(k)+A(k))/2L^{(k)} = D^{(k)} - (A^{(k)} + A^{(k)^\top})/2. The Fiedler vector u2(k)\mathbf{u}_2^{(k)} of L(k)L^{(k)} identifies the two groups of tokens most separated by head kk's attention.

Applications:

  • Attention head pruning: Heads whose attention graphs have very small spectral gap (uniform attention) contribute little and can be pruned (Michel et al., 2019; Voita et al., 2019)
  • Mechanistic interpretability: Spectral analysis of multi-head attention composition identifies information routing circuits (Elhage et al., 2021)
  • Context window analysis: The Laplacian spectrum of attention graphs grows as more tokens are added; sudden changes in λ2\lambda_2 indicate "phase transitions" in how the model processes context

13. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing L=DAL = D - A with L=ADL = A - DSign convention: L0L \succeq 0 requires L=DAL = D - A. With ADA - D, all eigenvalues are 0\leq 0.Always check Lii=di>0L_{ii} = d_i > 0 (positive diagonal) to verify sign convention
2Using unnormalized LL for spectral clustering on graphs with varying degreesRatioCut (unnormalized) penalizes unequally-sized partitions; for real data, NCut (normalized) gives much better clustersUse LsymL_{\text{sym}} or LrwL_{\text{rw}} for spectral clustering; unnormalized LL only for regular graphs
3Taking the Fiedler vector u1\mathbf{u}_1 (index 1) instead of u2\mathbf{u}_2u1\mathbf{u}_1 is the constant vector 1/n\mathbf{1}/\sqrt{n} (trivial null vector); it has no discriminative informationIn NumPy: eigenvectors are sorted ascending; take column index 1 (0-indexed), not index 0
4Ignoring sign ambiguity of eigenvectorsFor each eigenvector uk\mathbf{u}_k, both uk\mathbf{u}_k and uk-\mathbf{u}_k are valid; different runs give different signsUse absolute value (uk)i\lvert (\mathbf{u}_k)_i \rvert for visualizations; or use RWPE instead of LapPE to avoid sign issues
5Confusing the Cheeger inequality direction: λ22h\lambda_2 \leq 2h vs hλ2/2h \geq \lambda_2/2These are the same inequality; the confusing part is that "large λ2\lambda_2" implies "large hh" (good expander), not "small hh"Remember: small λ2\lambda_2 <-> bottleneck <-> small hh <-> easy to cut. Large λ2\lambda_2 <-> expander <-> large hh <-> hard to cut
6Computing the graph Laplacian for disconnected graphs and expecting λ2>0\lambda_2 > 0For a disconnected graph, λ2=0\lambda_2 = 0 always. The null space has dimension equal to the number of components.Check connectivity before spectral clustering. If graph is disconnected, handle each component separately or add a small connectivity term
7Treating spectral clustering as scale-free (the same regardless of kk)The cluster structure at scale kk uses eigenvectors u2,,uk+1\mathbf{u}_2, \ldots, \mathbf{u}_{k+1}. The kk-th eigenvector captures increasingly fine-grained structureChoose kk using the eigengap heuristic: k=argmaxk(λk+1λk)k^* = \arg\max_k (\lambda_{k+1} - \lambda_k)
8Applying GCN (a low-pass filter) to heterophilic graphsGCN smooths features toward neighborhood averages. For heterophilic graphs (connected nodes have different labels), this destroys discriminative informationUse high-pass or band-pass graph filters (e.g., GPRGNN, FAGCN, BernNet) for heterophilic settings
9Confusing LsymL_{\text{sym}} and LrwL_{\text{rw}}: using LrwL_{\text{rw}} for Ng-Jordan-WeissNJW requires LsymL_{\text{sym}} (symmetric, orthogonal eigenvectors) for row normalization to work. LrwL_{\text{rw}} is not symmetric, so its eigenvectors are not orthogonalUse scipy.linalg.eigh(L_sym) for symmetric eigendecomposition; eigenvectors form orthonormal columns
10Over-interpreting spectral methods on cospectral graphsTwo different graphs can have identical Laplacian spectra. Spectral features cannot distinguish themAugment spectral features with structural features (degree, triangle count, etc.) or use WL-based methods
11Forgetting the self-loop normalization in GCN derivationWithout self-loops, the K=1K=1 Chebyshev filter has eigenvalues in [1,1][-1, 1]; adding A~=A+I\tilde{A} = A+I shifts them to [0,2][0,2]Always include self-loops A~=A+I\tilde{A} = A + I and renormalize with D~\tilde{D} in the GCN propagation rule
12Using the full GFT (O(n3)O(n^3)) on graphs with n>1000n > 1000Full eigendecomposition is O(n3)O(n^3); for n=106n = 10^6, this is 101810^{18} operations - completely intractableUse polynomial filters (Chebyshev, KK sparse matrix-vector products), Lanczos for top-kk eigenvectors, or RWPE

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue