Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
27 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Graph Neural Networks: Part 6: Graph Attention Networks GAT to 10. Graph Transformers

6. Graph Attention Networks (GAT)

6.1 The Attention Mechanism on Graphs

GCN's aggregation weights 1d~ud~v\frac{1}{\sqrt{\tilde{d}_u \tilde{d}_v}} are fixed functions of node degrees - they depend only on the graph structure, not on node features. Two neighbors with identical degree contribute identically regardless of their relevance to the central node's representation.

Graph Attention Networks (Velickovic et al., 2018) replace fixed weights with learned, data-dependent attention coefficients. For each edge (u,v)(u, v), the attention coefficient αuv\alpha_{uv} is computed as a function of both nodes' features, allowing the network to focus on the most relevant neighbors.

This mirrors the transformer's self-attention, but with attention constrained to the graph's edge structure: node vv only attends to its actual neighbors N(v)\mathcal{N}(v), not to all nodes (which would be O(n2)O(n^2)). The graph acts as a structural prior on the attention pattern.

6.2 GAT Layer: Attention Coefficients

Step 1: Linear transformation. Apply a shared weight matrix WRd×dW \in \mathbb{R}^{d' \times d} to transform all node features:

zv=Whv[l]vV\mathbf{z}_v = W\mathbf{h}_v^{[l]} \quad \forall v \in V

Step 2: Attention scores. For each edge (u,v)(u, v) (including self-loops), compute:

euv=LeakyReLU ⁣(a[zvzu])e_{uv} = \operatorname{LeakyReLU}\!\left(\mathbf{a}^\top \left[\mathbf{z}_v \,\|\, \mathbf{z}_u\right]\right)

where aR2d\mathbf{a} \in \mathbb{R}^{2d'} is a learnable attention vector, and [][\cdot \| \cdot] denotes concatenation. The raw attention score euve_{uv} measures the "relevance" of node uu to node vv. LeakyReLU (with negative slope α=0.2\alpha = 0.2) allows gradient flow for zero scores.

Step 3: Normalize. Apply softmax over the neighborhood to get normalized coefficients:

αuv=exp(euv)kN(v){v}exp(ekv)\alpha_{uv} = \frac{\exp(e_{uv})}{\sum_{k \in \mathcal{N}(v) \cup \{v\}} \exp(e_{kv})}

Note: αuv0\alpha_{uv} \geq 0 and uN(v){v}αuv=1\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{uv} = 1.

Step 4: Aggregation. Compute the updated representation:

hv[l+1]=σ ⁣(uN(v){v}αuvzu)\mathbf{h}_v^{[l+1]} = \sigma\!\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{uv} \cdot \mathbf{z}_u\right)

MPNN view: the message is M(hv,hu)=αuvWhuM(\mathbf{h}_v, \mathbf{h}_u) = \alpha_{uv} \cdot W\mathbf{h}_u and the aggregation is sum.

6.3 Multi-Head Attention for Graphs

Single-head attention can be unstable and may attend to a limited range of features. Multi-head attention stabilizes training and allows the model to jointly attend to different aspects of neighbors.

Multi-head GAT layer. Run KK independent attention mechanisms in parallel, each with its own parameters (W(k),a(k))(W^{(k)}, \mathbf{a}^{(k)}):

hv[l+1]=k=1Kσ ⁣(uN(v)αuv(k)W(k)hu[l])\mathbf{h}_v^{[l+1]} = \|_{k=1}^K \sigma\!\left(\sum_{u \in \mathcal{N}(v)} \alpha_{uv}^{(k)} W^{(k)} \mathbf{h}_u^{[l]}\right)

where \| denotes concatenation. The output dimension is KdK \cdot d'.

For the final layer (before a classification head), averaging is often preferred over concatenation:

hv[L]=σ ⁣(1Kk=1KuN(v)αuv(k)W(k)hu[L1])\mathbf{h}_v^{[L]} = \sigma\!\left(\frac{1}{K} \sum_{k=1}^K \sum_{u \in \mathcal{N}(v)} \alpha_{uv}^{(k)} W^{(k)} \mathbf{h}_u^{[L-1]}\right)

Complexity. A KK-head GAT layer costs O(Kmd)O(K \cdot m \cdot d) for attention computation (mm edges, dd features) and O(Kndd)O(K \cdot n \cdot d \cdot d') for the linear transformations. For dense attention (fully-connected graph), this is O(Kn2d)O(K \cdot n^2 \cdot d) - the same scaling as transformer attention.

6.4 GATv2: Fixing the Static Attention Problem

Brody, Alon & Yahav (2022) identified a fundamental limitation of the original GAT: its attention is static - the ranking of a neighbor's importance does not depend on the query node.

The problem. In GAT, the attention score is:

euv=a[zvzu]=aleftzv+arightzue_{uv} = \mathbf{a}^\top [\mathbf{z}_v \| \mathbf{z}_u] = \mathbf{a}_{\text{left}}^\top \mathbf{z}_v + \mathbf{a}_{\text{right}}^\top \mathbf{z}_u

Since this is a sum of two terms - one depending only on vv, one depending only on uu - the ranking of neighbors u1u_1 vs u2u_2 by their score eu1ve_{u_1 v} vs eu2ve_{u_2 v} is independent of zv\mathbf{z}_v. The "most important neighbor" is the same regardless of which node vv is doing the asking. This is "static attention."

Dynamic attention (GATv2). Fix by swapping the order of the linear transformation and concatenation:

euv=aLeakyReLU ⁣(W[hvhu])e_{uv} = \mathbf{a}^\top \operatorname{LeakyReLU}\!\left(W \cdot \left[\mathbf{h}_v \| \mathbf{h}_u\right]\right)

Now the LeakyReLU nonlinearity is applied before the dot product with a\mathbf{a}, creating genuine interaction between hv\mathbf{h}_v and hu\mathbf{h}_u. The ranking of u1u_1 vs u2u_2 can now depend on vv - the attention is dynamic.

Formal claim (Brody et al., 2022): GAT is a strictly less powerful attention function than GATv2. There exist graphs where GAT's attention collapses to uniform weighting while GATv2's attention correctly assigns non-uniform importance. GATv2 matches or exceeds GAT on all standard benchmarks.

6.5 Attention Sparsity and Interpretability

A commonly cited advantage of GAT over GCN is interpretability: the attention coefficients αuv\alpha_{uv} can be read as edge importance scores. Edges with high αuv\alpha_{uv} are "important" for node vv's representation; edges with near-zero αuv\alpha_{uv} are "ignored."

Limitations of this interpretation. Jain & Wallace (2019) and Wiegreffe & Pinter (2019) showed for NLP models that attention weights do not reliably indicate feature importance. Similar caveats apply to GAT:

  1. Attention weights are post-softmax and sum to 1; a weight of 0.90.9 out of 3 neighbors is very different from 0.90.9 out of 300 neighbors.
  2. The softmax creates competition between neighbors - a "dominant" neighbor may capture high attention simply by having a large dot product with a\mathbf{a}, not because it is semantically important.
  3. Different attention heads may attend to completely different structural patterns, and the meaning of each head is not determined a priori.

Despite these caveats, in practice, GAT attention patterns do reveal meaningful structural patterns for molecular and knowledge graph tasks, especially when validated against domain knowledge.


7. Expressiveness and the Weisfeiler-Leman Test

7.1 The Graph Isomorphism Problem

Definition (Graph Isomorphism). Two graphs G=(VG,EG)G = (V_G, E_G) and H=(VH,EH)H = (V_H, E_H) are isomorphic (written GHG \cong H) if there exists a bijection ϕ:VGVH\phi: V_G \to V_H such that (u,v)EG    (ϕ(u),ϕ(v))EH(u, v) \in E_G \iff (\phi(u), \phi(v)) \in E_H.

Two isomorphic graphs are structurally identical - they differ only in how the vertices are labeled. For any function of graph structure (predicting molecular properties, classifying graph type), the function must produce the same output on isomorphic graphs. This is exactly the permutation invariance requirement from 2.3.

The graph isomorphism (GI) problem - determining whether two given graphs are isomorphic - is one of the few natural problems not known to be in P or NP-complete. For practical purposes (GNN expressiveness), we ask a weaker question: can a GNN distinguish two non-isomorphic graphs by assigning them different representations?

Non-isomorphic graphs that look similar. Consider:

Graph 1: Two disjoint triangles (C_3 \cup C_3)
         o-o   o-o
          \ /   \ /
           o     o

Graph 2: One 6-cycle (C_6)
         o-o-o
         |     |
         o-o-o

Both graphs have 6 nodes, each of degree 2 - same degree sequence. Are they isomorphic? No: C3C3C_3 \cup C_3 has two triangles; C6C_6 has no triangles. A GNN should assign them different representations. Can it?

7.2 1-WL and Color Refinement

The Weisfeiler-Leman (WL) graph isomorphism test (Weisfeiler & Leman, 1968) is an efficient algorithm for testing graph isomorphism. While not complete (there exist non-isomorphic pairs it cannot distinguish), it correctly identifies isomorphism for most practical graphs.

1-WL Color Refinement Algorithm:

Initialize: assign all nodes the same color cv(0)=1c_v^{(0)} = 1 (or, for node-attributed graphs, cv(0)=hash(xv)c_v^{(0)} = \operatorname{hash}(\mathbf{x}_v)).

Iterate for t=0,1,2,t = 0, 1, 2, \ldots until convergence:

cv(t+1)=HASH ⁣(cv(t),{{cu(t):uN(v)}})c_v^{(t+1)} = \operatorname{HASH}\!\left(c_v^{(t)},\, \{\{c_u^{(t)} : u \in \mathcal{N}(v)\}\}\right)

where {{}}\{\{\cdot\}\} denotes a multiset (duplicate elements are preserved) and HASH is an injective hash function on (color, multiset of colors) pairs.

Decision: If at any iteration, graphs GG and HH have different multisets of node colors, declare them non-isomorphic. If the algorithm stabilizes with the same color multisets, declare them (possibly) isomorphic.

Convergence: The algorithm stabilizes in at most nn iterations (when no color changes occur). This gives O(n2logn)O(n^2 \log n) time complexity.

Example. For C3C3C_3 \cup C_3 vs C6C_6 with no node attributes:

  • t=0t=0: all nodes color c=1c=1 - same for both graphs
  • t=1t=1: cv(1)=HASH(1,{1,1})=c_v^{(1)} = \operatorname{HASH}(1, \{1, 1\}) = same for all nodes (all have degree 2 with same-colored neighbors) - still same!
  • Algorithm stabilizes: 1-WL cannot distinguish C3C3C_3 \cup C_3 from C6C_6 - both are "two disjoint 3-regular rings."

This is a fundamental limitation: 1-WL cannot detect triangles (3-cycles), so neither can any MPNN (Theorem below).

7.3 The GNN Expressiveness Theorem

Theorem (Xu et al., 2019). Let A\mathcal{A} be any MPNN with a fixed number of layers LL and countably many colors (feature values). Then:

  1. If A\mathcal{A} assigns different representations to graphs G≇HG \not\cong H, then 1-WL also distinguishes GG and HH (in L\leq L iterations).
  2. For any pair (G,H)(G, H) that 1-WL distinguishes, there exists an MPNN A\mathcal{A} that also distinguishes them.

Corollary. The discriminative power of any MPNN is bounded above by 1-WL. Conversely, the most expressive MPNN achieves exactly the discriminative power of 1-WL.

Proof sketch (upper bound). At each MPNN layer, node vv's representation is a function of: (cv[l],{{cu[l]:uN(v)}})(c_v^{[l]}, \{\{c_u^{[l]} : u \in \mathcal{N}(v)\}\}). This is precisely what 1-WL computes. If the aggregation + update function is injective (different inputs -> different outputs), the MPNN exactly simulates 1-WL. If it is not injective (e.g., mean aggregation), it may collapse distinct multisets to the same representation, making it strictly weaker than 1-WL.

Implications:

  • MPNNs with mean or max aggregation are strictly weaker than 1-WL - they collapse some distinct multisets
  • MPNNs with sum aggregation and injective MLP update are exactly 1-WL - GIN achieves this
  • No MPNN (regardless of architecture) can distinguish graphs that 1-WL cannot distinguish (e.g., C3C3C_3 \cup C_3 vs C6C_6, or regular graphs of the same degree)
  • Expressiveness limitations are fundamental, not a matter of training or capacity

7.4 GIN: The Most Expressive 1-WL GNN

Graph Isomorphism Network (Xu et al., 2019). To achieve 1-WL expressiveness, the aggregation function must be injective on multisets. Xu et al. prove:

Lemma. Any injective function on multisets over a countable universe can be expressed as:

h=φ ⁣(uSf(hu))\mathbf{h} = \varphi\!\left(\sum_{u \in \mathcal{S}} f(\mathbf{h}_u)\right)

for some functions φ,f\varphi, f. In other words, sum aggregation with an MLP is a universal approximator for injective multiset functions.

GIN layer:

hv[l+1]=MLP[l] ⁣((1+ε[l])hv[l]+uN(v)hu[l])\mathbf{h}_v^{[l+1]} = \operatorname{MLP}^{[l]}\!\left((1 + \varepsilon^{[l]}) \cdot \mathbf{h}_v^{[l]} + \sum_{u \in \mathcal{N}(v)} \mathbf{h}_u^{[l]}\right)

where ε[l]\varepsilon^{[l]} is either learned or set to 00. The (1+ε)(1+\varepsilon) term allows the network to distinguish the central node from its neighbors (otherwise, a node with features h\mathbf{h} and a neighbor also with features h\mathbf{h} would be aggregated indistinguishably from the node having two neighbors with features h/2\mathbf{h}/2 each under mean aggregation).

Why mean and max fail:

AggregationCounterexample multisetsBehavior
Mean{1,1}\{1, 1\} vs {1,1,1}\{1, 1, 1\}Both give mean = 1 - indistinguishable
Max{1,2}\{1, 2\} vs {2}\{2\}Both give max = 2 - indistinguishable
Sum{1,1}\{1, 1\} vs {1,1,1}\{1, 1, 1\}Sums 2 vs 3 - distinguishable

GIN for graph classification. Use sum readout at each layer and combine across layers:

hG=CONCAT ⁣(READOUT[l] ⁣({hv[l]}):l=0,1,,L)\mathbf{h}_G = \operatorname{CONCAT}\!\left(\operatorname{READOUT}^{[l]}\!\left(\left\{\mathbf{h}_v^{[l]}\right\}\right) : l = 0, 1, \ldots, L\right)

This jumping-knowledge-style readout captures structural patterns at multiple scales.

7.5 Beyond 1-WL: Higher-Order GNNs

1-WL's limitations (e.g., failing to count triangles, failing on regular graphs) motivate higher-order extensions.

kk-WL test. Instead of coloring individual nodes, color kk-tuples of nodes (v1,,vk)(v_1, \ldots, v_k). The refinement rule considers the colors of all kk-tuples that differ in exactly one position. kk-WL is strictly more powerful than (k1)(k-1)-WL for all k2k \geq 2.

kk-GNN (Morris et al., 2019). Implement kk-WL as a GNN by passing messages between kk-tuples. The cost is O(nk)O(n^k) - prohibitive for k3k \geq 3 on large graphs.

NGNN and subgraph GNNs. A more practical approach: instead of node-level MPNNs, run MPNNs on subgraphs induced by kk-hop neighborhoods, ego-graphs, or sampled subgraphs. These can detect cycles, cliques, and other motifs that 1-WL misses. Examples: NGNN (Zhang & Li, 2021), OSAN (Zhao et al., 2022).

Practical trade-off:

MethodExpressivenessComplexityPractical use
1-WL MPNN (GIN)1-WLO(md)O(m \cdot d)Standard; most applications
Subgraph GNNs>> 1-WLO(nmd)O(n \cdot m \cdot d)Medium graphs, chemistry
kk-GNN (k=3k=3)3-WLO(n3d)O(n^3 \cdot d)Small graphs only
Random featuresUniversalO(md)O(m \cdot d)Simple, effective in practice

7.6 Structural Features and Distance Encoding

A practical alternative to higher-order GNNs: augment node features with handcrafted structural descriptors that help the GNN detect patterns it otherwise cannot.

Random Walk Structural Encoding (RWSE). For each node vv, compute the landing probability of a random walk of length kk starting and ending at vv:

pv(k)=[Ak]vv/dvp_v^{(k)} = [A^k]_{vv} / d_v

The vector [pv(1),pv(2),,pv(K)][p_v^{(1)}, p_v^{(2)}, \ldots, p_v^{(K)}] encodes the local loop structure (triangles, 4-cycles, etc.). Used in GPS (Rampasek et al., 2022) and Graphormer.

Laplacian Positional Encoding (LapPE). Use the first kk eigenvectors of the normalized Laplacian as node features: xv[xvu1(v),,uk(v)]\mathbf{x}_v \leftarrow [\mathbf{x}_v \| \mathbf{u}_1(v), \ldots, \mathbf{u}_k(v)]. (Reviewed in 11-04 9.5 - see there for the sign invariance issue and fix.)

Degree features. Simply appending the node's degree dvd_v as a feature allows the GNN to distinguish nodes of different degree even when 1-WL would assign them the same color (since 1-WL with uniform initial colors cannot use degree beyond what the refinement computes). More sophisticated: distance encoding (Li et al., 2020) appends the shortest-path distances from a node to a set of anchor nodes.


8. Over-Smoothing, Over-Squashing, and Depth

8.1 Over-Smoothing: Formal Analysis

A 2-layer GCN typically outperforms a 6-layer GCN on standard node classification benchmarks. This seems paradoxical: more layers should give access to more of the graph. The reason is over-smoothing: as depth increases, all node representations converge to the same vector, destroying the discriminative information needed for classification.

Formal statement (Li et al., 2018). Let A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} be the GCN propagation matrix (assume A~=A+I\tilde{A} = A + I for simplicity). Then:

A^kπ1as k\hat{A}^k \to \boldsymbol{\pi} \mathbf{1}^\top \quad \text{as } k \to \infty

where πRn\boldsymbol{\pi} \in \mathbb{R}^n is the stationary distribution of the random walk on A~\tilde{A}: πv=d~v/(2m+n)\pi_v = \tilde{d}_v / (2m + n). That is, the kk-step propagation from any starting point converges to the stationary distribution regardless of initial conditions.

Consequence. For a linear GCN (no nonlinearities), H[L]=A^LXW(0)W(1)W(L1)H^{[L]} = \hat{A}^L X W^{(0)} W^{(1)} \cdots W^{(L-1)}. As LL \to \infty:

Hv,:[L]πv1XWconst×(column average of XW)H^{[L]}_{v,:} \to \pi_v \cdot \mathbf{1}^\top X W \propto \text{const} \times (\text{column average of } XW)

All nodes collapse to a constant multiple of the graph-wide feature average. For connected graphs, H[L]H^{[L]} converges to a rank-1 matrix - all node representations are proportional to π\boldsymbol{\pi}.

With nonlinearities (ReLU), convergence is not exact, but the trend holds: after 6-8 layers, node representations on typical graphs (small-world, power-law degree distribution) are nearly identical.

MADGap metric (Chen et al., 2020). Mean Average Distance (MAD) measures pairwise distances between node representations. MADGap = MAD(between-class) - MAD(within-class). For a good classifier, MADGap should be large (between-class distances large, within-class distances small). Over-smoothing drives MAD -> 0, collapsing MADGap.

8.2 Information-Theoretic View via Dirichlet Energy

The Dirichlet energy of the node representation matrix HRn×dH \in \mathbb{R}^{n \times d} measures total variation across edges:

E(H)=(i,j)Ehihj22=tr(HLH)E(H) = \sum_{(i,j) \in E} \lVert\mathbf{h}_i - \mathbf{h}_j\rVert_2^2 = \operatorname{tr}(H^\top L H)

where L=DAL = D - A is the unnormalized Laplacian.

Properties:

  • E(H)=0    hi=hjE(H) = 0 \iff \mathbf{h}_i = \mathbf{h}_j for all connected i,ji, j (on a connected graph: all nodes identical)
  • E(H)E(H) is large when adjacent nodes have different representations - high "frequency content"
  • The GCN smoothing step reduces E(H)E(H): E(SH)E(H)E(SH) \leq E(H) where S=A^S = \hat{A} (since A^\hat{A} is the low-pass graph filter)

Over-smoothing = energy dissipation. Each GCN layer reduces E(H)E(H). After LL layers:

E(H[L])λmax(S)2LE(H[0])E(H^{[L]}) \leq \lambda_{\max}(S)^{2L} \cdot E(H^{[0]})

Since λmax(S)1\lambda_{\max}(S) \leq 1 for symmetric normalized A^\hat{A} with self-loops, the energy decays geometrically. The rate depends on λ2(L)\lambda_2(L) (the Fiedler value): graphs with large spectral gap (expanders) over-smooth faster than graphs with small spectral gap (clusters).

Practical implication. For clustered graphs (molecular graphs, citation networks with strong community structure), over-smoothing is slow - one can use deeper GNNs. For expander-like graphs (dense social networks, random graphs), over-smoothing is fast - 2-4 layers is optimal.

8.3 Over-Squashing: Bottleneck Nodes

Over-squashing is a distinct (and more subtle) problem: information from distant nodes is exponentially compressed as it flows through bottleneck edges, preventing long-range interactions from influencing predictions.

Jacobian analysis (Alon & Yahav, 2021). Consider the Jacobian of node vv's representation at layer kk with respect to node uu's initial feature:

hv[k]xu=l=0k1(W[l])hv[k]hv[k1]hu[1]xu\frac{\partial \mathbf{h}_v^{[k]}}{\partial \mathbf{x}_u} = \prod_{l=0}^{k-1} (W^{[l]})^\top \cdot \frac{\partial \mathbf{h}_v^{[k]}}{\partial \mathbf{h}_v^{[k-1]}} \cdots \frac{\partial \mathbf{h}_{u'}^{[1]}}{\partial \mathbf{x}_u}

where the product runs along any path from uu to vv of length kk. The norm of this Jacobian is bounded by:

hv[k]xuC(A^k)vu\left\lVert\frac{\partial \mathbf{h}_v^{[k]}}{\partial \mathbf{x}_u}\right\rVert \leq C \cdot \left(\hat{A}^k\right)_{vu}

where (A^k)vu(\hat{A}^k)_{vu} is the (v,u)(v,u) entry of the kk-th power of the propagation matrix. For nodes far apart in the graph, (A^k)vu(\hat{A}^k)_{vu} is exponentially small in the distance d(u,v)d(u,v).

The bottleneck. If uu and vv are separated by a single edge (s,t)(s, t) of high betweenness centrality (a "bridge"), all information from uu-side to vv-side must pass through (s,t)(s,t). The aggregation at ss receives messages from all N(s)|\mathcal{N}(s)| neighbors and compresses them into a single dd-dimensional vector - losing information at rate proportional to N(s)/d|\mathcal{N}(s)| / d.

Over-squashing vs over-smoothing. These are dual problems:

  • Over-smoothing: too much aggregation -> representations become too similar
  • Over-squashing: aggregation is a bottleneck -> distant information cannot influence predictions
  • Over-smoothing is a property of the graph and its spectrum; over-squashing is a property of specific bottleneck edges

8.4 Remedies for Over-Smoothing

DropEdge (Rong et al., 2020). Randomly remove edges during training: replace AA with a sparse mask A~drop\tilde{A}_{\text{drop}} that drops each edge independently with probability pp. This is analogous to dropout for neurons but applied to edges. It slows the convergence of over-smoothing (less propagation per layer) and acts as a data augmentation method. Improves performance at depths 4-8 on standard benchmarks.

PairNorm (Zhao & Akoglu, 2020). After each GCN layer, re-normalize the node representations to ensure a fixed total pairwise distance:

hvhvhˉ,hˉ=1nvhv\mathbf{h}_v \leftarrow \mathbf{h}_v - \bar{\mathbf{h}}, \quad \bar{\mathbf{h}} = \frac{1}{n}\sum_v \mathbf{h}_v hvshv(1nvhv2)1/2\mathbf{h}_v \leftarrow s \cdot \frac{\mathbf{h}_v}{\left(\frac{1}{n}\sum_v \lVert\mathbf{h}_v\rVert^2\right)^{1/2}}

where ss is a scaling hyperparameter. PairNorm prevents the Dirichlet energy from decaying to zero by keeping pairwise distances bounded below. Empirically effective for deep GCNs (8-16 layers).

Initial residual connection (GCNII, Chen et al., 2020).

hv[l+1]=σ ⁣([(1α)A^hv[l]+αhv[0]][(1β)I+βW[l]])\mathbf{h}_v^{[l+1]} = \sigma\!\left(\left[(1-\alpha)\hat{A}\mathbf{h}_v^{[l]} + \alpha \mathbf{h}_v^{[0]}\right] \cdot \left[(1-\beta) I + \beta W^{[l]}\right]\right)

Two additions: (1) αhv[0]\alpha \mathbf{h}_v^{[0]} - always mixing in the initial feature with weight α\alpha prevents full convergence; (2) the weight matrix mixes between identity (β=0\beta=0, pure smoothing) and full linear transform (β=1\beta=1). With α=0.1,β=log(λ/l+1)\alpha=0.1, \beta = \log(\lambda/l+1) (decreasing β\beta with depth), GCNII successfully trains 64-layer GCNs, achieving state-of-the-art on Cora at the time (85.5% accuracy vs 81.5% for 2-layer GCN).

GroupNorm / NodeNorm. Normalize within groups of nodes to prevent scale collapse. Less commonly used than the above.

8.5 Graph Rewiring

A more aggressive approach: change the graph structure to improve information flow before running the GNN.

Diffusion-based rewiring (DIGL, Gasteiger et al., 2019). Replace AA with the heat kernel approximation Θ=exp(tL)\Theta = \exp(-t L) (truncated) or personalized PageRank matrix Θ=α(I(1α)A^)1\Theta = \alpha(I - (1-\alpha)\hat{A})^{-1}. The diffusion matrix connects distant nodes with non-zero weights, enabling long-range information flow in 1 GNN layer. Edges with small weights are pruned, keeping the graph sparse.

Expander Graph Propagation (EGP, Deac et al., 2022). Augment the graph with the edges of a sparse expander graph (a dd-regular graph with large spectral gap, e.g., a Ramanujan graph). Expander edges provide "shortcuts" that reduce the effective diameter of the graph, alleviating over-squashing without the O(n2)O(n^2) cost of full connectivity.

FoSR: First-Order Spectral Rewiring (Karhadkar et al., 2023). Add edges that most increase λ2(L)\lambda_2(L) (the Fiedler value), directly attacking the spectral bottleneck. Greedy algorithm: at each step, add the edge (u,v)E(u,v) \notin E that maximally increases λ2\lambda_2 of the augmented graph. With kk added edges, the graph diameter decreases and over-squashing is reduced.

8.6 Depth vs Width Trade-off in GNNs

Why shallow GNNs dominate in practice. On most benchmark tasks (node classification on citation networks, graph classification on molecular benchmarks), 2-4 GNN layers achieve the best performance. The reasons:

  1. Homophily dominates: in social and citation networks, the most informative neighbors are at distance 1-2. Beyond distance 3, nodes are often from different classes, and aggregating them hurts classification.
  2. Over-smoothing: as analyzed above, deep GCNs lose discriminative information.
  3. Exponential neighborhood growth: a node's kk-hop neighborhood in a power-law graph has dˉk\sim \bar{d}^k nodes. For dˉ=10\bar{d}=10 and k=5k=5, that's 100,000 nodes - mostly irrelevant.

When depth helps. Long-range tasks where predictions depend on distant nodes:

  • Predicting molecular properties that depend on the entire molecular graph (QM9 quantum chemistry dataset)
  • Reasoning over knowledge graphs with long inference chains
  • Planning and path-finding tasks on sparse graphs

For these tasks, graph Transformers (10) - which compute full pairwise attention in O(n2)O(n^2) - often outperform deep GNNs.

Jumping Knowledge Networks (Xu et al., 2018). A principled approach: each node selects the representation from the most appropriate depth:

hvfinal=COMBINE ⁣(hv[1],hv[2],,hv[L])\mathbf{h}_v^{\text{final}} = \operatorname{COMBINE}\!\left(\mathbf{h}_v^{[1]}, \mathbf{h}_v^{[2]}, \ldots, \mathbf{h}_v^{[L]}\right)

COMBINE can be concatenation, LSTM-attention, or max-pooling across layers. This allows different nodes to "listen" at different distances - nodes in dense clusters use shallow representations; nodes that are bridges benefit from deeper representations.


9. Graph Pooling and Hierarchical GNNs

9.1 Why Pooling Is Non-Trivial on Graphs

In image CNNs, pooling is straightforward: reduce a 2×22 \times 2 spatial block to a single pixel by averaging. The operation is well-defined because images have a fixed grid structure.

On graphs, pooling (coarsening) must answer: which nodes get merged? How are their features combined? How is the edge structure of the coarsened graph defined? All of these must be permutation invariant - the coarsened graph cannot depend on how vertices are labeled.

The challenges:

  1. No canonical merging: unlike pixels in a grid, there is no obvious spatial proximity to guide merging. Spectral methods (9.4) use the Fiedler vector; learned methods (9.3) learn soft assignments.
  2. Edge reconstruction: after merging nodes {v1,v2}\{v_1, v_2\} into a super-node ss, which edges does ss inherit? Typically all edges incident to v1v_1 or v2v_2 - but this may create multi-edges and self-loops.
  3. Information loss: pooling irreversibly reduces the graph. Unlike deconvolution in image models, there is no standard graph "unpooling" that perfectly recovers the original structure.

9.2 Global Pooling Methods

For graph-level tasks, a single pooling step at the end suffices. The readout function R({hv[L]})R(\{\mathbf{h}_v^{[L]}\}) maps the set of node representations to a single graph embedding.

Sum pooling: hG=vVhv[L]\mathbf{h}_G = \sum_{v \in V} \mathbf{h}_v^{[L]}. Sensitive to graph size (larger graphs get larger embeddings). Most expressive for graph-level tasks (by the same argument as sum aggregation for node-level).

Mean pooling: hG=1nvhv[L]\mathbf{h}_G = \frac{1}{n}\sum_{v} \mathbf{h}_v^{[L]}. Normalizes for graph size; best when comparing graphs of different sizes.

Max pooling: (hG)k=maxv(hv[L])k(\mathbf{h}_G)_k = \max_v (\mathbf{h}_v^{[L]})_k. Detects whether any node has feature kk above a threshold; good for detecting rare structural motifs.

Attention pooling (gated global pooling):

hG=vVsoftmax ⁣(fgate(hv[L]))vffeat(hv[L])\mathbf{h}_G = \sum_{v \in V} \operatorname{softmax}\!\left(f_\text{gate}(\mathbf{h}_v^{[L]})\right)_v \cdot f_\text{feat}(\mathbf{h}_v^{[L]})

where fgate:RdRf_\text{gate}: \mathbb{R}^d \to \mathbb{R} scores each node, and ffeat:RdRdf_\text{feat}: \mathbb{R}^d \to \mathbb{R}^{d'} transforms the features. Allows the model to focus on the most informative nodes. Used in Graph U-Net, Graphormer.

Set2Set (Vinyals et al., 2016). An LSTM-based readout that runs TT steps of attention over the node set, accumulating a context vector:

qt=LSTM(qt1),etv=qthv,αtv=softmaxv(etv)\mathbf{q}_t = \operatorname{LSTM}(\mathbf{q}_{t-1}), \qquad e_{tv} = \mathbf{q}_t^\top \mathbf{h}_v, \qquad \alpha_{tv} = \operatorname{softmax}_v(e_{tv}) ct=vαtvhv,hG=cT\mathbf{c}_t = \sum_v \alpha_{tv} \mathbf{h}_v, \qquad \mathbf{h}_G = \mathbf{c}_T

More expressive than simple pooling but with O(nT)O(nT) sequential steps.

9.3 DiffPool: Differentiable Graph Pooling

DiffPool (Ying et al., 2019) learns soft cluster assignments to hierarchically coarsen the graph. At each pooling level, run two GNNs in parallel:

S(l)=softmax ⁣(GNNpool(l) ⁣(A(l),H(l)))Rnl×nl+1S^{(l)} = \operatorname{softmax}\!\left(\operatorname{GNN}_{\text{pool}}^{(l)}\!\left(A^{(l)}, H^{(l)}\right)\right) \in \mathbb{R}^{n_l \times n_{l+1}} Z(l)=GNNembed(l) ⁣(A(l),H(l))Rnl×dZ^{(l)} = \operatorname{GNN}_{\text{embed}}^{(l)}\!\left(A^{(l)}, H^{(l)}\right) \in \mathbb{R}^{n_l \times d}

where nln_l is the number of nodes at level ll, nl+1<nln_{l+1} < n_l is the target number of clusters, S(l)S^{(l)} is the soft assignment matrix (each node vv assigns fractionally to each of nl+1n_{l+1} clusters), and Z(l)Z^{(l)} is the GNN embedding of current-level nodes.

Coarsened graph:

H(l+1)=S(l)Z(l)Rnl+1×dH^{(l+1)} = S^{(l)\top} Z^{(l)} \in \mathbb{R}^{n_{l+1} \times d} A(l+1)=S(l)A(l)S(l)Rnl+1×nl+1A^{(l+1)} = S^{(l)\top} A^{(l)} S^{(l)} \in \mathbb{R}^{n_{l+1} \times n_{l+1}}

The coarsened adjacency A(l+1)A^{(l+1)} is dense - every pair of clusters has a (soft) connection weighted by the total edge weight between them.

Auxiliary losses. DiffPool adds two regularization terms to the main task loss:

  • Link prediction loss: LLP=A(l)S(l)S(l)F\mathcal{L}_{\text{LP}} = \lVert A^{(l)} - S^{(l)} S^{(l)\top} \rVert_F - encourages clusters to correspond to connected subgraphs
  • Entropy loss: LE=1nvH(Sv,:(l))\mathcal{L}_{\text{E}} = \frac{1}{n}\sum_v H(S^{(l)}_{v,:}) - encourages crisp (non-uniform) cluster assignments

Limitation. DiffPool produces a dense A(l+1)A^{(l+1)} at each level - O(nl+12)O(n_{l+1}^2) memory. For large graphs (n>104n > 10^4), this is prohibitive.

9.4 MinCutPool and Spectral Pooling

MinCutPool (Bianchi et al., 2020) addresses DiffPool's density problem by formulating pooling as a spectral clustering problem with a differentiable objective.

The mincut objective. For a soft cluster assignment SRn×kS \in \mathbb{R}^{n \times k}, the normalized minimum cut is:

minCUT(S)=tr(SLS)tr(SDS)\text{minCUT}(S) = \frac{\operatorname{tr}(S^\top L S)}{\operatorname{tr}(S^\top D S)}

Minimizing this over soft assignments SS is equivalent to spectral clustering - the optimal SS consists of the kk Fiedler eigenvectors of Lrw=D1LL_{\text{rw}} = D^{-1}L. MinCutPool uses a GNN to produce SS and trains end-to-end with the minCUT loss plus an orthogonality regularizer SS/SSFI/kF\lVert S^\top S / \lVert S^\top S \rVert_F - I / \sqrt{k} \rVert_F to prevent degenerate cluster assignments.

Advantage over DiffPool: the objective is theoretically motivated by spectral graph theory; the sparse regularization keeps the assignment matrix well-conditioned.

9.5 SAGPool: Self-Attention Graph Pooling

SAGPool (Lee et al., 2019) takes a different approach: instead of soft cluster assignments, select the top-kk nodes by a learned importance score and discard the rest.

Algorithm:

  1. Run one GNN layer to compute node scores: z=GNN(A,H)Rn\mathbf{z} = \operatorname{GNN}(A, H) \in \mathbb{R}^n
  2. Select top-kk nodes: idx=topk(z,k)\text{idx} = \operatorname{topk}(\mathbf{z}, k)
  3. Gate the selected features: H=Hidxsigmoid(zidx)H' = H_{\text{idx}} \odot \operatorname{sigmoid}(\mathbf{z}_{\text{idx}})
  4. Induce the subgraph: A=Aidx,idxA' = A_{\text{idx,idx}} (restrict AA to selected nodes)

Advantages: simple, interpretable (which nodes are selected?), maintains sparsity of adjacency. Disadvantage: hard top-kk selection is not differentiable (uses straight-through estimator or continuous relaxation during training).


10. Graph Transformers

10.1 Motivation: GNNs vs Transformers

The fundamental constraint of MPNNs is locality: a kk-layer GNN can only aggregate information within a kk-hop neighborhood. For long-range dependencies (e.g., two atoms on opposite ends of a large molecule that interact through space), a deep GNN would need O(diameter)O(\text{diameter}) layers - running into over-smoothing and over-squashing.

The transformer architecture - with full O(n2)O(n^2) pairwise attention - solves the locality problem: every node can directly attend to every other node in a single layer. But transformers treat inputs as sets of independent tokens; they have no built-in notion of graph structure.

Graph Transformers combine: (1) the global receptive field of transformers (full pairwise attention), with (2) graph-structural inductive biases (local message passing, positional encodings derived from the graph topology).

The trade-off: full attention costs O(n2d)O(n^2 d) per layer - practical only for small-to-medium graphs (n104n \leq 10^4). For large graphs (n>106n > 10^6), neighbor-sampled MPNNs remain the only scalable option.

10.2 Positional Encodings for Graphs

In transformers, positional encodings break the permutation symmetry of the attention mechanism: without them, the transformer cannot distinguish token position, and all orderings of the same tokens produce the same output. The same problem occurs in graph transformers: the attention mechanism is permutation invariant, so we need positional encodings that break symmetry in a structure-aware way.

Recall from 11-04 9.5: Laplacian eigenvectors provide a natural graph Fourier basis, and the first kk eigenvectors of LsymL_{\text{sym}} give a kk-dimensional coordinate for each node that reflects the graph's spectral structure. The sign invariance problem (eigenvectors are defined up to sign) is addressed by random sign flipping during training or by using absolute values. Full treatment: 11-04 9.5.

Building on this, three main PE strategies for graph transformers:

Laplacian PE (LapPE). For each node vv, extract the vv-th row of the matrix [u1,u2,,uk][\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_k] where ui\mathbf{u}_i are the first kk eigenvectors of LsymL_{\text{sym}} (excluding the constant eigenvector). Append to node features:

xv[xvu1(v),,uk(v)]\mathbf{x}_v \leftarrow \left[\mathbf{x}_v \,\|\, \mathbf{u}_1(v), \ldots, \mathbf{u}_k(v)\right]

Sign ambiguity fix: during training, randomly flip the sign of each eigenvector independently (this has no effect on the Laplacian spectrum but makes the model invariant to sign choice). Used in Dwivedi et al. (2020), GPS (2022).

Random Walk SE (RWSE). For each node vv and walk length pp, compute (ApD1)vv(A^p D^{-1})_{vv} - the probability of a length-pp random walk returning to vv. Stack for p=1,,Pp = 1, \ldots, P:

RWSEv=[(AD1)vv,(A2D2)vv,,(APDP)vv]\text{RWSE}_v = \left[(AD^{-1})_{vv}, (A^2 D^{-2})_{vv}, \ldots, (A^P D^{-P})_{vv}\right]

RWSE is always positive and sign-free (no sign ambiguity). It encodes local loop structure: (AD1)vv=0(AD^{-1})_{vv} = 0 iff vv has no self-loops; (A2D2)vv>0(A^2 D^{-2})_{vv} > 0 iff any neighbor of vv is also connected back to vv (triangles). Used in GPS (2022), GIN+RWSE.

Degree and centrality encoding. Simply append dvd_v (node degree), normalized closeness centrality, or betweenness centrality as scalar node features. Used in Graphormer (2021), which also encodes shortest-path distances as edge features.

10.3 GPS Framework

General, Powerful, Scalable (GPS) Graph Transformer (Rampasek et al., 2022) is the most general graph transformer framework, winning multiple tracks of the 2022 Long Range Graph Benchmark (LRGB).

GPS Layer:

H[l+1]=LayerNorm ⁣(H[l]+MPNN[l] ⁣(H[l],A)+Transformer[l] ⁣(H[l]))H^{[l+1]} = \operatorname{LayerNorm}\!\left(H^{[l]} + \operatorname{MPNN}^{[l]}\!\left(H^{[l]}, A\right) + \operatorname{Transformer}^{[l]}\!\left(H^{[l]}\right)\right)

Each GPS layer combines:

  1. Local MPNN: any standard GNN (GCN, GINE, GAT) operating on the sparse graph edges - captures local structure efficiently
  2. Global Transformer: full multi-head self-attention over all nodes - captures long-range dependencies
  3. LayerNorm + residual: stabilizes training

Modularity. GPS is a framework, not a fixed architecture: the MPNN and Transformer components are interchangeable. In practice, GINE (GIN with edge features) + Transformer works well; one can also use GAT + Performer (linear attention approximation) for scalability.

Positional encodings. GPS accepts arbitrary node-level PEs (LapPE, RWSE, degree) as extra input features, processed by a dedicated PE encoder:

xvin=Wfeatxv+WPEPEv\mathbf{x}_v^{\text{in}} = W_{\text{feat}} \mathbf{x}_v + W_{\text{PE}} \text{PE}_v

Performance. On the LRGB Peptides-func benchmark (a graph where predictions require integrating global molecular structure), GPS achieves significantly higher accuracy than any pure MPNN, demonstrating the necessity of global attention for long-range tasks.

10.4 Graphormer

Graphormer (Ying et al., 2021) uses a standard transformer architecture (with full pairwise attention) augmented by three graph-structural encodings. It won the OGB-LSC 2021 quantum chemistry track.

Central encoding. Add a degree-dependent bias to the node features:

xvin=xv+zdv+zdv+\mathbf{x}_v^{\text{in}} = \mathbf{x}_v + \mathbf{z}_{d_v^-} + \mathbf{z}_{d_v^+}

where zd\mathbf{z}_{d^-} and zd+\mathbf{z}_{d^+} are learned embeddings for in- and out-degree. This allows the transformer to distinguish hub nodes from leaf nodes.

Spatial encoding. Add a bias to the attention score between nodes uu and vv based on their shortest-path distance ϕ(u,v)\phi(u,v):

euv=(huWQ)(hvWK)dk+bϕ(u,v)e_{uv} = \frac{(\mathbf{h}_u W_Q)(\mathbf{h}_v W_K)^\top}{\sqrt{d_k}} + b_{\phi(u,v)}

where bϕb_\phi is a scalar learned per distance ϕ\phi. Nodes far apart get a learned attention bias; nodes nearby get a different bias. This encodes graph topology directly into the attention pattern.

Edge encoding. For each pair (u,v)(u,v), average the edge features along the shortest path:

cuv=1ϕ(u,v)esp(u,v)weac_{uv} = \frac{1}{\phi(u,v)} \sum_{e \in \text{sp}(u,v)} \mathbf{w}_e^\top \mathbf{a}

where weRde\mathbf{w}_e \in \mathbb{R}^{d_e} is the feature of path edge ee and aRde\mathbf{a} \in \mathbb{R}^{d_e} is a learned vector. Added to the attention score as an additional bias.

10.5 Graph Mamba and Sequence-Based Methods

The quadratic O(n2)O(n^2) cost of full attention makes graph transformers impractical for large graphs. Recent work (2023-2024) explores linear-complexity alternatives.

Converting graphs to sequences. Several methods serialize the graph into a sequence of tokens, then apply a sequence model (LSTM, Mamba SSM):

  • BFS/DFS ordering: nodes in BFS/DFS traversal order; nearby nodes in the graph -> nearby in sequence. Breaks permutation invariance (different traversal orderings give different results).
  • Node-and-edge tokenization (TokenGT, Kim et al., 2022): represent each node and each edge as a separate token; full transformer attention over all n+mn + m tokens. Permutation invariant if node/edge PEs are orthogonalized.

Graph Mamba (Chen et al., 2023). Apply Mamba (state-space model with selective scan) to graphs by: (1) ordering nodes by degree or BFS, (2) running the SSM along this ordering as a "linearized graph sequence." Achieves O(nlogn)O(n \log n) complexity while matching transformer performance on some benchmarks. The key challenge: the SSM must be made robust to different node orderings (since graphs have no canonical ordering).

Status (2026). Graph Mamba and similar linear-attention methods are active research areas. For small-to-medium graphs (n104n \leq 10^4), GPS-style full attention is standard. For large graphs, MPNN-based methods (GraphSAGE, Cluster-GCN) remain dominant in production.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue