Part 2

27 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Graph Neural Networks: Part 6: Graph Attention Networks GAT to 10. Graph Transformers

6. Graph Attention Networks (GAT)

6.1 The Attention Mechanism on Graphs

GCN's aggregation weights $\frac{1}{\sqrt{\tilde{d}_u \tilde{d}_v}}$ are fixed functions of node degrees - they depend only on the graph structure, not on node features. Two neighbors with identical degree contribute identically regardless of their relevance to the central node's representation.

Graph Attention Networks (Velickovic et al., 2018) replace fixed weights with learned, data-dependent attention coefficients. For each edge $(u, v)$ , the attention coefficient $\alpha_{uv}$ is computed as a function of both nodes' features, allowing the network to focus on the most relevant neighbors.

This mirrors the transformer's self-attention, but with attention constrained to the graph's edge structure: node $v$ only attends to its actual neighbors $\mathcal{N}(v)$ , not to all nodes (which would be $O(n^2)$ ). The graph acts as a structural prior on the attention pattern.

6.2 GAT Layer: Attention Coefficients

Step 1: Linear transformation. Apply a shared weight matrix $W \in \mathbb{R}^{d' \times d}$ to transform all node features:

\mathbf{z}_v = W\mathbf{h}_v^{[l]} \quad \forall v \in V

Step 2: Attention scores. For each edge $(u, v)$ (including self-loops), compute:

e_{uv} = \operatorname{LeakyReLU}\!\left(\mathbf{a}^\top \left[\mathbf{z}_v \,\|\, \mathbf{z}_u\right]\right)

where $\mathbf{a} \in \mathbb{R}^{2d'}$ is a learnable attention vector, and $[\cdot \| \cdot]$ denotes concatenation. The raw attention score $e_{uv}$ measures the "relevance" of node $u$ to node $v$ . LeakyReLU (with negative slope $\alpha = 0.2$ ) allows gradient flow for zero scores.

Step 3: Normalize. Apply softmax over the neighborhood to get normalized coefficients:

\alpha_{uv} = \frac{\exp(e_{uv})}{\sum_{k \in \mathcal{N}(v) \cup \{v\}} \exp(e_{kv})}

Note: $\alpha_{uv} \geq 0$ and $\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{uv} = 1$ .

Step 4: Aggregation. Compute the updated representation:

\mathbf{h}_v^{[l+1]} = \sigma\!\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{uv} \cdot \mathbf{z}_u\right)

MPNN view: the message is $M(\mathbf{h}_v, \mathbf{h}_u) = \alpha_{uv} \cdot W\mathbf{h}_u$ and the aggregation is sum.

6.3 Multi-Head Attention for Graphs

Single-head attention can be unstable and may attend to a limited range of features. Multi-head attention stabilizes training and allows the model to jointly attend to different aspects of neighbors.

Multi-head GAT layer. Run $K$ independent attention mechanisms in parallel, each with its own parameters $(W^{(k)}, \mathbf{a}^{(k)})$ :

\mathbf{h}_v^{[l+1]} = \|_{k=1}^K \sigma\!\left(\sum_{u \in \mathcal{N}(v)} \alpha_{uv}^{(k)} W^{(k)} \mathbf{h}_u^{[l]}\right)

where $\|$ denotes concatenation. The output dimension is $K \cdot d'$ .

For the final layer (before a classification head), averaging is often preferred over concatenation:

\mathbf{h}_v^{[L]} = \sigma\!\left(\frac{1}{K} \sum_{k=1}^K \sum_{u \in \mathcal{N}(v)} \alpha_{uv}^{(k)} W^{(k)} \mathbf{h}_u^{[L-1]}\right)

Complexity. A $K$ -head GAT layer costs $O(K \cdot m \cdot d)$ for attention computation ( $m$ edges, $d$ features) and $O(K \cdot n \cdot d \cdot d')$ for the linear transformations. For dense attention (fully-connected graph), this is $O(K \cdot n^2 \cdot d)$ - the same scaling as transformer attention.

6.4 GATv2: Fixing the Static Attention Problem

Brody, Alon & Yahav (2022) identified a fundamental limitation of the original GAT: its attention is static - the ranking of a neighbor's importance does not depend on the query node.

The problem. In GAT, the attention score is:

e_{uv} = \mathbf{a}^\top [\mathbf{z}_v \| \mathbf{z}_u] = \mathbf{a}_{\text{left}}^\top \mathbf{z}_v + \mathbf{a}_{\text{right}}^\top \mathbf{z}_u

Since this is a sum of two terms - one depending only on $v$ , one depending only on $u$ - the ranking of neighbors $u_1$ vs $u_2$ by their score $e_{u_1 v}$ vs $e_{u_2 v}$ is independent of $\mathbf{z}_v$ . The "most important neighbor" is the same regardless of which node $v$ is doing the asking. This is "static attention."

Dynamic attention (GATv2). Fix by swapping the order of the linear transformation and concatenation:

e_{uv} = \mathbf{a}^\top \operatorname{LeakyReLU}\!\left(W \cdot \left[\mathbf{h}_v \| \mathbf{h}_u\right]\right)

Now the LeakyReLU nonlinearity is applied before the dot product with $\mathbf{a}$ , creating genuine interaction between $\mathbf{h}_v$ and $\mathbf{h}_u$ . The ranking of $u_1$ vs $u_2$ can now depend on $v$ - the attention is dynamic.

Formal claim (Brody et al., 2022): GAT is a strictly less powerful attention function than GATv2. There exist graphs where GAT's attention collapses to uniform weighting while GATv2's attention correctly assigns non-uniform importance. GATv2 matches or exceeds GAT on all standard benchmarks.

6.5 Attention Sparsity and Interpretability

A commonly cited advantage of GAT over GCN is interpretability: the attention coefficients $\alpha_{uv}$ can be read as edge importance scores. Edges with high $\alpha_{uv}$ are "important" for node $v$ 's representation; edges with near-zero $\alpha_{uv}$ are "ignored."

Limitations of this interpretation. Jain & Wallace (2019) and Wiegreffe & Pinter (2019) showed for NLP models that attention weights do not reliably indicate feature importance. Similar caveats apply to GAT:

Attention weights are post-softmax and sum to 1; a weight of $0.9$ out of 3 neighbors is very different from $0.9$ out of 300 neighbors.
The softmax creates competition between neighbors - a "dominant" neighbor may capture high attention simply by having a large dot product with $\mathbf{a}$ , not because it is semantically important.
Different attention heads may attend to completely different structural patterns, and the meaning of each head is not determined a priori.

Despite these caveats, in practice, GAT attention patterns do reveal meaningful structural patterns for molecular and knowledge graph tasks, especially when validated against domain knowledge.

7. Expressiveness and the Weisfeiler-Leman Test

7.1 The Graph Isomorphism Problem

Definition (Graph Isomorphism). Two graphs $G = (V_G, E_G)$ and $H = (V_H, E_H)$ are isomorphic (written $G \cong H$ ) if there exists a bijection $\phi: V_G \to V_H$ such that $(u, v) \in E_G \iff (\phi(u), \phi(v)) \in E_H$ .

Two isomorphic graphs are structurally identical - they differ only in how the vertices are labeled. For any function of graph structure (predicting molecular properties, classifying graph type), the function must produce the same output on isomorphic graphs. This is exactly the permutation invariance requirement from 2.3.

The graph isomorphism (GI) problem - determining whether two given graphs are isomorphic - is one of the few natural problems not known to be in P or NP-complete. For practical purposes (GNN expressiveness), we ask a weaker question: can a GNN distinguish two non-isomorphic graphs by assigning them different representations?

Non-isomorphic graphs that look similar. Consider:

Graph 1: Two disjoint triangles (C_3 \cup C_3)
         o-o   o-o
          \ /   \ /
           o     o

Graph 2: One 6-cycle (C_6)
         o-o-o
         |     |
         o-o-o

Both graphs have 6 nodes, each of degree 2 - same degree sequence. Are they isomorphic? No: $C_3 \cup C_3$ has two triangles; $C_6$ has no triangles. A GNN should assign them different representations. Can it?

7.2 1-WL and Color Refinement

The Weisfeiler-Leman (WL) graph isomorphism test (Weisfeiler & Leman, 1968) is an efficient algorithm for testing graph isomorphism. While not complete (there exist non-isomorphic pairs it cannot distinguish), it correctly identifies isomorphism for most practical graphs.

1-WL Color Refinement Algorithm:

Initialize: assign all nodes the same color $c_v^{(0)} = 1$ (or, for node-attributed graphs, $c_v^{(0)} = \operatorname{hash}(\mathbf{x}_v)$ ).

Iterate for $t = 0, 1, 2, \ldots$ until convergence:

c_v^{(t+1)} = \operatorname{HASH}\!\left(c_v^{(t)},\, \{\{c_u^{(t)} : u \in \mathcal{N}(v)\}\}\right)

where $\{\{\cdot\}\}$ denotes a multiset (duplicate elements are preserved) and HASH is an injective hash function on (color, multiset of colors) pairs.

Decision: If at any iteration, graphs $G$ and $H$ have different multisets of node colors, declare them non-isomorphic. If the algorithm stabilizes with the same color multisets, declare them (possibly) isomorphic.

Convergence: The algorithm stabilizes in at most $n$ iterations (when no color changes occur). This gives $O(n^2 \log n)$ time complexity.

Example. For $C_3 \cup C_3$ vs $C_6$ with no node attributes:

$t=0$ : all nodes color $c=1$ - same for both graphs
$t=1$ : $c_v^{(1)} = \operatorname{HASH}(1, \{1, 1\}) =$ same for all nodes (all have degree 2 with same-colored neighbors) - still same!
Algorithm stabilizes: 1-WL cannot distinguish $C_3 \cup C_3$ from $C_6$ - both are "two disjoint 3-regular rings."

This is a fundamental limitation: 1-WL cannot detect triangles (3-cycles), so neither can any MPNN (Theorem below).

7.3 The GNN Expressiveness Theorem

Theorem (Xu et al., 2019). Let $\mathcal{A}$ be any MPNN with a fixed number of layers $L$ and countably many colors (feature values). Then:

If $\mathcal{A}$ assigns different representations to graphs $G \not\cong H$ , then 1-WL also distinguishes $G$ and $H$ (in $\leq L$ iterations).
For any pair $(G, H)$ that 1-WL distinguishes, there exists an MPNN $\mathcal{A}$ that also distinguishes them.

Corollary. The discriminative power of any MPNN is bounded above by 1-WL. Conversely, the most expressive MPNN achieves exactly the discriminative power of 1-WL.

Proof sketch (upper bound). At each MPNN layer, node $v$ 's representation is a function of: $(c_v^{[l]}, \{\{c_u^{[l]} : u \in \mathcal{N}(v)\}\})$ . This is precisely what 1-WL computes. If the aggregation + update function is injective (different inputs -> different outputs), the MPNN exactly simulates 1-WL. If it is not injective (e.g., mean aggregation), it may collapse distinct multisets to the same representation, making it strictly weaker than 1-WL.

Implications:

MPNNs with mean or max aggregation are strictly weaker than 1-WL - they collapse some distinct multisets
MPNNs with sum aggregation and injective MLP update are exactly 1-WL - GIN achieves this
No MPNN (regardless of architecture) can distinguish graphs that 1-WL cannot distinguish (e.g., $C_3 \cup C_3$ vs $C_6$ , or regular graphs of the same degree)
Expressiveness limitations are fundamental, not a matter of training or capacity

7.4 GIN: The Most Expressive 1-WL GNN

Graph Isomorphism Network (Xu et al., 2019). To achieve 1-WL expressiveness, the aggregation function must be injective on multisets. Xu et al. prove:

Lemma. Any injective function on multisets over a countable universe can be expressed as:

\mathbf{h} = \varphi\!\left(\sum_{u \in \mathcal{S}} f(\mathbf{h}_u)\right)

for some functions $\varphi, f$ . In other words, sum aggregation with an MLP is a universal approximator for injective multiset functions.

GIN layer:

\mathbf{h}_v^{[l+1]} = \operatorname{MLP}^{[l]}\!\left((1 + \varepsilon^{[l]}) \cdot \mathbf{h}_v^{[l]} + \sum_{u \in \mathcal{N}(v)} \mathbf{h}_u^{[l]}\right)

where $\varepsilon^{[l]}$ is either learned or set to $0$ . The $(1+\varepsilon)$ term allows the network to distinguish the central node from its neighbors (otherwise, a node with features $\mathbf{h}$ and a neighbor also with features $\mathbf{h}$ would be aggregated indistinguishably from the node having two neighbors with features $\mathbf{h}/2$ each under mean aggregation).

Why mean and max fail:

Aggregation	Counterexample multisets	Behavior
Mean	$\{1, 1\}$ vs $\{1, 1, 1\}$	Both give mean = 1 - indistinguishable
Max	$\{1, 2\}$ vs $\{2\}$	Both give max = 2 - indistinguishable
Sum	$\{1, 1\}$ vs $\{1, 1, 1\}$	Sums 2 vs 3 - distinguishable

GIN for graph classification. Use sum readout at each layer and combine across layers:

\mathbf{h}_G = \operatorname{CONCAT}\!\left(\operatorname{READOUT}^{[l]}\!\left(\left\{\mathbf{h}_v^{[l]}\right\}\right) : l = 0, 1, \ldots, L\right)

This jumping-knowledge-style readout captures structural patterns at multiple scales.

7.5 Beyond 1-WL: Higher-Order GNNs

1-WL's limitations (e.g., failing to count triangles, failing on regular graphs) motivate higher-order extensions.

$k$ -WL test. Instead of coloring individual nodes, color $k$ -tuples of nodes $(v_1, \ldots, v_k)$ . The refinement rule considers the colors of all $k$ -tuples that differ in exactly one position. $k$ -WL is strictly more powerful than $(k-1)$ -WL for all $k \geq 2$ .

$k$ -GNN (Morris et al., 2019). Implement $k$ -WL as a GNN by passing messages between $k$ -tuples. The cost is $O(n^k)$ - prohibitive for $k \geq 3$ on large graphs.

NGNN and subgraph GNNs. A more practical approach: instead of node-level MPNNs, run MPNNs on subgraphs induced by $k$ -hop neighborhoods, ego-graphs, or sampled subgraphs. These can detect cycles, cliques, and other motifs that 1-WL misses. Examples: NGNN (Zhang & Li, 2021), OSAN (Zhao et al., 2022).

Practical trade-off:

Method	Expressiveness	Complexity	Practical use
1-WL MPNN (GIN)	1-WL	$O(m \cdot d)$	Standard; most applications
Subgraph GNNs	$>$ 1-WL	$O(n \cdot m \cdot d)$	Medium graphs, chemistry
$k$ -GNN ( $k=3$ )	3-WL	$O(n^3 \cdot d)$	Small graphs only
Random features	Universal	$O(m \cdot d)$	Simple, effective in practice

7.6 Structural Features and Distance Encoding

A practical alternative to higher-order GNNs: augment node features with handcrafted structural descriptors that help the GNN detect patterns it otherwise cannot.

Random Walk Structural Encoding (RWSE). For each node $v$ , compute the landing probability of a random walk of length $k$ starting and ending at $v$ :

p_v^{(k)} = [A^k]_{vv} / d_v

The vector $[p_v^{(1)}, p_v^{(2)}, \ldots, p_v^{(K)}]$ encodes the local loop structure (triangles, 4-cycles, etc.). Used in GPS (Rampasek et al., 2022) and Graphormer.

Laplacian Positional Encoding (LapPE). Use the first $k$ eigenvectors of the normalized Laplacian as node features: $\mathbf{x}_v \leftarrow [\mathbf{x}_v \| \mathbf{u}_1(v), \ldots, \mathbf{u}_k(v)]$ . (Reviewed in 11-04 9.5 - see there for the sign invariance issue and fix.)

Degree features. Simply appending the node's degree $d_v$ as a feature allows the GNN to distinguish nodes of different degree even when 1-WL would assign them the same color (since 1-WL with uniform initial colors cannot use degree beyond what the refinement computes). More sophisticated: distance encoding (Li et al., 2020) appends the shortest-path distances from a node to a set of anchor nodes.

8. Over-Smoothing, Over-Squashing, and Depth

8.1 Over-Smoothing: Formal Analysis

A 2-layer GCN typically outperforms a 6-layer GCN on standard node classification benchmarks. This seems paradoxical: more layers should give access to more of the graph. The reason is over-smoothing: as depth increases, all node representations converge to the same vector, destroying the discriminative information needed for classification.

Formal statement (Li et al., 2018). Let $\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ be the GCN propagation matrix (assume $\tilde{A} = A + I$ for simplicity). Then:

\hat{A}^k \to \boldsymbol{\pi} \mathbf{1}^\top \quad \text{as } k \to \infty

where $\boldsymbol{\pi} \in \mathbb{R}^n$ is the stationary distribution of the random walk on $\tilde{A}$ : $\pi_v = \tilde{d}_v / (2m + n)$ . That is, the $k$ -step propagation from any starting point converges to the stationary distribution regardless of initial conditions.

Consequence. For a linear GCN (no nonlinearities), $H^{[L]} = \hat{A}^L X W^{(0)} W^{(1)} \cdots W^{(L-1)}$ . As $L \to \infty$ :

H^{[L]}_{v,:} \to \pi_v \cdot \mathbf{1}^\top X W \propto \text{const} \times (\text{column average of } XW)

All nodes collapse to a constant multiple of the graph-wide feature average. For connected graphs, $H^{[L]}$ converges to a rank-1 matrix - all node representations are proportional to $\boldsymbol{\pi}$ .

With nonlinearities (ReLU), convergence is not exact, but the trend holds: after 6-8 layers, node representations on typical graphs (small-world, power-law degree distribution) are nearly identical.

MADGap metric (Chen et al., 2020). Mean Average Distance (MAD) measures pairwise distances between node representations. MADGap = MAD(between-class) - MAD(within-class). For a good classifier, MADGap should be large (between-class distances large, within-class distances small). Over-smoothing drives MAD -> 0, collapsing MADGap.

8.2 Information-Theoretic View via Dirichlet Energy

The Dirichlet energy of the node representation matrix $H \in \mathbb{R}^{n \times d}$ measures total variation across edges:

E(H) = \sum_{(i,j) \in E} \lVert\mathbf{h}_i - \mathbf{h}_j\rVert_2^2 = \operatorname{tr}(H^\top L H)

where $L = D - A$ is the unnormalized Laplacian.

Properties:

$E(H) = 0 \iff \mathbf{h}_i = \mathbf{h}_j$ for all connected $i, j$ (on a connected graph: all nodes identical)
$E(H)$ is large when adjacent nodes have different representations - high "frequency content"
The GCN smoothing step reduces $E(H)$ : $E(SH) \leq E(H)$ where $S = \hat{A}$ (since $\hat{A}$ is the low-pass graph filter)

Over-smoothing = energy dissipation. Each GCN layer reduces $E(H)$ . After $L$ layers:

E(H^{[L]}) \leq \lambda_{\max}(S)^{2L} \cdot E(H^{[0]})

Since $\lambda_{\max}(S) \leq 1$ for symmetric normalized $\hat{A}$ with self-loops, the energy decays geometrically. The rate depends on $\lambda_2(L)$ (the Fiedler value): graphs with large spectral gap (expanders) over-smooth faster than graphs with small spectral gap (clusters).

Practical implication. For clustered graphs (molecular graphs, citation networks with strong community structure), over-smoothing is slow - one can use deeper GNNs. For expander-like graphs (dense social networks, random graphs), over-smoothing is fast - 2-4 layers is optimal.

8.3 Over-Squashing: Bottleneck Nodes

Over-squashing is a distinct (and more subtle) problem: information from distant nodes is exponentially compressed as it flows through bottleneck edges, preventing long-range interactions from influencing predictions.

Jacobian analysis (Alon & Yahav, 2021). Consider the Jacobian of node $v$ 's representation at layer $k$ with respect to node $u$ 's initial feature:

\frac{\partial \mathbf{h}_v^{[k]}}{\partial \mathbf{x}_u} = \prod_{l=0}^{k-1} (W^{[l]})^\top \cdot \frac{\partial \mathbf{h}_v^{[k]}}{\partial \mathbf{h}_v^{[k-1]}} \cdots \frac{\partial \mathbf{h}_{u'}^{[1]}}{\partial \mathbf{x}_u}

where the product runs along any path from $u$ to $v$ of length $k$ . The norm of this Jacobian is bounded by:

\left\lVert\frac{\partial \mathbf{h}_v^{[k]}}{\partial \mathbf{x}_u}\right\rVert \leq C \cdot \left(\hat{A}^k\right)_{vu}

where $(\hat{A}^k)_{vu}$ is the $(v,u)$ entry of the $k$ -th power of the propagation matrix. For nodes far apart in the graph, $(\hat{A}^k)_{vu}$ is exponentially small in the distance $d(u,v)$ .

The bottleneck. If $u$ and $v$ are separated by a single edge $(s, t)$ of high betweenness centrality (a "bridge"), all information from $u$ -side to $v$ -side must pass through $(s,t)$ . The aggregation at $s$ receives messages from all $|\mathcal{N}(s)|$ neighbors and compresses them into a single $d$ -dimensional vector - losing information at rate proportional to $|\mathcal{N}(s)| / d$ .

Over-squashing vs over-smoothing. These are dual problems:

Over-smoothing: too much aggregation -> representations become too similar
Over-squashing: aggregation is a bottleneck -> distant information cannot influence predictions
Over-smoothing is a property of the graph and its spectrum; over-squashing is a property of specific bottleneck edges

8.4 Remedies for Over-Smoothing

DropEdge (Rong et al., 2020). Randomly remove edges during training: replace $A$ with a sparse mask $\tilde{A}_{\text{drop}}$ that drops each edge independently with probability $p$ . This is analogous to dropout for neurons but applied to edges. It slows the convergence of over-smoothing (less propagation per layer) and acts as a data augmentation method. Improves performance at depths 4-8 on standard benchmarks.

PairNorm (Zhao & Akoglu, 2020). After each GCN layer, re-normalize the node representations to ensure a fixed total pairwise distance:

\mathbf{h}_v \leftarrow \mathbf{h}_v - \bar{\mathbf{h}}, \quad \bar{\mathbf{h}} = \frac{1}{n}\sum_v \mathbf{h}_v

\mathbf{h}_v \leftarrow s \cdot \frac{\mathbf{h}_v}{\left(\frac{1}{n}\sum_v \lVert\mathbf{h}_v\rVert^2\right)^{1/2}}

where $s$ is a scaling hyperparameter. PairNorm prevents the Dirichlet energy from decaying to zero by keeping pairwise distances bounded below. Empirically effective for deep GCNs (8-16 layers).

Initial residual connection (GCNII, Chen et al., 2020).

\mathbf{h}_v^{[l+1]} = \sigma\!\left(\left[(1-\alpha)\hat{A}\mathbf{h}_v^{[l]} + \alpha \mathbf{h}_v^{[0]}\right] \cdot \left[(1-\beta) I + \beta W^{[l]}\right]\right)

Two additions: (1) $\alpha \mathbf{h}_v^{[0]}$ - always mixing in the initial feature with weight $\alpha$ prevents full convergence; (2) the weight matrix mixes between identity ( $\beta=0$ , pure smoothing) and full linear transform ( $\beta=1$ ). With $\alpha=0.1, \beta = \log(\lambda/l+1)$ (decreasing $\beta$ with depth), GCNII successfully trains 64-layer GCNs, achieving state-of-the-art on Cora at the time (85.5% accuracy vs 81.5% for 2-layer GCN).

GroupNorm / NodeNorm. Normalize within groups of nodes to prevent scale collapse. Less commonly used than the above.

8.5 Graph Rewiring

A more aggressive approach: change the graph structure to improve information flow before running the GNN.

Diffusion-based rewiring (DIGL, Gasteiger et al., 2019). Replace $A$ with the heat kernel approximation $\Theta = \exp(-t L)$ (truncated) or personalized PageRank matrix $\Theta = \alpha(I - (1-\alpha)\hat{A})^{-1}$ . The diffusion matrix connects distant nodes with non-zero weights, enabling long-range information flow in 1 GNN layer. Edges with small weights are pruned, keeping the graph sparse.

Expander Graph Propagation (EGP, Deac et al., 2022). Augment the graph with the edges of a sparse expander graph (a $d$ -regular graph with large spectral gap, e.g., a Ramanujan graph). Expander edges provide "shortcuts" that reduce the effective diameter of the graph, alleviating over-squashing without the $O(n^2)$ cost of full connectivity.

FoSR: First-Order Spectral Rewiring (Karhadkar et al., 2023). Add edges that most increase $\lambda_2(L)$ (the Fiedler value), directly attacking the spectral bottleneck. Greedy algorithm: at each step, add the edge $(u,v) \notin E$ that maximally increases $\lambda_2$ of the augmented graph. With $k$ added edges, the graph diameter decreases and over-squashing is reduced.

8.6 Depth vs Width Trade-off in GNNs

Why shallow GNNs dominate in practice. On most benchmark tasks (node classification on citation networks, graph classification on molecular benchmarks), 2-4 GNN layers achieve the best performance. The reasons:

Homophily dominates: in social and citation networks, the most informative neighbors are at distance 1-2. Beyond distance 3, nodes are often from different classes, and aggregating them hurts classification.
Over-smoothing: as analyzed above, deep GCNs lose discriminative information.
Exponential neighborhood growth: a node's $k$ -hop neighborhood in a power-law graph has $\sim \bar{d}^k$ nodes. For $\bar{d}=10$ and $k=5$ , that's 100,000 nodes - mostly irrelevant.

When depth helps. Long-range tasks where predictions depend on distant nodes:

Predicting molecular properties that depend on the entire molecular graph (QM9 quantum chemistry dataset)
Reasoning over knowledge graphs with long inference chains
Planning and path-finding tasks on sparse graphs

For these tasks, graph Transformers (10) - which compute full pairwise attention in $O(n^2)$ - often outperform deep GNNs.

Jumping Knowledge Networks (Xu et al., 2018). A principled approach: each node selects the representation from the most appropriate depth:

\mathbf{h}_v^{\text{final}} = \operatorname{COMBINE}\!\left(\mathbf{h}_v^{[1]}, \mathbf{h}_v^{[2]}, \ldots, \mathbf{h}_v^{[L]}\right)

COMBINE can be concatenation, LSTM-attention, or max-pooling across layers. This allows different nodes to "listen" at different distances - nodes in dense clusters use shallow representations; nodes that are bridges benefit from deeper representations.

9. Graph Pooling and Hierarchical GNNs

9.1 Why Pooling Is Non-Trivial on Graphs

In image CNNs, pooling is straightforward: reduce a $2 \times 2$ spatial block to a single pixel by averaging. The operation is well-defined because images have a fixed grid structure.

On graphs, pooling (coarsening) must answer: which nodes get merged? How are their features combined? How is the edge structure of the coarsened graph defined? All of these must be permutation invariant - the coarsened graph cannot depend on how vertices are labeled.

The challenges:

No canonical merging: unlike pixels in a grid, there is no obvious spatial proximity to guide merging. Spectral methods (9.4) use the Fiedler vector; learned methods (9.3) learn soft assignments.
Edge reconstruction: after merging nodes $\{v_1, v_2\}$ into a super-node $s$ , which edges does $s$ inherit? Typically all edges incident to $v_1$ or $v_2$ - but this may create multi-edges and self-loops.
Information loss: pooling irreversibly reduces the graph. Unlike deconvolution in image models, there is no standard graph "unpooling" that perfectly recovers the original structure.

9.2 Global Pooling Methods

For graph-level tasks, a single pooling step at the end suffices. The readout function $R(\{\mathbf{h}_v^{[L]}\})$ maps the set of node representations to a single graph embedding.

Sum pooling: $\mathbf{h}_G = \sum_{v \in V} \mathbf{h}_v^{[L]}$ . Sensitive to graph size (larger graphs get larger embeddings). Most expressive for graph-level tasks (by the same argument as sum aggregation for node-level).

Mean pooling: $\mathbf{h}_G = \frac{1}{n}\sum_{v} \mathbf{h}_v^{[L]}$ . Normalizes for graph size; best when comparing graphs of different sizes.

Max pooling: $(\mathbf{h}_G)_k = \max_v (\mathbf{h}_v^{[L]})_k$ . Detects whether any node has feature $k$ above a threshold; good for detecting rare structural motifs.

Attention pooling (gated global pooling):

\mathbf{h}_G = \sum_{v \in V} \operatorname{softmax}\!\left(f_\text{gate}(\mathbf{h}_v^{[L]})\right)_v \cdot f_\text{feat}(\mathbf{h}_v^{[L]})

where $f_\text{gate}: \mathbb{R}^d \to \mathbb{R}$ scores each node, and $f_\text{feat}: \mathbb{R}^d \to \mathbb{R}^{d'}$ transforms the features. Allows the model to focus on the most informative nodes. Used in Graph U-Net, Graphormer.

Set2Set (Vinyals et al., 2016). An LSTM-based readout that runs $T$ steps of attention over the node set, accumulating a context vector:

\mathbf{q}_t = \operatorname{LSTM}(\mathbf{q}_{t-1}), \qquad e_{tv} = \mathbf{q}_t^\top \mathbf{h}_v, \qquad \alpha_{tv} = \operatorname{softmax}_v(e_{tv})

\mathbf{c}_t = \sum_v \alpha_{tv} \mathbf{h}_v, \qquad \mathbf{h}_G = \mathbf{c}_T

More expressive than simple pooling but with $O(nT)$ sequential steps.

9.3 DiffPool: Differentiable Graph Pooling

DiffPool (Ying et al., 2019) learns soft cluster assignments to hierarchically coarsen the graph. At each pooling level, run two GNNs in parallel:

S^{(l)} = \operatorname{softmax}\!\left(\operatorname{GNN}_{\text{pool}}^{(l)}\!\left(A^{(l)}, H^{(l)}\right)\right) \in \mathbb{R}^{n_l \times n_{l+1}}

Z^{(l)} = \operatorname{GNN}_{\text{embed}}^{(l)}\!\left(A^{(l)}, H^{(l)}\right) \in \mathbb{R}^{n_l \times d}

where $n_l$ is the number of nodes at level $l$ , $n_{l+1} < n_l$ is the target number of clusters, $S^{(l)}$ is the soft assignment matrix (each node $v$ assigns fractionally to each of $n_{l+1}$ clusters), and $Z^{(l)}$ is the GNN embedding of current-level nodes.

Coarsened graph:

H^{(l+1)} = S^{(l)\top} Z^{(l)} \in \mathbb{R}^{n_{l+1} \times d}

A^{(l+1)} = S^{(l)\top} A^{(l)} S^{(l)} \in \mathbb{R}^{n_{l+1} \times n_{l+1}}

The coarsened adjacency $A^{(l+1)}$ is dense - every pair of clusters has a (soft) connection weighted by the total edge weight between them.

Auxiliary losses. DiffPool adds two regularization terms to the main task loss:

Link prediction loss: $\mathcal{L}_{\text{LP}} = \lVert A^{(l)} - S^{(l)} S^{(l)\top} \rVert_F$ - encourages clusters to correspond to connected subgraphs
Entropy loss: $\mathcal{L}_{\text{E}} = \frac{1}{n}\sum_v H(S^{(l)}_{v,:})$ - encourages crisp (non-uniform) cluster assignments

Limitation. DiffPool produces a dense $A^{(l+1)}$ at each level - $O(n_{l+1}^2)$ memory. For large graphs ( $n > 10^4$ ), this is prohibitive.

9.4 MinCutPool and Spectral Pooling

MinCutPool (Bianchi et al., 2020) addresses DiffPool's density problem by formulating pooling as a spectral clustering problem with a differentiable objective.

The mincut objective. For a soft cluster assignment $S \in \mathbb{R}^{n \times k}$ , the normalized minimum cut is:

\text{minCUT}(S) = \frac{\operatorname{tr}(S^\top L S)}{\operatorname{tr}(S^\top D S)}

Minimizing this over soft assignments $S$ is equivalent to spectral clustering - the optimal $S$ consists of the $k$ Fiedler eigenvectors of $L_{\text{rw}} = D^{-1}L$ . MinCutPool uses a GNN to produce $S$ and trains end-to-end with the minCUT loss plus an orthogonality regularizer $\lVert S^\top S / \lVert S^\top S \rVert_F - I / \sqrt{k} \rVert_F$ to prevent degenerate cluster assignments.

Advantage over DiffPool: the objective is theoretically motivated by spectral graph theory; the sparse regularization keeps the assignment matrix well-conditioned.

9.5 SAGPool: Self-Attention Graph Pooling

SAGPool (Lee et al., 2019) takes a different approach: instead of soft cluster assignments, select the top- $k$ nodes by a learned importance score and discard the rest.

Algorithm:

Run one GNN layer to compute node scores: $\mathbf{z} = \operatorname{GNN}(A, H) \in \mathbb{R}^n$
Select top- $k$ nodes: $\text{idx} = \operatorname{topk}(\mathbf{z}, k)$
Gate the selected features: $H' = H_{\text{idx}} \odot \operatorname{sigmoid}(\mathbf{z}_{\text{idx}})$
Induce the subgraph: $A' = A_{\text{idx,idx}}$ (restrict $A$ to selected nodes)

Advantages: simple, interpretable (which nodes are selected?), maintains sparsity of adjacency. Disadvantage: hard top- $k$ selection is not differentiable (uses straight-through estimator or continuous relaxation during training).

10. Graph Transformers

10.1 Motivation: GNNs vs Transformers

The fundamental constraint of MPNNs is locality: a $k$ -layer GNN can only aggregate information within a $k$ -hop neighborhood. For long-range dependencies (e.g., two atoms on opposite ends of a large molecule that interact through space), a deep GNN would need $O(\text{diameter})$ layers - running into over-smoothing and over-squashing.

The transformer architecture - with full $O(n^2)$ pairwise attention - solves the locality problem: every node can directly attend to every other node in a single layer. But transformers treat inputs as sets of independent tokens; they have no built-in notion of graph structure.

Graph Transformers combine: (1) the global receptive field of transformers (full pairwise attention), with (2) graph-structural inductive biases (local message passing, positional encodings derived from the graph topology).

The trade-off: full attention costs $O(n^2 d)$ per layer - practical only for small-to-medium graphs ( $n \leq 10^4$ ). For large graphs ( $n > 10^6$ ), neighbor-sampled MPNNs remain the only scalable option.

10.2 Positional Encodings for Graphs

In transformers, positional encodings break the permutation symmetry of the attention mechanism: without them, the transformer cannot distinguish token position, and all orderings of the same tokens produce the same output. The same problem occurs in graph transformers: the attention mechanism is permutation invariant, so we need positional encodings that break symmetry in a structure-aware way.

Recall from 11-04 9.5: Laplacian eigenvectors provide a natural graph Fourier basis, and the first $k$ eigenvectors of $L_{\text{sym}}$ give a $k$ -dimensional coordinate for each node that reflects the graph's spectral structure. The sign invariance problem (eigenvectors are defined up to sign) is addressed by random sign flipping during training or by using absolute values. Full treatment: 11-04 9.5.

Building on this, three main PE strategies for graph transformers:

Laplacian PE (LapPE). For each node $v$ , extract the $v$ -th row of the matrix $[\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_k]$ where $\mathbf{u}_i$ are the first $k$ eigenvectors of $L_{\text{sym}}$ (excluding the constant eigenvector). Append to node features:

\mathbf{x}_v \leftarrow \left[\mathbf{x}_v \,\|\, \mathbf{u}_1(v), \ldots, \mathbf{u}_k(v)\right]

Sign ambiguity fix: during training, randomly flip the sign of each eigenvector independently (this has no effect on the Laplacian spectrum but makes the model invariant to sign choice). Used in Dwivedi et al. (2020), GPS (2022).

Random Walk SE (RWSE). For each node $v$ and walk length $p$ , compute $(A^p D^{-1})_{vv}$ - the probability of a length- $p$ random walk returning to $v$ . Stack for $p = 1, \ldots, P$ :

\text{RWSE}_v = \left[(AD^{-1})_{vv}, (A^2 D^{-2})_{vv}, \ldots, (A^P D^{-P})_{vv}\right]

RWSE is always positive and sign-free (no sign ambiguity). It encodes local loop structure: $(AD^{-1})_{vv} = 0$ iff $v$ has no self-loops; $(A^2 D^{-2})_{vv} > 0$ iff any neighbor of $v$ is also connected back to $v$ (triangles). Used in GPS (2022), GIN+RWSE.

Degree and centrality encoding. Simply append $d_v$ (node degree), normalized closeness centrality, or betweenness centrality as scalar node features. Used in Graphormer (2021), which also encodes shortest-path distances as edge features.

10.3 GPS Framework

General, Powerful, Scalable (GPS) Graph Transformer (Rampasek et al., 2022) is the most general graph transformer framework, winning multiple tracks of the 2022 Long Range Graph Benchmark (LRGB).

GPS Layer:

H^{[l+1]} = \operatorname{LayerNorm}\!\left(H^{[l]} + \operatorname{MPNN}^{[l]}\!\left(H^{[l]}, A\right) + \operatorname{Transformer}^{[l]}\!\left(H^{[l]}\right)\right)

Each GPS layer combines:

Local MPNN: any standard GNN (GCN, GINE, GAT) operating on the sparse graph edges - captures local structure efficiently
Global Transformer: full multi-head self-attention over all nodes - captures long-range dependencies
LayerNorm + residual: stabilizes training

Modularity. GPS is a framework, not a fixed architecture: the MPNN and Transformer components are interchangeable. In practice, GINE (GIN with edge features) + Transformer works well; one can also use GAT + Performer (linear attention approximation) for scalability.

Positional encodings. GPS accepts arbitrary node-level PEs (LapPE, RWSE, degree) as extra input features, processed by a dedicated PE encoder:

\mathbf{x}_v^{\text{in}} = W_{\text{feat}} \mathbf{x}_v + W_{\text{PE}} \text{PE}_v

Performance. On the LRGB Peptides-func benchmark (a graph where predictions require integrating global molecular structure), GPS achieves significantly higher accuracy than any pure MPNN, demonstrating the necessity of global attention for long-range tasks.

10.4 Graphormer

Graphormer (Ying et al., 2021) uses a standard transformer architecture (with full pairwise attention) augmented by three graph-structural encodings. It won the OGB-LSC 2021 quantum chemistry track.

Central encoding. Add a degree-dependent bias to the node features:

\mathbf{x}_v^{\text{in}} = \mathbf{x}_v + \mathbf{z}_{d_v^-} + \mathbf{z}_{d_v^+}

where $\mathbf{z}_{d^-}$ and $\mathbf{z}_{d^+}$ are learned embeddings for in- and out-degree. This allows the transformer to distinguish hub nodes from leaf nodes.

Spatial encoding. Add a bias to the attention score between nodes $u$ and $v$ based on their shortest-path distance $\phi(u,v)$ :

e_{uv} = \frac{(\mathbf{h}_u W_Q)(\mathbf{h}_v W_K)^\top}{\sqrt{d_k}} + b_{\phi(u,v)}

where $b_\phi$ is a scalar learned per distance $\phi$ . Nodes far apart get a learned attention bias; nodes nearby get a different bias. This encodes graph topology directly into the attention pattern.

Edge encoding. For each pair $(u,v)$ , average the edge features along the shortest path:

c_{uv} = \frac{1}{\phi(u,v)} \sum_{e \in \text{sp}(u,v)} \mathbf{w}_e^\top \mathbf{a}

where $\mathbf{w}_e \in \mathbb{R}^{d_e}$ is the feature of path edge $e$ and $\mathbf{a} \in \mathbb{R}^{d_e}$ is a learned vector. Added to the attention score as an additional bias.

10.5 Graph Mamba and Sequence-Based Methods

The quadratic $O(n^2)$ cost of full attention makes graph transformers impractical for large graphs. Recent work (2023-2024) explores linear-complexity alternatives.

Converting graphs to sequences. Several methods serialize the graph into a sequence of tokens, then apply a sequence model (LSTM, Mamba SSM):

BFS/DFS ordering: nodes in BFS/DFS traversal order; nearby nodes in the graph -> nearby in sequence. Breaks permutation invariance (different traversal orderings give different results).
Node-and-edge tokenization (TokenGT, Kim et al., 2022): represent each node and each edge as a separate token; full transformer attention over all $n + m$ tokens. Permutation invariant if node/edge PEs are orthogonalized.

Graph Mamba (Chen et al., 2023). Apply Mamba (state-space model with selective scan) to graphs by: (1) ordering nodes by degree or BFS, (2) running the SSM along this ordering as a "linearized graph sequence." Achieves $O(n \log n)$ complexity while matching transformer performance on some benchmarks. The key challenge: the SSM must be made robust to different node orderings (since graphs have no canonical ordering).

Status (2026). Graph Mamba and similar linear-attention methods are active research areas. For small-to-medium graphs ( $n \leq 10^4$ ), GPS-style full attention is standard. For large graphs, MPNN-based methods (GraphSAGE, Cluster-GCN) remain dominant in production.

Graph Neural Networks: Part 2 - Graph Attention Networks Gat To 10 Graph Transformers