Lesson overview | Previous part | Next part
Graph Neural Networks: Part 6: Graph Attention Networks GAT to 10. Graph Transformers
6. Graph Attention Networks (GAT)
6.1 The Attention Mechanism on Graphs
GCN's aggregation weights are fixed functions of node degrees - they depend only on the graph structure, not on node features. Two neighbors with identical degree contribute identically regardless of their relevance to the central node's representation.
Graph Attention Networks (Velickovic et al., 2018) replace fixed weights with learned, data-dependent attention coefficients. For each edge , the attention coefficient is computed as a function of both nodes' features, allowing the network to focus on the most relevant neighbors.
This mirrors the transformer's self-attention, but with attention constrained to the graph's edge structure: node only attends to its actual neighbors , not to all nodes (which would be ). The graph acts as a structural prior on the attention pattern.
6.2 GAT Layer: Attention Coefficients
Step 1: Linear transformation. Apply a shared weight matrix to transform all node features:
Step 2: Attention scores. For each edge (including self-loops), compute:
where is a learnable attention vector, and denotes concatenation. The raw attention score measures the "relevance" of node to node . LeakyReLU (with negative slope ) allows gradient flow for zero scores.
Step 3: Normalize. Apply softmax over the neighborhood to get normalized coefficients:
Note: and .
Step 4: Aggregation. Compute the updated representation:
MPNN view: the message is and the aggregation is sum.
6.3 Multi-Head Attention for Graphs
Single-head attention can be unstable and may attend to a limited range of features. Multi-head attention stabilizes training and allows the model to jointly attend to different aspects of neighbors.
Multi-head GAT layer. Run independent attention mechanisms in parallel, each with its own parameters :
where denotes concatenation. The output dimension is .
For the final layer (before a classification head), averaging is often preferred over concatenation:
Complexity. A -head GAT layer costs for attention computation ( edges, features) and for the linear transformations. For dense attention (fully-connected graph), this is - the same scaling as transformer attention.
6.4 GATv2: Fixing the Static Attention Problem
Brody, Alon & Yahav (2022) identified a fundamental limitation of the original GAT: its attention is static - the ranking of a neighbor's importance does not depend on the query node.
The problem. In GAT, the attention score is:
Since this is a sum of two terms - one depending only on , one depending only on - the ranking of neighbors vs by their score vs is independent of . The "most important neighbor" is the same regardless of which node is doing the asking. This is "static attention."
Dynamic attention (GATv2). Fix by swapping the order of the linear transformation and concatenation:
Now the LeakyReLU nonlinearity is applied before the dot product with , creating genuine interaction between and . The ranking of vs can now depend on - the attention is dynamic.
Formal claim (Brody et al., 2022): GAT is a strictly less powerful attention function than GATv2. There exist graphs where GAT's attention collapses to uniform weighting while GATv2's attention correctly assigns non-uniform importance. GATv2 matches or exceeds GAT on all standard benchmarks.
6.5 Attention Sparsity and Interpretability
A commonly cited advantage of GAT over GCN is interpretability: the attention coefficients can be read as edge importance scores. Edges with high are "important" for node 's representation; edges with near-zero are "ignored."
Limitations of this interpretation. Jain & Wallace (2019) and Wiegreffe & Pinter (2019) showed for NLP models that attention weights do not reliably indicate feature importance. Similar caveats apply to GAT:
- Attention weights are post-softmax and sum to 1; a weight of out of 3 neighbors is very different from out of 300 neighbors.
- The softmax creates competition between neighbors - a "dominant" neighbor may capture high attention simply by having a large dot product with , not because it is semantically important.
- Different attention heads may attend to completely different structural patterns, and the meaning of each head is not determined a priori.
Despite these caveats, in practice, GAT attention patterns do reveal meaningful structural patterns for molecular and knowledge graph tasks, especially when validated against domain knowledge.
7. Expressiveness and the Weisfeiler-Leman Test
7.1 The Graph Isomorphism Problem
Definition (Graph Isomorphism). Two graphs and are isomorphic (written ) if there exists a bijection such that .
Two isomorphic graphs are structurally identical - they differ only in how the vertices are labeled. For any function of graph structure (predicting molecular properties, classifying graph type), the function must produce the same output on isomorphic graphs. This is exactly the permutation invariance requirement from 2.3.
The graph isomorphism (GI) problem - determining whether two given graphs are isomorphic - is one of the few natural problems not known to be in P or NP-complete. For practical purposes (GNN expressiveness), we ask a weaker question: can a GNN distinguish two non-isomorphic graphs by assigning them different representations?
Non-isomorphic graphs that look similar. Consider:
Graph 1: Two disjoint triangles (C_3 \cup C_3)
o-o o-o
\ / \ /
o o
Graph 2: One 6-cycle (C_6)
o-o-o
| |
o-o-o
Both graphs have 6 nodes, each of degree 2 - same degree sequence. Are they isomorphic? No: has two triangles; has no triangles. A GNN should assign them different representations. Can it?
7.2 1-WL and Color Refinement
The Weisfeiler-Leman (WL) graph isomorphism test (Weisfeiler & Leman, 1968) is an efficient algorithm for testing graph isomorphism. While not complete (there exist non-isomorphic pairs it cannot distinguish), it correctly identifies isomorphism for most practical graphs.
1-WL Color Refinement Algorithm:
Initialize: assign all nodes the same color (or, for node-attributed graphs, ).
Iterate for until convergence:
where denotes a multiset (duplicate elements are preserved) and HASH is an injective hash function on (color, multiset of colors) pairs.
Decision: If at any iteration, graphs and have different multisets of node colors, declare them non-isomorphic. If the algorithm stabilizes with the same color multisets, declare them (possibly) isomorphic.
Convergence: The algorithm stabilizes in at most iterations (when no color changes occur). This gives time complexity.
Example. For vs with no node attributes:
- : all nodes color - same for both graphs
- : same for all nodes (all have degree 2 with same-colored neighbors) - still same!
- Algorithm stabilizes: 1-WL cannot distinguish from - both are "two disjoint 3-regular rings."
This is a fundamental limitation: 1-WL cannot detect triangles (3-cycles), so neither can any MPNN (Theorem below).
7.3 The GNN Expressiveness Theorem
Theorem (Xu et al., 2019). Let be any MPNN with a fixed number of layers and countably many colors (feature values). Then:
- If assigns different representations to graphs , then 1-WL also distinguishes and (in iterations).
- For any pair that 1-WL distinguishes, there exists an MPNN that also distinguishes them.
Corollary. The discriminative power of any MPNN is bounded above by 1-WL. Conversely, the most expressive MPNN achieves exactly the discriminative power of 1-WL.
Proof sketch (upper bound). At each MPNN layer, node 's representation is a function of: . This is precisely what 1-WL computes. If the aggregation + update function is injective (different inputs -> different outputs), the MPNN exactly simulates 1-WL. If it is not injective (e.g., mean aggregation), it may collapse distinct multisets to the same representation, making it strictly weaker than 1-WL.
Implications:
- MPNNs with mean or max aggregation are strictly weaker than 1-WL - they collapse some distinct multisets
- MPNNs with sum aggregation and injective MLP update are exactly 1-WL - GIN achieves this
- No MPNN (regardless of architecture) can distinguish graphs that 1-WL cannot distinguish (e.g., vs , or regular graphs of the same degree)
- Expressiveness limitations are fundamental, not a matter of training or capacity
7.4 GIN: The Most Expressive 1-WL GNN
Graph Isomorphism Network (Xu et al., 2019). To achieve 1-WL expressiveness, the aggregation function must be injective on multisets. Xu et al. prove:
Lemma. Any injective function on multisets over a countable universe can be expressed as:
for some functions . In other words, sum aggregation with an MLP is a universal approximator for injective multiset functions.
GIN layer:
where is either learned or set to . The term allows the network to distinguish the central node from its neighbors (otherwise, a node with features and a neighbor also with features would be aggregated indistinguishably from the node having two neighbors with features each under mean aggregation).
Why mean and max fail:
| Aggregation | Counterexample multisets | Behavior |
|---|---|---|
| Mean | vs | Both give mean = 1 - indistinguishable |
| Max | vs | Both give max = 2 - indistinguishable |
| Sum | vs | Sums 2 vs 3 - distinguishable |
GIN for graph classification. Use sum readout at each layer and combine across layers:
This jumping-knowledge-style readout captures structural patterns at multiple scales.
7.5 Beyond 1-WL: Higher-Order GNNs
1-WL's limitations (e.g., failing to count triangles, failing on regular graphs) motivate higher-order extensions.
-WL test. Instead of coloring individual nodes, color -tuples of nodes . The refinement rule considers the colors of all -tuples that differ in exactly one position. -WL is strictly more powerful than -WL for all .
-GNN (Morris et al., 2019). Implement -WL as a GNN by passing messages between -tuples. The cost is - prohibitive for on large graphs.
NGNN and subgraph GNNs. A more practical approach: instead of node-level MPNNs, run MPNNs on subgraphs induced by -hop neighborhoods, ego-graphs, or sampled subgraphs. These can detect cycles, cliques, and other motifs that 1-WL misses. Examples: NGNN (Zhang & Li, 2021), OSAN (Zhao et al., 2022).
Practical trade-off:
| Method | Expressiveness | Complexity | Practical use |
|---|---|---|---|
| 1-WL MPNN (GIN) | 1-WL | Standard; most applications | |
| Subgraph GNNs | 1-WL | Medium graphs, chemistry | |
| -GNN () | 3-WL | Small graphs only | |
| Random features | Universal | Simple, effective in practice |
7.6 Structural Features and Distance Encoding
A practical alternative to higher-order GNNs: augment node features with handcrafted structural descriptors that help the GNN detect patterns it otherwise cannot.
Random Walk Structural Encoding (RWSE). For each node , compute the landing probability of a random walk of length starting and ending at :
The vector encodes the local loop structure (triangles, 4-cycles, etc.). Used in GPS (Rampasek et al., 2022) and Graphormer.
Laplacian Positional Encoding (LapPE). Use the first eigenvectors of the normalized Laplacian as node features: . (Reviewed in 11-04 9.5 - see there for the sign invariance issue and fix.)
Degree features. Simply appending the node's degree as a feature allows the GNN to distinguish nodes of different degree even when 1-WL would assign them the same color (since 1-WL with uniform initial colors cannot use degree beyond what the refinement computes). More sophisticated: distance encoding (Li et al., 2020) appends the shortest-path distances from a node to a set of anchor nodes.
8. Over-Smoothing, Over-Squashing, and Depth
8.1 Over-Smoothing: Formal Analysis
A 2-layer GCN typically outperforms a 6-layer GCN on standard node classification benchmarks. This seems paradoxical: more layers should give access to more of the graph. The reason is over-smoothing: as depth increases, all node representations converge to the same vector, destroying the discriminative information needed for classification.
Formal statement (Li et al., 2018). Let be the GCN propagation matrix (assume for simplicity). Then:
where is the stationary distribution of the random walk on : . That is, the -step propagation from any starting point converges to the stationary distribution regardless of initial conditions.
Consequence. For a linear GCN (no nonlinearities), . As :
All nodes collapse to a constant multiple of the graph-wide feature average. For connected graphs, converges to a rank-1 matrix - all node representations are proportional to .
With nonlinearities (ReLU), convergence is not exact, but the trend holds: after 6-8 layers, node representations on typical graphs (small-world, power-law degree distribution) are nearly identical.
MADGap metric (Chen et al., 2020). Mean Average Distance (MAD) measures pairwise distances between node representations. MADGap = MAD(between-class) - MAD(within-class). For a good classifier, MADGap should be large (between-class distances large, within-class distances small). Over-smoothing drives MAD -> 0, collapsing MADGap.
8.2 Information-Theoretic View via Dirichlet Energy
The Dirichlet energy of the node representation matrix measures total variation across edges:
where is the unnormalized Laplacian.
Properties:
- for all connected (on a connected graph: all nodes identical)
- is large when adjacent nodes have different representations - high "frequency content"
- The GCN smoothing step reduces : where (since is the low-pass graph filter)
Over-smoothing = energy dissipation. Each GCN layer reduces . After layers:
Since for symmetric normalized with self-loops, the energy decays geometrically. The rate depends on (the Fiedler value): graphs with large spectral gap (expanders) over-smooth faster than graphs with small spectral gap (clusters).
Practical implication. For clustered graphs (molecular graphs, citation networks with strong community structure), over-smoothing is slow - one can use deeper GNNs. For expander-like graphs (dense social networks, random graphs), over-smoothing is fast - 2-4 layers is optimal.
8.3 Over-Squashing: Bottleneck Nodes
Over-squashing is a distinct (and more subtle) problem: information from distant nodes is exponentially compressed as it flows through bottleneck edges, preventing long-range interactions from influencing predictions.
Jacobian analysis (Alon & Yahav, 2021). Consider the Jacobian of node 's representation at layer with respect to node 's initial feature:
where the product runs along any path from to of length . The norm of this Jacobian is bounded by:
where is the entry of the -th power of the propagation matrix. For nodes far apart in the graph, is exponentially small in the distance .
The bottleneck. If and are separated by a single edge of high betweenness centrality (a "bridge"), all information from -side to -side must pass through . The aggregation at receives messages from all neighbors and compresses them into a single -dimensional vector - losing information at rate proportional to .
Over-squashing vs over-smoothing. These are dual problems:
- Over-smoothing: too much aggregation -> representations become too similar
- Over-squashing: aggregation is a bottleneck -> distant information cannot influence predictions
- Over-smoothing is a property of the graph and its spectrum; over-squashing is a property of specific bottleneck edges
8.4 Remedies for Over-Smoothing
DropEdge (Rong et al., 2020). Randomly remove edges during training: replace with a sparse mask that drops each edge independently with probability . This is analogous to dropout for neurons but applied to edges. It slows the convergence of over-smoothing (less propagation per layer) and acts as a data augmentation method. Improves performance at depths 4-8 on standard benchmarks.
PairNorm (Zhao & Akoglu, 2020). After each GCN layer, re-normalize the node representations to ensure a fixed total pairwise distance:
where is a scaling hyperparameter. PairNorm prevents the Dirichlet energy from decaying to zero by keeping pairwise distances bounded below. Empirically effective for deep GCNs (8-16 layers).
Initial residual connection (GCNII, Chen et al., 2020).
Two additions: (1) - always mixing in the initial feature with weight prevents full convergence; (2) the weight matrix mixes between identity (, pure smoothing) and full linear transform (). With (decreasing with depth), GCNII successfully trains 64-layer GCNs, achieving state-of-the-art on Cora at the time (85.5% accuracy vs 81.5% for 2-layer GCN).
GroupNorm / NodeNorm. Normalize within groups of nodes to prevent scale collapse. Less commonly used than the above.
8.5 Graph Rewiring
A more aggressive approach: change the graph structure to improve information flow before running the GNN.
Diffusion-based rewiring (DIGL, Gasteiger et al., 2019). Replace with the heat kernel approximation (truncated) or personalized PageRank matrix . The diffusion matrix connects distant nodes with non-zero weights, enabling long-range information flow in 1 GNN layer. Edges with small weights are pruned, keeping the graph sparse.
Expander Graph Propagation (EGP, Deac et al., 2022). Augment the graph with the edges of a sparse expander graph (a -regular graph with large spectral gap, e.g., a Ramanujan graph). Expander edges provide "shortcuts" that reduce the effective diameter of the graph, alleviating over-squashing without the cost of full connectivity.
FoSR: First-Order Spectral Rewiring (Karhadkar et al., 2023). Add edges that most increase (the Fiedler value), directly attacking the spectral bottleneck. Greedy algorithm: at each step, add the edge that maximally increases of the augmented graph. With added edges, the graph diameter decreases and over-squashing is reduced.
8.6 Depth vs Width Trade-off in GNNs
Why shallow GNNs dominate in practice. On most benchmark tasks (node classification on citation networks, graph classification on molecular benchmarks), 2-4 GNN layers achieve the best performance. The reasons:
- Homophily dominates: in social and citation networks, the most informative neighbors are at distance 1-2. Beyond distance 3, nodes are often from different classes, and aggregating them hurts classification.
- Over-smoothing: as analyzed above, deep GCNs lose discriminative information.
- Exponential neighborhood growth: a node's -hop neighborhood in a power-law graph has nodes. For and , that's 100,000 nodes - mostly irrelevant.
When depth helps. Long-range tasks where predictions depend on distant nodes:
- Predicting molecular properties that depend on the entire molecular graph (QM9 quantum chemistry dataset)
- Reasoning over knowledge graphs with long inference chains
- Planning and path-finding tasks on sparse graphs
For these tasks, graph Transformers (10) - which compute full pairwise attention in - often outperform deep GNNs.
Jumping Knowledge Networks (Xu et al., 2018). A principled approach: each node selects the representation from the most appropriate depth:
COMBINE can be concatenation, LSTM-attention, or max-pooling across layers. This allows different nodes to "listen" at different distances - nodes in dense clusters use shallow representations; nodes that are bridges benefit from deeper representations.
9. Graph Pooling and Hierarchical GNNs
9.1 Why Pooling Is Non-Trivial on Graphs
In image CNNs, pooling is straightforward: reduce a spatial block to a single pixel by averaging. The operation is well-defined because images have a fixed grid structure.
On graphs, pooling (coarsening) must answer: which nodes get merged? How are their features combined? How is the edge structure of the coarsened graph defined? All of these must be permutation invariant - the coarsened graph cannot depend on how vertices are labeled.
The challenges:
- No canonical merging: unlike pixels in a grid, there is no obvious spatial proximity to guide merging. Spectral methods (9.4) use the Fiedler vector; learned methods (9.3) learn soft assignments.
- Edge reconstruction: after merging nodes into a super-node , which edges does inherit? Typically all edges incident to or - but this may create multi-edges and self-loops.
- Information loss: pooling irreversibly reduces the graph. Unlike deconvolution in image models, there is no standard graph "unpooling" that perfectly recovers the original structure.
9.2 Global Pooling Methods
For graph-level tasks, a single pooling step at the end suffices. The readout function maps the set of node representations to a single graph embedding.
Sum pooling: . Sensitive to graph size (larger graphs get larger embeddings). Most expressive for graph-level tasks (by the same argument as sum aggregation for node-level).
Mean pooling: . Normalizes for graph size; best when comparing graphs of different sizes.
Max pooling: . Detects whether any node has feature above a threshold; good for detecting rare structural motifs.
Attention pooling (gated global pooling):
where scores each node, and transforms the features. Allows the model to focus on the most informative nodes. Used in Graph U-Net, Graphormer.
Set2Set (Vinyals et al., 2016). An LSTM-based readout that runs steps of attention over the node set, accumulating a context vector:
More expressive than simple pooling but with sequential steps.
9.3 DiffPool: Differentiable Graph Pooling
DiffPool (Ying et al., 2019) learns soft cluster assignments to hierarchically coarsen the graph. At each pooling level, run two GNNs in parallel:
where is the number of nodes at level , is the target number of clusters, is the soft assignment matrix (each node assigns fractionally to each of clusters), and is the GNN embedding of current-level nodes.
Coarsened graph:
The coarsened adjacency is dense - every pair of clusters has a (soft) connection weighted by the total edge weight between them.
Auxiliary losses. DiffPool adds two regularization terms to the main task loss:
- Link prediction loss: - encourages clusters to correspond to connected subgraphs
- Entropy loss: - encourages crisp (non-uniform) cluster assignments
Limitation. DiffPool produces a dense at each level - memory. For large graphs (), this is prohibitive.
9.4 MinCutPool and Spectral Pooling
MinCutPool (Bianchi et al., 2020) addresses DiffPool's density problem by formulating pooling as a spectral clustering problem with a differentiable objective.
The mincut objective. For a soft cluster assignment , the normalized minimum cut is:
Minimizing this over soft assignments is equivalent to spectral clustering - the optimal consists of the Fiedler eigenvectors of . MinCutPool uses a GNN to produce and trains end-to-end with the minCUT loss plus an orthogonality regularizer to prevent degenerate cluster assignments.
Advantage over DiffPool: the objective is theoretically motivated by spectral graph theory; the sparse regularization keeps the assignment matrix well-conditioned.
9.5 SAGPool: Self-Attention Graph Pooling
SAGPool (Lee et al., 2019) takes a different approach: instead of soft cluster assignments, select the top- nodes by a learned importance score and discard the rest.
Algorithm:
- Run one GNN layer to compute node scores:
- Select top- nodes:
- Gate the selected features:
- Induce the subgraph: (restrict to selected nodes)
Advantages: simple, interpretable (which nodes are selected?), maintains sparsity of adjacency. Disadvantage: hard top- selection is not differentiable (uses straight-through estimator or continuous relaxation during training).
10. Graph Transformers
10.1 Motivation: GNNs vs Transformers
The fundamental constraint of MPNNs is locality: a -layer GNN can only aggregate information within a -hop neighborhood. For long-range dependencies (e.g., two atoms on opposite ends of a large molecule that interact through space), a deep GNN would need layers - running into over-smoothing and over-squashing.
The transformer architecture - with full pairwise attention - solves the locality problem: every node can directly attend to every other node in a single layer. But transformers treat inputs as sets of independent tokens; they have no built-in notion of graph structure.
Graph Transformers combine: (1) the global receptive field of transformers (full pairwise attention), with (2) graph-structural inductive biases (local message passing, positional encodings derived from the graph topology).
The trade-off: full attention costs per layer - practical only for small-to-medium graphs (). For large graphs (), neighbor-sampled MPNNs remain the only scalable option.
10.2 Positional Encodings for Graphs
In transformers, positional encodings break the permutation symmetry of the attention mechanism: without them, the transformer cannot distinguish token position, and all orderings of the same tokens produce the same output. The same problem occurs in graph transformers: the attention mechanism is permutation invariant, so we need positional encodings that break symmetry in a structure-aware way.
Recall from 11-04 9.5: Laplacian eigenvectors provide a natural graph Fourier basis, and the first eigenvectors of give a -dimensional coordinate for each node that reflects the graph's spectral structure. The sign invariance problem (eigenvectors are defined up to sign) is addressed by random sign flipping during training or by using absolute values. Full treatment: 11-04 9.5.
Building on this, three main PE strategies for graph transformers:
Laplacian PE (LapPE). For each node , extract the -th row of the matrix where are the first eigenvectors of (excluding the constant eigenvector). Append to node features:
Sign ambiguity fix: during training, randomly flip the sign of each eigenvector independently (this has no effect on the Laplacian spectrum but makes the model invariant to sign choice). Used in Dwivedi et al. (2020), GPS (2022).
Random Walk SE (RWSE). For each node and walk length , compute - the probability of a length- random walk returning to . Stack for :
RWSE is always positive and sign-free (no sign ambiguity). It encodes local loop structure: iff has no self-loops; iff any neighbor of is also connected back to (triangles). Used in GPS (2022), GIN+RWSE.
Degree and centrality encoding. Simply append (node degree), normalized closeness centrality, or betweenness centrality as scalar node features. Used in Graphormer (2021), which also encodes shortest-path distances as edge features.
10.3 GPS Framework
General, Powerful, Scalable (GPS) Graph Transformer (Rampasek et al., 2022) is the most general graph transformer framework, winning multiple tracks of the 2022 Long Range Graph Benchmark (LRGB).
GPS Layer:
Each GPS layer combines:
- Local MPNN: any standard GNN (GCN, GINE, GAT) operating on the sparse graph edges - captures local structure efficiently
- Global Transformer: full multi-head self-attention over all nodes - captures long-range dependencies
- LayerNorm + residual: stabilizes training
Modularity. GPS is a framework, not a fixed architecture: the MPNN and Transformer components are interchangeable. In practice, GINE (GIN with edge features) + Transformer works well; one can also use GAT + Performer (linear attention approximation) for scalability.
Positional encodings. GPS accepts arbitrary node-level PEs (LapPE, RWSE, degree) as extra input features, processed by a dedicated PE encoder:
Performance. On the LRGB Peptides-func benchmark (a graph where predictions require integrating global molecular structure), GPS achieves significantly higher accuracy than any pure MPNN, demonstrating the necessity of global attention for long-range tasks.
10.4 Graphormer
Graphormer (Ying et al., 2021) uses a standard transformer architecture (with full pairwise attention) augmented by three graph-structural encodings. It won the OGB-LSC 2021 quantum chemistry track.
Central encoding. Add a degree-dependent bias to the node features:
where and are learned embeddings for in- and out-degree. This allows the transformer to distinguish hub nodes from leaf nodes.
Spatial encoding. Add a bias to the attention score between nodes and based on their shortest-path distance :
where is a scalar learned per distance . Nodes far apart get a learned attention bias; nodes nearby get a different bias. This encodes graph topology directly into the attention pattern.
Edge encoding. For each pair , average the edge features along the shortest path:
where is the feature of path edge and is a learned vector. Added to the attention score as an additional bias.
10.5 Graph Mamba and Sequence-Based Methods
The quadratic cost of full attention makes graph transformers impractical for large graphs. Recent work (2023-2024) explores linear-complexity alternatives.
Converting graphs to sequences. Several methods serialize the graph into a sequence of tokens, then apply a sequence model (LSTM, Mamba SSM):
- BFS/DFS ordering: nodes in BFS/DFS traversal order; nearby nodes in the graph -> nearby in sequence. Breaks permutation invariance (different traversal orderings give different results).
- Node-and-edge tokenization (TokenGT, Kim et al., 2022): represent each node and each edge as a separate token; full transformer attention over all tokens. Permutation invariant if node/edge PEs are orthogonalized.
Graph Mamba (Chen et al., 2023). Apply Mamba (state-space model with selective scan) to graphs by: (1) ordering nodes by degree or BFS, (2) running the SSM along this ordering as a "linearized graph sequence." Achieves complexity while matching transformer performance on some benchmarks. The key challenge: the SSM must be made robust to different node orderings (since graphs have no canonical ordering).
Status (2026). Graph Mamba and similar linear-attention methods are active research areas. For small-to-medium graphs (), GPS-style full attention is standard. For large graphs, MPNN-based methods (GraphSAGE, Cluster-GCN) remain dominant in production.