Lesson overview | Previous part | Lesson overview
Graph Neural Networks: Appendix C: Connections to Classical Machine Learning to Appendix H: Quick Reference - Key Formulas
Appendix C: Connections to Classical Machine Learning
C.1 GNNs and Kernel Methods
The GNN's computation can be viewed through the lens of kernel methods. Define the graph neural tangent kernel (GNTK) as the limiting kernel of an infinitely-wide GNN at initialization:
This connects GNNs to the NTK theory of Jacot et al. (2018) and provides a framework for analyzing GNN training dynamics. For graph classification:
- Wide GNN kernel SVM: at infinite width, a GNN with random feature initialization and gradient descent is equivalent to a kernel machine with the GNTK
- GNTK expressiveness: the GNTK of a GIN-like architecture is exactly as expressive as 1-WL (matching the finite-depth GNN result)
Graph Weisfeiler-Leman kernel (Shervashidze et al., 2011). Before GNNs, graph kernels were the primary ML method for graphs. The WL kernel computes a feature vector from the color histograms at each WL refinement step:
where is the histogram of colors at WL iteration . The WL kernel is equivalent to a GIN with fixed (non-learned) MLP - it is the non-parametric version of GIN.
For AI: Understanding GNNs as kernel machines allows the use of kernel theory (generalization bounds, bandwidth selection, representer theorem) to analyze GNN behavior - important for the theoretical foundations in Chapter 12 (Functional Analysis) and Chapter 21 (Statistical Learning Theory).
C.2 GNNs and Belief Propagation
Loopy belief propagation (LBP) on probabilistic graphical models (Markov random fields, factor graphs) has exactly the same computational structure as MPNN message passing.
Belief propagation update:
MPNN message update:
The analogy:
- BP messages <-> GNN messages
- BP potential <-> GNN message function
- BP product of incoming messages <-> GNN sum/product aggregation
Implication: GNNs can be viewed as learned approximations to belief propagation. Where BP computes exact marginals on trees (and approximate marginals on loopy graphs), GNNs learn to compute useful representations that may or may not correspond to probabilistic marginals. For tasks that arise from probabilistic inference on graphs (decoding, error correction, satisfiability), architectures inspired by BP (like LDPC decoders implemented as GNNs) achieve state-of-the-art results.
C.3 Label Propagation vs GCN
Label propagation (Zhu et al., 2003) is a classical semi-supervised learning method that also propagates information across graph edges. The update rule:
where , is the label matrix (soft predictions), is the observed labels (zero for unlabeled nodes), and controls the balance between propagation and fitting observed labels.
At convergence: - a closed-form solution.
GCN vs label propagation:
- Label propagation: linear, closed-form, uses only graph structure (no node features)
- GCN: nonlinear, gradient-based, uses both node features and graph structure
- For feature-rich graphs: GCN consistently outperforms LP
- For feature-poor graphs (only node IDs): LP is competitive and much cheaper
- APPNP (Gasteiger et al., 2019): decouples the GNN feature transformation from the propagation step, applying propagation steps as a post-processing operation similar to LP. Achieves better performance than GCN on many node classification benchmarks by using more propagation steps without over-smoothing (due to the mixing coefficient).
C.4 Random Walk Methods and GNNs
Before GNNs, random walk methods (DeepWalk, Node2Vec) were the standard for learning node embeddings.
DeepWalk (Perozzi et al., 2014). Sample random walks from each node; treat the walk as a "sentence" and apply Skip-Gram (Word2Vec) to learn node embeddings that predict context nodes in the walk.
Node2Vec (Grover & Leskovec, 2016). Biased random walks with parameters (return probability) and (in-out probability) to balance BFS-like (homophily) and DFS-like (structural equivalence) exploration.
Connection to GNNs. The Skip-Gram objective for DeepWalk is equivalent to factorizing the matrix:
where is the -step transition matrix. This is related to the graph's random walk kernel.
GNNs can incorporate random walk information through RWSE (7.6) or through the unsupervised pretraining objective of GraphSAGE (5.4). Modern practice often combines: pretrain with random walk objectives (fast, scalable), then fine-tune with GNN layers (expressive, task-specific).
C.5 Over-smoothing and Spectral Perspective
From the spectral perspective of 11-04, over-smoothing has a clean frequency interpretation. A graph signal can be decomposed into its graph Fourier components (where is the -th Laplacian eigenvector). The GCN propagation acts as a low-pass filter:
where are the eigenvalues of . After layers:
- Lowest frequency (, ): constant component, amplified by factor 1 - perfectly preserved
- Low frequencies (): slowly varying signals, amplified by - nearly preserved
- High frequencies (): rapidly oscillating signals, suppressed by - exponentially suppressed
The class label signal is typically a low-frequency signal (connected nodes tend to have the same class in homophilic graphs). But even the label signal is eventually suppressed by many propagation steps - after enough layers, even the zero-frequency constant component dominates and all nodes collapse.
Optimal depth. The optimal number of layers balances two effects:
- More layers: larger receptive field, more node context, better long-range information
- More layers: stronger smoothing, loss of high-frequency discriminative features
For a graph with Fiedler value , the optimal depth is approximately (the mixing time of the random walk). For clustered graphs with small (e.g., ): layers could be used. For expander-like graphs with large (e.g., ): layers is optimal.
Appendix D: GNN Training Recipes and Engineering Tricks
D.1 Data Preprocessing for Graph ML
Node feature normalization. Unlike image pixels (naturally in ), node features in graphs span wildly different scales. Atom features in molecular graphs: atomic number (1-118), formal charge ( to ), number of hydrogens (-). Normalizing to zero mean, unit variance per feature dimension is essential.
Edge feature preprocessing. Bond lengths in molecules range from 1.0 A (triple bond) to 1.5 A (single bond). If used as raw edge features, their small absolute range means they have little influence relative to bond type one-hot features. Apply normalization or use learned embeddings.
Graph-level normalization. For graph regression tasks (predicting molecular energy, which scales with the number of atoms), normalize the target by dividing by (atoms count). Otherwise, larger molecules trivially have larger predicted values.
Degree-based normalization. When using sum aggregation, node representations can grow with degree. Apply degree normalization as part of preprocessing: scale node features by before input, matching the GCN normalization.
D.2 Positional Encodings: Practical Considerations
LapPE sign ambiguity. Eigenvectors are defined up to sign: if is an eigenvector of , so is . Naive use of eigenvectors as PEs means two identical graphs with differently-signed PEs produce different GNN outputs - violating permutation invariance over graph orientation.
Fix (Lim et al., 2022): during training, randomly flip the sign of each eigenvector independently at each mini-batch. This forces the GNN to be invariant to sign choice. At inference, use a consistent sign (e.g., always choose the sign such that the first non-zero element is positive).
RWSE efficiency. Computing for requires matrix powers - expensive for large graphs. Efficient computation: start a random walk from node ; estimate the return probability by averaging over random walkers:
With walkers per node, this gives accurate RWSE estimates in time.
Eigenvalue degeneracy. If two eigenvalues of are equal (), the corresponding eigenspace is two-dimensional and any orthonormal basis for it is valid. This means LapPE is non-unique for graphs with symmetry (e.g., complete graphs, regular graphs). Degenerate eigenvectors encode no structural information - they should either be excluded from PEs or handled with random rotation augmentation.
D.3 Loss Functions for Graph Tasks
Node classification. Cross-entropy over labeled nodes:
For class-imbalanced graphs (e.g., rare fraud nodes), use focal loss or weighted cross-entropy.
Link prediction. Binary cross-entropy with negative sampling:
Negative sampling ratio (how many non-edges per positive edge): 1:1 to 1:5 is standard. Hard negative mining (sampling non-edges that are structurally close to positive edges) improves embedding quality at the cost of more complex sampling.
Graph regression. Mean absolute error (MAE) for quantum chemistry (physically meaningful, outlier-robust) or mean squared error (MSE) for property prediction where large errors are particularly bad. Normalize targets per-dataset: for training, inverse-transform predictions.
Contrastive learning on graphs. Self-supervised GNNs (GraphCL, SimGRACE) maximize agreement between augmented views of the same graph:
where is cosine similarity, is temperature, and is the embedding of an augmented view (e.g., edge dropout, node feature masking). Graph augmentations must preserve semantic meaning - dropping a bond in a molecule changes its identity, so augmentation must be domain-aware.
D.4 Debugging GNNs in Practice
Symptom: GNN performs worse than MLP on node features alone. Cause: the graph structure is not informative for the task (e.g., low homophily), or the GNN is over-smoothing and destroying the discriminative node features. Fix: verify homophily ratio ; if , graph structure hurts. Use heterophilic GNNs (FAGCN, H2GCN) that can handle graphs where connected nodes have different labels.
Symptom: Training loss decreases but validation loss explodes. Cause: overfitting, likely due to GNN memorizing the training graph topology rather than learning transferable patterns. Fix: add dropout on node features (not just edges), reduce model capacity, add weight decay. For transductive semi-supervised learning, ensure the train/val split is performed correctly.
Symptom: All node representations collapse to the same vector. Cause: over-smoothing (too many layers) or numerical instability (exploding/vanishing gradients through ). Fix: check Dirichlet energy after each layer; use PairNorm or GCNII; add gradient clipping.
Symptom: GAT attention weights are all uniform (). Cause: the attention vector has collapsed - the model learned that uniform attention is safest. Often occurs with high learning rates or without proper initialization. Fix: initialize with Xavier initialization; use a smaller learning rate for attention parameters; check for gradient vanishing in LeakyReLU (use a larger negative slope, e.g., ).
Symptom: Training is extremely slow. Cause: running GNN on the full large graph every step (full-batch training). Fix: switch to mini-batch training with neighbor sampling (GraphSAGE), graph partitioning (Cluster-GCN), or subgraph sampling (GraphSAINT). Profile GPU utilization - if below 70%, the bottleneck is data loading, not computation.
Appendix E: Benchmark Datasets and Evaluation
E.1 Standard Node Classification Benchmarks
| Dataset | Nodes | Edges | Features | Classes | Homophily | Task |
|---|---|---|---|---|---|---|
| Cora | 2,708 | 5,429 | 1,433 | 7 | 0.81 | Transductive node clf |
| CiteSeer | 3,327 | 4,732 | 3,703 | 6 | 0.74 | Transductive node clf |
| PubMed | 19,717 | 44,338 | 500 | 3 | 0.80 | Transductive node clf |
| Amazon-Photo | 7,650 | 119,081 | 745 | 8 | 0.83 | Transductive node clf |
| Chameleon | 2,277 | 36,101 | 2,325 | 5 | 0.23 | Heterophilic node clf |
| Squirrel | 5,201 | 217,073 | 2,089 | 5 | 0.22 | Heterophilic node clf |
| OGB-arxiv | 169,343 | 1,166,243 | 128 | 40 | 0.65 | Large-scale node clf |
| OGB-products | 2,449,029 | 61,859,140 | 100 | 47 | 0.81 | Large-scale node clf |
Evaluation protocol. Standard split for Cora/CiteSeer/PubMed: 20 nodes per class for training, 500 for validation, 1000 for test (following Kipf & Welling, 2017). OGB uses official dataset-specific splits with multiple random seeds for reliable comparison.
E.2 Standard Graph Classification Benchmarks
| Dataset | Graphs | Avg. nodes | Avg. edges | Classes | Domain |
|---|---|---|---|---|---|
| MUTAG | 188 | 17.9 | 19.8 | 2 | Molecular (mutagenicity) |
| PROTEINS | 1,113 | 39.1 | 72.8 | 2 | Protein function |
| IMDB-B | 1,000 | 19.8 | 96.5 | 2 | Social (movie collaboration) |
| COLLAB | 5,000 | 74.5 | 2,457.8 | 3 | Social (research collaboration) |
| OGB-molhiv | 41,127 | 25.5 | 27.5 | 2 | Molecular (HIV inhibition) |
| OGB-molpcba | 437,929 | 26.0 | 28.1 | 128 | Molecular (bioassays) |
| QM9 | 130,831 | 18.0 | 18.6 | - | Molecular (12 quantum properties) |
Evaluation. Graph classification: 10-fold cross-validation accuracy for TU datasets; ROC-AUC for OGB. Molecular regression (QM9): mean absolute error (MAE) per property, target unit in Angstroms, eV, Debye (domain-dependent).
E.3 Long-Range Graph Benchmarks (LRGB)
Introduced by Dwivedi et al. (2022) to specifically test long-range dependency learning - tasks where local GNNs fail but graph transformers succeed.
| Dataset | Nodes | Task | Long-range property |
|---|---|---|---|
| Peptides-func | 150K graphs, avg 150 nodes | Graph multi-label clf | Functional class depends on full peptide structure |
| Peptides-struct | 150K graphs, avg 150 nodes | Graph regression | 3D structural properties (end-to-end distances) |
| PCQM-Contact | 529K graphs | Link prediction | Molecular contact maps (spatial proximity) |
| COCO-SP | 123K graphs, avg 476 nodes | Node segmentation | Pixel graphs; semantic context is long-range |
GNN vs Graph Transformer gap. On Peptides-func (AP metric):
- GCN: 0.5930
- GIN: 0.6621
- GPS (GINE + Transformer): 0.7470
- GPS + RWSE: 0.7538
The gap between GIN and GPS demonstrates the clear advantage of global attention for long-range tasks.
E.4 Choosing the Right Architecture
ARCHITECTURE SELECTION GUIDE
========================================================================
Task requires long-range dependencies?
+-- YES -> Graph Transformer (GPS, Graphormer)
| or Graph rewiring + MPNN
+-- NO down
Graph has millions of nodes?
+-- YES -> GraphSAGE (inductive) or Cluster-GCN/GraphSAINT (large)
+-- NO down
Need to classify entire graphs?
+-- YES -> GIN + sum pooling (expressiveness matters)
| or DiffPool/SAGPool for hierarchical structure
+-- NO down
Need interpretable edge importance?
+-- YES -> GAT or GATv2
+-- NO down
Simple strong baseline needed first?
+-- -> 2-layer GCN (fast to implement, often competitive)
Has rich edge features (bond types, relation types)?
+-- -> MPNN/GINE with edge message functions
========================================================================
Appendix F: Notation Summary for This Section
| Symbol | Definition | First Used |
|---|---|---|
| Undirected graph with vertex set , edge set | 2.1 | |
| , | Number of nodes and edges | 2.1 |
| Node feature matrix; row is | 2.1 | |
| Neighborhood of node : | 1.2 | |
| Degree of node | 2.1 | |
| Representation of node at layer | 3.1 | |
| All node representations at layer | 4.2 | |
| Adjacency matrix | 2.1 | |
| Adjacency with self-loops | 4.2 | |
| Degree matrix: | 4.2 | |
| Degree matrix of | 4.2 | |
| GCN propagation matrix | 4.2 | |
| Unnormalized Laplacian | 8.2 | |
| Normalized (symmetric) Laplacian | 10.2 | |
| Dirichlet energy of | 8.2 | |
| Aggregated message at node , layer | 3.1 | |
| Message function at layer | 3.1 | |
| Update function at layer | 3.1 | |
| Readout function for graph-level tasks | 3.1 | |
| Weight matrix at layer | 4.2 | |
| GAT attention coefficient from to | 6.2 | |
| WL color of node at iteration | 7.2 | |
| GIN learnable mixing parameter | 7.4 | |
| Mean Average Distance of node representations | 8.1 | |
| DiffPool soft cluster assignment matrix | 9.3 | |
| Positional encoding of node | 10.2 | |
| First eigenvectors of | 10.2 | |
| Shortest-path distance between and | 10.4 | |
| Graphormer spatial encoding bias at distance | 10.4 | |
| permutation matrix | 2.3 | |
| Stationary distribution of random walk on | 8.1 | |
| Graph neural tangent kernel | C.1 | |
| Weisfeiler-Leman graph kernel | C.1 |
All vectors are column vectors by default ( means ). Matrix norms: (Frobenius), (spectral). Notation follows docs/NOTATION_GUIDE.md throughout.
Appendix G: Further Reading
Foundational Papers (Must-Read)
-
Scarselli et al. (2009) - "The Graph Neural Network Model." IEEE Transactions on Neural Networks. Original GNN formulation with fixed-point iteration. Historical foundation.
-
Defferrard, Bresson & Vandergheynst (2016) - "Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering." NeurIPS 2016. ChebNet: polynomial spectral filters, making spectral GNNs computationally feasible.
-
Kipf & Welling (2017) - "Semi-Supervised Classification with Graph Convolutional Networks." ICLR 2017. GCN: the simplification that started the field. Short, clear, highly cited.
-
Hamilton, Ying & Leskovec (2017) - "Inductive Representation Learning on Large Graphs." NeurIPS 2017. GraphSAGE: inductive learning, neighbor sampling, unsupervised objectives.
-
Gilmer et al. (2017) - "Neural Message Passing for Quantum Chemistry." ICML 2017. MPNN framework: the unified abstraction. The formalism most GNN papers use today.
-
Velickovic et al. (2018) - "Graph Attention Networks." ICLR 2018. GAT: learned attention weights for graphs. Widely deployed in production.
-
Xu et al. (2019) - "How Powerful are Graph Neural Networks?" ICLR 2019. GIN + WL expressiveness theorem: the theoretical foundation of GNN expressiveness.
Depth Pathologies
-
Li et al. (2018) - "Deeper Insights into Graph Convolutional Networks for Semi-Supervised Classification." AAAI 2018. First formal analysis of over-smoothing.
-
Alon & Yahav (2021) - "On the Bottleneck of Graph Neural Networks and its Practical Implications." ICLR 2021. Over-squashing: Jacobian analysis and graph rewiring motivation.
-
Chen et al. (2020) - "Simple and Deep Graph Convolutional Networks." ICML 2020. GCNII: initial residual + identity mapping for training 64-layer GCNs.
Graph Transformers
-
Ying et al. (2021) - "Do Transformers Really Perform Bad for Graph Representation?" NeurIPS 2021. Graphormer: spatial, central, and edge encodings for graph transformers. OGB-LSC winner.
-
Rampasek et al. (2022) - "Recipe for a General, Powerful, Scalable Graph Transformer." NeurIPS 2022. GPS framework: MPNN + Transformer combination with arbitrary PEs.
-
Dwivedi et al. (2022) - "Long Range Graph Benchmark." NeurIPS 2022 Datasets. LRGB: benchmarks specifically designed to test long-range dependency learning.
Expressiveness
-
Morris et al. (2019) - "Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks." AAAI 2019. -GNN: implementing -WL as a neural network.
-
Brody, Alon & Yahav (2022) - "How Attentive are Graph Attention Networks?" ICLR 2022. GATv2: fixing the static attention problem in GAT.
Applications
-
Jumper et al. (2021) - "Highly Accurate Protein Structure Prediction with AlphaFold." Nature. AlphaFold 2: the Evoformer and structure module. GNNs at scale for biology.
-
Ying et al. (2018) - "Graph Convolutional Neural Networks for Web-Scale Recommender Systems." KDD 2018. PinSage: GraphSAGE at 3 billion nodes. Engineering lessons for production GNNs.
-
Edge et al. (2024) - "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." Microsoft Research. Graph RAG: GNNs + LLMs for document understanding.
Textbooks and Surveys
-
Hamilton (2020) - Graph Representation Learning. Synthesis Lectures on AI. The definitive textbook on GNNs; covers theory and practice comprehensively.
-
Wu et al. (2022) - "A Comprehensive Study on Large-Scale Graph Training." NeurIPS 2022. Empirical comparison of full-batch, neighbor sampling, and subgraph sampling methods.