Lesson overview | Lesson overview | Next part
Graph Neural Networks: Part 1: Intuition to 5. GraphSAGE: Inductive Learning on Large Graphs
1. Intuition
1.1 The Core Challenge: Learning on Non-Euclidean Data
Standard deep learning architectures - convolutional networks, recurrent networks, transformers - were designed for data that lives on regular, well-ordered domains. An image is a grid: every pixel has exactly the same number of neighbors, arranged in the same spatial pattern. A sentence is a sequence: every token has a left neighbor and a right neighbor, and positions are canonically ordered. These regular structures make convolution and positional encodings natural.
Real-world data is rarely so obliging. A molecule has atoms connected by bonds in a pattern determined by chemistry, not a grid. A social network has users connected by friendships in patterns determined by behavior and geography. A knowledge base has concepts connected by relations in patterns determined by the world's semantic structure. A protein's function is determined by the 3D arrangement of amino acid residues, which a graph can represent but a grid cannot. These are non-Euclidean data structures: they have no natural notion of translation, no canonical coordinate system, no uniform neighborhood size.
The challenge GNNs solve is: how do we define a learnable function on a graph that generalizes across different graphs, respects the graph's structure, and can be trained end-to-end by gradient descent? This is harder than it sounds. The same function applied to two isomorphic graphs (graphs with the same structure but differently labeled vertices) should produce the same output. The function must handle graphs of arbitrary size and arbitrary vertex degrees. And it must be expressive enough to detect structural patterns (triangles, paths of length , specific subgraph motifs) that determine the graph's properties.
For AI: In 2026, GNNs are production components in AlphaFold 2's structure module (atoms as nodes, spatial edges), PinSage (users and items as nodes, interactions as edges with 3 billion monthly active users), and Microsoft's Graph RAG system (entities and relations from documents as knowledge graphs for retrieval-augmented generation). Understanding GNNs is not optional for anyone working with structured, relational, or molecular data.
1.2 The Message-Passing Insight
The insight that makes GNNs possible is conceptually simple: a node's representation should depend on the representations of its neighbors, and this dependency should be computed iteratively.
Consider a social network node - a person. Their social identity depends on whom they know. But whom they know also depends on whom those people know, and so on. The first round of "listening to neighbors" gives each person a sense of their immediate social circle. The second round gives a sense of friends-of-friends. After rounds, each node's representation encodes the structure of its -hop neighborhood.
This is message passing: in each layer, every node sends a message to its neighbors, every node aggregates the messages it receives, and every node updates its representation based on the aggregated messages. Formally, for node at layer :
where is the set of neighbors of , is node 's representation at layer , and is the initial node feature.
The connection to BFS: Breadth-first search from node explores exactly the nodes reachable in steps at step . A -layer GNN computes a representation that depends on exactly the nodes reachable in steps. A BFS is the non-learned, binary-valued version of message passing: it answers "which nodes are in the -hop neighborhood?" whereas a GNN answers "what is the learned summary of the -hop neighborhood?"
For AI: This message-passing structure is the graph analogue of the convolutional receptive field. A pixel in a CNN "sees" a grid after convolutional layers. A node in a GNN "sees" its -hop neighborhood after message-passing layers. The critical difference: the CNN's receptive field grows quadratically in (for a 2D grid), while the GNN's receptive field can grow exponentially in (for a well-connected graph), which creates both expressiveness and scalability challenges we will study in 8.
1.3 Two Perspectives: Spectral vs Spatial
Graph convolution was first defined in the spectral domain. The key idea from 11-04 is that the graph Laplacian admits an eigendecomposition , and the columns of form a graph Fourier basis. A graph signal (one scalar per node) is filtered by applying a function to its Fourier coefficients:
where is a diagonal matrix of filter coefficients. Full spectral convolution requires computing the full eigendecomposition - cost, infeasible for large graphs.
Recall from 11-04 7.5: Kipf & Welling (2017) derived the GCN layer as a first-order Chebyshev polynomial approximation to spectral filtering, further simplified by setting and adding self-loops. This gives the propagation rule - see 11-04 7.5 for the full derivation.
Once we arrive at the GCN layer, the spectral scaffolding can be discarded and the layer reinterpreted spatially: it is simply normalized aggregation of neighbor features followed by a linear transformation and nonlinearity. This spatial reinterpretation is what makes GNNs extensible: one can design new aggregation functions, learn attention weights, incorporate edge features, and handle directed graphs - none of which fit neatly into the spectral framework.
The two perspectives are complementary:
- Spectral: principled derivation, frequency interpretation, connection to graph signal processing; limited to fixed graphs, requires eigendecomposition or polynomial approximation
- Spatial: flexible, scalable, inductive, handles heterogeneous graphs; less theoretically constrained, harder to analyze convergence
Modern GNN research lives primarily in the spatial perspective, with spectral theory providing theoretical grounding and positional encodings (10).
1.4 What Makes Graphs Different from Other Data
Four properties of graph-structured data require fundamentally different architectural choices:
1. Variable and irregular neighborhoods. Node in a molecule may have 1 bond (hydrogen) or 4 bonds (carbon). Node in a social network may have 3 friends or 3,000. Standard convolution assumes a fixed-size, fixed-pattern receptive field. GNNs must aggregate over neighborhoods of arbitrary size - which forces a choice of aggregation function (sum, mean, max, attention) that is insensitive to neighborhood size.
2. No canonical node ordering. An image has a canonical pixel ordering (row-major). A sentence has a canonical word ordering (left-to-right). A graph has no such ordering. If we relabel the nodes of a graph, we get the same graph - just with different indices. This means a GNN must be permutation equivariant (the output node representations permute consistently with any relabeling of input nodes) and any graph-level prediction must be permutation invariant (the same under any relabeling). This rules out any architecture that depends on node indices.
3. Structural information is in the topology. Two molecules with identical atom composition but different bond structures (structural isomers) have very different chemical properties. The GNN must detect this topological difference from the adjacency structure alone. This is the expressiveness question of 7: which topological patterns can a GNN distinguish?
4. Heterogeneous graphs. Real graphs often have multiple types of nodes and edges (knowledge graphs: Person --worksAt--> Company, Person --bornIn--> City). Relational GNNs (R-GCN) and heterogeneous GNNs handle this by type-specific transformation matrices. This section focuses on homogeneous graphs; heterogeneous extensions are discussed in 12.2.
1.5 Historical Timeline (2013-2026)
GRAPH NEURAL NETWORK HISTORICAL TIMELINE
========================================================================
2009 Scarselli et al. - Original GNN: fixed-point iteration on graphs
2013 Bruna et al. - Spectral graph convolution (full eigendecomp.)
2015 Defferrard et al. - ChebNet: Chebyshev polynomial spectral filters
2016 Li et al. - Gated GNN: GRU-based update functions
2017 Kipf & Welling - GCN: simplified spectral GNN for node classification
2017 Hamilton et al. - GraphSAGE: inductive learning with neighbor sampling
2017 Gilmer et al. - MPNN: unified message-passing framework
2018 Velickovic et al. - GAT: attention-weighted neighborhood aggregation
2019 Xu et al. - GIN + WL expressiveness theorem
2019 Ying et al. - DiffPool: differentiable hierarchical pooling
2019 Chiang et al. - Cluster-GCN: scalable training by graph partitioning
2020 Rong et al. - DropEdge: data augmentation against over-smoothing
2021 Ying et al. - Graphormer: transformer for graphs (OGB-LSC winner)
2021 Brody et al. - GATv2: fixed dynamic attention for GAT
2022 Rampasek et al. - GPS: general, powerful, scalable graph transformer
2023 Chen et al. - Graph Mamba: state-space models on graphs
2024 Microsoft - Graph RAG: GNNs for retrieval-augmented generation
2024 -2026 - LLM+GNN: language models with graph-structured memory
========================================================================
2. Formal Setup: Graphs as Data
2.1 Graph with Node and Edge Features
A featured graph is a tuple where:
- - the vertex set, nodes
- - the edge set, edges
- - node feature matrix: row is the feature vector of node
- - edge feature matrix: row is the feature of edge
The adjacency structure is encoded either as a dense matrix or (for large sparse graphs) as an edge list - a pair of tensors where is the source and is the target of edge .
Examples:
| Domain | Node features | Edge features | Task |
|---|---|---|---|
| Molecular graph | Atom type (one-hot), charge, hybridization | Bond type (single/double/aromatic), distance | Predict solubility |
| Social network | User demographics, activity history | Interaction frequency, timestamp | Link prediction |
| Knowledge graph | Entity type embedding | Relation type embedding | Triple completion |
| Citation network | Bag-of-words of paper abstract | None (unweighted) | Classify research area |
| Code AST | Token type, identifier embedding | Syntactic relation type | Bug detection |
Non-examples of graph data:
- An image stored as a grid: regular structure makes CNN superior; GNNs are overkill and less efficient
- A flat table of independent samples: no edges, no relational structure; standard ML applies
- A time series: sequential structure with canonical ordering; RNN/Transformer preferred unless interactions are truly non-sequential
2.2 Learning Tasks on Graphs
GNNs support three levels of prediction:
Node-level tasks: predict a label for each node (or a subset). The GNN produces a node embedding and a final classifier .
- Semi-supervised node classification: most nodes are unlabeled; a small fraction have known labels. The GNN propagates information from labeled to unlabeled nodes through the graph structure. Classic example: Cora/CiteSeer citation networks (Kipf & Welling, 2017).
- Node regression: predict a continuous value per node. Example: predicting traffic speed at road intersections.
Edge-level tasks: predict a property of a pair of nodes , whether or not the edge exists.
- Link prediction: does edge exist? Used in recommendation (predict user-item interaction) and knowledge graph completion (predict missing relation). The GNN produces and ; the prediction is (e.g., inner product or MLP).
- Edge classification: predict the type of an existing edge. Example: classify protein-protein interaction type (physical, genetic, etc.).
Graph-level tasks: predict a single label for an entire graph .
- Graph classification: classify molecular graphs as toxic/non-toxic; classify citation subgraphs by topic. Requires a readout function that aggregates all node embeddings into a single graph embedding .
- Graph regression: predict a scalar property of the graph (e.g., binding affinity, quantum energy). Most molecular property prediction benchmarks (OGB-molhiv, QM9) are graph regression tasks.
2.3 Permutation Invariance and Equivariance
Definition (Permutation Equivariance). Let be the group of permutation matrices. A function is permutation equivariant if for every permutation matrix :
That is, permuting the node ordering in the input permutes the output representations in the same way. Node-level GNNs must be permutation equivariant: if we relabel the nodes, the node embeddings should permute consistently.
Definition (Permutation Invariance). A function is permutation invariant if for every permutation :
Graph-level GNNs must be permutation invariant: the prediction for a graph is the same regardless of how we number its vertices.
Why this constrains architecture. Consider a simple approach: concatenate all node features into a vector and pass it through an MLP. This is not permutation equivariant - shuffling the nodes changes the input vector and changes the output. Similarly, computing any function of the matrix that uses specific row/column indices (like the entry) is not permutation invariant.
The only permutation-equivariant functions that aggregate neighborhoods are those that apply a fixed function to the set of neighbor features - which is precisely what GNN aggregation functions do. Functions of sets must be insensitive to the order of elements; sum, mean, and max all satisfy this; concatenation does not.
Formal example. Let (GCN-style). This is equivariant because sums over a set - reordering the neighbors does not change the sum. But for an ordered list of neighbors is not equivariant.
2.4 Representations: From Single Graph to Batched Graphs
Training GNNs on datasets of multiple small graphs (graph classification, molecular property prediction) requires efficient batching. Unlike image batches where tensors have uniform shape, graphs have variable numbers of nodes and edges.
The block-diagonal batch trick. Given graphs with nodes, form a single disconnected "super-graph" with nodes:
The adjacency matrix is block-diagonal (no edges between different graphs). Since graphs in the batch are disconnected, message passing does not cross graph boundaries. A batch index vector tracks which graph each node belongs to, enabling graph-level readout by aggregating within each block.
Virtual node trick. Adding a "super node" connected to all real nodes in a graph is a common trick to enable long-range information propagation without increasing depth. The virtual node aggregates the entire graph's information in one hop, then broadcasts back. Used in MPNN (Gilmer et al., 2017) and Graphormer (Ying et al., 2021). It also serves as a form of global pooling: .
3. The MPNN Framework
3.1 The Gilmer et al. (2017) Formulation
The Message Passing Neural Network (MPNN) framework, introduced by Gilmer et al. (2017) for quantum chemistry property prediction, provides the canonical mathematical abstraction for GNNs. It unifies GCN, GraphSAGE, GAT, GIN, and most other spatial GNN architectures as special cases.
Algorithm: Message Passing Phase
For (layers):
Readout Phase (for graph-level tasks):
Here:
- - message function at layer ; takes the central node's features, a neighbor's features, and the edge features, returns a message vector
- - update function at layer ; combines the node's current representation with the aggregated message
- - readout function; maps a set of node representations to a graph-level prediction
The aggregation in the message phase is written as a sum, but can be replaced by any permutation-invariant function (mean, max, attention). The sum formulation is canonical because it is the most expressive (see 7).
Initialization: - the initial representation is the input node feature.
For AI: The MPNN framework is so general that it also subsumes certain attention mechanisms. The transformer's self-attention layer can be viewed as a fully-connected MPNN (every token attends to every other token) with as the message function.
3.2 Message Functions
The message function determines what information flows along each edge. Several common designs:
Edge-independent messages. Many simple GNNs (GCN, GIN) ignore edge features and use:
or
Edge-conditioned messages. When edge features are available:
where is an MLP. Used in MPNN for molecular graphs (Gilmer et al., 2017): edge features encode bond type, bond length, and stereochemistry.
Pair messages. In models like NNConv, the edge feature defines the transformation matrix:
where is an MLP that produces a weight matrix from the edge features. This is extremely expressive but quadratically expensive in .
3.3 Aggregation: Permutation-Invariant Pooling
The aggregation function must be permutation invariant over the set . Three canonical choices and their properties:
Sum aggregation:
Sensitive to neighborhood size - a node with 10 neighbors gets a larger aggregate than a node with 2, even if their neighborhood compositions are identical. This turns out to be a feature, not a bug: sum aggregation is the most expressive (7.4). Used in GIN (Xu et al., 2019).
Mean aggregation:
Normalizes by neighborhood size. Good when neighborhood size is not informative and you want to compare average neighbor characteristics across nodes of different degrees. Used in GCN (Kipf & Welling, 2017) and GraphSAGE (mean variant).
Max aggregation:
Detects whether any neighbor has feature above a threshold. Good for detecting the presence of particular node types in the neighborhood. Used in GraphSAGE (max-pool variant).
Expressiveness hierarchy. Xu et al. (2019) proved: for injective graph classification, sum mean = max in expressive power. Mean cannot distinguish a graph with two nodes each of degree 1 from a graph with four nodes each of degree 2 if neighbor features are identical. Max cannot count copies of identical features. Sum can distinguish both.
Attention aggregation. Instead of fixed weights, learn attention coefficients:
where is a learned attention score. This is the GAT model (6).
3.4 Update Functions
The update function combines the node's current representation with the aggregated neighborhood message.
Linear update (GCN):
The current node representation is absorbed into the message (via including the self-loop).
Concatenation update (GraphSAGE):
Concatenating the node's current representation with the aggregated neighbor message before transforming. Preserves the node's own information explicitly.
GRU update (Gated GNN, Li et al. 2016):
The GRU's gating mechanism adaptively controls how much of the current representation to retain vs. replace with the new message. More expressive than linear update but more parameters.
Residual update:
Skip connections (as in ResNet) preserve gradient flow in deep GNNs and mitigate over-smoothing. Used in GCNII (Chen et al., 2020) for training 64-layer GCNs successfully.
3.5 Readout for Graph-Level Tasks
For graph classification and regression, the node representations must be aggregated into a single graph embedding .
Global pooling:
Simple choices: , , (element-wise). Sum is most expressive; mean normalizes for graph size.
Hierarchical pooling. Rather than a single pooling step at the end, hierarchical GNNs (DiffPool, SAGPool) alternate GNN layers with coarsening steps that reduce the number of nodes. This mirrors how image CNNs alternate convolution with spatial downsampling.
Jumping Knowledge (Xu et al., 2018). Rather than using only the final layer's representations, JK-networks concatenate representations from all layers:
This allows the readout to leverage both local (shallow layers) and global (deep layers) structural information, and empirically mitigates over-smoothing.
Set2Set (Vinyals et al., 2016). A more powerful readout using an LSTM-based attention over the node set, producing an order-invariant graph embedding that depends on all node representations through multiple rounds of attention. More expressive than simple global pooling but sequential steps.
3.6 MPNN Instances: Unifying GCN, GraphSAGE, GAT, GIN
The following table shows how each major GNN architecture is a special case of the MPNN framework:
MPNN UNIFICATION TABLE
========================================================================
Model Message M(h_v, h_u, e_uv) Aggregation Update U(h_v, m_v)
---------------------------------------------------------------------
GCN W*h_u Mean (norm.) \sigma(W*m_v)
GraphSAGE W_1*h_u Mean/Max/Sum \sigma(W*[h_v || m_v])
GAT \alpha_uv * W*h_u Attention \sigma(W*m_v)
(\alpha_uv learned from h_v, h_u)
GIN h_u Sum MLP((1+\epsilon)*h_v + m_v)
MPNN-Gilmer f_(h_u, e_uv) Sum GRU(h_v, m_v)
---------------------------------------------------------------------
Key insight: choice of M, aggregation, U determines expressiveness
and scalability. Sum agg. + injective MLP = most expressive (7)
========================================================================
4. Graph Convolutional Networks (GCN)
4.1 Spectral Foundation Recap
Recall from 11-04 7.5: The GCN layer is derived as a first-order Chebyshev polynomial approximation to spectral graph convolution. Starting from the spectral filter applied to the normalized Laplacian (with eigenvalues in ), constraining for a single parameter per layer, and applying the renormalization trick , , yields the GCN propagation rule. Full derivation: 11-04 Spectral Graph Theory 7.5.
The key takeaway is that the GCN layer is a principled spectral filter, not an ad hoc heuristic. The choice of (adding self-loops) and (symmetric normalization) both arise from the spectral derivation and can now be given spatial interpretations.
4.2 The GCN Layer
The GCN propagation rule for all nodes simultaneously (matrix form):
where:
- - adjacency matrix with added self-loops
- - degree matrix of
- - node feature matrix at layer ;
- - trainable weight matrix at layer
- - nonlinear activation (ReLU for hidden layers, softmax for output in classification)
Node-level form. For a single node :
where (degree including self-loop). Each neighbor 's contribution is weighted by - the geometric mean of the (augmented) degrees. This normalization prevents high-degree nodes from dominating and ensures the propagation matrix has spectral radius \leq 1.
The propagation matrix. Define . This is the normalized adjacency of . Its eigenvalues lie in (since is symmetric, non-negative). A key property:
After GCN layers, node 's representation is a learned function of the (weighted) -hop neighborhood - exactly the receptive field intuition from 1.2.
4.3 Matrix View: Propagation Rule
The GCN can be understood as a learned diffusion process. Define as the diffusion operator (also called the normalized propagation matrix). Then:
- - neighborhood aggregation: each node's features become a weighted average of its neighbors' features. This is a graph smoothing step - it reduces variation across connected nodes.
- - feature transformation: a learnable linear map on the -dimensional feature space.
- - nonlinearity: introduces expressive power.
Repeating this times gives .
Connection to heat diffusion. The continuous heat equation on a graph is , with solution . The GCN step approximates one discrete step of this diffusion. Deep GCNs approximate long-time diffusion, which - as we will see in 8 - causes all node representations to converge to the same value (over-smoothing).
4.4 Semi-Supervised Node Classification
The original application of GCN (Kipf & Welling, 2017) is semi-supervised node classification: given a graph with nodes, labeled and unlabeled, train the GCN to predict class labels for all nodes.
Setup:
- Two-layer GCN:
- Loss: cross-entropy over labeled nodes only:
- The GCN propagates label information from labeled to unlabeled nodes through the graph structure (similar to label propagation, but learned)
Benchmark results (Cora citation network, 140 labeled / 2708 total):
| Method | Accuracy |
|---|---|
| DeepWalk + SVM | 67.2% |
| Label Propagation | 68.0% |
| Planetoid (Yang et al. 2016) | 75.7% |
| GCN (Kipf & Welling, 2017) | 81.5% |
| GAT (Velickovic et al., 2018) | 83.0% |
| GCNII (Chen et al., 2020, 64 layers) | 85.5% |
The GCN's success on this benchmark established GNNs as the go-to method for graph-structured learning and spawned the entire field.
Why it works. The graph encodes a "homophily" assumption: connected nodes tend to have the same class. In citation networks, papers citing each other tend to be in the same research area. The GCN leverages this assumption by smoothing features across edges, effectively propagating class information from labeled nodes to their unlabeled neighbors.
4.5 Strengths and Weaknesses of GCN
Strengths:
- Simple to implement (two matrix multiplications per layer)
- Efficient: per layer for sparse graphs
- Well-understood theoretically: spectral derivation, convergence analysis
- Strong baselines for semi-supervised node classification
- Easily extended with residual connections, BatchNorm, dropout
Weaknesses:
- Transductive: the normalization is computed for the entire graph at training time. Adding a new node requires recomputing and retraining - the GCN cannot generalize to unseen nodes
- Fixed symmetric aggregation: every neighbor is weighted by the same geometric mean of degrees. High-degree hub nodes get down-weighted regardless of their relevance
- No edge features: the standard GCN layer has no mechanism to incorporate edge attributes
- Expressiveness limited to 1-WL: cannot distinguish non-isomorphic graphs that the WL test cannot distinguish (7)
- Over-smoothing at depth: after many layers, all node representations converge (8)
5. GraphSAGE: Inductive Learning on Large Graphs
5.1 The Inductive Learning Problem
GCN is transductive: it operates on a fixed graph seen during training. The normalized adjacency encodes the training graph's structure. When new nodes arrive (a new user on a social platform, a new molecule to screen), the GCN cannot produce representations without recomputing the full graph and rerunning the entire network.
This is impractical for large, dynamic graphs. Pinterest has billions of users; new pins are added every second. Training a fresh GCN daily is infeasible.
Inductive learning solves this: learn an aggregation function that can be applied to the neighborhood of any node, including nodes unseen during training. If the function is parameterized correctly, it generalizes: apply it to the neighborhood of a new node and get a useful embedding without retraining.
GraphSAGE (Hamilton, Ying & Leskovec, 2017) - SAmple and aggreGatE - is the canonical inductive GNN framework.
5.2 GraphSAGE Framework
Key idea: instead of using the entire neighborhood of a node (which can be millions of neighbors for hub nodes), sample a fixed-size subset at each layer. Then aggregate only over the sampled set. The parameters of the aggregation function are shared across all nodes and all graphs.
Algorithm (2-layer GraphSAGE):
For each node in the target set (e.g., a mini-batch of nodes):
- Sample neighbors: - sample neighbors uniformly
- For each sampled neighbor , sample their neighbors:
- Compute depth-2 embeddings bottom-up:
- for all
- Normalize:
Inductive inference: for a new node with known features and edges (to training-time nodes), apply the same aggregation functions with the learned - no retraining needed.
Sampling depth. A -layer GraphSAGE with neighborhood sizes touches at most nodes per target node. With , at most 250 nodes are accessed per target node, regardless of graph size. This makes training on massive graphs feasible.
5.3 Aggregation Strategies
GraphSAGE defines three aggregation variants with different expressiveness-efficiency tradeoffs:
Mean aggregator:
Similar to GCN but with uniform weights (no degree normalization). Concatenation rather than addition preserves the distinction between self and neighborhood.
LSTM aggregator:
Apply an LSTM over a random permutation of the sampled neighbors. The LSTM can model complex interactions between neighbors (e.g., in a sequence-like order), but the random permutation breaks the theoretical permutation invariance. In practice, the LSTM often performs best on benchmark tasks due to its representational power.
Max-pool aggregator:
Apply a learnable transformation to each neighbor's representation, then take the element-wise max. Detects the presence of specific feature patterns in the neighborhood. Often the best choice for tasks where the presence of a rare feature type is informative.
Choosing an aggregator. Empirically: LSTM aggregator performs best on benchmark accuracy but is not permutation invariant. Mean aggregator is permutation invariant and performs well on inductive tasks. Max-pool aggregator is best at detecting rare structural patterns. Practitioners typically tune on validation data.
5.4 Unsupervised Training and Link Prediction
GraphSAGE can be trained without labels using a link prediction objective:
where is a positive pair (connected in the graph), is a negative sample (random non-neighbor), is the number of negative samples, is the negative sampling distribution (typically uniform), and is the sigmoid function.
Intuition: nodes that co-occur in short random walks should have similar embeddings; nodes that are far apart should have dissimilar embeddings. This is the DeepWalk objective extended to learned GNN embeddings.
For AI: unsupervised GraphSAGE embeddings are used in PinSage (Pinterest's recommendation system) to generate embeddings for pins (images with descriptions) in a bipartite user-pin-board graph. These embeddings power Pinterest's "more like this" recommendations, serving 3+ billion monthly active users.
5.5 Scaling to Billion-Node Graphs
PinSage (Ying et al., 2018) adapted GraphSAGE to Pinterest's graph with 3 billion nodes and 18 billion edges - the largest graph neural network deployment at the time.
Key engineering innovations:
- Importance-based sampling: instead of uniform random sampling, sample neighbors proportional to their random walk visit frequency from the target node. This gives more weight to structurally important neighbors and reduces variance.
- Producer-consumer minibatch pipeline: precompute the node neighborhoods offline in Spark, stream them to GPU workers as training data. Decouples graph traversal (CPU-bound) from gradient computation (GPU-bound).
- Hard negative mining: the standard negative sampling produces easy negatives (random pins). Hard negatives are pins that are visually similar but semantically different - they force the model to learn finer-grained distinctions.
- Curriculum learning: train first with easy negatives, then add hard negatives progressively.
Deployment results: 150% improvement in head-to-head pin recommendation quality over previous collaborative filtering system, with recommendations updated daily across 3+ billion items.