"An algorithm must be seen to be believed." - Donald Knuth
Overview
Graph algorithms are the computational engine of graph theory. A graph is an abstract object; a graph algorithm is a procedure that extracts structure, distances, connectivity, or optimal subgraphs from that object in time proportional to the graph's size. The classical algorithms developed between the 1950s and 1980s - BFS, DFS, Dijkstra, Bellman-Ford, Kruskal, Prim, Tarjan, Ford-Fulkerson - remain the workhorses of modern systems, from web crawlers to GPS navigation to protein interaction databases.
For AI and machine learning, graph algorithms are more than historical curiosities. Breadth-first search defines the -hop neighbourhood that a -layer Graph Neural Network can see - the BFS frontier is the GNN receptive field. Topological sort on directed acyclic graphs is exactly the forward-pass ordering algorithm in PyTorch and JAX autograd engines. Dijkstra's algorithm reappears as approximate inference in probabilistic graphical models. Minimum spanning trees underpin graph-based clustering and hierarchical agglomeration. Max-flow / min-cut formalises graph partitioning and connects - via the Cheeger inequality - to the spectral clustering methods in 04.
This section covers the full classical canon: traversal (BFS, DFS), shortest paths (Dijkstra, Bellman-Ford, Floyd-Warshall, A*), minimum spanning trees (Kruskal, Prim), DAG algorithms (topological sort, critical path), strongly connected components (Tarjan, Kosaraju), and maximum flow. Every algorithm is stated with precise pseudocode, complexity analysis, a correctness argument, and an explicit connection to AI practice.
Prerequisites
- Graph definitions (, adjacency, degree, paths, cycles, DAGs) - 01 Graph Basics
- Graph representations (adjacency list, CSR, edge list) - 02 Graph Representations
- Basic asymptotic notation - Chapter 1
- Priority queue / heap data structure (assumed from CS background)
- Union-Find (disjoint-set union) data structure (introduced in 4.2 below)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | BFS/DFS with layer visualisation, Dijkstra with priority queue trace, Bellman-Ford DP table, Kruskal with Union-Find, Tarjan SCC, max-flow residual graph animation |
| exercises.ipynb | 8 graded exercises: BFS shortest paths through max-flow and GPU graph analytics |
Learning Objectives
After completing this section, you will:
- Implement BFS and DFS from scratch and use them to compute shortest-hop distances, detect cycles, and classify edges
- Prove correctness of Dijkstra's algorithm via the greedy-choice invariant
- Apply Bellman-Ford for graphs with negative weights and detect negative cycles
- Implement Floyd-Warshall for all-pairs shortest paths and recognise its DP structure
- Describe A* and construct admissible heuristics for graph search problems
- Implement Kruskal's algorithm with Union-Find and Prim's with a priority queue
- Perform topological sort using both Kahn's and DFS-based algorithms
- Find strongly connected components using Tarjan's single-pass DFS algorithm
- State and apply the Max-Flow Min-Cut theorem
- Map each classical algorithm to its role in a modern AI system (GNN, autograd, clustering, planning)
- Select the right algorithm given graph properties, edge-weight constraints, and hardware
Table of Contents
- 1. Intuition
- 2. Graph Traversal
- 3. Shortest Paths
- 4. Minimum Spanning Trees
- 5. Topological Sort and DAG Algorithms
- 6. Strongly Connected Components
- 7. Maximum Flow and Minimum Cut
- 8. Advanced Topics
- 9. Algorithm Selection Framework
- 10. Why This Matters for AI (2026 Perspective)
- 11. Common Mistakes
- 12. Exercises
- 13. Conceptual Bridge
1. Intuition
1.1 What Is a Graph Algorithm?
A graph algorithm is a procedure that takes a graph (and possibly edge weights or a source node) as input and produces a structured answer about that graph: the shortest path between two nodes, the set of connected components, the minimum total weight spanning tree, the maximum flow that can be routed from source to sink.
Graph algorithms fall into three broad families:
Traversal algorithms visit every vertex reachable from a starting point, in a principled order. Breadth-First Search (BFS) visits nodes layer by layer in increasing hop distance; Depth-First Search (DFS) dives as deep as possible before backtracking. Both run in time for a graph with nodes and edges - this is optimal, since every vertex and edge must be examined at least once.
Optimisation algorithms find a subgraph or path that minimises or maximises some quantity. Dijkstra finds shortest paths minimising total edge weight. Kruskal and Prim find spanning trees minimising total weight. Ford-Fulkerson maximises the flow from source to sink. These algorithms use problem-specific data structures (priority queues, union-find) to achieve better-than-brute-force complexity.
Structural algorithms reveal the global organisation of a graph - which vertices belong to the same strongly connected component, whether the graph is a DAG, what its chromatic number is. Tarjan's SCC algorithm and topological sort are the core structural algorithms in this section.
The choice of algorithm depends critically on:
- Graph structure: directed or undirected, weighted or unweighted, dense or sparse
- Edge weights: all positive (Dijkstra), possibly negative (Bellman-Ford), integer capacities (max-flow)
- Query type: single-source vs all-pairs, paths vs trees vs components
- Scale: (any algorithm), (need ), (need near-linear or parallel)
1.2 The Graph Algorithm Zoo
The table below gives the at-a-glance reference for every algorithm in this section.
| Algorithm | Problem | Time Complexity | Space | Edge Weights |
|---|---|---|---|---|
| BFS | SSSP (unweighted), reachability | Unweighted | ||
| DFS | Cycle detection, topological sort, SCC | Any | ||
| Dijkstra | SSSP | Non-negative | ||
| Bellman-Ford | SSSP, negative-cycle detection | Any | ||
| Floyd-Warshall | APSP | Any (no neg cycle) | ||
| A* | Single-pair SP with heuristic | avg | Non-negative | |
| Kruskal | MST | Any | ||
| Prim | MST | Any | ||
| Topo Sort (Kahn) | DAG ordering | DAG only | ||
| Tarjan SCC | Strongly connected components | Directed | ||
| Kosaraju SCC | Strongly connected components | Directed | ||
| Ford-Fulkerson | Max flow | Non-neg integer cap | ||
| Edmonds-Karp | Max flow | Non-neg cap |
1.3 Why Graph Algorithms Matter for AI
Graph algorithms are not separate from deep learning - they are embedded in it.
GNN receptive fields. A -layer Graph Neural Network aggregates information from the -hop neighbourhood of each node. The set of nodes at hop distance is exactly the BFS frontier after iterations from that node. The GNN forward pass is a parameterised, learnable BFS. See 2.6 for details.
Autograd engines. PyTorch's backward() call and JAX's grad() both operate on a Directed Acyclic Graph (DAG) of operations. Forward pass = topological sort traversal. Backward pass = reverse topological order, accumulating gradients via the chain rule. See 5.4.
Knowledge graph reasoning. Multi-hop reasoning over knowledge graphs (e.g., finding paths between entities in Wikidata) uses SSSP and BFS. KGQA (knowledge-graph question answering) systems routinely call shortest-path routines on subgraphs.
Graph-based clustering. Spectral clustering (04) finds the optimal graph cut by computing the Fiedler eigenvector. The Cheeger inequality bounds the spectral gap by the minimum normalised cut - which is a min-cut problem. Graph-cut-based image segmentation (Boykov & Kolmogorov, 2004) uses max-flow to segment images.
Reinforcement learning. Bellman-Ford's update rule is the same recurrence as the Bellman equation in dynamic programming. Value iteration in RL is Bellman-Ford on the state-action graph.
Embodied AI and planning. Robot navigation, game playing (MCTS), and task planning all use shortest-path algorithms. A* and its variants (Theta*, D* Lite) are standard in robotics path planning.
1.4 Historical Timeline
GRAPH ALGORITHM HISTORY
========================================================================
1736 Euler - Konigsberg Bridge Problem: first graph theory argument
1847 Kirchhoff - Spanning tree theorem for electrical networks
1930 Jarnik - MST algorithm (rediscovered as Prim 1957)
1956 Ford & Fulkerson - Max-flow / augmenting path method
1956 Bellman - Shortest paths via dynamic programming
1957 Kruskal - MST via sorted edge greedy
1959 Dijkstra - Shortest path with priority queue
1962 Floyd / Warshall - All-pairs shortest paths via DP
1968 Hart, Nilsson, Raphael - A* heuristic search
1972 Tarjan - Linear-time DFS-based SCC algorithm
1972 Hopcroft & Karp - Bipartite matching, max-flow
1975 Kosaraju - Two-pass DFS SCC algorithm
1987 Goldberg & Tarjan - Push-relabel max-flow O(n^2\sqrtm)
2003 Boykov & Kolmogorov - Graph-cuts for image segmentation
2017 Kipf & Welling - GCN: BFS-structured neural message passing
2021 GraphBLAS - Sparse linear algebra standard for graph analytics
2023+ GPU-accelerated BFS for trillion-edge graphs (NVIDIA cuGraph)
========================================================================
2. Graph Traversal
2.1 Breadth-First Search (BFS)
Breadth-First Search answers the question: "Starting from a source node , in what order do we first encounter each reachable vertex, visiting closer vertices before farther ones?"
Algorithm. BFS maintains a FIFO queue and a distance array initialised to .
BFS(G, s):
for each v in V: d[v] <- \infty, parent[v] <- NIL
d[s] <- 0
Q <- empty queue
enqueue(Q, s)
while Q not empty:
u <- dequeue(Q)
for each neighbour v of u:
if d[v] = \infty: <- v not yet visited
d[v] <- d[u] + 1
parent[v] <- u
enqueue(Q, v)
Output: After BFS, is the exact shortest number of hops from to (or if is not reachable). The parent[] array encodes the BFS tree - a spanning tree of the component containing in which every root-to-leaf path is a shortest path.
Complexity. Each vertex is enqueued at most once ( enqueue/dequeue operations). Each edge is examined at most twice - once when is dequeued, once when is dequeued. Total: time, space for and parent.
Layer structure. BFS naturally partitions into layers , neighbours of , neighbours of not in , etc. All vertices in are at hop distance exactly from .
Example. For the graph , , starting at :
- ,
- ,
- ,
- ,
2.2 BFS Correctness and Shortest-Hop Paths
Theorem (BFS Shortest-Hop). For an unweighted graph and source , BFS sets for all , where is the minimum hop distance from to .
Proof sketch. We prove by induction on the hop distance that all vertices at distance from are assigned . Base: , by initialisation. Inductive step: suppose all vertices at distance are correctly assigned. A vertex at distance has at least one neighbour at distance . By induction , so when is dequeued, is discovered with . No earlier dequeue can discover with a smaller label, since any earlier dequeued node has , and 's distance from via would be .
Corollary. BFS solves the unweighted SSSP problem in - optimal, since merely reading the graph takes .
Path reconstruction. Following parent[] pointers from back to yields a shortest path. The path has exactly edges.
2.3 Depth-First Search (DFS)
DFS explores as far as possible along each branch before backtracking. It assigns two timestamps to each vertex: disc[v] (when is first discovered) and fin[v] (when DFS finishes exploring all descendants of ).
DFS(G):
for each v in V: colour[v] <- WHITE
time <- 0
for each v in V:
if colour[v] = WHITE:
DFS-VISIT(G, v)
DFS-VISIT(G, u):
time <- time + 1; disc[u] <- time
colour[u] <- GREY <- discovered, not finished
for each neighbour v of u:
if colour[v] = WHITE:
parent[v] <- u
DFS-VISIT(G, v)
colour[u] <- BLACK <- finished
time <- time + 1; fin[u] <- time
Three colours - WHITE (undiscovered), GREY (discovered, stack open), BLACK (finished) - are the classic DFS state machine. A vertex is GREY between its disc and fin timestamps.
The parenthesis theorem. For any two vertices : the intervals and are either disjoint or one contains the other. They cannot partially overlap. This gives DFS a nested interval structure - hence the "parenthesis" name.
Complexity. - each vertex transitions WHITE->GREY->BLACK exactly once; each edge is examined twice (once per endpoint in undirected, once in directed).
2.4 DFS: Edge Classification
In a DFS on a directed graph, every edge falls into exactly one class:
| Edge Type | Condition | Meaning |
|---|---|---|
| Tree edge | is WHITE when is explored | Edge in DFS forest |
| Back edge | is GREY when is explored | is ancestor of ; forms a cycle |
| Forward edge | is BLACK, | is descendant of |
| Cross edge | is BLACK, | No ancestor/descendant relation |
Key results from edge classification:
- Cycle detection: A directed graph has a cycle if and only if DFS discovers at least one back edge. (An undirected graph has a cycle iff DFS finds a non-tree edge.)
- DAG test: A directed graph is a DAG <=> DFS finds no back edges.
- Topological order: In a DAG, vertices listed in decreasing
fin[]order form a topological ordering (5.2).
For undirected graphs there are only tree edges and back edges - forward and cross edges cannot occur, because an edge explored when is BLACK would have already been explored in the other direction.
2.5 BFS vs DFS - When to Use Which
BFS vs DFS: DECISION GUIDE
========================================================================
Use BFS when: Use DFS when:
------------- ------------
OK Shortest paths (unweighted) OK Cycle detection
OK Level-by-level exploration OK Topological sort
OK GNN k-hop neighbourhood OK Strongly connected components
OK Web crawling (breadth-first) OK Detecting back/forward edges
OK Connected components OK Maze solving (backtracking)
OK Bipartiteness check OK DFS tree for Tarjan SCC
OK Social network proximity OK Memory-efficient deep search
Memory: O(n) queue (wide graphs Memory: O(depth) stack (narrow
can be large!) graphs can be efficient)
BFS queue = FIFO (first in, first out)
DFS stack = LIFO (last in, first out) - implicit via recursion
========================================================================
Memory consideration. For a wide, shallow graph (e.g., a star ), BFS's queue can hold nodes simultaneously. For a deep, narrow graph (e.g., a path ), DFS recursion uses stack frames. In practice, BFS is preferred when the shortest path or level structure matters; DFS is preferred when structural properties (SCC, cycles) are the goal.
2.6 AI Connection: GNN Receptive Fields
Preview (full treatment in 05 Graph Neural Networks)
A Graph Neural Network layer aggregates messages from immediate neighbours. After layers, each node has aggregated information from all nodes within hop distance - its -hop neighbourhood, or receptive field.
This is exactly the BFS -hop set .
GCN layer (Kipf & Welling 2017):
where (adjacency with self-loops). The matrix multiplication is a one-step BFS aggregation: row of the result is the sum of feature vectors of all neighbours of (including itself). Stacking such layers = BFS hops, with the degree normalisation playing the role of averaging rather than summing.
Implication. The expressiveness of a GNN is bounded by the -WL hierarchy (Weisfeiler-Leman test). Two nodes that BFS cannot distinguish in hops will be assigned identical embeddings by any -layer GNN with sum/mean aggregation - regardless of the weights .
3. Shortest Paths
3.1 The SSSP Problem
Definition. Given a directed or undirected graph with edge weight function and a source vertex , the Single-Source Shortest Path (SSSP) problem asks: for every , find the shortest-weight path from to .
The distance from to is:
where the minimum is over all paths from to . If no path exists, . If a negative-weight cycle is reachable on some path, .
Triangle inequality. For all : . This is the key relaxation invariant used by all SSSP algorithms.
Relaxation. The operation of updating a distance estimate:
RELAX(u, v, w):
if d[v] > d[u] + w(u,v):
d[v] <- d[u] + w(u,v)
parent[v] <- u
All SSSP algorithms consist of choosing which edges to relax, and in what order. BFS is the special case where all weights equal 1 and the queue replaces the priority queue.
3.2 Dijkstra's Algorithm
Dijkstra's algorithm solves SSSP for graphs with non-negative edge weights. It is a greedy algorithm: at each step it permanently fixes the shortest distance to the closest unvisited vertex.
Algorithm.
DIJKSTRA(G, s):
for each v in V: d[v] <- \infty, parent[v] <- NIL
d[s] <- 0
Q <- min-priority queue with all vertices keyed by d[*]
while Q not empty:
u <- EXTRACT-MIN(Q) <- vertex with smallest tentative distance
for each neighbour v of u:
if d[u] + w(u,v) < d[v]:
d[v] <- d[u] + w(u,v)
parent[v] <- u
DECREASE-KEY(Q, v, d[v]) <- update priority queue
Correctness invariant. When vertex is extracted from , - it is the exact shortest distance. This holds because all weights are non-negative: any alternative path from to passing through an unvisited vertex has (since was extracted first), so extending that path cannot improve upon .
Proof sketch. By induction. Base: is extracted first with . Inductive step: assume all previously extracted vertices have correct distances. Let be the next extracted vertex. Suppose for contradiction there exists a shorter path with extracted, not extracted. Then (otherwise would have been extracted before ), and the suffix has non-negative weight, so the path is no shorter than . Contradiction.
Complexity. With a binary heap: - EXTRACT-MIN and up to DECREASE-KEY operations, each . With a Fibonacci heap: - theoretically better for dense graphs; rarely used in practice due to large constants.
Failure case. Dijkstra gives incorrect results when any edge has negative weight. Example: with weight 10, with weight 2, with weight . Dijkstra fixes when is extracted, but the true distance is via .
For AI. Dijkstra is used in knowledge graph embedding to find the shortest semantic path between entities. In route planning for embodied agents, Dijkstra on a road graph with positive travel times is the standard baseline. Its priority queue structure mirrors the attention mechanism's softmax weighting of neighbour importance.
3.3 Bellman-Ford Algorithm
Bellman-Ford solves SSSP for graphs with arbitrary edge weights (including negative), and detects negative-weight cycles.
Algorithm.
BELLMAN-FORD(G, s):
for each v in V: d[v] <- \infty, parent[v] <- NIL
d[s] <- 0
for i = 1 to n-1: <- n-1 relaxation rounds
for each edge (u,v) in E:
RELAX(u, v, w(u,v))
for each edge (u,v) in E: <- negative cycle check
if d[v] > d[u] + w(u,v):
return "negative cycle exists"
return d[]
Why rounds? Any shortest path that does not repeat a vertex has at most edges. After round , correctly reflects the shortest path using at most edges. After rounds, all shortest paths are found (assuming no negative cycle).
Negative cycle detection. After rounds, if any edge still allows relaxation, there is a negative-weight cycle reachable from - the -th round would continue improving distances indefinitely.
Complexity. - rounds, each examining all edges. Slower than Dijkstra but correct for negative weights.
Dynamic programming view. Let length of shortest path from to using edges. Recurrence:
This is exactly the Bellman equation in RL:
with (value after iterations), (negative reward = positive cost), and . Value iteration in RL is Bellman-Ford on the state-action graph.
3.4 Floyd-Warshall: All-Pairs Shortest Paths
Floyd-Warshall solves the All-Pairs Shortest Path (APSP) problem: find for all pairs .
Algorithm. Maintain a matrix where is the shortest known distance from to .
FLOYD-WARSHALL(G):
D <- n\timesn matrix with D[i][j] = w(i,j) if (i,j)\inE, 0 if i=j, \infty otherwise
for k = 1 to n:
for i = 1 to n:
for j = 1 to n:
D[i][j] <- min(D[i][j], D[i][k] + D[k][j])
return D
Correctness. After the outer loop for , holds the shortest path from to using only vertices as intermediates. After all iterations, holds all-pairs shortest distances. The recurrence:
is a DP over the choice of whether vertex is used as an intermediate.
Transitive closure. Replace min/+ with Boolean or/and to compute reachability: iff a path from to exists. Same structure.
Negative cycles. After Floyd-Warshall, if for some , vertex lies on a negative cycle.
Complexity. time, space. Practical for ; infeasible for large graphs.
Matrix interpretation. The Floyd-Warshall update is the tropical matrix product (min-plus algebra): So - the -th tropical power of the weight matrix - gives all-pairs distances. This connects APSP to algebraic path problems over semirings.
3.5 A* Search
A* (Hart, Nilsson & Raphael, 1968) is a best-first search algorithm that guides Dijkstra with a heuristic function estimating the remaining cost from to the goal .
Algorithm. Identical to Dijkstra, but the priority queue key for vertex is instead of just .
A-STAR(G, s, t, h):
for each v: d[v] <- \infty, f[v] <- \infty
d[s] <- 0; f[s] <- h(s)
Q <- min-priority queue keyed by f[*]
enqueue(Q, s)
while Q not empty:
u <- EXTRACT-MIN(Q)
if u = t: return d[t] <- goal found
for each neighbour v of u:
new_d <- d[u] + w(u,v)
if new_d < d[v]:
d[v] <- new_d
f[v] <- new_d + h(v)
update Q with v
Admissibility. A heuristic is admissible if for all - it never overestimates. An admissible heuristic guarantees A* finds the optimal path.
Consistency (monotonicity). is consistent if for all edges . Consistency implies admissibility and guarantees each vertex is processed at most once (like Dijkstra).
Examples of admissible heuristics:
- Grid navigation: Euclidean distance to goal (straight-line, never overestimates)
- Grid with 4-directional movement: Manhattan distance to goal
AI relevance. A* is the standard algorithm in robotics path planning (ROS navigation stack), game AI (Pathfinder, StarCraft), and any domain where a domain-specific heuristic is available. The trade-off: a tighter heuristic (closer to true distance) leads to fewer node expansions. With , A* degenerates to Dijkstra. With (perfect heuristic), A* expands only nodes on the optimal path.
3.6 Shortest-Path Complexity Comparison
| Algorithm | Time | Space | Weights | Use Case |
|---|---|---|---|---|
| BFS | Unweighted only | Hop distance, BFS tree | ||
| Dijkstra (binary heap) | Non-negative | Standard SP | ||
| Dijkstra (Fibonacci heap) | Non-negative | Dense graphs, theory | ||
| Bellman-Ford | Any (detects neg cycle) | Negative weights, DP | ||
| SPFA (Bellman-Ford queue opt.) | worst, avg | Any | Practical Bellman-Ford | |
| Floyd-Warshall | Any (no neg cycle) | All-pairs, small | ||
| Johnson's | Any | APSP, sparse graphs | ||
| A* | avg | Non-negative | Single-pair with heuristic |
4. Minimum Spanning Trees
4.1 The MST Problem and Key Properties
Definition. Given a connected, undirected, weighted graph , a spanning tree is a subgraph that connects all vertices with exactly edges (i.e., a connected acyclic spanning subgraph). A Minimum Spanning Tree (MST) is a spanning tree minimising total edge weight:
Existence. Every connected graph has at least one MST. If all edge weights are distinct, the MST is unique.
Two key structural theorems underlie all MST algorithms:
Cut property. Let be any proper subset of vertices, and let be the minimum-weight edge crossing the cut (i.e., , ). Then belongs to every MST of .
Intuition: If the minimum cut-crossing edge were not in the MST, we could swap it with a heavier cut-crossing edge in the MST and reduce total weight - contradiction.
Cycle property. Let be any cycle in , and let be the unique maximum-weight edge of . Then does NOT belong to any MST of .
Intuition: If were in the MST, removing it would disconnect the tree into two components; the other edges of provide a cheaper reconnection, contradicting the MST assumption.
Both Kruskal and Prim are instantiations of a generic greedy MST algorithm that repeatedly adds the minimum-weight safe edge (one that doesn't create a cycle and respects the cut property).
Total weight lower bound. The MST weight gives the minimum cost to connect all vertices - any other spanning tree costs at least as much. MST weight is a lower bound for the Travelling Salesman Problem (TSP) on the same metric graph.
4.2 Kruskal's Algorithm
Kruskal's algorithm builds the MST by greedily adding edges in increasing weight order, skipping edges that would form a cycle.
KRUSKAL(G):
Sort edges E by weight: e_1 \leq e_2 \leq ... \leq e_m
T <- \emptyset
for each vertex v: MAKE-SET(v) <- Union-Find initialisation
for each edge (u,v) in sorted order:
if FIND(u) \neq FIND(v): <- u and v in different components?
T <- T \cup {(u,v)}
UNION(u, v) <- merge components
return T
Union-Find (Disjoint Set Union). This data structure maintains a collection of disjoint sets and supports:
MAKE-SET(x): create a singleton setFIND(x): return the representative ("root") of 's setUNION(x, y): merge the sets containing and
With union by rank and path compression, all operations run in amortised time, where is the inverse Ackermann function - effectively for all practical .
Correctness. Each added edge is the minimum-weight edge crossing some cut where is one of the current components. By the cut property, this edge belongs to the MST.
Complexity. Sorting edges: . Union-Find: . Total: , dominated by the sort.
Example. Graph with edges :
- Add : ; components
- Add : ; components
- Add : ; all connected - done. MST weight
- Skip : would form cycle
- Skip : would form cycle
4.3 Prim's Algorithm
Prim's algorithm grows the MST one vertex at a time, always attaching the cheapest edge connecting the current tree to an unvisited vertex.
PRIM(G, r):
for each v: key[v] <- \infty, parent[v] <- NIL, inMST[v] <- FALSE
key[r] <- 0 <- start from root r
Q <- min-priority queue of all vertices keyed by key[*]
while Q not empty:
u <- EXTRACT-MIN(Q)
inMST[u] <- TRUE
for each neighbour v of u:
if NOT inMST[v] AND w(u,v) < key[v]:
key[v] <- w(u,v)
parent[v] <- u
DECREASE-KEY(Q, v, key[v])
return {(v, parent[v]) : v \neq r}
Analogy with Dijkstra. Prim and Dijkstra have identical structure - the only difference is the priority key: Dijkstra uses (cumulative path cost); Prim uses (edge cost alone). Prim greedily picks the cheapest edge into the tree; Dijkstra greedily picks the cheapest path to any vertex.
Complexity. with a binary heap - same as Dijkstra.
When to prefer Prim vs Kruskal:
- Kruskal is simpler to implement and better for sparse graphs ( close to )
- Prim is better for dense graphs ( close to ) with an adjacency matrix
- Both produce an MST with the same total weight (unique if weights are distinct)
4.4 AI Applications of MSTs
Graph-based clustering. Remove the heaviest edges from the MST to obtain clusters. This is single-linkage hierarchical clustering - equivalent to building the MST and pruning. Each resulting subtree is a cluster. The method is sensitive to outliers (a long edge can connect two distant clusters) but is and parameter-free.
Feature selection. Build a minimum spanning tree on a graph where nodes are features and edge weights are mutual information between features. The MST structure reveals the most informative non-redundant features.
Network design. Connect nodes (data centres, sensors, brain regions) with minimum total cable cost while ensuring full connectivity.
Approximate TSP. The MST provides a 2-approximation for TSP on metric graphs: walk the MST DFS-order and shortcut repeated vertices. This gives a tour of cost . Christofides' algorithm (1976) uses MST + perfect matching for a 1.5-approximation.
Neuroscience. Brain connectivity graphs are often analysed via MST to find the backbone connectivity structure with minimal total connection weight.
5. Topological Sort and DAG Algorithms
5.1 Directed Acyclic Graphs (DAGs)
A Directed Acyclic Graph (DAG) is a directed graph with no directed cycles. Formally, there is no sequence of distinct vertices.
Recognition. Run DFS. If DFS finds a back edge (an edge from to an ancestor in the DFS tree), the graph has a cycle and is not a DAG. If DFS completes without finding any back edge, the graph is a DAG.
Examples of DAGs in AI:
- Computation graphs: every PyTorch/JAX forward pass. Each
opnode has edges to the ops whose outputs it consumes. - Causal graphs: nodes are random variables, edges are causal influences. A structural causal model (SCM) is a DAG.
- Dependency graphs: package managers (pip, conda, apt) resolve dependencies on a DAG.
- Scheduling: project tasks with precedence constraints.
- Probabilistic graphical models: Bayesian networks are DAGs where edges represent conditional dependencies.
In-degree and out-degree in DAGs. Every DAG has at least one source (in-degree 0) and at least one sink (out-degree 0). This follows because if every vertex had a predecessor, we could follow predecessors indefinitely - but the graph is finite and acyclic, so we must eventually reach a vertex with no predecessor.
5.2 Topological Sort
A topological ordering of a DAG is a linear ordering of its vertices such that for every directed edge , we have (the source comes before the target).
Existence. Every DAG has at least one topological ordering. (Proof: repeatedly remove a source vertex and its outgoing edges; by the DAG property, at least one source always exists.)
Non-uniqueness. If there are multiple vertices with in-degree 0, different choices yield different topological orderings.
Algorithm 1: Kahn's Algorithm (BFS-based)
KAHN(G):
Compute in-degree indeg[v] for each v
Q <- queue of all vertices with indeg[v] = 0 <- sources
order <- empty list
while Q not empty:
u <- dequeue(Q)
append u to order
for each edge (u,v):
indeg[v] <- indeg[v] - 1
if indeg[v] = 0: enqueue(Q, v)
if |order| < n: return "graph has a cycle"
return order
Kahn's algorithm processes vertices in topological order, removing each vertex and its edges. If it processes fewer than vertices, the remaining vertices form a cycle.
Algorithm 2: DFS-based
Run DFS. As each vertex is finished (colour turns BLACK), prepend to the result list. The final list (in reverse finish time order) is a valid topological ordering.
Why? For any edge in a DAG, (since finishes before due to DFS structure). So decreasing finish time = correct topological order.
Complexity. Both algorithms run in .
5.3 Longest and Shortest Paths in DAGs
In general graphs, finding longest paths is NP-hard (it generalises TSP). But in DAGs, both longest and shortest paths can be found in using DP over the topological ordering.
Algorithm.
DAG-SSSP(G, s):
d[s] <- 0; d[v] <- \infty for all v \neq s
for each u in topological order of G:
for each edge (u,v):
RELAX(u, v, w(u,v))
return d[]
For longest paths, initialise , and replace RELAX with a maximisation.
Critical Path Method. In project scheduling, nodes are tasks and edges represent precedence. Edge weights are durations. The critical path is the longest path in the DAG - it determines the minimum project duration. Earliest/latest start times for each task are computed via two DAG DP passes (forward and backward).
5.4 AI Connection: Autograd Computation Graphs
This is one of the most important applications of graph algorithms in all of modern AI.
PyTorch's autograd engine builds a DAG when executing a forward pass. Every tensor operation (matmul, relu, softmax, ...) creates a node; edges connect inputs to outputs.
PYTORCH AUTOGRAD DAG (simplified):
========================================================================
Forward pass (topological order, sources to sinks):
x ---matmul---> z ---relu---> a ---softmax---> p ---cross_entropy---> L
up up
W (no weight here)
Backward pass (reverse topological order, sinks to sources):
\partialL/\partialp <----- \partialL/\partiala <----- \partialL/\partialz <----- \partialL/\partialx, \partialL/\partialW
At each node: multiply incoming gradient by local Jacobian (chain rule)
Accumulate gradients at leaf nodes (parameters)
========================================================================
Forward pass = topological traversal from input tensors (sources) to the loss (sink). Each node applies its operation.
Backward pass = reverse topological traversal. Each node receives the upstream gradient and propagates to its predecessors. The gradient for shared tensors is accumulated (summed) over all incoming edges.
Why DAG and not a tree? When a tensor is used multiple times (e.g., weight sharing, skip connections in ResNet), it has multiple out-edges. The backward pass sums gradients from all out-edges - this is the graph-theoretic explanation for gradient accumulation.
No-cycle guarantee. The autograd engine assumes the computation graph is a DAG. Recurrent networks (RNNs) are unrolled through time into DAGs before differentiation - this is Backpropagation Through Time (BPTT), which is simply differentiation on an unrolled DAG.
6. Strongly Connected Components
6.1 Connectivity in Directed Graphs
In undirected graphs, connectivity is binary: two vertices are either connected or not. In directed graphs, connectivity is richer.
Definitions.
- Vertices and are mutually reachable if there exists a directed path AND a directed path .
- A Strongly Connected Component (SCC) is a maximal subset such that every pair is mutually reachable.
- Every directed graph has a unique partition into SCCs.
- A vertex with no edges forms its own trivial SCC .
Condensation graph. Contract each SCC to a single node. The result is a DAG - the condensation (also called the SCC DAG or meta-graph). This DAG reveals the high-level structure of the directed graph: SCCs with no incoming edges are "sources" (starting points for information flow); SCCs with no outgoing edges are "sinks".
Example. For the directed graph (cycle), , , (another cycle):
- SCCs: (cycle), (cycle), each a single strongly connected component
- Condensation: - a DAG with two nodes and one edge
6.2 Kosaraju's Algorithm
Kosaraju's algorithm finds all SCCs with two DFS passes.
KOSARAJU(G):
Phase 1: Run DFS on G; push vertices onto stack S in finish order
Phase 2: Compute G^T (reverse all edges)
Run DFS on G^T, exploring in decreasing finish order (pop from S)
Each DFS tree in phase 2 is one SCC
Why it works. In , a vertex with a high finish time is in a "high" SCC in the condensation (or is itself a source SCC). In , the condensation edges are reversed. A DFS from in can only reach vertices in 's own SCC (it cannot cross condensation edges to "higher" SCCs, which have been reversed). So each DFS tree in exactly recovers one SCC.
Complexity. Two DFS passes: each. Graph reversal: . Total: .
6.3 Tarjan's Algorithm
Tarjan's algorithm finds all SCCs in a single DFS pass using low-link values.
Low-link value. - the minimum discovery time reachable from the subtree rooted at via back edges.
TARJAN(G):
disc[] <- all -1, low[] <- 0
onStack[] <- all FALSE
stack S <- empty, time <- 0, sccs <- []
DFS-VISIT(v):
disc[v] <- low[v] <- time++
push v onto S; onStack[v] <- TRUE
for each edge (v,w):
if disc[w] = -1: <- tree edge
DFS-VISIT(w)
low[v] <- min(low[v], low[w])
elif onStack[w]: <- back edge to stack vertex
low[v] <- min(low[v], disc[w])
if low[v] = disc[v]: <- v is root of an SCC
pop vertices from S until v, forming one SCC
for each v with disc[v] = -1: DFS-VISIT(v)
The key insight. A vertex is the root of an SCC if and only if no back edge from 's subtree escapes to a vertex discovered before - i.e., . When this condition holds, all vertices on the stack above (including ) form one SCC.
Complexity. Single DFS pass: .
Comparison with Kosaraju:
- Both are ; Tarjan is faster in practice (one pass vs two, no graph reversal)
- Kosaraju is easier to reason about (correctness follows from DFS finish order properties)
- Tarjan is more memory-efficient (no need to store the reversed graph)
6.4 AI Applications of SCCs
Knowledge graph analysis. In a directed knowledge graph (entity -> relation -> entity), SCCs correspond to groups of entities that are mutually related via directed paths. Trivial SCCs (singletons) are "leaf" entities; large SCCs indicate dense relationship clusters.
Circular dependency detection. Package dependency graphs should be DAGs. If the package manager finds an SCC with more than one node, there is a circular dependency - a build-system error.
PageRank on condensation. Google's original PageRank algorithm first computes the condensation DAG, then computes within-SCC PageRank, then propagates across the condensation edges. This avoids the "dangling node" and "spider trap" problems that arise from cycles.
Constraint propagation. In SAT solving (Boolean satisfiability), 2-SAT problems reduce to finding SCCs in an implication graph. A formula is satisfiable iff no variable and its negation appear in the same SCC.
7. Maximum Flow and Minimum Cut
7.1 Flow Networks
A flow network is a directed graph with:
- A source (no incoming edges in the abstract model)
- A sink (no outgoing edges)
- A capacity function giving the maximum flow on each edge
A flow is a function satisfying:
- Capacity constraint: for all
- Flow conservation: For each : (flow in = flow out)
The value of flow is - total flow leaving the source.
Goal: Find a flow maximising .
7.2 Ford-Fulkerson and Augmenting Paths
Residual graph. Given flow , the residual graph captures where flow can be added or cancelled:
- For edge : forward residual capacity
- For edge : backward residual capacity (flow can be cancelled)
An augmenting path is any path from to in with positive capacity on every edge.
FORD-FULKERSON(G, s, t):
f(u,v) <- 0 for all edges
while there exists an augmenting path P in G_f:
\delta <- min capacity on P (bottleneck)
augment flow along P by \delta
return f
Correctness. The algorithm terminates (for integer capacities; may not terminate for irrational capacities with bad path choices) and returns a maximum flow, by the Max-Flow Min-Cut theorem (7.4).
Complexity. where and is the maximum flow value. This can be exponential if edge capacities are large.
7.3 Edmonds-Karp Algorithm
Edmonds-Karp (1972) is Ford-Fulkerson with BFS used to find augmenting paths (shortest path in in terms of number of edges).
Complexity. - polynomial regardless of edge capacities. The key insight: augmenting along a shortest path guarantees that the number of augmentation steps is at most .
7.4 The Max-Flow Min-Cut Theorem
Cut definition. An - cut partitions into (containing ) and (containing ). The capacity of the cut is .
Weak duality. For any flow and any - cut : . (Proof: the flow must cross the cut, and is bounded by cut capacity.)
Theorem (Max-Flow Min-Cut). The maximum value of any - flow equals the minimum capacity of any - cut:
Proof sketch. When Ford-Fulkerson terminates, there is no augmenting path in . Let . Then is an - cut with no residual capacity - meaning every edge from to is saturated () and every edge from to carries zero flow. Thus . Since for any cut, is maximum and is minimum.
Integrality theorem. If all capacities are integers, there exists a maximum flow that is integer-valued on every edge.
7.5 AI Connections: Graph Cuts and Partitioning
Image segmentation. Model pixels as nodes; adjacent pixels connected by edges weighted by colour similarity (high weight = similar). Source = "foreground" terminal, sink = "background" terminal. Add edges from to pixels likely to be foreground, from pixels likely to be background to . The min-cut partitions pixels into foreground/background - this is the Boykov-Kolmogorov graph cut algorithm used throughout computer vision (2001-2010).
Graph partitioning. Dividing a graph into balanced parts with minimal cut capacity - used for distributed computing (partition a computation graph across GPUs), parallel simulation, and community detection.
Spectral connection (forward reference). The normalised cut (Shi & Malik 2000) minimises , which approximates - the Cheeger constant of the graph. The Cheeger inequality (04) bounds in terms of the Fiedler eigenvalue : . This is how max-flow/min-cut and spectral graph theory connect.
-> Full treatment of spectral clustering: 04 Spectral Graph Theory
8. Advanced Topics
8.1 Bidirectional Search
Motivation. Standard Dijkstra from source to target explores a "ball" of radius around . For large graphs, this ball can be enormous.
Bidirectional Dijkstra runs two simultaneous Dijkstra searches: one forward from and one backward from (on the reversed graph). The searches alternate, and terminate when a vertex is settled by both.
Speedup. In the best case, each search explores a ball of radius , reducing explored area from to - a factor of 2 in practice, but can be much better for graphs with clustered structure.
ALT (A* with Landmarks and Triangle inequality). A practical enhancement: precompute shortest distances from a small set of landmark vertices. Use these to compute tight admissible heuristics for A*. Used in production road routing (OpenStreetMap, Google Maps).
Contraction Hierarchies. The leading practical algorithm for road network routing: precompute a hierarchy of "shortcut" edges that compress long paths. Query time is milliseconds for continent-scale road graphs. Used in OSRM and many navigation systems.
8.2 Johnson's Algorithm
Johnson's algorithm solves APSP for sparse graphs with negative edge weights (but no negative cycles), more efficiently than running Floyd-Warshall.
Key idea. Reweight edges to make all weights non-negative, then run Dijkstra from every source.
JOHNSON(G):
1. Add new vertex q with zero-weight edges to all vertices
2. Run Bellman-Ford from q to compute h[v] = \delta(q,v) for all v
3. Reweight: w'(u,v) = w(u,v) + h[u] - h[v] \geq 0
4. For each source s: run Dijkstra with weights w'
5. Adjust distances: \delta(u,v) = \delta'(u,v) - h[u] + h[v]
Complexity. - one Bellman-Ford () plus Dijkstra runs ( total). Better than Floyd-Warshall's for sparse graphs where .
Correctness of reweighting. The reweighted distances are non-negative (by the triangle inequality on Bellman-Ford distances: ). The reweighting preserves shortest paths: .
8.3 Parallel and GPU Graph Algorithms
Frontier-based parallel BFS. Standard BFS is sequential (queue pops one vertex at a time). Parallel BFS processes the entire current frontier simultaneously:
PARALLEL-BFS(G, s):
frontier <- {s}
d[s] <- 0
while frontier not empty:
next_frontier <- \emptyset
for each u in frontier (in parallel):
for each neighbour v of u:
if d[v] = \infty (using atomic compare-and-swap):
d[v] <- d[u] + 1
add v to next_frontier
frontier <- next_frontier
This is called level-synchronous BFS or frontier BFS. On a GPU with thousands of cores, all vertices in the frontier are processed simultaneously. For billion-node social network graphs, frontier BFS with GPU parallelism achieves throughputs of edges/second (NVIDIA cuGraph, 2023).
GraphBLAS. The GraphBLAS standard (2017) expresses graph algorithms as sparse linear algebra over user-defined semirings. BFS becomes matrix-vector multiplication over the (OR, AND) semiring (Boolean frontier propagation). Dijkstra becomes SpMV over the (min, +) tropical semiring. This unification enables highly optimised GPU implementations.
GNN aggregation as SpMM. The GNN message-passing step is a Sparse Matrix times Dense Matrix (SpMM) operation. Modern frameworks (PyG, DGL) implement this as a single GPU kernel. The time complexity is where is the feature dimension - identical to running BFS with a -dimensional state vector per node.
8.4 Approximation Algorithms
Metric TSP (2-approximation via MST). Given cities with metric distances (satisfying triangle inequality), find the shortest tour visiting all cities.
- Compute MST (cost OPT, since deleting any tour edge gives a spanning tree)
- Perform Euler tour (DFS) of , shortcutting repeated vertices
- Result: tour of cost
Greedy max-cut (0.5-approximation). Partition into two sets and to maximise the number of edges between and .
- Greedy: for each vertex, assign it to the side that maximises its cut contribution
- Result: at least edges in the cut (since a random partition cuts each edge in expectation with probability )
- Goemans-Williamson SDP relaxation achieves approximation - a landmark result in approximation algorithms
9. Algorithm Selection Framework
9.1 Decision Flowchart
GRAPH ALGORITHM SELECTION
========================================================================
What do you need?
|
+- TRAVERSAL / REACHABILITY
| +- Shortest hop-distance? ----------------------> BFS O(n+m)
| +- Detect cycles? -----------------------------> DFS O(n+m)
| +- Connected components? ----------------------> BFS or DFS
| +- SCCs (directed)? ---------------------------> Tarjan O(n+m)
|
+- SHORTEST PATHS
| +- Unweighted graph? --------------------------> BFS O(n+m)
| +- Non-negative weights?
| | +- Single source ------------------------> Dijkstra O((n+m)logn)
| | +- Single pair, heuristic known ---------> A* O(m logn)
| | +- All pairs, dense graph ---------------> Floyd-Warshall O(n^3)
| +- Negative weights (no neg cycle)?
| | +- Single source ------------------------> Bellman-Ford O(nm)
| | +- All pairs, sparse graph --------------> Johnson's O(nm + n^2logn)
| +- Need negative cycle detection? ------------> Bellman-Ford
|
+- SPANNING TREES
| +- Sparse graph (m close to n) ----------------> Kruskal O(m logm)
| +- Dense graph (m close to n^2) ---------------> Prim O((n+m)logn)
|
+- DAG PROBLEMS
| +- Ordering with dependencies ----------------> Topological Sort O(n+m)
| +- Shortest path in DAG ----------------------> DAG-DP O(n+m)
| +- Critical path (project scheduling) --------> DAG longest path O(n+m)
|
+- FLOW / CUT
+- Max flow (small graph) ---------------------> Ford-Fulkerson
+- Max flow (large graph) ---------------------> Edmonds-Karp O(nm^2)
========================================================================
9.2 Complexity Summary Table
| Problem | Best Algorithm | Time | Notes |
|---|---|---|---|
| Reachability | BFS/DFS | Linear optimal | |
| Connected components | BFS/DFS | All components in one pass | |
| Bipartiteness | BFS | 2-colour during BFS | |
| SSSP unweighted | BFS | Exact hop distances | |
| SSSP non-neg weights | Dijkstra | Binary heap | |
| SSSP arbitrary weights | Bellman-Ford | Detects neg cycles | |
| SSSP negative cycle check | Bellman-Ford | -th round test | |
| APSP dense | Floyd-Warshall | Simple DP | |
| APSP sparse neg weights | Johnson's | Reweight + Dijkstra | |
| MST | Kruskal / Prim | Both optimal | |
| DAG topological sort | Kahn / DFS | Linear optimal | |
| SCC | Tarjan / Kosaraju | Linear optimal | |
| Max flow | Edmonds-Karp | Polynomial guarantee | |
| Single-pair SP w/ heuristic | A* | avg | Admissible heuristic |
9.3 Implementation Pitfalls
Ten common bugs in graph algorithm implementations:
-
Integer overflow in weights. Initialising distances to
INT_MAXand adding a positive weight overflows to a negative number, breaking Dijkstra's invariant. Fix: usefloat('inf')(Python) or1e18(C++/Java). -
Relaxing undirected edges only once. In Bellman-Ford on undirected graphs, each undirected edge must be treated as two directed edges. Relaxing only one direction misses valid shorter paths.
-
Using Dijkstra on negative-weight edges. Dijkstra is incorrect with negative weights. Symptom: incorrect distances for some vertices. Fix: use Bellman-Ford or reweight with Johnson's.
-
Missing the negative-cycle check. Implementing Bellman-Ford without the final check-round misses the case where the answer is . Always run the -th relaxation round and check for further improvement.
-
Off-by-one in Kahn's algorithm. If the output list has fewer than vertices, a cycle exists - this is the cycle-detection mechanism. Not checking this means returning a partial topological order silently.
-
Not resetting the
onStackflag in Tarjan. When an SCC is popped, vertices must be marked as not on the stack. Otherwise, later DFS edges to those vertices incorrectly updatelowvalues. -
Forgetting the residual backward edge in max-flow. Ford-Fulkerson requires adding both forward and backward residual edges. Omitting the backward edge prevents flow cancellation and gives suboptimal results.
-
BFS on multigraphs with parallel edges. If multiple edges exist between and , BFS may mark as visited via one edge and miss a shorter-weight path via another. For weighted BFS, use Dijkstra.
-
DFS stack overflow on large graphs. Recursive DFS overflows the call stack for deep graphs ( in Python). Fix: implement iterative DFS with an explicit stack.
-
Ignoring disconnected graphs. BFS/DFS from a single source only visits one component. For algorithms requiring all vertices (e.g., Kahn's topological sort, Kosaraju's SCC), always loop over all unvisited sources.
10. Why This Matters for AI (2026 Perspective)
| Algorithm | Concrete AI/ML Application |
|---|---|
| BFS | GNN -hop receptive field; knowledge graph multi-hop QA; web crawling for LLM pre-training data |
| DFS | Cycle detection in autograd graphs; dependency resolution for model serving; parse tree traversal in NLP |
| Dijkstra | Shortest semantic path in knowledge graphs; road routing for embodied agents; optimal plan length in MCTS |
| Bellman-Ford | Value iteration in RL (Bellman equation); credit assignment in temporal-difference learning; negative-reward environments |
| Floyd-Warshall | All-pairs semantic distance in knowledge bases; all-pairs token co-occurrence distances; small graph APSP in graph transformers |
| A* | Robot path planning (ROS navigation); game AI search (Alpha*, MCTS + heuristic); structured output generation with heuristic guidance |
| Kruskal / Prim (MST) | Single-linkage hierarchical clustering; minimum-cost network design; graph-based feature selection; phylogenetic tree construction |
| Topological Sort | PyTorch/JAX autograd forward pass; package dependency resolution (pip, conda); build system (Make, Bazel) ordering |
| Tarjan SCC | Circular dependency detection in ML pipelines; knowledge graph cycle analysis; 2-SAT constraint solving in neural architecture search |
| Ford-Fulkerson / Edmonds-Karp | Image segmentation (Boykov-Kolmogorov); network reliability; matching in bipartite graphs (Hungarian algorithm) |
| Max-Flow / Min-Cut | Graph partitioning for distributed GNN training; community detection; min-cut spectral connection (Cheeger inequality -> 04) |
| Graph cuts (Boykov 2001) | Semantic segmentation before deep learning (still used as post-processing); MRF energy minimisation |
| Parallel BFS (GPU) | Trillion-edge social network analysis; billion-node knowledge graph traversal (NVIDIA cuGraph, GraphBLAS) |
| SpMM (GNN aggregation) | Every GNN forward pass: is sparse-dense matrix multiply |
The GNN as a Parameterised Graph Algorithm
The deepest connection between classical graph algorithms and modern AI is this:
GRAPH ALGORITHM -> GNN GENERALISATION
========================================================================
BFS (hop-distance labels) -> GCN (learned node embeddings)
BFS frontier = layer k nodes -> GNN layer k receptive field
"distance from source" -> "representation of k-hop context"
Fixed aggregation (count) -> Learned aggregation (attention)
Deterministic (one answer) -> Parameterised (trained on data)
Dijkstra (shortest path) -> Neural shortest path (learned w)
Static weights w(u,v) -> Edge features in GNN
Priority queue extraction -> Attention-weighted aggregation
Exact optimal solution -> Approximate learned solution
MST (global structure) -> Graph pooling (global summary)
Kruskal cut/cycle decisions -> Differentiable graph pooling
Minimum spanning structure -> Compressed graph representation
========================================================================
Every GNN is a learned, parameterised, differentiable version of a classical graph algorithm. Understanding the classical algorithms provides:
- Interpretability: what the GNN is computing approximations of
- Complexity awareness: how many layers = how many hops the GNN can "see"
- Architecture design: which aggregation functions (sum, mean, max) correspond to which classical problems
- Failure case prediction: when the GNN cannot solve a problem (e.g., WL-indistinguishable graphs)
11. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using d[v] = 0 to mark "unvisited" in BFS | Conflicts with actual distance 0 for source; BFS fails on any source other than node 0 | Use d[v] = \infty (infinity) for unvisited; only d[s] = 0 is set at initialisation |
| 2 | Applying Dijkstra to a graph with negative edge weights | Greedy extraction invariant breaks; can return incorrect distances | Use Bellman-Ford for negative weights; or reweight with Johnson's method |
| 3 | Treating DFS finish-order as topological order (not reversing) | DFS finish order is the reverse topological order | Reverse the finish order list, or use Kahn's algorithm directly |
| 4 | Confusing BFS shortest paths on weighted graphs | BFS counts hops, not weighted distances; gives wrong answers for weighted SSSP | Use Dijkstra (non-neg weights) or Bellman-Ford (arbitrary weights) |
| 5 | Running Kruskal on a multigraph without handling parallel edges | Parallel edges can all be added if they connect different components; deduplication may be needed | Pre-process: keep only minimum-weight edge between each node pair if MST is the goal |
| 6 | Declaring a graph a DAG after just one DFS component | DFS from one source only checks reachable vertices; cycles elsewhere go undetected | Run DFS from all unvisited vertices (outer loop) and check for back edges globally |
| 7 | Floyd-Warshall with D[i][i] = \infty initial value | Diagonal should be 0 (distance from a node to itself); incorrect initialisations propagate errors throughout all iterations | Initialise D[i][i] = 0 for all |
| 8 | Max-flow without bidirectional residual edges | Cancellation of flow is impossible; algorithm gets stuck at suboptimal solutions | For every edge with capacity , add a reverse edge with initial capacity 0 in the residual graph |
| 9 | Prim's algorithm applied to a disconnected graph | Prim always starts from one vertex and grows; disconnected vertices are never added to the MST | Verify graph is connected before running Prim, or run Kruskal (which works on disconnected graphs, yielding a minimum spanning forest) |
| 10 | Bellman-Ford stopping early when no relaxation occurs in a round | In the presence of negative edges, relaxation may restart later; early stopping can miss shorter paths | Only apply early stopping if a full round passes with zero relaxations - this is an optimisation, but must complete all rounds first |
| 11 | A* with an inadmissible heuristic | An overestimating heuristic causes A* to expand a non-optimal path first, missing the true shortest path | Always verify (admissibility); Euclidean distance is safe for Euclidean graphs |
| 12 | SCCs on an undirected graph using SCC algorithm | Every edge in an undirected graph is a bidirectional path; every connected component is a trivial SCC | For undirected graphs, simply find connected components with BFS/DFS |
12. Exercises
Exercise 1 * - BFS Shortest Paths Consider the undirected graph with adjacency list: , , , , , .
(a) Run BFS from source . Show the queue state at each step and the resulting distance array . (b) Find all shortest paths from to . (c) How many distinct shortest paths are there from to ? Enumerate them. (d) Implement BFS in Python and verify your answer.
Exercise 2 * - DFS Edge Classification Consider the directed graph with edges: , , , , , .
(a) Run DFS starting from vertex 1. Record disc[] and fin[] timestamps. (b) Classify each edge as tree, back, forward, or cross edge. (c) Does this graph contain a cycle? Identify it. (d) Is a topological ordering possible? Explain why or why not.
Exercise 3 * - Dijkstra's Algorithm Run Dijkstra's algorithm on the weighted directed graph with edges: .
(a) Show the priority queue state after each EXTRACT-MIN. (b) Give the shortest path tree (parent[] array). (c) What is the shortest path from to ? (d) Why would Dijkstra fail if edge had weight ? What is the correct answer, and which algorithm finds it?
Exercise 4 ** - Bellman-Ford and DP Given the directed graph with edges :
(a) Show the Bellman-Ford DP table for (rows = iterations, columns = vertices). (b) Verify the final distances match the true shortest paths. (c) Add edge . What happens to Bellman-Ford? Show the behaviour in the -th round. (d) Explain the connection between Bellman-Ford's update rule and the Bellman equation for value iteration in RL.
Exercise 5 ** - Kruskal's MST with Union-Find Implement Kruskal's algorithm with Union-Find (union by rank + path compression) for the graph with edges: .
(a) Sort edges by weight and show each Union-Find decision. (b) Give the final MST and its total weight. (c) Verify the MST satisfies the cut property for the cut . (d) How many distinct MSTs does this graph have? (Hint: are all edge weights distinct?)
Exercise 6 ** - Topological Sort and Autograd Consider a computation graph (DAG) with operations: , , , .
(a) Draw the computation DAG (nodes = tensors and operations). (b) Run Kahn's topological sort. What is the in-degree of each node at the start? (c) Give the topological order of operations for the forward pass. (d) Give the reverse topological order for the backward pass. (e) Which gradients are accumulated (summed) rather than simply propagated? Why?
Exercise 7 *** - Tarjan's SCC Run Tarjan's algorithm on the directed graph with edges: , , , , , , , , .
(a) Trace disc[] and low[] values throughout the DFS. (b) Identify when each SCC is detected (when low[v] = disc[v]). (c) List all SCCs and draw the condensation DAG. (d) Which SCC is a "source" (in-degree 0 in condensation) and which is a "sink"?
Exercise 8 *** - Max-Flow and the Min-Cut Theorem Consider the flow network with source , sink , and edges: .
(a) Find a maximum - flow using Ford-Fulkerson. Show the residual graph after each augmenting path. (b) What is the maximum flow value? (c) Find a minimum - cut and verify its capacity equals the max flow. (d) Describe how this max-flow instance models a data routing problem: = data centre, = relay nodes, = users. What does the min-cut represent physically?
13. Conceptual Bridge
Backward: What This Section Builds On
Graph algorithms are the operational layer on top of the structures defined in 01 and 02. Every algorithm in this section assumes familiarity with:
- 01 Graph Basics: the definitions of paths, cycles, connectivity, and DAGs that make algorithm statements precise
- 02 Graph Representations: the choice of adjacency list vs adjacency matrix vs CSR directly determines algorithm complexity - BFS on an adjacency list is ; on an adjacency matrix it would be even for sparse graphs
The algorithms also draw on core data structures: priority queues (Dijkstra, Prim, A*), Union-Find (Kruskal), stacks (DFS, Tarjan), and queues (BFS, Kahn). These are assumed from a standard algorithms course.
Forward: What This Section Enables
04 Spectral Graph Theory takes the graph objects studied here and analyses them through eigenvalues. The Cheeger inequality connects the min-cut (7.4 above) to the Fiedler eigenvalue of the Laplacian. Spectral clustering finds an approximate min-cut by computing 's eigenvector - a bridge from combinatorial algorithms (03) to algebraic methods (04).
05 Graph Neural Networks makes the algorithmic connections explicit: GNN layers are parameterised, differentiable generalisations of BFS aggregation. Every classical algorithm in this section has a GNN counterpart: BFS -> GCN, Dijkstra -> neural shortest paths, MST -> differentiable graph pooling. The expressiveness limitations of GNNs (WL test) are precisely characterised in terms of how many BFS hops the network can simulate.
22 Causal Inference uses DAG algorithms - topological sort, d-separation, Markov blanket computation - to reason about causal structure. The do-calculus operates on causal DAGs; interventions modify the DAG structure; counterfactuals involve path analysis.
CURRICULUM POSITION: 03 GRAPH ALGORITHMS
========================================================================
+-----------------------------------------------------------------+
| CH. 11 GRAPH THEORY |
| |
| 01 Graph Basics 02 Graph Representations |
| (vocabulary) (data structures) |
| down down |
| +==================================+ |
| | 03 GRAPH ALGORITHMS <- YOU ARE HERE |
| | BFS * DFS * Dijkstra * Bellman-Ford |
| | Kruskal * Prim * Topo Sort * SCC |
| | Ford-Fulkerson * Max-Flow / Min-Cut |
| +==================================+ |
| down |
| 04 Spectral Graph Theory |
| (Laplacian eigenvalues, Fiedler, spectral clustering) |
| down |
| 05 Graph Neural Networks |
| (MPNN, GCN, GAT - learned graph algorithms) |
| down |
| 06 Random Graphs |
| (probabilistic graph models, phase transitions) |
+-----------------------------------------------------------------+
Prerequisite chains into this section:
01 --- definitions, paths, DAGs
02 --- adjacency list (O(n+m) complexity)
Ch.02 -- matrix multiply (Floyd-Warshall, adjacency matrix BFS)
Where this section feeds:
04 --- min-cut <-> Cheeger constant; BFS layers <-> graph spectrum
05 --- BFS = GNN receptive field; topo sort = autograd ordering
22 --- DAG algorithms for causal inference
========================================================================
Further Reading
- Cormen, Leiserson, Rivest, Stein - Introduction to Algorithms (CLRS), Ch. 22-26: the definitive reference for BFS, DFS, SSSP, MST, max-flow with complete proofs
- Kleinberg & Tardos - Algorithm Design, Ch. 3-4, 7: excellent motivating examples for graph algorithms and flows
- Sedgewick & Wayne - Algorithms (4th ed.), Part II: practical Java implementations with visualisations
- Kepner & Gilbert - Graph Algorithms in the Language of Linear Algebra (2011): GraphBLAS and semiring formulations
- Boykov & Kolmogorov - "An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision" (PAMI 2004): graph cuts for CV
- Leskovec, Rajaraman & Ullman - Mining of Massive Datasets, Ch. 10: graph algorithms at web scale
<- Back to Graph Theory | Previous: Graph Representations <- | Next: Spectral Graph Theory ->
Appendix A: Worked Traces
A.1 Complete Dijkstra Trace
Graph. Directed, nodes , edges: , , , , , .
Initialisation: ; keyed by .
| Step | Extract | d[s] | d[a] | d[b] | d[c] | d[d] | Relaxed |
|---|---|---|---|---|---|---|---|
| 0 | - | 0 | \infty | \infty | \infty | \infty | init |
| 1 | s (0) | 0 | 4 | 2 | \infty | \infty | s->a=4, s->b=2 |
| 2 | b (2) | 0 | 3 | 2 | 7 | \infty | b->a: 2+1=3<4 OK, b->c: 2+5=7 |
| 3 | a (3) | 0 | 3 | 2 | 7 | 6 | a->d: 3+3=6 |
| 4 | c (7) | 0 | 3 | 2 | 7 | 6 | c->d: 7+1=8>6, no update |
| 5 | d (6) | 0 | 3 | 2 | 7 | 6 | no outgoing edges |
Result: .
- Shortest path to : (cost ). OK
- Key observation: was extracted before (2 < 4), and when was processed, it improved from 4 to 3. This is the power of Dijkstra's greedy approach - later extraction can improve earlier estimates.
A.2 Complete Bellman-Ford Trace
Graph. nodes , edges: , , , , , where , .
Simplified: nodes , edges , , , - note: this creates a negative cycle of weight .
Let's use a clean no-neg-cycle example: nodes , edges , , , , .
| Round | d[s] | d[a] | d[b] | d[t] |
|---|---|---|---|---|
| 0 (init) | 0 | \infty | \infty | \infty |
| 1 | 0 | 3 | 6 | min(\infty,3+7,6+4)=10 |
| 2 | 0 | min(3,6-2)=3 | 6 | min(10,3+7,6+4)=10 |
| 3 | 0 | 3 | 6 | 10 |
Result: via (cost ) or (cost ) - both give 10. No negative cycle (round 3 makes no changes). OK
A.3 Tarjan SCC: Detailed Trace
Graph. Nodes , edges: , , (cycle), , , (cycle).
DFS from node 0:
Visit 0: disc[0]=0, low[0]=0, stack=[0]
Visit 1: disc[1]=1, low[1]=1, stack=[0,1]
Visit 2: disc[2]=2, low[2]=2, stack=[0,1,2]
Edge 2->0: 0 is on stack -> low[2] = min(2,disc[0]) = 0
Finish 2: low[2]=0 \neq disc[2]=2 -> not SCC root
Back-propagate: low[1] = min(1,low[2]) = 0
Visit 3: disc[3]=3, low[3]=3, stack=[0,1,2,3]
Visit 4: disc[4]=4, low[4]=4, stack=[0,1,2,3,4]
Edge 4->1: 1 is on stack -> low[4] = min(4,disc[1]) = 1
Finish 4: low[4]=1 \neq disc[4]=4 -> not SCC root
Back-propagate: low[3] = min(3,low[4]) = 1
Finish 3: low[3]=1 \neq disc[3]=3 -> not SCC root
Back-propagate: low[1] = min(0,low[3]) = 0
Finish 1: low[1]=0 \neq disc[1]=1 -> not SCC root
Back-propagate: low[0] = min(0,low[1]) = 0
Finish 0: low[0]=0 = disc[0]=0 -> SCC ROOT!
Pop stack until 0: {0,1,2,3,4} -> but wait...
Actually node 0 is the first on the stack. All nodes pop as one SCC only if they're all mutually reachable. Let's verify: - is 3 reachable from 0? Yes. Is 0 reachable from 3? . Yes! So SCCs: form one SCC in this graph.
For distinct SCCs, use the graph from Exercise 7: (SCC ), (SCC ), links , (singletons , ).
Appendix B: Complexity Deep Dives
B.1 Why Dijkstra is O((n+m) log n)
The binary-heap-based priority queue supports:
INSERTin - adding vertices initially:EXTRACT-MINin - called times:DECREASE-KEYin - called at most once per edge:
Total: .
With a Fibonacci heap, DECREASE-KEY runs in amortised: total . Theoretically superior for dense graphs (), but the large constants make Fibonacci heaps impractical in most implementations.
Practical consideration. For road networks (small degree ), , so Dijkstra is . With contraction hierarchies, query time drops to milliseconds for continent-scale road graphs.
B.2 Why Floyd-Warshall is Correct: DP Subproblem
Let = length of shortest path from to using only vertices as intermediates.
Base: if , 0 if , otherwise.
Transition: Either the optimal path from to via uses vertex , or it doesn't.
After iterations (all vertices considered as potential intermediates), .
In-place update. The transition can be done in-place (the same matrix) because and - the row and column do not change in iteration .
B.3 Union-Find Complexity: Inverse Ackermann
The inverse Ackermann function grows so slowly that for any realistically encountered in practice ( only for ). Thus Union-Find with path compression + union by rank runs in amortised per operation - effectively .
Path compression. During FIND(x), after finding the root , set the parent of every node on the path directly to . This flattens the tree for future queries.
Union by rank. When merging two trees, attach the shallower tree's root as a child of the taller tree's root. With rank as an upper bound on height, no tree gets taller than .
Together, these two optimisations achieve for any sequence of operations - proven by Tarjan (1975), completing the complexity analysis of Kruskal's algorithm.
B.4 Max-Flow Integrality and Network Applications
Integrality theorem. If all edge capacities in a flow network are integers, then there exists a maximum flow that is integer-valued on every edge. (Proof: Ford-Fulkerson augments by integer amounts when all capacities and current flows are integers; by induction, the final flow is integer-valued.)
Bipartite matching via max-flow. Given a bipartite graph , add:
- Source with edge for all
- Sink with edge for all
- All original bipartite edges with capacity 1
Maximum flow = maximum matching (by integrality). This reduces bipartite matching to max-flow and solves it in via Ford-Fulkerson, or via the specialised Hopcroft-Karp algorithm.
Hall's theorem connection. By the max-flow min-cut theorem, a bipartite graph has a perfect matching iff every subset has (Hall's condition). The min-cut in the flow network corresponds exactly to the violating set .
Appendix C: Connections to Linear Algebra
C.1 BFS as Matrix-Vector Multiplication
BFS can be reformulated as repeated matrix-vector multiplication over the Boolean semiring (OR, AND).
Let be the current frontier (1 = in frontier, 0 = not). Let be the adjacency matrix (Boolean). Then:
where is Boolean matrix-vector multiply and removes already-visited nodes.
This formulation maps directly to the GraphBLAS interface: vxm (vector-times-matrix) over a user-defined semiring. For BFS, the semiring is (OR, AND). For shortest paths, the semiring is (min, +) - the tropical semiring.
C.2 Shortest Paths via Tropical Algebra
Define the tropical semiring where:
- (tropical addition)
- (tropical multiplication)
- Identity for : ; Identity for : 0
The tropical matrix product of weight matrices:
Floyd-Warshall computes the -th power of the weight matrix in the tropical semiring:
i.e., the tropical matrix power gives all-pairs shortest paths.
Bellman-Ford computes tropical matrix-vector products:
This algebraic view unifies shortest-path algorithms under the theory of algebraic path problems. The same framework extends to other graph algorithms by changing the semiring: (max, min) for widest paths, (max, \times) for maximum reliability paths.
C.3 The Adjacency Matrix Power and Walk Counting
From 02, recall that counts the number of walks of length exactly from to .
BFS connection. The first for which and is exactly - the BFS distance from to .
Reachability. Over the Boolean semiring: iff is reachable from in hops. Computing this is the transitive closure problem, equivalent to Floyd-Warshall over Booleans.
GNN aggregation. A -layer GNN with linear activations computes:
The matrix measures how much node contributes to node 's representation - proportional to the number of weighted walks of length from to . This is the algebraic content of "receptive field = BFS -hop neighbourhood."
Appendix D: Graph Algorithms in Practice
D.1 Python Ecosystem
| Library | Algorithms | Strength |
|---|---|---|
networkx | All classical algorithms | Easy to use, educational; slow for large graphs |
igraph | BFS, DFS, SP, MST, community | C-backed; 10-100\times faster than NetworkX |
scipy.sparse | SpMV, matrix-vector BFS | Efficient sparse matrix operations |
cuGraph (NVIDIA) | BFS, SSSP, PageRank, MST | GPU-accelerated; for billion-node graphs |
PyG / DGL | GNN message passing (SpMM) | Deep learning on graphs; CUDA support |
rustworkx (IBM) | All classical algorithms | Rust-backed; used in Qiskit circuits |
D.2 Choosing Data Structure for Each Algorithm
| Algorithm | Best Graph Representation | Why |
|---|---|---|
| BFS | Adjacency list | neighbour enumeration |
| DFS | Adjacency list | neighbour enumeration |
| Dijkstra | Adjacency list + min-heap | Neighbour access + priority queue |
| Bellman-Ford | Edge list | Iterate over all edges directly |
| Floyd-Warshall | Adjacency matrix | Direct index |
| Kruskal | Edge list (sorted) | Sort edges, then Union-Find |
| Prim | Adjacency list + min-heap | Same structure as Dijkstra |
| Topological Sort | Adjacency list + in-degree array | In-degree update per edge |
| Tarjan SCC | Adjacency list | DFS neighbour enumeration |
| Ford-Fulkerson | Residual adjacency list | Forward and backward edge access |
D.3 Scale Reference
| Scale | Nodes | Edges | Recommended Approach |
|---|---|---|---|
| Tiny | Any algorithm, NetworkX | ||
| Small | Python + NumPy/SciPy | ||
| Medium | C++/Rust, igraph, optimised adjacency list | ||
| Large | Distributed (GraphX, Pregel), or GPU (cuGraph) | ||
| Massive | Streaming algorithms, sketches, sampling |
The Facebook social graph (2012): nodes, edges. BFS on this graph requires distributed computing - each frontier step involves terabytes of edge data.
D.4 NetworkX Quick Reference
import networkx as nx
# Create graph
G = nx.DiGraph()
G.add_weighted_edges_from([(0,1,4), (0,2,2), (1,3,5), (2,1,1), (2,3,8)])
# BFS
bfs_layers = dict(nx.bfs_layers(G, 0)) # {node: layer}
bfs_tree = nx.bfs_tree(G, 0)
# DFS
dfs_edges = list(nx.dfs_edges(G, 0))
has_cycle = not nx.is_directed_acyclic_graph(G)
# Shortest paths
dist = nx.single_source_dijkstra_path_length(G, 0, weight='weight')
path = nx.dijkstra_path(G, 0, 3, weight='weight')
# MST (undirected only)
Gu = G.to_undirected()
mst = nx.minimum_spanning_tree(Gu, weight='weight')
# Topological sort (DAG only)
topo = list(nx.topological_sort(G))
# SCC
sccs = list(nx.strongly_connected_components(G))
# Max flow
R = nx.maximum_flow(G, 0, 3, capacity='weight')
Appendix E: Proofs and Mathematical Details
E.1 Correctness of Kruskal's Algorithm
Theorem. Kruskal's algorithm produces an MST of .
Proof. We use the cut property: if is the minimum-weight edge crossing some cut , then belongs to some MST.
Let be the spanning tree produced by Kruskal. We show is an MST by showing it has the same weight as any MST .
Consider the first edge in Kruskal's sorted order that is in but not . Adding to creates a cycle . Since is a spanning tree, must contain another edge that crosses the cut where is the component containing when was processed. Since was chosen by Kruskal (minimum-weight crossing edge), . Replace with in : the new tree has weight that of . Since is an MST, , and the new tree is also an MST.
By repeating this argument for each differing edge, we transform into without increasing weight. Hence is an MST.
E.2 Correctness of Bellman-Ford
Theorem. After iterations of Bellman-Ford from source in a graph with no negative-weight cycles, for all .
Proof. Let be any vertex with , and let be a shortest path (which exists and has edges, since shortest paths are simple for non-negative or no-negative-cycle graphs).
By induction on : after iteration , .
- : .
- Inductive step: After iteration , . In iteration , edge is relaxed: . Since always, we get .
After iterations, .
E.3 Max-Flow Min-Cut: Complete Proof
Theorem. In a flow network , .
Proof. We prove three equivalent statements:
- is a maximum flow.
- The residual graph has no augmenting path from to .
- for some - cut .
: If there were an augmenting path, we could increase , contradicting maximality.
: If no path in , let . Then , , so is an - cut. For every edge with , : we need , i.e., (saturated). For every edge with , : , i.e., (no backward cancellation possible). Therefore:
: By weak duality, for any cut. If , then achieves the lower bound on cut capacity - no flow can exceed the cut capacity, so is maximum.
E.4 DFS Parenthesis Theorem (Full Proof)
Theorem. For any DFS of a graph , and any two vertices : the intervals and are either disjoint or one properly contains the other.
Proof. WLOG assume (u is discovered before v).
Case 1: - is discovered after finishes. The intervals and are disjoint.
Case 2: - is discovered while is still on the stack (GREY). Then must be a descendant of in the DFS tree (because is still open when is opened, meaning is inside 's subtree). Consequently, - finishes before finishes. So .
Corollary. is a descendant of in the DFS tree iff .
Appendix F: Algorithm Summaries (Cheat Sheet)
BFS: Queue + visited set. O(n+m). SSSP (unweighted).
DFS: Stack/recursion + timestamps. O(n+m). Cycle detection, SCC.
Dijkstra: Min-heap + relax. O((n+m)logn). SSSP non-neg weights.
Bellman-Ford: n-1 rounds, relax all edges. O(nm). SSSP any weights.
Floyd-Warshall: DP D[i][k]+D[k][j]. O(n^3). APSP.
A*: Dijkstra + heuristic h(v). O(m logn). Single-pair SP.
Kruskal: Sort edges + Union-Find. O(m logm). MST.
Prim: Min-heap from any root. O((n+m)logn). MST.
Kahn: In-degree queue. O(n+m). Topological sort.
Tarjan SCC: DFS + low-link + stack. O(n+m). SCC directed graph.
Ford-Fulkerson: Residual graph + augmenting paths. O(Ef). Max flow.
Edmonds-Karp: BFS augmenting paths. O(nm^2). Max flow poly guarantee.
KEY INVARIANTS:
Dijkstra: d[u] = \delta(s,u) when u extracted.
Bellman-Ford: after k rounds, d[v] = \delta via \leqk edges.
Kruskal: each added edge is min-weight crossing some cut.
Tarjan: v is SCC root iff low[v] = disc[v].
MaxFlow: optimal iff no augmenting path in residual graph.
Appendix G: Extended AI Connections
G.1 Reinforcement Learning as Graph Search
The core of reinforcement learning can be understood as graph search on the state-action graph (Markov Decision Process):
- Nodes: states
- Edges: actions , weighted by (negative reward = cost)
- Goal: find a policy minimising expected total cost (equivalently, maximising expected reward)
| RL Concept | Graph Algorithm Analogue |
|---|---|
| State | Node |
| Action | Edge |
| Reward | Negative edge weight |
| Value function | Shortest path distance |
| Bellman equation (DP) | Bellman-Ford relaxation |
| Value iteration | Bellman-Ford until convergence |
| Policy improvement | Dijkstra (greedy shortest path) |
| -function | Edge-based distance labelling |
| Model-based RL | Explicit graph construction + SSSP |
| Model-free RL | Online Bellman-Ford with sampled edges |
Value iteration is Bellman-Ford on the MDP graph:
The is the Bellman-Ford relaxation; ensures convergence even without a "goal" node.
G.2 Transformers and Graph Attention
The Transformer attention mechanism (Vaswani et al., 2017) can be interpreted as a graph algorithm on the full-attention graph (complete bipartite graph over tokens).
In self-attention:
- Each token is a node
- Edge weights are learned attention scores
- Aggregation: - a weighted sum over all tokens
This is equivalent to one step of graph attention on the complete graph. Sparse transformers (Longformer, BigBird) restrict attention to a sparse graph, reducing complexity from to or - precisely the graph-algorithmic idea of working with sparse adjacency structures.
Graph Attention Networks (GAT, Velickovic et al. 2018) apply the same attention mechanism to graph neighbourhoods:
The support is restricted to the graph's actual edge set - sparse attention on the graph's adjacency structure. The algorithm is BFS aggregation with learned, node-pair-specific weights.
G.3 Knowledge Graph Reasoning as Shortest Path
Knowledge graphs encode entities and relations as directed, labelled edges: (entity1, relation, entity2). Multi-hop reasoning queries like "Who is the spouse of the president of France?" require traversing multiple edges.
Rule-based reasoning finds shortest paths or bounded-length paths in the KG. This is multi-hop BFS: find all paths from "president of France" of length that end at a person via a "spouse" edge.
Embedding-based reasoning (TransE, RotatE, ComplEx) learns entity embeddings and relation embeddings such that for true triples . Query answering becomes nearest-neighbour search in embedding space - the graph structure is compiled into the embeddings.
Neural graph reasoning (NeuralLP, DRUM, NBFNet) explicitly runs differentiable versions of Bellman-Ford or BFS on the KG, learning edge weights that capture relation semantics. NBFNet (Neural Bellman-Ford Networks, Zhu et al. 2021) initialises boundary conditions at the query entity and propagates generalised Bellman-Ford messages, achieving state-of-the-art on KG link prediction.
G.4 Causal Graphs and d-Separation
In a Bayesian network or structural causal model, the graph is a DAG where edges represent causal influences .
d-separation (Pearl 1988) is a graph-theoretic criterion for conditional independence: variables and are conditionally independent given iff every path between and is "blocked" by in the DAG.
A path is blocked if it contains:
- A chain with (conditioning on mediator blocks the path)
- A fork with (conditioning on common cause blocks)
- A collider with and no descendant of in (collider naturally blocks the path)
Algorithmic test. Given a DAG and sets , , , determine d-separation by BFS/DFS on the moral graph (marry co-parents, drop edge directions) restricted by the Bayes Ball algorithm. This is a graph reachability computation - Dijkstra and BFS on a transformed graph.
Interventional queries (do-calculus) modify the DAG by removing incoming edges to the intervened variable and running independence tests on the modified graph. Every do-calculus rule reduces to graph reachability computations.
G.5 Neural Architecture Search as Graph Problem
Neural Architecture Search (NAS) frames architecture design as optimisation over a DAG of operations. A neural architecture is a DAG where:
- Nodes are feature maps (tensors)
- Edges are operations (conv, relu, attention, etc.)
- Multiple paths from input to output = skip connections
DARTS (Differentiable Architecture Search, Liu et al. 2019) relaxes the discrete DAG into a continuous optimisation: each edge carries a mixture of all candidate operations, and the mixture weights are jointly optimised with network weights by gradient descent. The final architecture is recovered by picking the argmax operation per edge - a topological sort on the final discrete DAG.
Cell-based NAS restricts the search to small DAG templates (cells) that are stacked. The cell DAG is searched over, then replicated. This reduces the search space from all possible architectures to all possible DAGs with nodes and allowed operations.
Appendix H: Problem Sets and Discussion Questions
H.1 Conceptual Questions
Q1. BFS guarantees shortest hop-distances. Dijkstra guarantees shortest weighted distances. Under what exact condition do these two algorithms give the same answer?
Answer: When all edge weights are equal (say ). In this case Dijkstra's priority queue always extracts vertices in BFS order, and the two algorithms are equivalent. BFS is preferred in this case since it avoids the overhead of the priority queue.
Q2. Why can Bellman-Ford handle negative edge weights but Dijkstra cannot?
Answer: Dijkstra's correctness relies on the greedy invariant: once a vertex is extracted from the priority queue, its distance estimate is final and correct. This holds because non-negative edge weights guarantee that any later path through unvisited vertices cannot improve the distance. With negative edge weights, a vertex extracted "early" might later be improved via a path that includes a negative edge - breaking the invariant. Bellman-Ford avoids this by not committing to any distance estimate prematurely: it runs full rounds of relaxation, allowing distances to be improved in any order.
Q3. Give an example where DFS-based topological sort and Kahn's algorithm produce different orderings. Which is "correct"?
Answer: Any DAG with multiple valid topological orderings will do. For , (two sources ): DFS might produce or depending on which source is processed first. Kahn's might produce or depending on queue order. Both are correct - any topological ordering is valid, and neither algorithm has priority over the other.
Q4. In what sense is A* with equivalent to Dijkstra? With true remaining distance?
Answer: With : all vertices are sorted by , so A* extracts vertices in exactly the same order as Dijkstra. They are identical. With (perfect heuristic): - this equals for all vertices on optimal paths. A* with the perfect heuristic only expands vertices on optimal paths - it finds the answer by exploring the minimum possible number of nodes.
Q5. The Ford-Fulkerson algorithm may not terminate if capacities are irrational. Explain the pathological example.
Answer: Consider a graph where each augmenting path carries an irrational amount (where is the golden ratio), and the algorithm always picks a particular augmenting path. The sequence of flow increments converges to a sum that is not the maximum flow - the algorithm oscillates without terminating. Edmonds-Karp avoids this by always using BFS (shortest augmenting path), which guarantees polynomial termination regardless of capacities.
H.2 Proof Exercises
P1. Prove that any graph with vertices and edges must be connected. (Hint: show a disconnected graph can have at most edges by placing all edges in one component.)
P2. Prove the cycle property: the unique maximum-weight edge in any cycle is not in any MST. (Hint: suppose it is in some MST ; removing it splits into two components; the other cycle edges provide a cheaper reconnection.)
P3. Prove that for any DFS tree, a vertex is an ancestor of iff .
P4. Show that the number of SCCs in a directed graph equals the number of source nodes (in-degree 0) in the condensation DAG if and only if the condensation is a path (linear DAG). Give an example where this fails.
H.3 Implementation Challenges
I1. Implement Dijkstra on a graph with nodes and edges (random sparse). Compare running time with and without lazy deletion (keep stale entries in the heap and skip them when extracted).
I2. Implement Tarjan's SCC iteratively (without recursion, using an explicit stack). Verify against the recursive version on a random directed graph with .
I3. Implement bidirectional Dijkstra and compare its running time to standard Dijkstra on a grid graph. Report the ratio of nodes explored.
I4. Implement Ford-Fulkerson with BFS (Edmonds-Karp). Test on a bipartite matching instance: workers, jobs, random bipartite graph. Compare max-flow solution to scipy.optimize.linear_sum_assignment.
Appendix I: Connections to Other Chapters
I.1 From Linear Algebra to Graph Algorithms
The adjacency matrix introduced in 02 appears in graph algorithms in several ways:
- Walk counting: = number of walks of length from to (C.3 above)
- Reachability: transitive closure - Boolean APSP
- Tropical matrix power: shortest paths via in the min-plus semiring (C.2)
- Gaussian elimination on graphs: LU decomposition of the Laplacian relates to Gaussian elimination on the graph (Cholesky = symbolic factorisation of the tree structure)
The matrix-tree theorem (Kirchhoff, 1847) connects spanning tree counting to linear algebra:
This is proved via the Cauchy-Binet formula and connects the combinatorial (MST, spanning trees) and algebraic (eigenvalues) views of graphs - a bridge to 04.
I.2 From Probability to Graph Algorithms
Random walks on graphs - the Markov chain perspective - connect graph algorithms to probability theory (Chapter 6):
- Stationary distribution of the random walk on a -regular graph = uniform over vertices
- Mixing time = how long until the walk is close to stationary \approx where is the Fiedler value (connection to 04)
- Cover time = expected steps for the walk to visit all vertices; for any connected graph, cover time (Matthews bound)
- PageRank = stationary distribution of a damped random walk: - a fixed-point equation solved by power iteration (an iterative BFS-like process)
I.3 From Optimisation to Graph Algorithms
Graph algorithms are the foundation of many combinatorial optimisation methods (Chapter 8):
- Linear programming duality <-> max-flow min-cut duality: the LP relaxation of min-cut gives max-flow (strong duality)
- Integer programming on graphs: MST, matching, flow are all polynomial due to total unimodularity of their constraint matrices
- Gradient descent on the graph Laplacian: subject to constraints - eigenvalue problems (04)
- Submodular optimisation: cut functions are submodular; greedy maximisation of submodular functions on graphs generalises MST
I.4 Looking Ahead to Chapter 12 (Functional Analysis)
The graph Laplacian can be extended to continuous domains: replace the finite vertex set with a manifold, replace the adjacency matrix with a kernel function , and take the limit . The result is the Laplace-Beltrami operator on manifolds - the spectral graph theory of 04 generalises to functional analysis on infinite-dimensional spaces.
Spectral graph algorithms (04) such as spectral clustering use eigenvectors of - these correspond to the eigenfunctions of the Laplace-Beltrami operator in the continuous limit. This connection motivates the study of Hilbert spaces and operators in Chapter 12.
Quick Reference: Key Formulas
Appendix J: Graph Algorithms in Large Language Models
J.1 Retrieval-Augmented Generation and Knowledge Graphs
Large Language Models (LLMs) augmented with external knowledge graphs use graph traversal at inference time:
- Entity recognition: identify entities in the user query (NER)
- Graph lookup: retrieve the entity's node in the KG
- Multi-hop traversal: BFS/DFS to find related facts within hops
- Context injection: retrieved subgraph facts are appended to the LLM prompt
The KG traversal step is BFS with budget constraints (at most hops, at most nodes retrieved). Systems like KAPING (Baek et al. 2023) and Think-on-Graph (Sun et al. 2023) implement this pattern, achieving better factual accuracy than pure parametric LLMs on multi-hop QA.
J.2 Chain-of-Thought as Topological Sort
When an LLM generates a chain-of-thought (CoT) reasoning trace, it implicitly performs topological ordering over a DAG of reasoning steps:
- Each intermediate conclusion is a node
- Dependencies between conclusions are edges (conclusion A must precede conclusion B)
- The order of generation = a topological sort of the reasoning DAG
Tree-of-Thought (Yao et al. 2023) makes this explicit: a search tree of partial solutions is explored using BFS or DFS, with a value function scoring each node. This is A* or BFS on the reasoning tree - graph algorithms for LLM reasoning.
Scratchpad / Program-of-Thought (Chen et al. 2022) generates executable programs (Python, JavaScript) that are then run. A Python program is itself a DAG - statements form a control flow graph. The LLM is generating a topological order of operations, and Python's interpreter executes them in that order.
J.3 Attention Patterns as Adjacency Matrices
The attention weight matrix in a Transformer layer (for tokens) is a dense, weighted adjacency matrix of a complete directed graph. Each head computes a different "view" of this graph.
Mechanistic interpretability research (Elhage et al. 2021, Olsson et al. 2022) uses graph-theoretic analysis to understand information flow through attention:
- Induction heads implement a 2-hop path: attend to a previous occurrence of the current token, then "copy" the next token - a BFS-2 pattern
- Circuit analysis identifies subgraphs of attention heads (circuits) responsible for specific capabilities - analogous to finding subgraphs with specific connectivity properties
- Attention sink phenomenon (Xiao et al. 2023): some tokens (e.g.,
[BOS]) receive very high attention from all other tokens - analogous to high-degree "hub" nodes in a graph, or sink nodes that absorb most of the max-flow
J.4 Structured State Space Models and Path Counting
Structured State Space Models (S4, Mamba) process sequences using linear recurrences:
The matrix defines a recurrence over a latent state - equivalent to a weighted adjacency matrix of a directed "state transition graph." The output depends on all past inputs via the series - a sum over all path lengths in the state graph.
This is the tropical algebra perspective again: SSMs compute convolutions in the linear semiring , while graph shortest paths use the tropical semiring . The mathematical structure is identical; only the semiring changes.
Appendix K: Pseudocode Summary (All Algorithms)
======================================================================
BFS(G, s)
d[s]=0; d[v]=\infty \forallv\neqs; Q=[s]
while Q: u=dequeue(Q); for v\inN(u): if d[v]=\infty: d[v]=d[u]+1; enqueue(Q,v)
----------------------------------------------------------------------
DFS(G)
for v: DFS-VISIT(v) if v unvisited
DFS-VISIT(u): disc[u]=time++; GREY; for v\inN(u): if v WHITE: DFS-VISIT(v)
fin[u]=time++; BLACK
----------------------------------------------------------------------
DIJKSTRA(G, s)
d[s]=0; d[v]=\infty; Q=min-heap(V keyed by d)
while Q: u=EXTRACT-MIN(Q); for v\inN(u): if d[u]+w<d[v]: DECREASE-KEY
----------------------------------------------------------------------
BELLMAN-FORD(G, s)
d[s]=0; d[v]=\infty
for i=1..n-1: for (u,v)\inE: RELAX(u,v)
for (u,v)\inE: if d[u]+w<d[v]: return "neg cycle"
----------------------------------------------------------------------
FLOYD-WARSHALL(G)
D=weight matrix; D[i][i]=0
for k=1..n: for i: for j: D[i][j]=min(D[i][j], D[i][k]+D[k][j])
----------------------------------------------------------------------
KRUSKAL(G)
Sort E by weight; MAKE-SET(v) \forallv
for (u,v) in sorted E: if FIND(u)\neqFIND(v): add edge; UNION(u,v)
----------------------------------------------------------------------
PRIM(G, r)
key[r]=0; key[v]=\infty; Q=min-heap(V keyed by key)
while Q: u=EXTRACT-MIN; for v\inN(u): if v\inQ and w<key[v]: DECREASE-KEY
----------------------------------------------------------------------
KAHN(G)
indeg[v]=in-degree \forallv; Q=[v: indeg[v]=0]
while Q: u=dequeue; order.append(u); for (u,v): indeg[v]--; if 0: enqueue
----------------------------------------------------------------------
TARJAN-SCC(G)
DFS-VISIT(v): disc[v]=low[v]=time++; stack.push(v); onStack[v]=T
for (v,w): if disc[w]=-1: visit(w); low[v]=min(low[v],low[w])
elif onStack[w]: low[v]=min(low[v],disc[w])
if low[v]=disc[v]: pop SCC from stack until v
----------------------------------------------------------------------
FORD-FULKERSON(G, s, t)
f=0; while augmenting path P in G_f: \delta=bottleneck(P); augment by \delta
------------------------------------------------------================
Appendix L: Worked Example - Full Graph Problem
L.1 End-to-End: Building and Querying a Road Network
Problem. You are given a city road network with 7 intersections and 10 roads (edges). You need to: (a) Find shortest driving routes (weighted by travel time) (b) Find which intersections can reach city hall (reachability) (c) Plan a minimum-cost infrastructure to connect all intersections (MST) (d) Find the maximum flow of buses from depot to stadium
Graph. Intersections . Roads (undirected, travel time in minutes):
Part (a): Dijkstra from
Sort by extraction order:
- Extract : relax ,
- Extract : relax (tie),
- Extract : relax OK,
- Extract : relax ,
- Extract : relax OK,
- Extract : no improvement
- Extract : done
Result:
Part (b): BFS reachability (unweighted)
From : Layer 0: , Layer 1: , Layer 2: , Layer 3: . All 7 intersections reachable.
Part (c): Kruskal MST
Sorted edges:
Re-sorted:
- Add : , others singletons.
- Add : .
- Add : .
- Add : (merges).
- Add : (merges).
- Add : FIND() = FIND() - same component! Skip.
- Add : - all connected!
MST edges: . Total weight: minutes.
Part (d): Max flow (bus routes with capacities)
Convert to directed flow network: depot = , stadium = , capacities = edge weights. Augmenting paths found by BFS:
- Path : bottleneck = min(5,3,2,4,1) = 1
- After augmenting: saturated. Next path must avoid edge at capacity.
- Path : bottleneck = min(5,3,2,6) = 2; update residual
- Continue until no augmenting path...
Max flow = 5 (limited by the cut with capacity ...
[Full trace left as Exercise 8]
This single graph problem illustrates how all four algorithm families (BFS, SSSP, MST, max-flow) answer different questions about the same underlying graph - each appropriate for a different operational need.
Appendix M: Implementation Notes for ML Engineers
M.1 Graph Algorithms in PyTorch Geometric
PyTorch Geometric (PyG) is the standard library for graph deep learning, but it also exposes classical graph operations. Understanding how classical algorithms map to PyG's data model is essential for efficient GNN implementation.
Graph data model:
from torch_geometric.data import Data
import torch
# A graph with 4 nodes, 4 edges (2 undirected = 4 directed)
edge_index = torch.tensor([[0,1,1,2,2,3,0,3], # source nodes
[1,0,2,1,3,2,3,0]], dtype=torch.long) # target nodes
x = torch.randn(4, 8) # 4 nodes, 8 features each
data = Data(x=x, edge_index=edge_index)
BFS in PyG: not directly built-in, but torch_geometric.utils.k_hop_subgraph implements BFS- neighbourhood extraction:
from torch_geometric.utils import k_hop_subgraph
# Get 2-hop neighbourhood of node 0
subset, edge_index_sub, mapping, mask = k_hop_subgraph(0, num_hops=2, edge_index=edge_index)
Shortest paths via scipy:
from scipy.sparse import csr_matrix
from scipy.sparse.csgraph import shortest_path, connected_components
# Convert PyG edge_index to scipy sparse matrix
A = csr_matrix((torch.ones(edge_index.size(1)), edge_index.numpy()), shape=(n,n))
dist = shortest_path(A, method='D') # Dijkstra (non-neg weights)
n_comp, labels = connected_components(A)
M.2 Efficient Implementations: Common Patterns
Pattern 1: Avoid Python loops - use vectorised operations
# BAD: Python loop BFS (O(n*m) effectively due to Python overhead)
for u in queue:
for v in adj[u]:
if d[v] == inf: ...
# GOOD: scipy sparse BFS (C-backed)
from scipy.sparse.csgraph import breadth_first_search
dist = breadth_first_search(A, i_start=0, return_predecessors=False)
Pattern 2: Use indexed edge lists for Bellman-Ford
# Represent edges as three arrays for vectorised relaxation
src = np.array([0, 0, 1, 2])
dst = np.array([1, 2, 3, 3])
w = np.array([4., 2., 1., 3.])
d = np.full(n, np.inf); d[s] = 0.
for _ in range(n-1):
improved = d[src] + w < d[dst]
d[dst[improved]] = d[src[improved]] + w[improved]
Pattern 3: Union-Find with path compression (NumPy)
parent = np.arange(n)
def find(x):
while parent[x] != x:
parent[x] = parent[parent[x]] # path compression (halving)
x = parent[x]
return x
def union(x, y):
px, py = find(x), find(y)
if px != py: parent[px] = py
M.3 GPU Graph Algorithms with cuGraph
NVIDIA cuGraph provides GPU-accelerated graph algorithms with a NetworkX-compatible API:
import cudf, cugraph
# Load edge list
edges = cudf.DataFrame({'src': [0,0,1,2], 'dst': [1,2,3,3], 'wt': [4.,2.,1.,3.]})
G = cugraph.Graph()
G.from_cudf_edgelist(edges, source='src', destination='dst', edge_attr='wt')
# GPU BFS
bfs_result = cugraph.bfs(G, start_vertex=0)
# GPU Dijkstra (SSSP)
sssp_result = cugraph.sssp(G, source=0)
For billion-node graphs, cuGraph achieves throughputs of edges/second on modern GPUs - three orders of magnitude faster than serial Python implementations.
M.4 Memory-Efficient Graph Storage
For large graphs that don't fit in GPU memory, consider:
- Graph partitioning: split the graph into parts using METIS or spectral methods, process each partition on a separate GPU, and exchange boundary information
- Graph sampling: for GNNs, sample subgraphs (GraphSAGE-style) rather than loading the full adjacency
- Streaming algorithms: process edges in order without materialising the full graph; union-find for online connectivity, sliding-window BFS for temporal graphs
- External memory algorithms: if the graph doesn't fit in RAM, use disk-based BFS (level-by-level, one level per disk pass) - used for web-crawl graphs in the early 2000s