Supernodal Floyd-Warshall

Motivation

When I was in graduate school, I studied the paper A Supernodal All-Pairs Shortest Path Algorithm (PPoPP '20).
This project is a revisit of that paper and a re-implementation of its core ideas.
The key contribution of the paper is exploiting graph sparsity in the Floyd-Warshall algorithm by importing techniques from sparse Cholesky factorization: fill-in reducing ordering, symbolic analysis, supernodal traversal, and elimination tree parallelism.
All source code can be found at GitHub.

Background

Single source shortest path (SSSP)

Single source shortest path (SSSP) is a problem that finds the shortest paths from a single source vertex to all other vertices in a weighted graph.

Bellman-Ford algorithm

Bellman-Ford relaxes every edge repeatedly.
After $|V| - 1$ iterations, it guarantees that all shortest paths are found, since any shortest path has at most $|V| - 1$ edges.
If a shorter path can still be found after $|V| - 1$ iterations, a negative cycle exists.
It supports graphs with negative edge weights and can detect negative cycles.
It takes $O(|E||V|)$ to compute.
Source code is here.

Dijkstra

Dijkstra's algorithm uses a priority queue (min-heap) to always relax the vertex with the currently known smallest distance.
Because it greedily picks the minimum, it cannot handle negative edge weights.
It takes $O(|E| \log |V|)$ to compute.
Source code is here.

All-pairs shortest path (APSP)

All-pairs shortest path (APSP) is a problem that finds the shortest paths between every pair of vertices in a weighted graph.

Delegation

The simplest approach: run an SSSP solver from every vertex.
It takes $O(|V| \times \text{SSSP complexity})$, i.e., $O(|E||V|^2)$ with Bellman-Ford or $O(|E||V|\log|V|)$ with Dijkstra.
Source code is here.

Johnson's algorithm

Johnson's algorithm removes negative edge weights with Bellman-Ford so that Dijkstra can be applied safely.
Let $n$ = number of vertices, $m$ = number of edges, and $d(i,j)$ the weight from vertex $i$ to vertex $j$.
The steps are:

Add a new vertex $S$ (index $n+1$) and connect it to all other vertices with weight $0$.
Run Bellman-Ford from $S$ to compute $h(v) = d(n+1, v)$ for every vertex $v$.
Reweight each edge: $\hat{d}(i,j) = d(i,j) + h(i) - h(j)$.
Remove $S$. All reweighted edges are now non-negative.
Run Dijkstra from every vertex using the reweighted graph.
Recover original distances: $\text{dist}(i,j) = \hat{\text{dist}}(i,j) - h(i) + h(j)$.

The non-negativity of $\hat{d}$ follows because $h(i) \le h(j) + d(i,j)$ (shortest-path property), so $\hat{d}(i,j) \ge 0$.
It takes $O(|E||V|\log|V|)$ to compute.
Source code is here.

Floyd-Warshall

Floyd-Warshall uses dynamic programming: at the $k$-th iteration it checks, for every pair $(i,j)$, whether using vertex $v_k$ as an intermediate gives a shorter path.
$$\text{Dist}^k[i,j] \leftarrow \min\!\left(\text{Dist}^{k-1}[i,j],\; \text{Dist}^{k-1}[i,k] + \text{Dist}^{k-1}[k,j]\right)$$ After $n$ iterations, $\text{Dist}[i,j]$ is the true shortest-path length between $v_i$ and $v_j$.

Function FloydWarshall($G = (V, E)$)

Initialize $\text{Dist}[i,j] \leftarrow w_{i,j}$ if $(i,j) \in E$, else $\infty$
For $k \leftarrow 1, \cdots, n$

For $i \leftarrow 1, \cdots, n$

For $j \leftarrow 1, \cdots, n$

$\text{Dist}[i,j] \leftarrow \min(\text{Dist}[i,j],\; \text{Dist}[i,k] + \text{Dist}[k,j])$

Return Dist

Algorithm 1. Floyd-Warshall

It takes $O(|V|^3)$ in time and $O(|V|^2)$ in space.
Source code is here.

Floyd-Warshall with double buffering

The standard Floyd-Warshall reads and writes Dist in place, which creates a data race when parallelized.
Double buffering maintains two copies of the distance matrix — one for reading and one for writing — so the $k$-th outer iteration can be parallelized over all $(i, j)$ pairs without conflict.
Source code is here.

Floyd-Warshall with compressive indexing

For sparse graphs, many entries of $\text{Dist}$ are $\infty$.
Compressive indexing maintains, for each vertex $k$, a list of indices $(i, j)$ where $\text{Dist}[i,k]$ and $\text{Dist}[k,j]$ are both finite, so the inner loops iterate only over actual edges rather than all $n^2$ pairs. Source code is here.

Floyd-Warshall with double buffering + compressive outgoing edge indexing

Combining both optimizations: double buffering removes the in-place write race, while compressive indexing skips $\infty$ pairs.
Only one axis (say, the $i$ dimension) is parallelized to avoid race conditions when updating the sparse index structure.
It also skips updating entries that have not changed.
Source code is here.

Floyd-Warshall with double buffering + compressive (incoming / outgoing) edge indexing

A further variant uses locks to allow full parallelization over both dimensions while still maintaining the sparse index structure.
This increases synchronization overhead but reduces the total number of operations.
Source code is here.

Blocked Floyd-Warshall

Blocked Floyd-Warshall improves cache locality by reordering the computation into three phases per block column $k$.
Let $N$ = number of vertices, $B$ = number of blocks, $S$ = block size ($N = SB$).

Diagonal update: Run classical Floyd-Warshall on the $k$-th diagonal block $A(k,k)$.
Panel update: Update the $k$-th block row and column using the newly updated diagonal block via Min-Plus multiply.
MinPlus outer product: Update all remaining blocks using the $k$-th block row and column.

Function BlockedFloydWarshall($A$)

For $bk \leftarrow 1, \cdots, B$

// Diagonal update
$A(bk,bk) \leftarrow \text{FloydWarshall}(A(bk,bk))$
// Panel update
For $bi \leftarrow 1, \cdots, B$, $bi \neq bk$

$A(bi, bk) \leftarrow A(bi, bk) \oplus A(bi, bk) \otimes A(bk, bk)$

For $bj \leftarrow 1, \cdots, B$, $bj \neq bk$

$A(bk, bj) \leftarrow A(bk, bj) \oplus A(bk, bk) \otimes A(bk, bj)$

// MinPlus outer product
For $bi \leftarrow 1, \cdots, B$, $bi \neq bk$

For $bj \leftarrow 1, \cdots, B$, $bj \neq bk$

$A(bi,bj) \leftarrow A(bi,bj) \oplus A(bi,bk) \otimes A(bk,bj)$

Return $A$

Algorithm 2. Blocked Floyd-Warshall

Here $\oplus$ denotes element-wise min, and $\otimes$ denotes the Min-Plus (tropical) matrix product.
The outer product step is analogous to the Schur-complement update in LU or Cholesky factorization.
Source code is here.

Min-Plus semiring

APSP can be understood algebraically over the tropical semiring.
Define two binary scalar operators: $$x \oplus y := \min(x, y) \qquad x \otimes y := x + y$$ The Min-Plus matrix product of $A \in \mathbb{R}^{m \times k}$ and $B \in \mathbb{R}^{k \times n}$ is: $$C_{ij} \leftarrow \bigoplus_{k} A_{ik} \otimes B_{kj} = \min_k (A_{ik} + B_{kj})$$ This is the graph-path analog of ordinary matrix multiplication: $C_{ij}$ is the length of the shortest two-hop path from source $i$ to destination $j$ via any bridge vertex $k$.
Floyd-Warshall computes the closure of the weight matrix under this semiring.

Supernodal Floyd-Warshall

The naive and blocked Floyd-Warshall algorithms perform $O(n^3)$ work regardless of sparsity.
The key insight of the paper is that the algebraic equivalence between Floyd-Warshall and Gaussian elimination means that all techniques from sparse direct solvers can be applied.
The result is the SuperFw algorithm, which combines three ideas: nested-dissection reordering, symbolic analysis with supernodal structure, and elimination tree parallelism.

Nested dissection ordering

When running Blocked Floyd-Warshall, an empty block (all-$\infty$) remains empty as long as the pivot column and row are also empty.
This is the graph-path analog of zero fill-in.
By reordering vertices, we can defer the creation of finite entries (fill-in) as long as possible, thereby skipping many operations.

Nested dissection (ND) finds a vertex separator $S \subset V$ that partitions $V$ into three disjoint sets $V = C_1 \cup S \cup C_2$ such that:

There are no edges between $C_1$ and $C_2$ (they are disconnected without $S$).
$|C_1|$ and $|C_2|$ are roughly equal.
$|S|$ is as small as possible.

After reordering, the adjacency matrix takes a block-arrow shape: the two diagonal blocks correspond to $C_1$ and $C_2$, and the separator $S$ forms the final rows and columns.
This reordering is applied recursively within $C_1$ and $C_2$ to produce a multi-level separator tree.
Tools such as METIS or Scotch compute near-optimal separators.
For a planar graph (e.g., a road network), the separator size is $|S| = O(\sqrt{n})$, yielding $O(n^{2.5})$ total work instead of $O(n^3)$.

Symbolic analysis and fill-in structure

Before running the numerical phase, symbolic analysis computes the exact sparsity pattern of the final $\text{Dist}$ matrix — i.e., which entries will eventually become finite — without performing any arithmetic.
This mirrors the symbolic phase of sparse Cholesky factorization and enables:

Pre-allocation of memory for all fill-in positions.
Construction of the elimination tree and supernodal structure.

What is fill-in?

Initially, $\text{Dist}[i,j] = \infty$ whenever there is no direct edge between $v_i$ and $v_j$.
During Floyd-Warshall, at iteration $k$: $$\text{Dist}[i,j] \leftarrow \min(\text{Dist}[i,j],\; \text{Dist}[i,k] + \text{Dist}[k,j])$$ If $\text{Dist}[i,k] < \infty$ and $\text{Dist}[k,j] < \infty$, then $\text{Dist}[i,j]$ will become finite even if it was $\infty$ before.
This transition from $\infty$ to a finite value is called fill-in — the graph-path analog of a zero entry becoming non-zero in Gaussian elimination.

At termination, $\text{Dist}[i,j]$ is finite if and only if $v_j$ is reachable from $v_i$.
For a connected graph, the final $\text{Dist}$ matrix is therefore completely dense.

However, when a specific entry $(i,j)$ fills in depends on vertex ordering.
Under a fill-in reducing ordering (ND), many entries stay $\infty$ for as long as possible, allowing large portions of the computation to be skipped.

Column non-zero pattern

For supernode (column) $k$, define $\text{struct}(k)$ as the set of row indices $i > k$ for which $\text{Dist}[i,k]$ will eventually become finite.
This set can be computed purely symbolically using the following recurrence:
$$\text{struct}(k) = \bigl(\text{Adj}(k) \cap \{k+1,\ldots,n\}\bigr) \;\cup\; \bigcup_{\substack{j < k \\ j \in \text{Adj}(k)}} \text{struct}(j)$$ In words: column $k$'s non-zero rows are the direct neighbors of $k$ above it in the ordering, plus the non-zero rows inherited from every already-eliminated neighbor $j < k$.
This inheritance corresponds to paths that pass through $j$ on their way to $k$: if $(j, k)$ is an edge and $(i, j)$ is a path, then there is a path $(i, k)$ via $j$.

Computing all $\text{struct}(k)$ for $k = 1, \ldots, n$ gives the complete fill-in pattern without evaluating any actual distances.

Elimination tree construction

While computing column non-zero patterns, the parent} of supernode $k$ in the elimination tree is defined as: $$\text{parent}(k) = \min\{i \in \text{struct}(k)\}$$ i.e., the smallest-indexed row that fills in column $k$.
This can be computed efficiently with a union-find (disjoint-set) data structure during a single left-to-right pass over the columns:
Function BuildETree($G = (V, E)$ after ND ordering)
Initialize union-find: each vertex is its own root
For $k \leftarrow 1, \cdots, n$
For each neighbor $j \in \text{Adj}(k)$ with $j < k$
$r \leftarrow \text{Find}(j)$
If $r \neq k$
$\text{parent}(r) \leftarrow k$
$\text{Union}(r, k)$

Return parent array

Algorithm. Elimination tree construction via union-find
The total cost is $O(n\,\alpha(n))$ where $\alpha$ is the inverse Ackermann function — effectively linear.
Supernode detection
A fundamental supernode is a maximal set of consecutive columns $\{k, k+1, \ldots, k+s-1\}$ that all share the same non-zero row pattern below the diagonal (except for the diagonal entry itself).
Formally, columns $k$ and $k+1$ belong to the same supernode if:

$\text{parent}(k) = k+1$ in the eTree (they are adjacent in the elimination chain), and

$\text{struct}(k+1) = \text{struct}(k) \setminus \{k+1\}$ (column $k+1$ has exactly the same non-zero rows as column $k$, minus its own diagonal).

When this holds, the entire group can be stored as a single dense block and processed with a dense Min-Plus kernel — this is the source of computational efficiency.

For example, under ND ordering on a 2D grid, the fine-level separator vertices within the same sub-region form supernodes because they all connect to the same higher-level separator vertices.
What symbolic analysis produces
After running symbolic analysis, the following structures are available before any arithmetic:

Column non-zero lists $\text{struct}(k)$ for every supernode $k$ — tells exactly which row blocks are non-zero in each column block.

Elimination tree (eTree) — encodes data dependencies and guides the parallel schedule.

Supernodal partition — groups columns into supernodes so that the distance matrix can be stored as a block-sparse matrix with dense blocks.

Pre-allocated supernodal block-sparse matrix — memory is allocated exactly for the non-zero blocks; no dynamic allocation is needed during the numerical phase.

This separation of structural computation (symbolic) from numerical computation is what allows SuperFw to avoid any work on all-$\infty$ blocks entirely.

Elimination tree
The elimination tree (eTree) is a tree data structure that captures data dependencies between the block eliminations in Blocked Floyd-Warshall.

In the 3-block example from Algorithm 2, eliminating block 1 and block 2 updates the trailing block 3.
Neither block-1 nor block-2 elimination depends on the other, so they can proceed in parallel, but both must complete before block-3 is eliminated.
This dependency is exactly described by an eTree where blocks 1 and 2 are siblings whose parent is block 3.

Under ND ordering, the eTree becomes multi-level: leaf nodes correspond to the finest-grained sub-separators, and the root corresponds to the top-level separator $S$.
Two supernodes $a$ and $b$ are called cousins if $\mathcal{D}(a) \cap \mathcal{D}(b) = \emptyset$ (their descendant sets are disjoint).
The elimination of cousin supernodes can proceed in parallel.

Supernodal structure
A supernode is a group of vertices that share the same fill-in pattern in the post-ND symbolic analysis.
Grouping such vertices together turns the sparse matrix into a block-sparse matrix whose blocks can be processed with dense Min-Plus kernels, exploiting SIMD and cache efficiently.

The supernodal partition is obtained by vertex contraction on the eTree, yielding a supernodal eTree.

For a supernode $v$, $\mathcal{A}(v)$ denotes its ancestors (parent, grandparent, …) and $\mathcal{D}(v)$ its descendants in the eTree.

SuperFw algorithm (sequential)

Function SuperFw($G = (V, E)$)
$n_s \leftarrow$ number of supernodes
For $k \leftarrow 1, \cdots, n_s$
// Diagonal update
$A(k,k) \leftarrow \text{FloydWarshall}(A(k,k))$
// Panel update
For $i \in \mathcal{A}(k) \cup \mathcal{D}(k)$
$A(i,k) \leftarrow A(i,k) \oplus A(i,k) \otimes A(k,k)$
$A(k,i) \leftarrow A(k,i) \oplus A(k,k) \otimes A(k,i)$

// MinPlus outer product
For $(i,j) \in \{\mathcal{A}(k) \cup \mathcal{D}(k)\} \times \{\mathcal{A}(k) \cup \mathcal{D}(k)\}$
$A(i,j) \leftarrow A(i,j) \oplus A(i,k) \otimes A(k,j)$

Return $A$

Algorithm 3. SuperFw (sequential)
The Panel and OuterUpdate steps touch only blocks corresponding to ancestors and descendants of $k$, not all $n_s^2$ block pairs.
This is exactly what sparsity buys: blocks that are structurally zero (all-$\infty$) are never accessed.

When eliminating supernode $v$, the OuterUpdate touches four regions:

$\mathcal{D}(v) \times \mathcal{D}(v)$ — updates the dense distance matrix directly (descendant–descendant).

$\mathcal{D}(v) \times \mathcal{A}(v)$ — updates the dense distance matrix directly (descendant–ancestor).

$\mathcal{A}(v) \times \mathcal{D}(v)$ — updates the dense distance matrix directly (ancestor–descendant).

$\mathcal{A}(v) \times \mathcal{A}(v)$ — updates the supernodal block-sparse matrix (trailing matrix, analogous to the Schur complement in Cholesky).

At the end of the computation, the supernodal matrix contains the semiring equivalent of the Cholesky factors, and the dense distance matrix contains all pairwise shortest-path lengths.

Parallel SuperFw with eTree parallelism
Two cousin supernodes $a$ and $b$ ($\mathcal{D}(a) \cap \mathcal{D}(b) = \emptyset$) operate on disjoint regions of the matrix in their DiagUpdate and PanelUpdate steps, and also in the first three OuterUpdate subsets.
Only the $\mathcal{A}(v) \times \mathcal{A}(v)$ blocks of two cousins can overlap, but since Min-Plus addition is associative, those updates can be performed in any order (they commute).

The parallel schedule is a bottom-up topological sort of the eTree, partitioning it into levels.
All supernodes at the same level are cousins and can be eliminated in parallel.
This is called eTree parallelism.
Function ParallelSuperFw($G = (V, E)$)
Compute eTree levels by bottom-up topological sort
For each level $\ell$ from leaves to root
For each supernode $k$ at level $\ell$ in parallel
// Diagonal update
$A(k,k) \leftarrow \text{FloydWarshall}(A(k,k))$
// Panel update
For $i \in \mathcal{A}(k) \cup \mathcal{D}(k)$
$A(i,k) \leftarrow A(i,k) \oplus A(i,k) \otimes A(k,k)$
$A(k,i) \leftarrow A(k,i) \oplus A(k,k) \otimes A(k,i)$

// MinPlus outer product (first three subsets — no conflict between cousins)
For $(i,j) \in \{\mathcal{D}(k) \times (\mathcal{A}(k) \cup \mathcal{D}(k))\} \cup \{(\mathcal{A}(k) \cup \mathcal{D}(k)) \times \mathcal{D}(k)\}$
$A(i,j) \leftarrow A(i,j) \oplus A(i,k) \otimes A(k,j)$
// $\mathcal{A}(k) \times \mathcal{A}(k)$ blocks updated with atomic reduce (tree-reduction)
For $(i,j) \in \mathcal{A}(k) \times \mathcal{A}(k)$
$A(i,j) \leftarrow A(i,j) \oplus A(i,k) \otimes A(k,j)$ (atomic)

Return $A$

Algorithm 4. Parallel SuperFw with eTree parallelism
eTree parallelism contributes up to $2\times$ better scalability for small graphs, where per-iteration work is low.
For large graphs with dense iterations, ordinary within-iteration parallelism is already sufficient.

Asymptotic analysis
The following table summarizes work $W(n)$, depth $D(n)$, and concurrency $C(n) = W(n)/D(n)$ in the work-depth model (PRAM/CREW).

Algorithm $W(n)$ $D(n)$ $C(n)$

BlockedFW $O(n^3)$ $O(n)$ $O(n^2)$

SuperFw $O(n^2 |S|)$ $O(|S| \log^2 n)$ $O\!\left(\dfrac{n^2}{\log^2 n}\right)$

Dijkstra (Johnson) $O(n^2 \log n + nm)$ $O(n \log n + m)$ $O(n)$

Path Doubling $O(n^3)$ $O(\log n)$ $O\!\left(\dfrac{n^3}{\log n}\right)$

Here $|S|$ is the size of the top-level separator of the graph.

Work: $W(n) = n^2|S|$
The dominant cost in eliminating the top-level separator $S$ is the OuterUpdate, which computes the outer product of two panels of size $(n - |S|) \times |S|$: $$W^0(n) \approx n^2 S(n)$$ Summing over all $\log n$ levels of the separator tree (each level contributes $\frac{n^2}{2^i} S\!\left(\frac{n}{2^i}\right)$ work) and using the monotonicity of $S(n)$ yields: $$W(n) = n^2 |S|$$ For a planar graph, $|S| = O(\sqrt{n})$, so $W(n) = O(n^{2.5})$. For a 3D mesh, $|S| = O(n^{2/3})$, so $W(n) = O(n^{8/3})$. For expander graphs, $|S| = O(n)$ and SuperFw provides no asymptotic improvement.

Depth: $D(n) = O(|S| \log^2 n)$
The top-level separator is eliminated sequentially using blocked Floyd-Warshall with depth $S(n)$.
At level $i$ of the separator tree, $2^i$ separators are eliminated in parallel, but blocks in the $\mathcal{A} \times \mathcal{A}$ region are shared: a block may be updated by up to $2^i$ processes simultaneously, adding a tree-reduction depth of $\log(2^i) = i$.
Summing over all $\log n$ levels: $$D(n) = \sum_{i=0}^{\log n} i \cdot S\!\left(\frac{n}{2^i}\right) \le S(n) \log^2 n$$
Concurrency: $C(n) = O(n^2 / \log^2 n)$
$$C(n) = \frac{W(n)}{D(n)} = O\!\left(\frac{n^2}{\log^2 n}\right)$$ SuperFw lowers the work of BlockedFW by $O(n/|S|)$ while barely reducing concurrency.
In contrast, Dijkstra has lower work but only $O(n)$ concurrency — it cannot exploit large SIMD units or deep memory hierarchies as effectively.

Graph classes and when SuperFw helps
SuperFw is most beneficial for graphs with small separators, broadly the class of geometric graphs:

Planar graphs (road networks, 2D FEM meshes): $|S| = O(\sqrt{n})$, work $O(n^{2.5})$.

3D FEM meshes: $|S| = O(n^{2/3})$, work $O(n^{8/3})$.

Power grids, social networks with community structure: separators are small in practice even when not worst-case optimal.

SuperFw provides no asymptotic improvement over BlockedFW for expander graphs (e.g., Barabási–Albert, Erdős–Rényi), where $|S| = O(n)$.

Implementations
This project re-implements the algorithms described above as a study of the paper.
The following algorithms are implemented and available at GitHub.
SSSP

Bellman-Ford

Dijkstra

APSP

Delegation

Johnson

Floyd-Warshall

Floyd-Warshall with double buffering

Floyd-Warshall with compressive indexing

Floyd-Warshall with double buffering + compressive outgoing edge indexing

Floyd-Warshall with double buffering + compressive (incoming / outgoing) edge indexing

Blocked Floyd-Warshall

Blocked Floyd-Warshall with compressive indexing

Supernodal Floyd-Warshall

Supernodal Floyd-Warshall with compressive indexing