Indiana UniversityLouisiana State UniversityMar 4, 2026arXiv:2603.04583

Overcoming Latency-bound Limitations of Distributed Graph Algorithms using the HPX Runtime System

Karame Mohammadiporshokooh, Panagiotis Syskakis, Andrew Lumsdaine

AI Summary

This paper introduces a distributed graph processing library prototype built on the HPX runtime system, aiming to overcome latency limitations in distributed graph algorithms. The prototype implements Breadth-First Search (BFS), PageRank, and Triangle Counting, leveraging HPX's distributed containers and asynchronous constructs. Evaluation against GraphX and PBGL demonstrates that HPX enables significant performance improvements by exploiting asynchronous execution, latency hiding, and fine-grained parallelism.

Key Contribution

HPX runtime lets distributed graph algorithms leave Spark GraphX and PBGL in the dust, thanks to asynchronous execution and latency hiding.

Abstract

Graph processing at scale presents many challenges, including the irregular structure of graphs, the latency-bound nature of graph algorithms, and the overhead associated with distributed execution. While existing frameworks such as Spark GraphX and the Parallel Boost Graph Library (PBGL) have introduced abstractions for distributed graph processing, they continue to struggle with inherent issues like load imbalance and synchronization overhead. In this work, we present a distributed library prototype and a distributed implementation of three key graph algorithms - Breadth-First Search (BFS), PageRank, and Triangle Counting, using C++ mechanisms from the NWgraph library and leveraging HPX's distributed containers and asynchronous constructs. These algorithms span the categories of Traversal, centrality, and Pattern matching, and are selected to represent diverse computational characteristics. We evaluate our HPX-based implementations against GraphX, and PBGL, showing that a high-performance runtime such as HPX enables the construction of algorithms that significantly outperform conventional frameworks by exploiting asynchronous execution, latency hiding, and fine-grained parallelism in shared memory. All algorithms in our prototype follow a unified execution model in which local and remote computations are expressed using the same programming abstractions, with asynchrony managed transparently by the runtime. This design explicitly leverages shared-memory parallelism within each locality while overlapping communication and computation across localities, providing a practical foundation for extending this approach to a broader class of distributed graph algorithms.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...