HPEORNLSandiaTennessee Technological UniversityUniversity of AlabamaUniversity of New MexicoFeb 17, 2026arXiv:2602.15356

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

Patrick G. Bridges, Derek Schafer, Jack Lange, James B. White, Anthony Skjellum, Evan Suggs, Thomas Hines, Purushotham Bangalore, Matthew G. F. Dosanjh, Whit Schonbein

AI Summary

This paper introduces a novel MPI-based GPU communication API designed for CPU-free operation, aiming to improve performance in GPU-accelerated ML and HPC applications. The API leverages HPE Slingshot 11 network card capabilities and builds upon existing MPI extensions to minimize CPU involvement in the communication fast path. Evaluation on Frontier and Tuolumne demonstrates significant performance gains, including a 50% reduction in medium message latency and a 28% speedup in a halo-exchange benchmark at scale.

Key Contribution

Unlock up to 28% faster GPU communication in HPC workloads by cutting the CPU out of the loop with a new MPI-based API.

Abstract

Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that place significant synchronization burdens on programmers. In this paper we describe the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication. This API builds on previously proposed MPI extensions and leverages HPE Slingshot 11 network card capabilities. We demonstrate the utility and performance of the API by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework, and through a performance comparison with Cray MPICH on the Frontier and Tuolumne supercomputers. Results from this evaluation show up to a 50% reduction in medium message latency in simple GPU ping-pong exchanges and a 28% speedup improvement when strong scaling a halo-exchange benchmark to 8,192 GPUs of the Frontier supercomputer.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

Related Papers