Search papers, labs, and topics across Lattice.
This paper introduces a novel MPI-based GPU communication API designed for CPU-free operation, aiming to improve performance in GPU-accelerated ML and HPC applications. The API leverages HPE Slingshot 11 network card capabilities and builds upon existing MPI extensions to minimize CPU involvement in the communication fast path. Evaluation on Frontier and Tuolumne demonstrates significant performance gains, including a 50% reduction in medium message latency and a 28% speedup in a halo-exchange benchmark at scale.
Unlock up to 28% faster GPU communication in HPC workloads by cutting the CPU out of the loop with a new MPI-based API.
Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that place significant synchronization burdens on programmers. In this paper we describe the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication. This API builds on previously proposed MPI extensions and leverages HPE Slingshot 11 network card capabilities. We demonstrate the utility and performance of the API by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework, and through a performance comparison with Cray MPICH on the Frontier and Tuolumne supercomputers. Results from this evaluation show up to a 50% reduction in medium message latency in simple GPU ping-pong exchanges and a 28% speedup improvement when strong scaling a halo-exchange benchmark to 8,192 GPUs of the Frontier supercomputer.