Search papers, labs, and topics across Lattice.
The paper introduces HetCCL, a collective communication library designed to enable efficient large language model training across heterogeneous GPU clusters by unifying vendor-specific backends like NVIDIA NCCL and AMD RCCL. HetCCL facilitates RDMA-based communication between GPUs from different vendors without requiring driver modifications, addressing a critical gap in current deep learning frameworks. Experiments on a multi-vendor cluster demonstrate that HetCCL achieves performance comparable to NCCL and RCCL in homogeneous settings and uniquely scales in heterogeneous environments.
Unlock the full potential of your mixed NVIDIA/AMD GPU clusters: HetCCL enables seamless, high-performance LLM training across heterogeneous hardware without code modifications.
The rapid growth of large language models is driving organizations to expand their GPU clusters, often with GPUs from multiple vendors. However, current deep learning frameworks lack support for collective communication across heterogeneous GPUs, leading to inefficiency and higher costs. We present HetCCL, a collective communication library that unifies vendor-specific backends and enables RDMA-based communication across GPUs without requiring driver modifications. HetCCL introduces two novel mechanisms that enable cross-vendor communication while leveraging optimized vendor libraries, NVIDIA NCCL and AMD RCCL. Evaluations on a multi-vendor GPU cluster show that HetCCL matches NCCL and RCCL performance in homogeneous setups while uniquely scaling in heterogeneous environments, enabling practical, high-performance training with both NVIDIA and AMD GPUs without changes to existing deep learning applications.