Jan 30, 2026arXiv:2601.22585

HetCCL: Accelerating LLM Training with Heterogeneous GPUs

Heehoon Kim, Jaehwan Lee, Taejeoung Kim, Jongwon Park, Jinpyo Kim, Pyongwon Suh, Ryan H. Choi, Sangwoo Lee, Jaejin Lee

AI Summary

The paper introduces HetCCL, a collective communication library designed to enable efficient large language model training across heterogeneous GPU clusters by unifying vendor-specific backends like NVIDIA NCCL and AMD RCCL. HetCCL facilitates RDMA-based communication between GPUs from different vendors without requiring driver modifications, addressing a critical gap in current deep learning frameworks. Experiments on a multi-vendor cluster demonstrate that HetCCL achieves performance comparable to NCCL and RCCL in homogeneous settings and uniquely scales in heterogeneous environments.

Key Contribution

Unlock the full potential of your mixed NVIDIA/AMD GPU clusters: HetCCL enables seamless, high-performance LLM training across heterogeneous hardware without code modifications.

Abstract

The rapid growth of large language models is driving organizations to expand their GPU clusters, often with GPUs from multiple vendors. However, current deep learning frameworks lack support for collective communication across heterogeneous GPUs, leading to inefficiency and higher costs. We present HetCCL, a collective communication library that unifies vendor-specific backends and enables RDMA-based communication across GPUs without requiring driver modifications. HetCCL introduces two novel mechanisms that enable cross-vendor communication while leveraging optimized vendor libraries, NVIDIA NCCL and AMD RCCL. Evaluations on a multi-vendor GPU cluster show that HetCCL matches NCCL and RCCL performance in homogeneous setups while uniquely scaling in heterogeneous environments, enabling practical, high-performance training with both NVIDIA and AMD GPUs without changes to existing deep learning applications.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References61

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HetCCL: Accelerating LLM Training with Heterogeneous GPUs

Related Papers