HITHKUSTApr 30, 2026arXiv:2604.27844

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

Wenxiang Lin, Xinglin Pan, Ruibo Fan, Shaohuai Shi, Xiaowen Chu

AI Summary

ZipCCL introduces a lossless compression library for communication collectives in LLM training, targeting the communication bottleneck in distributed training. It leverages exponent coding optimized for the near-Gaussian distribution of LLM tensors, GPU-optimized compression/decompression kernels, and adaptive communication strategies. Experiments on a 64-GPU cluster demonstrate up to 1.35x reduction in communication time and 1.18x end-to-end training speedup for both MoE and dense transformer models, without impacting model quality.

Key Contribution

LLM training bottlenecks? ZipCCL slashes communication time by up to 35% with lossless compression, proving that optimized compression can actually *speed up* distributed training.

Abstract

Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the potential of lossless compression has remained largely underexplored since compression and decompression typically consume larger overheads than the benefits of reduced communication traffic. We observe that the communication data, including activations, gradients and parameters, during training often follows a near-Gaussian distribution, which is a key feature for data compression. Thus, we introduce ZipCCL, a lossless compressed communication library of collectives for LLM training. ZipCCL is equipped with our novel techniques: (1) theoretically grounded exponent coding that exploits the Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, (2) GPU-optimized compression and decompression kernels that carefully design memory access patterns and pipeline using communication-aware data layout, and (3) adaptive communication strategies that dynamically switch collective operations based on workload patterns and system characteristics. Evaluated on a 64-GPU cluster using both mixture-of-experts and dense transformer models, ZipCCL reduces communication time by up to 1.35$\times$ and achieves end-to-end training speedups of up to 1.18$\times$ without any impact on model quality.

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

Related Papers