Search papers, labs, and topics across Lattice.
The paper introduces CCL-D, a system for diagnosing slow/hang anomalies in large-scale distributed training using collective communication libraries (CCL). It employs a rank-level real-time probe to measure cross-layer anomaly metrics via lightweight distributed tracing, coupled with an intelligent decision analyzer for automated anomaly detection and root-cause localization. Deployed on a 4,000-GPU cluster, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes, significantly improving upon existing methods.
Cut your debugging time: CCL-D slashes the diagnosis time for slow/hang anomalies in large-scale distributed training from days to just 6 minutes.
As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes鈥攕ubstantially outperforming existing solutions.