Search papers, labs, and topics across Lattice.
The paper addresses the bottleneck of all-to-all GPU communication in large-scale training clusters caused by traffic skew and the two-tier structure of GPU systems. They propose a dynamic hierarchical Birkhoff-von Neumann (BvN) decomposition framework that first balances traffic within each server and then applies a hierarchical BvN decomposition at the server level, refined into GPU-level matchings. The proposed scheduler, integrated with dynamic frame sizing (DFS), achieves provable stability under admissible Poisson arrivals and demonstrates significant reductions in mean frame length, especially under server-localized hotspot traffic.
Hierarchical scheduling slashes communication overhead in multi-GPU servers by intelligently reshaping traffic and exploiting locality.
All-to-all GPU communication is a critical bottleneck in large-scale training clusters, where completion time is constrained by per-port bandwidth and can be severely impacted by traffic skew across GPUs and network interface cards (NICs). This issue is amplified by the two-tier structure of modern GPU systems, which combine fast intra-server links with much slower inter-server networks. Motivated by recent system observations that highlight the importance of traffic reshaping and hierarchy awareness, we study all-to-all scheduling from an online switching and queueing-theoretic perspective. We propose a dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics. At each frame boundary, traffic is first balanced within each server using simple local operations to mitigate micro-level GPU/NIC skew while preserving aggregate server-to-server demand. A hierarchical BvN decomposition is then applied at the server level and refined into GPU-level matchings, significantly reducing decomposition complexity relative to a flat GPU-level approach. By integrating this construction with the dynamic frame sizing (DFS) principle, we obtain an online scheduler with provable stability under admissible Poisson arrivals. Simulations demonstrate substantial reductions in mean frame length, particularly under server-localized hotspot traffic.