China MobileUT AustinJan 1, 2026

Communication Frequency in Megatron‐LM: Experimental Insights Applied to Heterogeneous Distributed Training Time Prediction

HaoRan Zhang, Yanzhao Feng, Zhengwei Chen, Yutong Tian, Xiaoli Zheng, Cong Liu, Sheng Wang, Jie Ren, Yucong Li, Rui Zhu

AI Summary

This paper introduces HATP, a Heterogeneous-Aware Time Predictor, to accurately estimate the training time of Megatron-LM on heterogeneous GPU clusters by modeling communication and computation complexities. They experimentally analyze and quantify communication frequency patterns within Megatron-LM's parallel strategies to account for communication differences among heterogeneous GPUs. HATP achieves a prediction accuracy of 97.41% in isomorphic environments and 96.04% in heterogeneous parallel configurations, outperforming existing methods.

Key Contribution

Training trillion-parameter models on heterogeneous GPU clusters just got easier: HATP accurately predicts Megatron-LM training times, enabling faster optimization of parallel strategies.

Abstract

As model parameters increase exponentially, distributed training has become essential for advancing modern deep neural networks. Megatron‐LM, an efficient distributed training framework developed by NVIDIA, enables the training of trillion‐parameter models on thousands of GPUs by integrating tensor, pipeline, and data parallelism. Its computational efficiency has established it as a foundational tool for training large‐scale models. Rapid identification of optimal parallel configurations for specific GPU clusters is critical for maximizing computational resource utilization, with training time prediction serving as a key evaluation metric. The high cost and limited availability of high‐performance GPUs, particularly those based on NVIDIA architectures, have made the construction of large‐scale heterogeneous clusters a practical solution to resource and cost constraints. However, existing prediction methods do not reliably or efficiently account for the computational and communication complexities inherent in heterogeneous GPU clusters. To address this gap, HATP (Heterogeneous‐Aware Time Predictor) is introduced as a novel performance prediction method specifically designed for heterogeneous GPU clusters. For any given parallel configuration, HATP rapidly and accurately simulates execution times to inform the optimization of parallel strategies. To address communication differences among heterogeneous GPUs, comprehensive experimental analyses are conducted and analytical expressions are derived to characterize the communication frequency patterns in Megatron‐LM's parallel strategies. This work presents the first systematic quantification of communication operations within Megatron‐LM framework, ensuring that performance predictions remain highly accurate even in complex, heterogeneous environments. Furthermore, to account for computational differences among heterogeneous GPUs, a layer‐level computational performance acquisition scheme is proposed to reduce the impact of fine‐grained operator overlap and additional memory operations. Experimental results demonstrate that HATP achieves an average prediction accuracy of 97.41% in isomorphic environments, surpassing the current state‐of‐the‐art method, ACEso. HATP also attains an average accuracy of 96.04% in heterogeneous data parallel and pipeline parallel configurations, representing the first extension of training time prediction capabilities to heterogeneous environments.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueConcurrency and Computation

Related Papers

Finding related papers...

Search

Communication Frequency in Megatron‐LM: Experimental Insights Applied to Heterogeneous Distributed Training Time Prediction

Related Papers