Search papers, labs, and topics across Lattice.
This paper introduces FoMoE, a novel system that partitions expert layers across workers to eliminate the need for full model replicas in distributed training of Large Language Models (LLMs). By leveraging partial expert replication, FoMoE significantly reduces communication costs and improves throughput, achieving up to 1.42x and 1.4x enhancements over existing methods, respectively. These advancements allow for the training of massive models on constrained compute budgets while maintaining efficiency across geographically distributed data centers.
FoMoE shatters the full-replica barrier, enabling efficient LLM training across weakly connected data centers with unprecedented communication and memory savings.
Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.