Jun 10, 2026arXiv:2606.11867

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao, Xupeng Miao, Jie Jiang, Fangcheng Fu, Bin Cui

AI Summary

This paper introduces ForeMoE, a novel micro-step-level load balancing system designed to address the expert load imbalance in reinforcement learning (RL) post-training for large language models. By leveraging foreseeable routing information from the rollout stage of the RL pipeline, ForeMoE enables proactive load balancing during subsequent stages, effectively managing the high-frequency load fluctuations caused by small batch sizes. Evaluations show that ForeMoE can achieve up to a 1.45× speedup compared to existing state-of-the-art RL post-training systems, highlighting its effectiveness in optimizing resource utilization.

Key Contribution

ForeMoE achieves a remarkable 1.45× speedup in RL post-training by anticipating load imbalances, transforming how we manage expert resources in large language models.

Abstract

Mixture-of-Experts (MoE) and reinforcement learning (RL) post-training now dominate large language model (LLM) development, yet expert load imbalance remains a critical challenge. Existing load-balancing systems target pre-training by relying on historical step-level statistics. However, these methods fail under the unique workload dynamics of RL post-training: the step-level load is stable, but the tiny batch sizes processed during micro-steps cause severe, high-frequency load fluctuations. We introduce ForeMoE, a micro-step-level load balancing system for MoE RL post-training. Instead of relying on historical statistics, ForeMoE exploits the multi-stage RL pipeline (rollout, recompute, policy update) by using foreseeable routing information from the rollout stage to proactively guide load balancing in the remaining stages. To support frequent per-micro-step reconfiguration, ForeMoE employs a hierarchical planner that decomposes the NP-hard load balancing problem into tractable sub-components, alongside a transfer engine that leverages complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer. Evaluations on 64 GPUs demonstrate that ForeMoE achieves up to a 1.45$\times$ speedup over state-of-the-art RL post-training systems.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Related Papers