BUPTByteDanceFeb 24, 2025arXiv:2502.16982

Muon is Scalable for LLM Training

Jingyuan Liu, Jianling Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Meng Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang

AI Summary

This paper addresses the scalability limitations of the Muon optimizer for large language model (LLM) training by introducing weight decay and carefully adjusting the per-parameter update scale. The authors demonstrate that these techniques enable Muon to achieve approximately 2x computational efficiency compared to AdamW in compute-optimal training scenarios. They further validate the improved Muon optimizer by training Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model, achieving state-of-the-art performance with significantly fewer training FLOPs and releasing the distributed implementation and model checkpoints.

Key Contribution

Muon optimizer now lets you train LLMs twice as fast as AdamW, as validated by a new 3B/16B MoE model called Moonlight.

Abstract

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

Distributed Systems & Hardware Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations128

Influential citations24

References46

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Muon is Scalable for LLM Training

Related Papers