May 26, 2026arXiv:2605.26842

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

Jiacheng Li, Jianchao Tan, Hongtao Xu, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

AI Summary

This paper introduces MONA, a novel optimizer that integrates Nesterov-style acceleration into the Muon optimizer's orthogonalization framework to improve convergence in large language model training. By incorporating an acceleration term derived from the exponential moving average of gradient differences, MONA escapes sharp local minima while maintaining Muon's spectral-norm regularization. Experiments across Mixture-of-Experts models (1B-68B parameters) demonstrate that MONA outperforms AdamW and Muon in pretraining convergence and downstream task performance, achieving state-of-the-art results on MOE-68B-A3B after supervised fine-tuning.

Key Contribution

MONA unlocks faster LLM pretraining and superior downstream performance by turbocharging the Muon optimizer with Nesterov-style acceleration, leaving AdamW in the dust.

Abstract

The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

Related Papers