Feb 25, 2026arXiv:2602.21545

Muon+: Towards Better Muon via One Additional Normalization Step

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang

AI Summary

The paper introduces Muon+, an enhanced version of the Muon optimizer for pre-training large language models, which adds a normalization step after gradient orthogonalization. This modification consistently improves training and validation perplexity compared to the original Muon optimizer across various GPT-style and LLaMA-style models. The effectiveness of Muon+ is demonstrated through pre-training experiments with model sizes ranging from 60M to 1B parameters and token-to-parameter ratios up to approximately 200.

Key Contribution

A single normalization step turns Muon into Muon+, delivering consistent perplexity improvements in LLM pre-training.

Abstract

The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Muon+: Towards Better Muon via One Additional Normalization Step

Related Papers