Mar 16, 2026arXiv:2603.15059

Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization

AI Summary

This paper analyzes the convergence of the Muon optimizer, which enforces orthogonality in parameter updates, under heavy-tailed stochastic noise in nonconvex Hölder-smooth empirical risk minimization. The authors prove that Muon converges to a stationary point even with heavy-tailed noise, addressing a limitation of prior analyses that assumed bounded variance. Furthermore, they demonstrate that Muon converges faster than mini-batch SGD in this setting.

Key Contribution

Muon, an optimizer designed for stable deep learning, provably converges even when trained with noisy, heavy-tailed data, outperforming standard SGD.

Abstract

Muon is a recently proposed optimizer that enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, leading to stable and efficient training in large-scale deep neural networks. Meanwhile, the previously reported results indicated that stochastic noise in practical machine learning may exhibit heavy-tailed behavior, violating the bounded-variance assumption. In this paper, we consider the problem of minimizing a nonconvex Hölder-smooth empirical risk that works well with the heavy-tailed stochastic noise. We then show that Muon converges to a stationary point of the empirical risk under the boundedness condition accounting for heavy-tailed stochastic noise. In addition, we show that Muon converges faster than mini-batch SGD.

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization

Related Papers