Search papers, labs, and topics across Lattice.
This paper analyzes the Muown optimizer, revealing that its directional update corresponds to a Riemannian step on normalized directions, while the magnitude modulates the angular step size. The authors introduce AngularMuown, which optimizes directly over normalized directions and employs a schedulable angular multiplier, leading to improved performance over Muown. Notably, AngularMuown has achieved top results in the modded nanoGPT speedrunning competition and demonstrates scalability in larger mixture-of-experts models.
AngularMuown not only enhances optimization stability but also outperforms its predecessor in competitive benchmarks, redefining expectations for matrix-aware optimizers.
Matrix-aware optimizers such as Muon and Muown have recently shown strong empirical performance for pre-training Transformers. In particular, Muown separates each weight matrix into row magnitudes and an un-normalized direction variable, updating the former with Adam and the latter with Muon. We show that the directional update of Muown is equivalent to a Riemannian step on the normalized directions, while the magnitude of the un-normalized parameterization only modulates the angular step size. This explains the step-size stability of Muown and suggests making the angular step size explicit. The resulting method, AngularMuown, optimizes directly over the normalized directions and uses a schedulable angular multiplier decoupled from the radial magnitude update. AngularMuown improves over Muown and, at the time of writing, a preliminary version is leading the per-optimizer category of the modded nanoGPT speedrunning competition. Further experiments on Qwen2-0.5B, and 1.1B parameter mixture-of-experts models confirm the algorithm scales beyond small models. An implementation of the algorithm is available at https://github.com/fhueb/angular-muown