Search papers, labs, and topics across Lattice.
Mousse is introduced as an optimizer that improves upon Muon by incorporating curvature-aware preconditioning, addressing Muon's limitation of assuming an isotropic optimization landscape. Mousse operates in a whitened coordinate system derived from Kronecker-factored statistics (Shampoo), allowing for anisotropic trust regions and adaptive updates based on curvature. Experiments on language models (160M-800M) show Mousse achieves a ~12% reduction in training steps compared to Muon with minimal overhead.
Muon's "one-size-fits-all" spectral update is holding back your models: Mousse adapts to curvature and cuts training time by 12%.
Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions. We argue that this "egalitarian" constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions. In this work, we propose \textbf{Mousse} (\textbf{M}uon \textbf{O}ptimization \textbf{U}tilizing \textbf{S}hampoo's \textbf{S}tructural \textbf{E}stimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning. Instead of applying Newton-Schulz orthogonalization directly to the momentum matrix, Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics (derived from Shampoo). Mathematically, we formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region, where the optimal update is derived via the polar decomposition of the whitened gradient. Empirical results across language models ranging from 160M to 800M parameters demonstrate that Mousse consistently outperforms Muon, achieving around $\sim$12\% reduction in training steps with negligible computational overhead.