Google ResearchNorthwesternFeb 17, 2026arXiv:2602.15322

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie

AI Summary

The paper investigates the effectiveness of randomly masking parameter updates in adaptive optimizers for large language model (LLM) training, finding that it surprisingly outperforms state-of-the-art optimizers. They attribute this to a curvature-dependent geometric regularization effect induced by the masking. Based on this, they propose Momentum-aligned gradient masking (Magma), a novel masking technique that modulates updates using momentum-gradient alignment, achieving significant perplexity reductions compared to Adam and Muon in LLM pre-training.

Key Contribution

Randomly masking parameter updates in RMSProp delivers state-of-the-art LLM training performance, revealing a surprisingly effective form of geometric regularization.

Abstract

Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Related Papers