Apr 15, 2026arXiv:2604.14108

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, Pierfrancesco Beneventano

AI Summary

This paper investigates the impact of momentum on the Edge of Stochastic Stability (EoSS) in deep learning optimization. It reveals that SGD with momentum exhibits batch-size-dependent behavior, with Batch Sharpness stabilizing at different plateaus depending on whether momentum amplifies stochastic fluctuations (small batch sizes) or provides classical stabilization (large batch sizes). The findings connect these regimes to linear stability thresholds, offering insights into hyperparameter tuning.

Key Contribution

Momentum's impact on sharpness isn't straightforward: it either amplifies stochastic fluctuations to favor flatter regions or recovers classical stabilization to favor sharper regions, depending on batch size.

Abstract

Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Related Papers