Mar 10, 2026arXiv:2603.09923

OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality

AI Summary

The paper introduces OptEMA, a novel adaptive Exponential Moving Average method for stochastic optimization, and analyzes two variants: OptEMA-M and OptEMA-V. OptEMA employs trajectory-dependent stepsizes without requiring Lipschitz constant knowledge, addressing limitations of existing Adam-style analyses. Under standard SGD assumptions, OptEMA achieves a noise-adaptive convergence rate of $\widetilde{\mathcal{O}}(T^{-1/2}+σ^{1/2} T^{-1/4})$, reaching a nearly optimal deterministic rate of $\widetilde{\mathcal{O}}(T^{-1/2})$ in the zero-noise regime without hyperparameter tuning.

Key Contribution

Forget manual hyperparameter tuning: OptEMA achieves near-optimal deterministic convergence in zero-noise stochastic optimization, adapting automatically.

Abstract

The Exponential Moving Average (EMA) is a cornerstone of widely used optimizers such as Adam. However, existing theoretical analyses of Adam-style methods have notable limitations: their guarantees can remain suboptimal in the zero-noise regime, rely on restrictive boundedness conditions (e.g., bounded gradients or objective gaps), use constant or open-loop stepsizes, or require prior knowledge of Lipschitz constants. To overcome these bottlenecks, we introduce OptEMA and analyze two novel variants: OptEMA-M, which applies an adaptive, decreasing EMA coefficient to the first-order moment with a fixed second-order decay, and OptEMA-V, which swaps these roles. Crucially, OptEMA is closed-loop and Lipschitz-free in the sense that its effective stepsizes are trajectory-dependent and do not require the Lipschitz constant for parameterization. Under standard stochastic gradient descent (SGD) assumptions, namely smoothness, a lower-bounded objective, and unbiased gradients with bounded variance, we establish rigorous convergence guarantees. Both variants achieve a noise-adaptive convergence rate of $\widetilde{\mathcal{O}}(T^{-1/2}+σ^{1/2} T^{-1/4})$ for the average gradient norm, where $σ$ is the noise level. In particular, in the zero-noise regime where $σ=0$, our bounds reduce to the nearly optimal deterministic rate $\widetilde{\mathcal{O}}(T^{-1/2})$ without manual hyperparameter retuning.

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality

Related Papers