Search papers, labs, and topics across Lattice.
This paper introduces a novel single-pass stochastic gradient descent (SGD) algorithm with momentum for generalized linear prediction in a streaming setting. The algorithm leverages a data-dependent proximal method to achieve dual-momentum acceleration, addressing an open question regarding the applicability of momentum in non-quadratic stochastic optimization. The derived excess risk bound demonstrates improved optimization error compared to standard SGD, while maintaining minimax optimal statistical error, thus showing momentum is more effective than variance reduction.
Momentum *can* accelerate single-pass stochastic gradient descent for generalized linear prediction, resolving a long-standing open question and outperforming variance reduction techniques.
We study generalized linear prediction under a streaming setting, where each iteration uses only one fresh data point for a gradient-level update. While momentum is well-established in deterministic optimization, a fundamental open question is whether it can accelerate such single-pass non-quadratic stochastic optimization. We propose the first algorithm that successfully incorporates momentum via a novel data-dependent proximal method, achieving dual-momentum acceleration. Our derived excess risk bound decomposes into three components: an improved optimization error, a minimax optimal statistical error, and a higher-order model-misspecification error. The proof handles mis-specification via a fine-grained stationary analysis of inner updates, while localizing statistical error through a two-phase outer-loop analysis. As a result, we resolve the open problem posed by Jain et al. [2018a] and demonstrate that momentum acceleration is more effective than variance reduction for generalized linear prediction in the streaming setting.