Search papers, labs, and topics across Lattice.
This paper analyzes the convergence of stochastic gradient descent (SGD) for composite optimization problems involving $N$ sequential operators, considering perturbations in both forward and backward passes. It addresses the limitation of classical analyses that treat gradient noise as additive by characterizing how forward and backward perturbations propagate and amplify geometrically through the computational graph. The paper provides convergence guarantees for non-convex and Polyak--Łojasiewicz objectives, identifies conditions for maintaining asymptotic convergence order despite perturbations, and offers a theoretical explanation for gradient spiking in deep learning.
Perturbations in forward and backward passes of SGD can cascade geometrically through deep networks, but this paper identifies conditions under which asymptotic convergence order is preserved.
We study stochastic gradient descent (SGD) for composite optimization problems with $N$ sequential operators subject to perturbations in both the forward and backward passes. Unlike classical analyses that treat gradient noise as additive and localized, perturbations to intermediate outputs and gradients cascade through the computational graph, compounding geometrically with the number of operators. We present the first comprehensive theoretical analysis of this setting. Specifically, we characterize how forward and backward perturbations propagate and amplify within a single gradient step, derive convergence guarantees for both general non-convex objectives and functions satisfying the Polyak--Łojasiewicz condition, and identify conditions under which perturbations do not deteriorate the asymptotic convergence order. As a byproduct, our analysis furnishes a theoretical explanation for the gradient spiking phenomenon widely observed in deep learning, precisely characterizing the conditions under which training recovers from spikes or diverges. Experiments on logistic regression with convex and non-convex regularization validate our theories, illustrating the predicted spike behavior and the asymmetric sensitivity to forward versus backward perturbations.