Search papers, labs, and topics across Lattice.
This paper analyzes the population risk of linear models trained with one-pass signSGD under a power-law random features (PLRF) model, accounting for feature and target decay. By comparing against SGD, the authors identify drift-normalization and noise-reshaping effects unique to signSGD, and derive compute-optimal scaling laws. The analysis reveals that noise-reshaping can lead to steeper compute-optimal slopes for signSGD compared to SGD in noise-dominant regimes, and that a warmup-stable-decay (WSD) schedule further enhances this effect under specific decay conditions.
SignSGD can outperform SGD in linear regression when noise dominates, thanks to a unique "noise-reshaping" effect that steepens its compute-optimal scaling law.
We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.