Mar 2, 2026arXiv:2603.02069

Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

AI Summary

This paper analyzes the population risk of linear models trained with one-pass signSGD under a power-law random features (PLRF) model, accounting for feature and target decay. By comparing against SGD, the authors identify drift-normalization and noise-reshaping effects unique to signSGD, and derive compute-optimal scaling laws. The analysis reveals that noise-reshaping can lead to steeper compute-optimal slopes for signSGD compared to SGD in noise-dominant regimes, and that a warmup-stable-decay (WSD) schedule further enhances this effect under specific decay conditions.

Key Contribution

SignSGD can outperform SGD in linear regression when noise dominates, thanks to a unique "noise-reshaping" effect that steepens its compute-optimal scaling law.

Abstract

We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.

Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

Related Papers