Hoseo UniversityApr 16, 2026arXiv:2604.14669

Zeroth-Order Optimization at the Edge of Stability

Minhak Song, Liang Zhang, Bingcong Li, Niao He, Michael Muehlebach, Sewoong Oh

AI Summary

This paper analyzes the stability of zeroth-order (ZO) optimization methods, which are crucial for black-box learning and memory-efficient fine-tuning of large models. They derive an explicit step size condition for mean-square stability of ZO methods based on the two-point estimator, revealing that stability depends on the entire Hessian spectrum, unlike first-order methods. Empirically, they show that full-batch ZO methods operate at the edge of stability, with large step sizes primarily regularizing the Hessian trace.

Key Contribution

Zeroth-order optimization stability depends on the *entire* Hessian spectrum, not just the largest eigenvalue like first-order methods, offering a new perspective on implicit regularization.

Abstract

Zeroth-order (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across a range of deep learning training problems. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Zeroth-Order Optimization at the Edge of Stability

Related Papers