Search papers, labs, and topics across Lattice.
The paper introduces Leap+Verify, a speculative execution framework for accelerating neural network training by predicting future model weights and validating these predictions before acceptance. It dynamically detects three training regimes (chaotic, transition, stable) using activation-space cosine similarity and employs analytic weight predictors (momentum, linear, quadratic extrapolation) within each regime to forecast model parameters. Experiments on GPT-2 124M and Qwen 1.5B show that finite-difference predictors outperform momentum-based prediction, achieving up to 37% acceptance rates, and reveal a scale-dependent shift in the bottleneck from predictor accuracy to regime availability.
Forget momentum extrapolation for speculative training: finite-difference methods can achieve up to 37% weight prediction acceptance in large language models, but only if you can find the fleeting "transition" regime.
We introduce Leap+Verify, a framework that applies speculative execution -- predicting future model weights and validating predictions before acceptance -- to accelerate neural network training. Inspired by speculative decoding in language model inference and by the Automatically Scalable Computation (ASC) architecture for program execution, Leap+Verify decomposes training into three dynamically detected regimes (chaotic, transition, stable) using activation-space cosine similarity as a real-time Lyapunov proxy signal. Within each regime, analytic weight predictors (momentum, linear, quadratic extrapolation) attempt to forecast model parameters K training steps ahead; predictions are accepted only when validated against a held-out loss criterion. We evaluate Leap+Verify on GPT-2 124M and Qwen 2.5-1.5B trained on WikiText-103 across five random seeds, sweeping prediction depth K in {5, 10, 25, 50, 75, 100}. Momentum-based prediction (Adam moment extrapolation) fails catastrophically at both scales, with predicted losses exceeding actuals by 100-10,000x -- a universal norm explosion in optimizer-state extrapolation. Finite-difference predictors (linear, quadratic) succeed where momentum fails: at 124M, they achieve 24% strict acceptance at K=5 in stable regimes; at 1.5B, they achieve 37% strict acceptance in transition regimes. The scale-dependent finding is in regime distribution: GPT-2 124M spends 34% of training in stable regime, while Qwen 1.5B spends 64% in chaotic regime and reaches stable in only 0-2 of 40 checkpoints. Larger models are more predictable when predictable, but less often predictable -- the practical bottleneck shifts from predictor accuracy to regime availability. Cross-seed results are highly consistent (less than 1% validation loss variance), and the three-regime framework produces identical phase boundaries (plus or minus 50 steps) across seeds.