Search papers, labs, and topics across Lattice.
This paper introduces Residual Koopman Spectral Profiling (RKSP), a method to predict transformer training divergence from a single forward pass at initialization by extracting Koopman spectral features from layer-wise residual snapshots. The key diagnostic, near-unit spectral mass, quantifies the fraction of modes concentrated near the unit circle and correlates strongly with instability risk, achieving an AUROC of 0.995 in predicting divergence. The authors also introduce Koopman Spectral Shaping (KSS) to reshape spectra during training, demonstrating its effectiveness in preventing divergence and enabling higher learning rates across various architectures and datasets.
Predict transformer training failures *before* you even start training, with 99.5% accuracy, using just a single forward pass.
Training divergence in transformers wastes compute, yet practitioners discover instability only after expensive runs begin. They therefore need an expected probability of failure for a transformer before training starts. Our study of Residual Koopman Spectral Profiling (RKSP) provides such an estimate. From a single forward pass at initialization, RKSP extracts Koopman spectral features by applying whitened dynamic mode decomposition to layer-wise residual snapshots. Our central diagnostic, the near-unit spectral mass, quantifies the fraction of modes concentrated near the unit circle, which captures instability risk. For predicting divergence across extensive configurations, this estimator achieves an AUROC of 0.995, outperforming the best gradient baseline. We further make this diagnostic actionable through Koopman Spectral Shaping (KSS), which reshapes spectra during training. We empirically validate that our method works in practice: RKSP predicts divergence at initialization, and when RKSP flags high risk, turning on KSS successfully prevents divergence. In the challenging high learning rate regime without normalization layers, KSS reduces the divergence rate from 66.7% to 12.5% and enables learning rates that are 50% to 150% higher. These findings generalize to WikiText-103 language modeling, vision transformers on CIFAR-10, and pretrained language models, including GPT-2 and LLaMA-2 up to 7B, as well as emerging architectures such as MoE, Mamba-style SSMs, and KAN.