Search papers, labs, and topics across Lattice.
This paper demonstrates that performing Singular Value Decomposition (SVD) on loss gradients, rather than AdamW updates, dramatically increases the measured perturbative coupling between Spectral Element Decomposition (SED) directions and Linear Centroid Hypothesis (LCH) features in modular arithmetic operations. The authors found that gradient aggregation across tasks obstructs the identification of SED-LCH coupling, but can be resolved by performing SVD on per-task gradients. Furthermore, they show that constraining attention updates to any rank-3 subspace accelerates grokking, suggesting that the SED-LCH coupling identifies where feature formation concentrates, but is not a unique causal pathway.
Forget what you thought you knew about how models learn: analyzing loss gradients, not just parameter updates, reveals a hidden order of magnitude increase in the coupling between learned features and parameter space.
We show that replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude. Performing SVD on the loss gradient instead of the AdamW update increases the measured perturbative coupling between SED directions and Linear Centroid Hypothesis (LCH) features from $ \bar{R}_k \approx 3 $--$9\times$ to $100$--$330\times$ across four single-task modular arithmetic operations, eliminating the apparent operation dependence in the original measurement. On a multitask transformer with a shared encoder, update-based SED gives $ \bar{R}_k \leq 1 $ -- an apparent failure of the diagnostic -- while per-operation gradient-based SED recovers $ \bar{R}_k = 20 $--$45\times$ across all four operations. Gradient aggregation across competing tasks is the main obstruction; performing SVD on per-task gradients resolves it. A causal intervention shows that constraining attention updates to any rank-3 subspace (whether SED-derived or random) accelerates grokking by approximately $2.3\times$ across random seeds and operations, while removing the rank-3 component has negligible effect under proper gradient-projection methodology. The SED-LCH coupling is therefore a strong diagnostic of where feature formation concentrates in parameter space, but it is not a unique causal pathway: the natural full-rank AdamW attention update is highly rank-redundant under our hyperparameters.