Search papers, labs, and topics across Lattice.
This paper investigates the effects of on-policy distillation (OPD) on model parameters, revealing that OPD updates are both small and coordinate-sparse, primarily concentrated in feedforward networks across various language and vision-language models. The findings indicate that training only the identified subnetwork can achieve performance comparable to full OPD, although the sparsity-inducing SGD optimizer is less effective than AdamW due to the latter's ability to adaptively scale gradients. Additionally, the updates maintain a full-rank structure but are concentrated away from the principal singular subspaces, highlighting the unique geometric properties retained by OPD despite dense teacher supervision.
Sparse updates in on-policy distillation can match full training performance, challenging conventional beliefs about parameter rewriting in deep learning.
On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.