Search papers, labs, and topics across Lattice.
3
0
6
4
OPD's "free lunch" of dense token-level reward may be an illusion, as teacher novelty, not just higher scores, drives successful distillation.
Tool-using agents may seem capable, but they struggle to distinguish neutral actions from errors, highlighting a critical need for better step-level process understanding.
Students can surpass their teachers in on-policy distillation by extrapolating rewards and merging knowledge from domain experts, challenging the conventional wisdom that students are inherently limited by their teachers' capabilities.