Search papers, labs, and topics across Lattice.
4
0
7
6
Multi-iteration experience learning in LLMs can lead to capability collapse, but strategic adjustments in experience granularity and injection patterns can stabilize and enhance performance.
OPD's "free lunch" of dense token-level reward may be an illusion, as teacher novelty, not just higher scores, drives successful distillation.
Tool-using agents may seem capable, but they struggle to distinguish neutral actions from errors, highlighting a critical need for better step-level process understanding.
Students can surpass their teachers in on-policy distillation by extrapolating rewards and merging knowledge from domain experts, challenging the conventional wisdom that students are inherently limited by their teachers' capabilities.