Search papers, labs, and topics across Lattice.
3
0
7
9
OPD's "free lunch" of dense token-level reward may be an illusion, as teacher novelty, not just higher scores, drives successful distillation.
LLMs can achieve massive performance gains on reasoning and knowledge-intensive tasks simply by iteratively refining their answers using pseudo-labels derived from unlabeled data.
Intrinsic reward signals in unsupervised RL for LLMs inevitably collapse due to sharpening of the model's prior, but external rewards grounded in computational asymmetries offer a path to sustained scaling.