Search papers, labs, and topics across Lattice.
2
0
5
5
OPD's "free lunch" of dense token-level reward may be an illusion, as teacher novelty, not just higher scores, drives successful distillation.
Intrinsic reward signals in unsupervised RL for LLMs inevitably collapse due to sharpening of the model's prior, but external rewards grounded in computational asymmetries offer a path to sustained scaling.