Search papers, labs, and topics across Lattice.
2
1
5
0
Intrinsic reward signals in unsupervised RL for LLMs inevitably collapse due to sharpening of the model's prior, but external rewards grounded in computational asymmetries offer a path to sustained scaling.
LLM agents can learn to explore novel states and generalize to new tasks with a hybrid on- and off-policy RL framework that leverages memory.