Search papers, labs, and topics across Lattice.
1
0
3
Ditch reward maximization: a new RL objective learns the *distribution* of reasoning advantages, boosting LLM accuracy and diversity without extra training costs.