Search papers, labs, and topics across Lattice.
2
0
6
4
LLMs can reason better and generate more diverse outputs by projecting negative samples onto a positive subspace during reinforcement learning.
Self-play can be dramatically improved by exploiting the "question construction path" it generates as privileged information for self-distillation, leading to 2-3x faster learning.