Search papers, labs, and topics across Lattice.
Cornell University
2
0
4
Attention from your LLM can be used to significantly improve preference optimization, outperforming existing methods without needing a separate reward model or heuristic token weighting.
Forget expensive human annotation: this self-play method lets LLMs bootstrap their own training signals for open-ended tasks by generating rubrics to evaluate their own outputs.