Search papers, labs, and topics across Lattice.
2
1
3
4
DPO's reliance on a reference policy can backfire, prematurely halting learning when the reference is pessimistically wrong, but a simple one-line fix can significantly improve performance.
LLMs learn better from AI *reward* than AI *preference*, leading to higher human-AI agreement and improved performance compared to standard online AI feedback and RLHF.