Search papers, labs, and topics across Lattice.
1
2
DPO's reliance on a reference policy can backfire, prematurely halting learning when the reference is pessimistically wrong, but a simple one-line fix can significantly improve performance.