Search papers, labs, and topics across Lattice.
1
0
3
4
Direct Preference Optimization (DPO) can be rescued from performance collapse with a simple importance sampling fix, especially when regularization is weak.