Search papers, labs, and topics across Lattice.
1
6
3
4
Initializing the DPO reference model *before* training, rather than identically to the policy, unlocks better preference optimization and beats standard DPO.