Search papers, labs, and topics across Lattice.
1
0
3
DPO's instability problem is solved with a new loss function that directly targets a finite logits difference, leading to better alignment and preventing reward hacking.