Search papers, labs, and topics across Lattice.
1
0
3
0
Finally, a single algorithm, DPO-COV, tackles the trifecta of corrupted preferences, reward overoptimization, and verbosity that plague RLHF and DPO, and it even comes with theoretical guarantees.