Search papers, labs, and topics across Lattice.
1
6
RLHF's two-stage approach can statistically outperform DPO when learning from implicitly sparse rewards, challenging the narrative that end-to-end preference optimization is always superior.