Search papers, labs, and topics across Lattice.
Tsinghua University
1
6
7
RLHF's two-stage approach can statistically outperform DPO when learning from implicitly sparse rewards, challenging the narrative that end-to-end preference optimization is always superior.