Search papers, labs, and topics across Lattice.
1
0
4
Correcting for suboptimal behavior during preference learning unlocks substantial gains in offline RLHF and improves online performance in continuous control tasks.