Search papers, labs, and topics across Lattice.
Shanghai Jiao Tong University
1
2
1
2
DPO's "implicit reward" reparameterization leads to suboptimal regularization, but EXPO offers a fix with explicit, intuitive regularization factors that provably work better.