Search papers, labs, and topics across Lattice.
1
2
DPO's "implicit reward" reparameterization leads to suboptimal regularization, but EXPO offers a fix with explicit, intuitive regularization factors that provably work better.