Search papers, labs, and topics across Lattice.
1
1
0
1
Optimistic Multi-step Preference Optimization is built upon the optimistic online mirror descent algorithm and provides a rigorous analysis for the convergence of OMPO and shows that OMPO requires O ( ϵ − 1 ) policy updates to converge to an ϵ -approximate Nash equilibrium.