Search papers, labs, and topics across Lattice.
ENS de Lyon - École normale supérieure de Lyon (15 parvis René Descartes - BP 7000 - 69342 Lyon Cedex 07 - France)
1
3
2
14
Ditch reward models: Nash Mirror Prox achieves fast, stable convergence to a Nash equilibrium directly from human preferences, sidestepping the limitations of traditional RLHF.