Search papers, labs, and topics across Lattice.
LMO - Laboratoire de Mathématiques d'Orsay (Bâtiment 307, 91405, Orsay cedex - France), CMAP - Centre de Mathématiques Appliquées de l'Ecole polytechnique (Route de Saclay, 91128 Palaiseau Cedex - France)
1
3
2
8
Ditch reward models: Nash Mirror Prox achieves fast, stable convergence to a Nash equilibrium directly from human preferences, sidestepping the limitations of traditional RLHF.