Search papers, labs, and topics across Lattice.
CMAP - Centre de Mathématiques Appliquées de l'Ecole polytechnique (Route de Saclay, 91128 Palaiseau Cedex - France), MBZUAI - Mohamed bin Zayed University of Artificial Intelligence (United Arab Emirates)
1
3
2
9
Ditch reward models: Nash Mirror Prox achieves fast, stable convergence to a Nash equilibrium directly from human preferences, sidestepping the limitations of traditional RLHF.