Search papers, labs, and topics across Lattice.
Hugging Face (United States)
1
3
2
15
Ditch reward models: Nash Mirror Prox achieves fast, stable convergence to a Nash equilibrium directly from human preferences, sidestepping the limitations of traditional RLHF.