Search papers, labs, and topics across Lattice.
Inria, ENS MVA
2
3
5
43
Re-training LLMs on their own generated content can fundamentally limit what they can learn, but only under specific, theoretically-defined conditions related to generation quality.
Ditch reward models: Nash Mirror Prox achieves fast, stable convergence to a Nash equilibrium directly from human preferences, sidestepping the limitations of traditional RLHF.