Search papers, labs, and topics across Lattice.
1
0
2
Forget reward model fitting: these primal-dual policy gradient methods offer provably safe and convergent RLHF in infinite horizon settings.