Search papers, labs, and topics across Lattice.
1
0
1
3
Stop overfitting your reward model: R2M leverages real-time policy feedback to dynamically align the reward model with the evolving policy distribution, reducing reward overoptimization in RLHF.