Search papers, labs, and topics across Lattice.
2
0
4
2
LRMs already know when to stop reasoning, but current sampling methods are holding them back.
Stop overfitting your reward model: R2M leverages real-time policy feedback to dynamically align the reward model with the evolving policy distribution, reducing reward overoptimization in RLHF.