Search papers, labs, and topics across Lattice.
3
0
6
3
Heterogeneous agents can boost each other's performance in RL without coordinated deployment, achieving better results with less data than traditional methods.
LRMs already know when to stop reasoning, but current sampling methods are holding them back.
Stop overfitting your reward model: R2M leverages real-time policy feedback to dynamically align the reward model with the evolving policy distribution, reducing reward overoptimization in RLHF.