Search papers, labs, and topics across Lattice.
Meituan
2
1
4
2
Test-time RL's vulnerability to noisy pseudo-labels is amplified by group-relative advantage estimation, but can be mitigated with a surprisingly simple debiasing and denoising approach.
Forget slow and steady: "Fast Thinking" prompts, combined with carefully tuned reward functions and REINFORCE, can dramatically boost the performance of RL-trained research agents.