Search papers, labs, and topics across Lattice.
1
347
3
4
Forget complex RLHF pipelines: simple PPO with rule-based rewards can outperform state-of-the-art reasoning models while slashing training costs by 90%.