Search papers, labs, and topics across Lattice.
M 0.831 5.260 0.804 0.861 0.946 0.033 0.822 0.340 0.0752 MatRIS-S 4., M 0.815 5.042 0.771 0.864 0.941 0.036 0.788 1.676 0.0757 eSEN-
2
347
3
6
Even reward models that get the right answer can be dangerously wrong in their reasoning, leading to worse RLHF outcomes, but R-Align fixes this by explicitly aligning rationales with gold standard judgments.
Forget complex RLHF pipelines: simple PPO with rule-based rewards can outperform state-of-the-art reasoning models while slashing training costs by 90%.