Search papers, labs, and topics across Lattice.
2
0
2
10
Current verifiers often reward correct answers derived from flawed reasoning, but PRIME offers a benchmark to identify and select verifiers that actually penalize incorrect derivations.
Even reward models that get the right answer can be dangerously wrong in their reasoning, leading to worse RLHF outcomes, but R-Align fixes this by explicitly aligning rationales with gold standard judgments.