Search papers, labs, and topics across Lattice.
3
0
4
12
Current mobile GUI agents are surprisingly inept at everyday smartphone tasks, achieving only 62% success on a new benchmark of real-world Android apps.
Current verifiers often reward correct answers derived from flawed reasoning, but PRIME offers a benchmark to identify and select verifiers that actually penalize incorrect derivations.
Even reward models that get the right answer can be dangerously wrong in their reasoning, leading to worse RLHF outcomes, but R-Align fixes this by explicitly aligning rationales with gold standard judgments.