Search papers, labs, and topics across Lattice.
Peking University
2
0
5
Most RLVR datasets are just remixes of a few originals, and this paper shows how to trace them back to their source, revealing widespread data contamination.
Fine-grained control over reward signals unlocks significant gains in multi-trait essay scoring, outperforming standard policy optimization techniques.