Search papers, labs, and topics across Lattice.
Chinese Academy of Sciences
2
0
5
LLMs trained with ScaleBox, a new high-fidelity code verification system, substantially outperform those trained with heuristic matching, suggesting current RLHF methods are bottlenecked by verification quality.
LLMs trained with reinforcement learning from verifiable rewards (RLVR) become overconfident in incorrect answers, but a simple fix鈥攄ecoupling reasoning and calibration objectives鈥攃an restore proper calibration without sacrificing accuracy.