Search papers, labs, and topics across Lattice.
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
1
0
3
6
Current autonomous agent benchmarks miss nearly half of safety violations and over 10% of robustness failures because they only check final outputs, a problem Claw-Eval directly addresses.