Search papers, labs, and topics across Lattice.
Peking University
2
0
3
5
LLM agents still fail to reliably automate real-world workflows, with even the best models succeeding on only two-thirds of tasks in a new live benchmark.
Current autonomous agent benchmarks miss nearly half of safety violations and over 10% of robustness failures because they only check final outputs, a problem Claw-Eval directly addresses.