Search papers, labs, and topics across Lattice.
University College London, Nanjing University
1
0
3
Today's agents are surprisingly bad at real-world terminal tasks, with even frontier models failing nearly 40% of the time on everyday workflows.