Search papers, labs, and topics across Lattice.
2
0
4
1
LLM agents that ace benchmarks often fall apart in real-world software engineering workflows, with completion rates plummeting from 100% to just 20% as tasks become more complex.
Current mobile GUI agents are surprisingly inept at everyday smartphone tasks, achieving only 62% success on a new benchmark of real-world Android apps.