Search papers, labs, and topics across Lattice.
The Hong Kong University of Science and Technology
3
0
5
3
LLM agents still fail to reliably automate real-world workflows, with even the best models succeeding on only two-thirds of tasks in a new live benchmark.
Agent-World reveals that self-evolving environments can dramatically boost agent performance, outperforming established models by leveraging dynamic task synthesis.
Even the best multimodal agents struggle with realistic visual scenarios, achieving only 27% accuracy on the new AgentVista benchmark that demands long-horizon tool use across web search, image search, and code.