Search papers, labs, and topics across Lattice.
Peking University
2
0
3
Stop wrestling with evaluation codebases: One-Eval automates LLM evaluation from natural language requests, handling benchmark selection, dataset normalization, and metric reporting with minimal user effort.
Current multimodal agents are surprisingly bad at web browsing, achieving only 36% accuracy on a new benchmark designed to test deep, multi-modal reasoning across web pages.