Search papers, labs, and topics across Lattice.
Columbia University ♠, New York University ♢, Barnard College
2
0
3
Even state-of-the-art LLMs like GPT-5.2 falter in LakeQA, scoring just 18.37% on a benchmark that demands both searching and multi-hop reasoning.
VISTA reveals that integrating UI and API interactions can drastically enhance the realism and comprehensiveness of agent evaluations, outperforming existing benchmarks.