Search papers, labs, and topics across Lattice.
The Hebrew University of Jerusalem
2
0
3
Systematic gaps in AI evaluation reporting are exposed, revealing inconsistencies that hinder reliable comparisons across thousands of models and benchmarks.
Agents that excel on traditional benchmarks may crumble under the pressure of newly synthesized tasks, revealing the limitations of current evaluation methods.