Search papers, labs, and topics across Lattice.
IBM Research, Massachusetts Institute of Technology ∗First authors ⋄Top contributors
MIT CSAIL2
0
2
Systematic gaps in AI evaluation reporting are exposed, revealing inconsistencies that hinder reliable comparisons across thousands of models and benchmarks.
Stop re-running full benchmarks: Calibrate new LLM datasets against existing suites with just 100 "anchor" questions and still get highly accurate performance predictions.