Search papers, labs, and topics across Lattice.
Faculty of Data and Decision Science, Technion, IBM Research
3
0
5
Agents that excel on traditional benchmarks may crumble under the pressure of newly synthesized tasks, revealing the limitations of current evaluation methods.
Stop re-running full benchmarks: Calibrate new LLM datasets against existing suites with just 100 "anchor" questions and still get highly accurate performance predictions.
AI agents are far better at automating data engineering tasks than previously thought, but flawed benchmarks are obscuring their true potential.