ETHFaculty of Data and Decision ScienceIBM ResearchTechnionUIUCMar 31, 2026arXiv:2603.29399

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Andrea Giovannini, Tengjun Jin, Yotam Perlitz

AI Summary

The paper re-evaluates ELT-Bench, a benchmark for AI agents constructing ELT pipelines, and finds that initial low success rates significantly underestimated agent capabilities due to benchmark quality issues. They develop an Auditor-Corrector methodology using LLMs and human validation to identify and correct errors in the benchmark's evaluation scripts, specifications, and ground truth. By creating ELT-Bench-Verified, a revised benchmark, the authors demonstrate that correcting these errors leads to significant performance improvements for AI agents, highlighting the importance of benchmark quality in evaluating complex agentic tasks.

Key Contribution

AI agents are far better at automating data engineering tasks than previously thought, but flawed benchmarks are obscuring their true potential.

Abstract

Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss'kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Related Papers