RWTHUniversitat Rovira i VirgiliFeb 25, 2026arXiv:2602.21480

Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas

AI Summary

This paper addresses the gap in evaluating Text-to-SQL systems within Big Data workflows, highlighting the inadequacy of existing Text-to-SQL benchmarks for assessing performance at scale. The authors introduce novel metrics tailored for "Text-to-Big SQL" that account for execution efficiency, cost, and data scale impact. Through an extensive evaluation of production-level LLM agents, the study demonstrates that traditional text-to-SQL metrics fail to capture the complexities of Big Data environments, while the proposed metrics provide a more accurate reflection of real-world performance.

Key Contribution

Text-to-SQL benchmarks fail to capture the real-world cost and performance bottlenecks that emerge when LLMs are used to generate SQL queries for Big Data.

Abstract

Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as "Text-to-Big SQL". However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text-to-SQL metrics are insufficient for Big Data. In contrast, our proposed text-to-Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. Furthermore, we provide LLM-specific insights, including fine-grained, cross-model comparisons of latency and cost.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

Related Papers