Mar 17, 2026arXiv:2603.17017

LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

Lifu Tu, Rongguang Wang, Tao Sheng, S. Ravi, Dan Roth

AI Summary

This paper introduces a new NL2SQL robustness benchmark with ten types of perturbations to evaluate LLMs in both traditional and agentic settings. The benchmark assesses LLM performance against surface noise (e.g., character corruption) and linguistic variations that preserve semantics. Experiments with Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2 reveal that while LLMs are generally robust, they exhibit significant performance degradation under surface noise in traditional pipelines and linguistic variations in agentic settings.

Key Contribution

LLMs can ace the NL2SQL benchmark, but throw in some typos or rephrase the question, and their performance tanks, especially in agentic settings.

Abstract

Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References18

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

Related Papers