JD.comMay 28, 2026arXiv:2605.30434

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Kewei Xu, Xiaobe Lu, Shuofei Qiao, Haoming Xu, Lei Liang, Ningyu Zhang

AI Summary

The paper introduces LongDS-Bench, a new benchmark designed to evaluate the ability of agents to perform long-horizon, multi-turn data analysis by maintaining and updating analytical states. This benchmark, derived from real-world Kaggle notebooks, comprises 68 tasks with an average dependency span of 11.3 turns across six domains. Experiments with state-of-the-art models reveal significant performance degradation over longer horizons, with the best model achieving only 48.45% average accuracy and a 47-point drop from early to late turns, highlighting the challenge of maintaining correct analytical states.

Key Contribution

Today's best data analysis agents forget crucial context as conversations drag on, dropping nearly 50 points in accuracy when reasoning through longer analyses.

Abstract

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents'ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References158

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Related Papers