Shanghai AI LabTencent AIFeb 15, 2026arXiv:2602.14337

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Yukang Feng, Yukang Feng, Jianwen Sun, Jian Sun, Ze Yang, Zelai Yang, Jiaxin Ai, Jiaxin Ai, Chuanhao Li, Zizhen Li, Zizhen Li, Fanrui Zhang, Fanrui Zhang, K. He, Rui Ma, Rui Ma, Jifan Lin, Jifan Lin, Jie Sun, Jie Sun, Yang Xiao, Yang Xiao, Yang Xiao, Sizhuo Zhou, Sizhuo Zhou, Wenxiao Wu, Wenxiao Wu, Yiming Liu, Pengfei Liu, Pengfei Liu, Yu Qiao, Yu Qiao, Shenglin Zhang, Kaipeng Zhang, Kaipeng Zhang

AI Summary

The authors introduce LongCLI-Bench, a new benchmark for evaluating long-horizon agentic programming in command-line interfaces, addressing limitations of existing benchmarks such as short task horizons and data contamination. They curated 20 long-horizon tasks from computer science assignments and real-world workflows, categorized into from-scratch development, feature addition, bug fixing, and refactoring. Experiments with state-of-the-art agents on LongCLI-Bench revealed low pass rates (below 20%) and early-stage failures, suggesting the need for improved planning, execution, and human-agent collaboration.

Key Contribution

Today's best AI agents fail at realistic software engineering tasks, stalling before even reaching 30% completion, revealing the urgent need for better long-horizon planning and human-AI collaboration.

Abstract

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents'planning and execution capabilities to overcome key challenges in long-horizon task performance.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Related Papers