Search papers, labs, and topics across Lattice.
The authors introduce LongCLI-Bench, a new benchmark for evaluating long-horizon agentic programming in command-line interfaces, addressing limitations of existing benchmarks such as short task horizons and data contamination. They curated 20 long-horizon tasks from computer science assignments and real-world workflows, categorized into from-scratch development, feature addition, bug fixing, and refactoring. Experiments with state-of-the-art agents on LongCLI-Bench revealed low pass rates (below 20%) and early-stage failures, suggesting the need for improved planning, execution, and human-agent collaboration.
Today's best AI agents fail at realistic software engineering tasks, stalling before even reaching 30% completion, revealing the urgent need for better long-horizon planning and human-AI collaboration.
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents'planning and execution capabilities to overcome key challenges in long-horizon task performance.