Mar 4, 2026arXiv:2603.03823

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, Bing Zhao

AI Summary

The paper introduces SWE-CI, a new benchmark for evaluating LLM-powered agents in maintaining codebases through continuous integration, addressing the limitations of static, one-shot code repair evaluations. SWE-CI comprises 100 tasks derived from real-world code repositories, each representing a long-term evolution history with multiple commits and requirement changes. The benchmark assesses an agent's ability to maintain code quality dynamically over multiple iterations of analysis and coding within a CI loop.

Key Contribution

LLMs that ace static code-fixing benchmarks may still struggle to maintain code quality over the long, iterative haul of real-world software development.

Abstract

Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbf{SWE-CI}, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textit{functional correctness} toward dynamic, long-term \textit{maintainability}. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References13

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Related Papers