Search papers, labs, and topics across Lattice.
The paper introduces SWE-CI, a new benchmark for evaluating LLM-powered agents in maintaining codebases through continuous integration, addressing the limitations of static, one-shot code repair evaluations. SWE-CI comprises 100 tasks derived from real-world code repositories, each representing a long-term evolution history with multiple commits and requirement changes. The benchmark assesses an agent's ability to maintain code quality dynamically over multiple iterations of analysis and coding within a CI loop.
LLMs that ace static code-fixing benchmarks may still struggle to maintain code quality over the long, iterative haul of real-world software development.
Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbf{SWE-CI}, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textit{functional correctness} toward dynamic, long-term \textit{maintainability}. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.