Jialong Chen

Sun Yat-sen University, Nγie=\frac{\sum_{i=1}^{N}\gamma^{i}\,a(c_{i})}{\sum_{i=1}^{N}\gamma^{i}} (3) In EvoScore, we set γ≥1\gamma\geq 1 so that later iterations receive greater weight. The rationale directly mirrors the ISO definition: a truly maintainable codebase is one that remains easy to modify as evolution progresses. An agent that sacrifices short-term speed for a cleaner, more extensible design will be rewarded over one that rushes to pass early tests but accumulates technical debt that cripples subsequent evolution. When γ=1\gamma=1, EvoScore reduces to the average normalized change; as γ\gamma increases, the metric progressively favors long-term stability over immediate gains. 3 SWE-CI Figure 2: Data curation process of SWE-CI. 3.1 Data curation As shown in Figure 1, our goal is to obtain a number of base-codebase/oracle-codebase pairs and to let the agent iteratively evolve the former toward the latter, measuring its ability to maintain code throughout this process. Each such pair can be viewed as two chronologically ordered commits within the same repository. Concretely, the construction of SWE-CI is carefully orchestrated as follows: Step 1: Repository Collection. Unlike SWE-Bench and similar benchmarks that draw exclusively from a handful of well-known open-source projects, we cast a wider net by searching across all Python repositories on GitHub. We then apply the following filtering criteria: (1) the repository has been actively maintained for at least three years; (2) it has accumulated more than 500 stars; (3) it contains configuration and dependency files (e.g., pyproject.toml and lockfiles) as well as a suite of unit tests; and (4) it is released under a permissive license such as MIT or Apache-2.0. After applying these filters, 4,923 repositories remain. Step 2: Commit Span Extraction. For each surviving repository, we retain only its main branch, reducing the history to a linear sequence of commits. We then compare the dependencies of consecutive commits along this sequence and identify all maximal subsequences within which the dependencies remain unchanged. The two endpoints of every such subsequence naturally form a candidate base/oracle pair. We further discard pairs whose total number of modified lines of code is below 1,000, as such pairs represent insufficient evolutionary distance. This process yields 8,311 candidate pairs. Step 3: Environment Construction. For each candidate pair, we automatically generate a Dockerfile based on the configuration and dependencies of the oracle codebase and snapshot the resulting runtime environment. We then execute the oracle codebase’s unit test suite within this environment to verify its correctness. To improve data retention, we introduce a self-repair mechanism: whenever the test suite fails to launch due to a missing dependency, we detect the failure and dynamically inject the required dependency into the Dockerfile to build a new environment. This mechanism substantially increases the number of viable candidate pairs. Pairs whose failures stem from other reasons are discarded. After this step, 1,458 candidate pairs and their runtime environment snapshots remain. Step 4: Case Filtering. Finally, we apply three further rounds of filtering to ensure the quality of the final dataset. First, within the runtime environment snapshot constructed in Step 3, we run the oracle codebase’s test suite against the base codebase. Any candidate whose tests fail to launch is discarded. Second, we compare the test reports produced by the base and oracle codebases on the same test suite; candidates for which the difference in the number of passing tests is fewer than five are removed. After these two automated filters, 137 candidates remain. In the last round, we rank the surviving candidates by their time span and number of intervening commits, and select the top 100 to form the final SWE-CI benchmark. The final SWE-CI benchmark comprises 100 samples drawn from 68 distinct repositories. On average, each base/oracle pair spans 233 days and 71 consecutive commits of real-world development history. In every pair, the transition from the base to the oracle codebase involves at least 500 lines of modified source code, excluding changes to test files. Each sample is shipped with the complete source code and a pre-built Docker environment to ensure reproducibility. These statistics confirm that SWE-CI captures substantial, long-term evolutionary episodes rather than trivial incremental changes. 3.2 Dual-agent evaluation protocol Figure 3: SWE-CI uses an architect-programmer dual-agent workflow to model the continuous integration cycle of professional software teams in the real world. As described in Figure 1, SWE-CI adopts evolution-based evaluation. To support this setting, we introduce an Architect-Programmer dual-agent protocol. The Architect identifies functional gaps and issues requirements; the Programmer implements them. Their collaboration reproduces the CI loop in real-world development, enabling fine-grained observation of how well agents maintain code. Architect agent. Based on the test gap between the current code and the oracle code, the Architect is tasked with producing a high-level requirements document in natural language. We prompt the architect to organize its behavior into three steps: ❶ Summarize. Architect reviews all failing tests, identifies root causes, and identifies source code files that need further inspection; ❷ Locate. Architect examines the source code and attributes failures to concrete deficiencies in the current implementation; ❸ Design. Based on these deficiencies, architect devises an improvement plan and produces the final requirements document. Two writing conventions are further imposed on requirement document. ➀ Incremental. The document should contain no more than five of the most urgent requirements, avoiding the pitfall of over-designing in a single iteration. ➁ High-level. the requirements should focus on describing expected behavior of code using neural language, leaving concrete implementation choices to the programmer. The core purpose of these specifications is to ensure that requirement documents meet the needs of real-world continuous integration processes. Programmer. The programmer’s responsibility is to maintain the code according to the requirements document. Programmer behavior is also standardized into three steps: ❶ Comprehend. Programmers understand high-level language requirements and translate them into explicit code specifications. ❷ Plan. Programmers plan the programming effort required to implement these specifications. ❸ Code. Programmers put these plans into practice and try to fulfill the requirements. In this protocol, the Programmer is driven by the requirements document rather than directly by the test gap — a deliberate design choice that aligns with the rapid iteration philosophy of continuous integration. To this end, the Architect is required to distill the most pressing requirements from the full set of failures, allowing the Programmer to focus on fast, targeted development without being overwhelmed by the full scope of changes. 4 Experiments 4.1 Experiment setting We use pytest and pytest-json-report as the testing framework, with a timeout of 3600 seconds per test run. iFlow CLI [9] serves as the default agent framework, and the maximum number of iterations in the dual-agent evaluation protocol is set to 20. Unless otherwise specified, the Architect Agent and the Programmer Agent share the same underlying base model. 4.2 Results Observation 1: The code maintenance capabilities of LLMs are advancing at an accelerating pace (Figure 4). Our extensive evaluation of 18 models from 8 different providers reveals a consistent pattern: within the same provider family, newer models always achieve higher scores, with models released after 2026 showing markedly larger gains than their predecessors. This suggests that the code capabilities of current LLMs are rapidly evolving beyond static bug-fixing toward sustained, long-term code maintenance. Among all evaluated models, the Claude Opus series demonstrates a commanding lead throughout the entire observation period, with GLM-5 also standing out as a strong performer. Figure 4: The EvoScore variation of state-of-the-art models from 8 providers on SWE-CI. Observation 2: Different provider place varying degrees of emphasis on code maintainability. (Figure 5). We vary the value of γ\gamma to examine how model rankings shift accordingly. When γ<1\gamma<1, EvoScore assigns higher weights to earlier iterations, favoring models that prioritize immediate gains from code modification. Conversely, when γ>1\gamma>1, later iterations are rewarded, giving an advantage to models that optimize for long-term improvement (i.e., prioritize code maintainability). We find that preferences vary considerably across providers, while models within the same provider tend to exhibit consistent tendencies. Specifically, MiniMax, DeepSeek, and GPT all show a preference for long-term gains, whereas Kimi and GLM lean toward short-term returns. Qwen, Doubao, and Claude, by contrast, remain relatively stable across different settings. We conjecture that this reflects differences in training strategies adopted by different providers, while the relative consistency within each provider suggests that their internal training pipelines remain largely stable. Figure 5: The model’s EvoScore ranking changes as γ\gamma increases. When γ>1\gamma>1, higher-ranking models indicate better codebase maintenance. Observation 3: Current LLMs still fall short in controlling regressions during long-term code maintenance. (Figure 6). Regression is a core metric for measuring software quality stability — if a unit test passes before a code change but fails afterward, the change is considered to have introduced a regression. In software maintenance, regressions must be strictly monitored. Once a regression occurs, it not only directly impacts user experience, but can also lead to systematic quality degradation as the number of changes accumulates over long-term maintenance. To this end, we measure in SWE-CI the proportion of samples in which no regression occurs throughout the entire code maintenance process, referred to as the zero-regression rate, to evaluate the stability of models in continuous maintenance scenarios. Experimental results show that most models achieve a zero-regression rate below 0.25, with only two models in the Claude-opus series exceeding 0.5, indicating that current LLMs still struggle to reliably avoid regressions in long-term code maintenance. This suggests that, although LLMs have shown significant improvements in snapshot-based code modification tasks, they still face substantial challenges in fully automated, long-term, and multi-round software development and maintenance scenarios. Figure 6: All models are sorted from smallest to largest by the zero regression rate. References [1] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. (2021) Program synthesis with large language models. arXiv preprint arXiv:2108.07732. Cited by: §1, §2.1. [2] F. P. Brooks Jr (1995) The mythical man-month: essays on software engineering. Pearson Education. Cited by: §1. [3] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. (2021) Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. Cited by: §1, §2.1. [4] International Organization for Standardization (2011) Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — Product quality model. Standard Technical Report 25010:2011, ISO/IEC. Note: Revised by ISO/IEC 25010:2023 External Links: Link Cited by: §2.3. [5] N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica (2024) Livecodebench: holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974. Cited by: §1, §2.1. [6] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan (2023) Swe-bench: can language models resolve real-world github issues?. arXiv preprint arXiv:2310.06770. Cited by: §1, §2.1. [7] M. M. Lehman (2005) Programs, life cycles, and laws of software evolution. Proceedings of the IEEE 68 (9), pp. 1060–1076. Cited by: §1. [8] M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, et al. (2026) Terminal-bench: benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868. Cited by: §1, §2.1. [9] W. Wang, X. Xu, W. An, F. Dai, W. Gao, Y. He, J. Huang, Q. Ji, H. Jin, X. Li, et al. (2025) Let it flow: agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem. arXiv preprint arXiv:2512.24873. Cited by: §4.1. [10] S. Yao, N. Shinn, P. Razavi, and K. Narasimhan (2024) tautau-Bench: a benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045. Cited by: §1, §2.1. Appendix A Prompts System Prompt for Architect Agent

MIT CSAIL

Papers on Lattice

Total citations

Topics

h-index

Research focus

Code Generation & Program Synthesis (1)Eval Frameworks & Benchmarks (1)Tool Use & Agents (1)

Frequent co-authors

Xander Xu (1)Hu Wei (1)Chuan Chen (1)Bing Zhao (1)

Papers (1)

Mar 4, 2026

MIT CSAILMar 4, 2026·also SYSU

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

LLMs that ace static code-fixing benchmarks may still struggle to maintain code quality over the long, iterative haul of real-world software development.

Jialong Chen, Xander Xu, Hu Wei +2

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Search

Jialong Chen

Research focus

Frequent co-authors

Papers (1)