Gaoling AIRUCMar 3, 2026arXiv:2603.03194

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen, Fanzheng Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Yuzhi Lin, Xin Zhao, Ruihua Song, Cheng Chen, K. Jia, Ji-Rong Wen

AI Summary

The paper introduces BeyondSWE, a new benchmark to evaluate code agents on more realistic software engineering tasks that require cross-repository reasoning, domain-specific knowledge, dependency management, and full-repository generation. Experiments using BeyondSWE reveal that state-of-the-art models struggle, achieving less than 45% success and exhibiting inconsistent performance across different task types. The authors also introduce SearchSWE, a framework for integrating web search into coding tasks, but find that search augmentation does not consistently improve performance, suggesting challenges in mimicking developer workflows.

Key Contribution

Today's code-generating AI falls apart when faced with real-world software engineering tasks that demand cross-repository reasoning and external knowledge, achieving less than 45% success on the new BeyondSWE benchmark.

Abstract

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References60

Year2026

VenueN/A

Related Papers

Finding related papers...