Jun 9, 2026arXiv:2606.10728

DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

Jiale Zhao, Guoxin Chen, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen, Kai Jia

AI Summary

This paper introduces DeNovoSWE, a large-scale dataset designed for generating entire software repositories from high-level specifications, addressing the challenge of training LLM-based code agents for long-horizon software engineering tasks. The dataset consists of 4,818 high-quality instances and is created through an automated, sandboxed workflow that emphasizes scalable curation without human intervention. Fine-tuning the Qwen3-30B-A3B model on DeNovoSWE leads to a significant performance boost on the BeyondSWE-Doc2Repo benchmark, increasing scores from 5.8% to 47.2%.

Key Contribution

Fine-tuning on the DeNovoSWE dataset boosts long-horizon software engineering performance by over 40 percentage points, revealing the potential of LLMs in complete repository generation.

Abstract

As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce \textbf{DeNovoSWE}, a large-scale dataset for whole-repository generation. DeNovoSWE comprises 4,818 high-quality instances, where each instance requires generating a complete repository from documentation. Our dataset is automatically constructed through a carefully designed sandboxed agentic workflow, enabling scalable curation without human annotation. DeNovoSWE is constructed with "divide and conquer" and critic-repair philosophy. To balance data quality and diversity, we further introduce a difficulty-aware trajectory filtering strategy. Fine-tuning Qwen3-30B-A3B on DeNovoSWE substantially improves long-horizon SWE performance, raising its score on the challenging BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.

Code Generation & Program Synthesis Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

Related Papers