NJURMITSchool of Computer ScienceSchool of Mathematical and Computational SciencesWHUApr 7, 2026arXiv:2604.06373

Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects

S. Kashif, Ruiyin Li, Peng Liang, Amjed Tahir, Qiong Feng, Zengyang Li, Mojtaba Shahin

AI Summary

This paper investigates the ability of AI-powered IDEs like Cursor to generate large-scale software projects and evaluates the design quality of the generated code. Using a Feature-Driven Human-In-The-Loop (FD-HITL) framework, the authors generated 10 projects across various domains and technologies, finding that while Cursor achieves high functional correctness (91%), the resulting projects exhibit significant design flaws. Static analysis revealed over 4,000 design issues related to code duplication, complexity, and violations of design principles, highlighting potential maintainability and evolvability risks.

Key Contribution

AI-generated code may be functionally correct, but it's riddled with design flaws that could turn your dream project into a maintenance nightmare.

Abstract

New generation of AI coding tools, including AI-powered IDEs equipped with agentic capabilities, can generate code within the context of the project. These AI IDEs are increasingly perceived as capable of producing project-level code at scale. However, there is limited empirical evidence on the extent to which they can generate large-scale software systems and what design issues such systems may exhibit. To address this gap, we conducted a study to explore the capability of Cursor in generating large-scale projects and to evaluate the design quality of projects generated by Cursor. First, we propose a Feature-Driven Human-In-The-Loop (FD-HITL) framework that systematically guides project generation from curated project descriptions. We generated 10 projects using Cursor with the FD-HITL framework across three application domains and multiple technologies. We assessed the functional correctness of these projects through manual evaluation, obtaining an average functional correctness score of 91%. Next, we analyzed the generated projects using two static analysis tools, CodeScene and SonarQube, to detect design issues. We identified 1,305 design issues categorized into 9 categories by CodeScene and 3,193 issues in 11 categories by SonarQube. Our findings show that (1) when used with the FD-HITL framework, Cursor can generate functional large-scale projects averaging 16,965 LoC and 114 files; (2) the generated projects nevertheless contain design issues that may pose long-term maintainability and evolvability risks, requiring careful review by experienced developers; (3) the most prevalent issues include Code Duplication, high Code Complexity, Large Methods, Framework Best-Practice Violations, Exception-Handling Issues and Accessibility Issues; (4) these design issues violate design principles such as SRP, SoC, and DRY. The replication package is at https://github.com/Kashifraz/DIinAGP

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects

Related Papers