Search papers, labs, and topics across Lattice.
The paper introduces Needle in the Repo (NITR), a benchmark to evaluate the maintainability of AI-generated code edits by testing for preservation of modularity and testability. NITR uses controlled probes embedded in multi-file codebases, paired with structural oracles, to diagnose maintainability failures beyond functional correctness. Experiments across GPT, Claude, Gemini, and Qwen models reveal that current AI coding systems struggle with maintainability, particularly in architectural aspects like dependency control and responsibility decomposition, even when passing functional tests.
AI coding agents often produce functionally correct code that's a maintainability nightmare, failing structural oracles in 13% of cases even when passing all functional tests.
AI coding agents can now complete complex programming tasks, but existing evaluations largely emphasize behavioral correctness and often overlook maintainability risks such as weak modularity or testability. We present Needle in the Repo (NITR), a diagnostic probe-and-oracle framework for evaluating whether behaviorally correct repository edits preserve maintainable structure. NITR distills recurring software engineering wisdom into controlled probes embedded in small, realistic multi-file codebases, each designed so that success depends primarily on one targeted maintainability dimension. Each probe is paired with a hidden evaluation harness that combines functional tests for required behavior with structural oracles that encode the targeted maintainability constraint and return interpretable diagnoses. Using NITR, we evaluate 23 coding configurations across GPT, Claude, Gemini, and Qwen families in both direct-inference and agent-based settings. Current AI coding systems remain far from robust: on average, configurations solve only 36.2% of cases, the best reaches 57.1%, and performance drops from 53.5% on micro cases to 20.6% on multi-step cases. The hardest pressures are architectural rather than local edits, especially dependency control (4.3%) and responsibility decomposition (15.2%). Moreover, 64/483 outcomes (13.3%) pass all functional tests yet fail the structural oracle. Under our harness, agent-mode configurations improve average performance from 28.2% to 45.0%, but do not eliminate these architectural failures. These results show that progress in code generation is not yet progress in maintainable code evolution, and that NITR exposes a critical failure surface missed by conventional evaluation.