BaiduOhio StateMay 28, 2026arXiv:2605.29277

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Jun Zhang, Jianying Qu, JianYing Qu, Hanwen Du, Hanwen Du, Zhongkai Sun, Zhongkai Sun, Ye Yang, Yehua Yang, Qiao Zhao, Qiao Zhao

AI Summary

Code-QA-Bench is introduced as a framework for automatically generating repository-level code QA benchmarks that isolates code reasoning from documentation recall. The framework uses an answer-first generation pipeline where a tool-equipped agent explores code to produce verified answers before question generation, and employs a three-condition experimental design (closed-book, code-only, documented) to quantify documentation utility and memorization. Experiments on 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories show that code access is the dominant factor, while documentation provides only modest additional benefit, validating the benchmark's design.

Key Contribution

Turns out, LLMs rely far more on raw code access than documentation when answering repository-level questions, challenging the assumption that documentation is the primary driver of code understanding.

Abstract

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two methodological contributions: (1) an answer-first generation pipeline where a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring every task is grounded in real code structure; and (2) a three-condition experimental design evaluating agents under closed-book (no repository), code-only (documentation removed), and documented (full repository) conditions, with deltas directly quantifying documentation utility and memorization. We generate 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, scored by an LLM judge on accuracy, completeness, and specificity. Experiments on four frontier models reveal that code access is the dominant factor (+0.23 mean gain over closed-book), documentation provides modest additional benefit (+0.071 on doc-dependent tasks), and code-only $\approx$ documented on code-derivable tasks, validating the design. The framework is open-source and applicable to any well-documented Python repository.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Related Papers