Search papers, labs, and topics across Lattice.
Code-QA-Bench is introduced as a framework for automatically generating repository-level code QA benchmarks that isolates code reasoning from documentation recall. The framework uses an answer-first generation pipeline where a tool-equipped agent explores code to produce verified answers before question generation, and employs a three-condition experimental design (closed-book, code-only, documented) to quantify documentation utility and memorization. Experiments on 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories show that code access is the dominant factor, while documentation provides only modest additional benefit, validating the benchmark's design.
Turns out, LLMs rely far more on raw code access than documentation when answering repository-level questions, challenging the assumption that documentation is the primary driver of code understanding.
We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two methodological contributions: (1) an answer-first generation pipeline where a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring every task is grounded in real code structure; and (2) a three-condition experimental design evaluating agents under closed-book (no repository), code-only (documentation removed), and documented (full repository) conditions, with deltas directly quantifying documentation utility and memorization. We generate 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, scored by an LLM judge on accuracy, completeness, and specificity. Experiments on four frontier models reveal that code access is the dominant factor (+0.23 mean gain over closed-book), documentation provides modest additional benefit (+0.071 on doc-dependent tasks), and code-only $\approx$ documented on code-derivable tasks, validating the design. The framework is open-source and applicable to any well-documented Python repository.