Search papers, labs, and topics across Lattice.
DebugRepair improves LLM-based automated program repair by incorporating intermediate runtime states collected through simulated debugging, addressing the limitation of existing methods that rely solely on outcome-level failure symptoms. It introduces three components: test semantic purification, simulated instrumentation with rule-based fallback, and debugging-driven conversational repair, which collectively reduce noise, collect runtime traces, and refine patches. Experiments on Defects4J and other benchmarks demonstrate that DebugRepair achieves state-of-the-art performance, fixing 224 bugs with GPT-3.5 and 295 with DeepSeek-V3, significantly outperforming existing LLM-based methods.
LLMs can fix 26% more bugs when given access to intermediate runtime states during program repair, proving that even the best models struggle to infer root causes from just failure symptoms.
Automated Program Repair (APR) has benefited from the code understanding and generation capabilities of Large Language Models (LLMs). Existing feedback-based APR methods iteratively refine candidate patches using test execution feedback and have shown promising results. However, most rely on outcome-level failure symptoms, such as stack traces, which show how failures are observed but fail to expose the intermediate runtime states critical for root-cause analysis. As a result, LLMs often infer bug causes without sufficient runtime evidence, leading to incorrect patches. To address this limitation, we propose DebugRepair, a self-directed debugging framework for LLM-based APR. DebugRepair enhances patch refinement with intermediate runtime evidence collected through simulated debugging. It consists of three components: test semantic purification, simulated instrumentation, and debugging-driven conversational repair. Together, they reduce noisy test context, collect runtime traces through targeted debugging statements with rule-based fallback, and progressively refine candidate patches using prior attempts and newly observed runtime states. We evaluate DebugRepair on three benchmarks across Java and Python. Experiments show that DebugRepair achieves state-of-the-art performance against 15 approaches. With GPT-3.5, it correctly fixes 224 bugs on Defects4J, outperforming prior SOTA LLM-based methods by 26.2%. With DeepSeek-V3, it correctly fixes 295 Defects4J bugs, surpassing the second-best baseline by 59 bugs. Across five additional backbone LLMs, DebugRepair improves repair performance by 51.3% over vanilla settings. Ablation studies further confirm the effectiveness of all components.