Microsoft ResearchUofTUSCApr 19, 2026arXiv:2604.17338

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Wang Bill Zhu, Miaosen Chai, Shangshang Wang, Yejia Liu, Song Bian, Honghua Dong, Willie Neiswanger

AI Summary

The paper introduces the Precise Debugging Benchmark (PDB) to evaluate the ability of LLMs to perform targeted code edits for debugging, rather than simply regenerating code. PDB automatically generates buggy programs with verified atomic bugs and evaluates debugging performance using edit-level precision and bug-level recall. Experiments on PDB-Single-Hard and PDB-Multi show that while frontier models achieve high unit-test pass rates, their precision remains low, indicating a tendency for over-editing, and iterative debugging strategies do not substantially improve precision or recall.

Key Contribution

Despite impressive unit test pass rates, today's best LLMs rewrite code instead of precisely debugging it, achieving less than 45% edit precision even when explicitly instructed to minimize changes.

Abstract

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Related Papers