Kyushu UniversityNAISTOsakaRitsumeikan UniversityWasedaMay 21, 2026arXiv:2605.22534

Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study

Sien Reeve O. Peralta, Fumika Hoshi, Hironori Washizaki, Naoyasu Ubayashi, Inase Kondo, Yoshiki Higo, Hiroki Mukai, Norihiro Yoshida, Kazuki Kusama, Hidetake Tanaka, Youmei Fan

AI Summary

This paper investigates the reliability of using pull request (PR) merge/rejection outcomes as a sole metric for evaluating AI coding agents. Through analysis of 9,799 human-reviewed Agentic-PRs and manual inspection of 717 cases, the authors found that rejection outcomes often overstate agent error, with only 35.7% of rejections attributed to clear agent failures. The study highlights the importance of considering reviewer interactions and workflow constraints for a more accurate assessment of agent capabilities.

Key Contribution

Judging AI coding agents solely by merged/rejected pull requests is misleading: over 60% of rejections aren't the agent's fault, and even "successful" merges often hide significant human intervention.

Abstract

AI coding agents increasingly submit pull requests (Agentic-PRs) to open-source repositories, yet their performance is commonly assessed using merge and rejection outcomes alone. We hypothesized that these outcome labels do not reliably reflect agent capability without considering review interactions. To test this, we conducted a decision-oriented analysis of 11,048 closed Agentic Pull Requests, refined to 9,799 human-reviewed PRs, and manually inspected 717 representative cases to recover decision rationale from interaction artifacts. We found that rejection outcomes substantially overstate agent error: only 35.7% of rejected PRs reflected clear agentic failures, while 31.2% were driven by workflow constraints and 33.1% lacked observable decision rationale. Among merged PRs, 15.4% required explicit reviewer involvement through feedback or direct commits, and 5.5% showed no visible interaction trace. We further observed systematic differences across agents, with Copilot and Devin more often embedded in reviewer-mediated workflows, while Codex and Cursor PRs were typically merged with minimal interaction. These results reject the assumption that PR outcomes alone capture agent performance and demonstrate the need for interaction-aware evaluation grounded in review behavior.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study

Related Papers