Search papers, labs, and topics across Lattice.
This paper empirically analyzes 278,790 code review conversations on GitHub to compare the effectiveness of human reviewers and AI agents in collaborative workflows. It finds that human reviewers provide more diverse feedback (understanding, testing, knowledge transfer) and engage in more rounds of review, especially for AI-generated code. Furthermore, AI-generated suggestions are adopted less frequently and, when adopted, lead to larger increases in code complexity and size compared to human suggestions, highlighting the need for human oversight.
AI code review agents may scale defect screening, but their suggestions are adopted less often and, when adopted, can actually *worsen* code quality, underscoring the critical need for human oversight.
Code review is a critical software engineering practice where developers review code changes before integration to ensure code quality, detect defects, and improve maintainability. In recent years, AI agents that can understand code context, plan review actions, and interact with development environments have been increasingly integrated into the code review process. However, there is limited empirical evidence to compare the effectiveness of AI agents and human reviewers in collaborative workflows. To address this gap, we conduct a large-scale empirical analysis of 278,790 code review conversations across 300 open-source GitHub projects. In our study, we aim to compare the feedback differences provided by human reviewers and AI agents. We investigate human-AI collaboration patterns in review conversations to understand how interaction shapes review outcomes. Moreover, we analyze the adoption of code suggestions provided by human reviewers and AI agents into the codebase and how adopted suggestions change code quality. We find that human reviewers provide additional feedback than AI agents, including understanding, testing, and knowledge transfer. Human reviewers exchange 11.8% more rounds when reviewing AI-generated code than human-written code. Moreover, code suggestions made by AI agents are adopted into the codebase at a significantly lower rate than suggestions proposed by human reviewers. Over half of unadopted suggestions from AI agents are either incorrect or addressed through alternative fixes by developers. When adopted, suggestions provided by AI agents produce significantly larger increases in code complexity and code size than suggestions provided by human reviewers. Our findings suggest that while AI agents can scale defect screening, human oversight remains critical for ensuring suggestion quality and providing contextual feedback that AI agents lack.