Search papers, labs, and topics across Lattice.
This paper investigates the reliability of automated evaluation methods for code review bots in an industrial setting by comparing G-Eval and LLM-as-a-Judge pipelines against developer-provided labels on 2,604 bot-generated pull request comments. The study finds only moderate alignment between automated evaluations (using Gemini, GPT-4, and GPT-5) and human labels, with agreement ratios ranging from 0.44 to 0.62, revealing that developer actions are influenced by contextual factors beyond comment quality. These findings underscore the limitations of relying solely on automated metrics for assessing the usefulness of code review bots in real-world software development environments.
Automated evaluations of code review bots disagree with developer feedback nearly 40% of the time, revealing that developer actions are driven by workflow pressures, not just code quality.
Automated code review (ACR) bots are increasingly used in industrial software development to assist developers during pull request (PR) review. As adoption grows, a key challenge is how to evaluate the usefulness of bot-generated comments reliably and at scale. In practice, such evaluation often relies on developer actions and annotations that are shaped by contextual and organizational factors, complicating their use as objective ground truth. We examine the feasibility and limitations of automating the evaluation of LLM-powered ACR bots in an industrial setting. We analyze an industrial dataset from Beko comprising 2,604 bot-generated PR comments, each labeled by software engineers as fixed/wontFix. Two automated evaluation approaches, G-Eval and an LLM-as-a-Judge pipeline, are applied using both binary decisions and a 0-4 Likert-scale formulation, enabling a controlled comparison against developer-provided labels. Across Gemini-2.5-pro, GPT-4.1-mini, and GPT-5.2, both evaluation strategies achieve only moderate alignment with human labels. Agreement ratios range from approximately 0.44 to 0.62, with noticeable variation across models and between binary and Likert-scale formulations, indicating sensitivity to both model choice and evaluation design. Our findings highlight practical limitations in fully automating the evaluation of ACR bot comments in industrial contexts. Developer actions such as resolving or ignoring comments reflect not only comment quality, but also contextual constraints, prioritization decisions, and workflow dynamics that are difficult to capture through static artifacts. Insights from a follow-up interview with a software engineering director further corroborate that developer labeling behavior is strongly influenced by workflow pressures and organizational constraints, reinforcing the challenges of treating such signals as objective ground truth.