CMU MLLambdaTexas A&MUCSDWaterlooApr 15, 2026arXiv:2604.14261

ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

Zhuofeng Li, Yi Lu, Dongfu Jiang, Haoxiang Zhang, Yuyang Bai, Chuan Li, Shuiwang Ji, Jianwen Xie

AI Summary

The paper introduces REVIEWBENCH, a benchmark for evaluating the quality and substantiveness of LLM-generated peer reviews based on paper-specific rubrics. To improve review quality, they propose REVIEWGROUNDER, a multi-agent framework that separates review drafting from evidence grounding using external tools. Experiments on REVIEWBENCH demonstrate that REVIEWGROUNDER, leveraging a Phi-4-14B drafter and GPT-OSS-120B grounder, surpasses significantly larger models like GPT-4.1 and DeepSeek-R1-670B in review quality and alignment with human judgment.

Key Contribution

A clever two-stage agent using smaller models can produce better, more substantive peer reviews than brute-force application of the largest LLMs.

Abstract

The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \href{https://github.com/EigenTom/ReviewGrounder}{here}.

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

Related Papers