Search papers, labs, and topics across Lattice.
This paper introduces ProReviewer, a proactive scientific peer review agent that utilizes a structured review log to enable large language models (LLMs) to investigate suspicious aspects of research papers similarly to human reviewers. By formulating the review process as a Markov Decision Process (MDP), ProReviewer significantly enhances the depth and quality of peer reviews, achieving superior performance metrics compared to larger prompt-based models and traditional fine-tuned baselines. The results indicate that ProReviewer not only excels in quantitative evaluations but also garners higher win rates in human assessments, underscoring its effectiveness in automating scientific peer review.
ProReviewer outperforms larger models by up to 39% in peer review quality by enabling proactive investigation of research papers.
Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.