George Mason UniversityMissouri University of Science and TechnologyUCFUniversity of AlabamaUniversity of CincinnatiUniversity of South FloridaMay 25, 2026arXiv:2605.25415

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

Lingyao Li, Junjie Xiong, Changjia Zhu, Runlong Yu, Chen Chen, Junyu Wang, Renkai Ma, Zhicong Lu

AI Summary

This paper benchmarks 12 LLMs as paper reviewers on a dataset of 898 NeurIPS/ICLR papers, evaluating rating calibration, divergence from human reviewers, and prompt injection resistance. LLMs systematically overrate weaker submissions, diverge from humans in topical emphasis (under-flagging Clarity, over-flagging Reproducibility), and generate longer, less diverse reviews. A simple invisible font-mapping attack can successfully promote low-scoring papers to acceptance-level ratings, highlighting significant prompt injection vulnerabilities.

Key Contribution

LLMs' susceptibility to invisible-character prompt injections that flip paper review scores reveals a critical vulnerability in their application to academic peer review.

Abstract

Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of LLM-as-a-Reviewer on 898 papers stratified from NeurIPS and ICLR, evaluating 12 LLMs along three axes: rating calibration, divergence from human reviewers, and resistance to prompt injection embedded via an invisible font-mapping attack. We find that LLMs systematically overrate weaker submissions and diverge from humans in topical emphasis, under-flagging Clarity and over-flagging Reproducibility, while producing reviews two to three times longer with lower lexical diversity and a more standardized vocabulary. Prompt injection remains highly effective. Simple hidden instructions can promote low-scoring papers to acceptance-level ratings in a substantial fraction of cases, with effectiveness varying sharply across model families. While LLMs offer utility in structuring evaluations, their integration into peer review requires safeguards against both intrinsic biases and adversarial risks.

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

Related Papers