CISPAHKUSTSUSTechWestlakeApr 29, 2026arXiv:2604.26506

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, Linyi Yang

AI Summary

This paper introduces SafeReview, a GAN-based framework to defend LLM-based academic peer review systems against adversarial hidden prompts. A Generator model crafts adversarial prompts, while a Defender model detects them, with joint optimization inspired by Information Retrieval GANs. SafeReview demonstrates enhanced resilience to novel attacks compared to static defenses, bolstering the integrity of LLM-augmented peer review.

Key Contribution

LLM-based peer review systems can be made significantly more robust against adversarial manipulation via a co-evolutionary GAN approach that anticipates novel attacks.

Abstract

As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Related Papers