Search papers, labs, and topics across Lattice.
This paper introduces SafeReview, a GAN-based framework to defend LLM-based academic peer review systems against adversarial hidden prompts. A Generator model crafts adversarial prompts, while a Defender model detects them, with joint optimization inspired by Information Retrieval GANs. SafeReview demonstrates enhanced resilience to novel attacks compared to static defenses, bolstering the integrity of LLM-augmented peer review.
LLM-based peer review systems can be made significantly more robust against adversarial manipulation via a co-evolutionary GAN approach that anticipates novel attacks.
As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.