Search papers, labs, and topics across Lattice.
The paper introduces AEGIS, a new benchmark for evaluating the forensic analysis of AI-generated academic images, addressing limitations in existing benchmarks through domain-specific complexity, diverse forgery simulations, and multi-dimensional forensic evaluation. Results show that even advanced models like GPT-5.1 struggle with the benchmark's complexity (48.80% overall performance), and that forensic techniques lag behind generative advances, with 11 generative models yielding average forensic accuracy below 50%. The benchmark reveals complementary strengths between model families, with MLLMs excelling in textual artifact recognition and expert detectors in binary authenticity detection.
GPT-5.1 can barely crack 50% accuracy when distinguishing real from AI-generated academic images, highlighting a stark gap between generative capabilities and forensic detection.
We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.