BUPTApr 30, 2026arXiv:2604.28177

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao, Peilin Gao, Zijie Xi, Zixin Ding, Haiyang Sun, Haocheng Gao, Yuan Liu, Liangjia Wang, Yiling Huang, Yujie Wang, Yuyue Zhang, Ronghui Xi, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Haihong E

AI Summary

The paper introduces AEGIS, a new benchmark for evaluating the forensic analysis of AI-generated academic images, addressing limitations in existing benchmarks through domain-specific complexity, diverse forgery simulations, and multi-dimensional forensic evaluation. Results show that even advanced models like GPT-5.1 struggle with the benchmark's complexity (48.80% overall performance), and that forensic techniques lag behind generative advances, with 11 generative models yielding average forensic accuracy below 50%. The benchmark reveals complementary strengths between model families, with MLLMs excelling in textual artifact recognition and expert detectors in binary authenticity detection.

Key Contribution

GPT-5.1 can barely crack 50% accuracy when distinguishing real from AI-generated academic images, highlighting a stark gap between generative capabilities and forensic detection.

Abstract

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Related Papers