Reality Inc.Feb 24, 2026arXiv:2602.20569

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Jiaqi Wu, Yuchen Zhou, Muduo Xu, Zisheng Liang, Simiao Ren, Jiayu Xue, Meige Yang, Siying Chen, Jingheng Huan

AI Summary

The paper introduces AIForge-Doc, a new benchmark dataset designed to evaluate the detection of AI-forged tampering in financial and form documents, specifically focusing on diffusion-model-based inpainting. The dataset comprises 4,061 forged images generated using Gemini 2.5 Flash Image and Ideogram v2 Edit on four public document datasets, with pixel-level annotations of tampered regions. Benchmarking existing detectors (TruFor, DocTamper, GPT-4o) reveals a significant performance drop compared to traditional forgery detection, highlighting the challenge AI-forged documents pose to current forensic methods.

Key Contribution

Existing document forgery detectors are essentially useless against the rising tide of AI-generated document fraud, as revealed by a new benchmark showing a massive performance drop.

Abstract

We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs -- Gemini 2.5 Flash Image and Ideogram v2 Edit -- yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors -- TruFor, DocTamper, and a zero-shot GPT-4o judge -- and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 -- essentially at chance -- confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.

Computer Vision Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Related Papers