Mar 12, 2026arXiv:2603.12208

ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao

AI Summary

ForensicZip is introduced as a training-free framework for compressing visual tokens in forensic vision-language models, addressing the issue of existing methods discarding background regions where manipulation traces often reside. It models temporal token evolution as a Birth-Death Optimal Transport problem, quantifying physical discontinuities indicative of generative artifacts. Experiments on deepfake and AIGC benchmarks demonstrate that ForensicZip achieves significant speedup and FLOPs reduction while maintaining state-of-the-art detection performance at 10% token retention.

Key Contribution

You can slash 90% of FLOPs in forensic vision-language models without sacrificing detection performance by focusing on forgery-driven token compression.

Abstract

Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.

Computer Vision Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Related Papers