PostsTelecommunications Institute of TechnologyMay 4, 2026arXiv:2605.02223

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran

AI Summary

The paper introduces MIST, a multilingual dataset for fine-grained speech inpainting forensics, featuring utterances with 1-3 independently inpainted word-level segments. To address the challenge of localizing these multiple tampered regions, they propose ISA, an iterative segment analysis framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement. They also introduce SF1@tau, a segment-level F1 metric based on temporal IoU, demonstrating that ISA outperforms existing methods in detecting and localizing these subtle speech manipulations.

Key Contribution

Existing deepfake detectors crumble when faced with realistic, multi-region speech inpainting, leaving a gaping vulnerability that this work begins to address.

Abstract

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

Related Papers