Search papers, labs, and topics across Lattice.
This paper investigates the impact of labeling codec-resynthesized audio as either bonafide or spoof in audio deepfake detection, given the dual nature of neural audio codecs for both compression and speech synthesis. They construct a challenging extension of the ASVspoof 5 dataset to analyze the effects of different labeling strategies on detection performance. The study provides insights into how labeling choices influence the effectiveness of audio deepfake detection systems.
Conflicting labels on codec-resynthesized audio can significantly impact audio deepfake detection, highlighting a critical challenge in dataset creation and model training.
Since Text-to-Speech systems typically don't produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.