JHUApr 14, 2026arXiv:2604.13229

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

Aurosweta Mahapatra, Aurosweta Mahapatra, Ismail Rasim Ulgen, Kong Aik Lee, Nicholas Andrews, Nicholas Andrews, Berrak Sisman, Berrak Sisman

AI Summary

This paper introduces ProSDD, a two-stage framework for speech deepfake detection that leverages supervised masked prediction of speaker-conditioned prosodic variation (pitch, voice activity, energy) learned from real speech. By explicitly modeling prosodic variability, ProSDD enhances model embeddings to better distinguish between natural and spoofed speech, especially under expressive and emotional attacks. Experiments demonstrate that ProSDD significantly outperforms baselines on ASVspoof 2019/2024 and EmoFake/EmoSpoof-TTS datasets, achieving up to 50% relative error reduction on emotional spoofing attacks.

Key Contribution

Training on real speech prosody alone can cut speech deepfake error rates by over 70% on emotional attacks, a blindspot for current detectors.

Abstract

Speech deepfake detection (SDD) systems perform well on standard benchmarks datasets but often fail to generalize to expressive and emotional spoofing attacks. Many methods rely on spoof-heavy training data, learning dataset-specific artifacts rather than transferable cues of natural speech. In contrast, humans internalize variability in real speech and detect fakes as deviations from it. We introduce ProSDD, a two-stage framework that enriches model embeddings through supervised masked prediction of speaker-conditioned prosodic variation based on pitch, voice activity, and energy. Stage I learns prosodic variability from real speech, and Stage II jointly optimizes this objective with spoof classification. ProSDD consistently outperforms baselines under both ASVspoof 2019 and 2024 training, reducing ASVspoof 2024 EER from 25.43% to 16.14% (2019-trained) and from 39.62% to 7.38% (2024-trained), while achieving 50% relative reductions on EmoFake and EmoSpoof-TTS.

Red-Teaming & Adversarial Robustness Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

Related Papers