Search papers, labs, and topics across Lattice.
The paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model for deepfake speech detection that incorporates acoustic chain-of-thought reasoning. This is achieved by injecting structured textual representations of low-level acoustic features into the model prompt, grounding the model's reasoning in interpretable evidence. Experiments on a novel dataset demonstrate that CoLMbo-DF outperforms existing audio language model baselines in detection accuracy and explainability, despite being a smaller model.
Grounding audio language models with acoustic feature representations unlocks more accurate and explainable deepfake detection, even with smaller models.
Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.