Mar 30, 2026arXiv:2603.28021

Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought

Runkun Chen, Run Chen, Yixiong Fang, Peng Chang, Pengyu Chang, Yuante Li, Massa Baali, Bhiksha Ramakrishnan, B. Ramakrishnan

AI Summary

The paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model for deepfake speech detection that incorporates acoustic chain-of-thought reasoning. This is achieved by injecting structured textual representations of low-level acoustic features into the model prompt, grounding the model's reasoning in interpretable evidence. Experiments on a novel dataset demonstrate that CoLMbo-DF outperforms existing audio language model baselines in detection accuracy and explainability, despite being a smaller model.

Key Contribution

Grounding audio language models with acoustic feature representations unlocks more accurate and explainable deepfake detection, even with smaller models.

Abstract

Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.

Natural Language Processing Reasoning & Chain-of-Thought Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought

Related Papers