Search papers, labs, and topics across Lattice.
This paper investigates the use of Reinforcement Learning (RL) via Group Relative Policy Optimization (GRPO) to fine-tune speech deepfake detection models for improved generalization. They found that GRPO-based fine-tuning outperforms supervised fine-tuning (SFT) and hybrid approaches on out-of-domain test sets, while maintaining performance on target-domain data. Ablation studies suggest that the negative reward component of GRPO is crucial for this generalization improvement.
RL fine-tuning with GRPO can significantly boost the generalization of speech deepfake detectors to unseen attacks, outperforming standard supervised methods.
Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (SFT). Inspired by the field of large language models, wherein reinforcement learning (RL) is used for model fine-tuning, we investigate the impact of RL, specifically Group Relative Policy Optimization (GRPO). The results from experiments using multiple detectors and test sets indicate that pure GRPO-based fine-tuning improves performance on out-of-domain test sets while maintaining performance on target-domain test data. This approach outperforms both SFT-only and hybrid setups. Our ablation studies further suggest that the negative reward in GRPO may be a key factor in this improvement.