Hangzhou Dianzi UniversityJun 9, 2026arXiv:2606.10365

KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyword Spotting

AI Summary

This paper introduces KFC-KWS, a multimodal framework designed for user-defined keyword spotting (KWS) that effectively addresses the challenge of distinguishing target keywords from phonetically similar alternatives. By utilizing connectionist temporal classification (CTC) to guide keyframe selection, the method identifies high-confidence phoneme frames and integrates them with full-utterance representations through cross-attention. The results show that KFC-KWS achieves state-of-the-art performance on the LibriPhrase dataset, with a balanced performance of 98.73% AUC and significant improvements on the hard subset, underscoring its capability in accurately identifying confusable keywords.

Key Contribution

KFC-KWS achieves an impressive 98.73% AUC in user-defined keyword spotting, setting a new benchmark for distinguishing confusable keywords.

Abstract

User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal framework that leverages connectionist temporal classification (CTC)-guided keyframe selection. Specifically, we exploit the peaky posterior distributions of CTC to identify high-confidence phoneme frames, enabling precise alignment across audio, phoneme, and text modalities. These keyframes are then fused with full-utterance representations through cross-attention to capture both local discriminative cues and global contextual information. On LibriPhrase, KFC-KWS achieves the best-balanced performance (98.73% AUC) and substantially outperforms advanced baselines on the challenging hard subset (97.65% AUC and 7.75% EER), demonstrating its effectiveness in discriminating between highly confusable keywords.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyword Spotting

Related Papers