Search papers, labs, and topics across Lattice.
This paper introduces KFC-KWS, a multimodal framework designed for user-defined keyword spotting (KWS) that effectively addresses the challenge of distinguishing target keywords from phonetically similar alternatives. By utilizing connectionist temporal classification (CTC) to guide keyframe selection, the method identifies high-confidence phoneme frames and integrates them with full-utterance representations through cross-attention. The results show that KFC-KWS achieves state-of-the-art performance on the LibriPhrase dataset, with a balanced performance of 98.73% AUC and significant improvements on the hard subset, underscoring its capability in accurately identifying confusable keywords.
KFC-KWS achieves an impressive 98.73% AUC in user-defined keyword spotting, setting a new benchmark for distinguishing confusable keywords.
User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal framework that leverages connectionist temporal classification (CTC)-guided keyframe selection. Specifically, we exploit the peaky posterior distributions of CTC to identify high-confidence phoneme frames, enabling precise alignment across audio, phoneme, and text modalities. These keyframes are then fused with full-utterance representations through cross-attention to capture both local discriminative cues and global contextual information. On LibriPhrase, KFC-KWS achieves the best-balanced performance (98.73% AUC) and substantially outperforms advanced baselines on the challenging hard subset (97.65% AUC and 7.75% EER), demonstrating its effectiveness in discriminating between highly confusable keywords.