Search papers, labs, and topics across Lattice.
This paper introduces an end-to-end multi-channel keyword spotting (KWS) framework that leverages spatial cues and directional priors to improve robustness in noisy environments. The system uses a spatial encoder to learn inter-channel features and a spatial embedding to incorporate directional priors, which are then processed by a streaming backbone. Experiments demonstrate that both spatial modeling and directional priors improve performance over baselines in simulated noisy conditions, with their combination yielding the best results.
Spatial audio cues and directional priors can be jointly learned end-to-end to significantly boost keyword spotting accuracy in noisy environments, outperforming traditional cascaded approaches.
Keyword spotting (KWS) is crucial for many speech-driven applications, but robust KWS in noisy environments remains challenging. Conventional systems often rely on single-channel inputs and a cascaded pipeline separating front-end enhancement from KWS. This precludes joint optimization, inherently limiting performance. We present an end-to-end multi-channel KWS framework that exploits spatial cues to improve noise robustness. A spatial encoder learns inter-channel features, while a spatial embedding injects directional priors; the fused representation is processed by a streaming backbone. Experiments in simulated noisy conditions across multiple signal-to-noise ratios (SNRs) show that spatial modeling and directional priors each yield clear gains over baselines, with their combination achieving the best results. These findings validate end-to-end multi-channel spatial modeling, indicating strong potential for the target-speaker-aware detection in complex acoustic scenarios.