Search papers, labs, and topics across Lattice.
This paper introduces Gaze-Guided Audio-Visual Speech Enhancement (GG-AVSE), a framework that leverages gaze direction to improve target speaker selection in multi-talker audio-visual speech enhancement. A Gaze-Guided Visual Module (GG-VM) combines gaze signals with facial features extracted using YOLO5Face, integrating them into the AVSEMamba model via zero-shot merging and partial visual fine-tuning. Experiments on the new AVSEC2-Gaze dataset demonstrate that GG-AVSE significantly outperforms gaze-free baselines, achieving a 23.69% improvement in SI-SDR.
Gaze is a surprisingly effective cue for resolving the cocktail party problem, boosting audio-visual speech enhancement by over 23% in SI-SDR.
This paper presents a Gaze-Guided Audio-Visual Speech Enhancement (GG-AVSE) framework to address the cocktail party problem. A major challenge in conventional AVSE is identifying the listener's intended speaker in multi-talker environments. GG-AVSE addresses this issue by exploiting gaze direction as a supervisory cue for target-speaker selection. Specifically, we propose the GG-VM module, which combines gaze signals with a YOLO5Face detector to extract the target speaker's facial features and integrates them with the pretrained AVSEMamba model through two strategies: zero-shot merging and partial visual fine-tuning. For evaluation, we introduce the AVSEC2-Gaze dataset. Experimental results show that GG-AVSE achieves substantial performance gains over gaze-free baselines: a 10.08% improvement in PESQ (2.370 to 2.609), a 5.18% improvement in STOI (0.8802 to 0.9258), and a 23.69% improvement in SI-SDR (9.16 to 11.33). These results confirm that gaze provides an effective cue for resolving target-speaker ambiguity and highlight the scalability of GG-AVSE for real-world applications.