Mar 8, 2026arXiv:2603.07696

Multi-View Based Audio Visual Target Speaker Extraction

AI Summary

This paper introduces Multi-View Tensor Fusion (MVTF), a novel framework for Audio-Visual Target Speaker Extraction (AVTSE) that leverages multi-view lip videos to improve performance, especially in non-frontal view scenarios. MVTF learns cross-view correlations by modeling multiplicative interactions between different views of input lip embeddings using pairwise outer products during training. Experimental results demonstrate that MVTF achieves significant performance gains in both single-view and multi-view input scenarios, enhancing overall performance and robustness.

Key Contribution

Unleashing the power of multi-view lip reading, this new framework lets you extract a target speaker's voice even from challenging, non-frontal video angles.

Abstract

Audio-Visual Target Speaker Extraction (AVTSE) aims to separate a target speaker's voice from a mixed audio signal using the corresponding visual cues. While most existing AVTSE methods rely exclusively on frontal-view videos, this limitation restricts their robustness in real-world scenarios where non-frontal views are prevalent. Such visual perspectives often contain complementary articulatory information that could enhance speech extraction. In this work, we propose Multi-View Tensor Fusion (MVTF), a novel framework that transforms multi-view learning into single-view performance gains. During the training stage, we leverage synchronized multi-perspective lip videos to learn cross-view correlations through MVTF, where pairwise outer products explicitly model multiplicative interactions between different views of input lip embeddings. At the inference stage, the system supports both single-view and multi-view inputs. Experimental results show that in the single-view inputs, our framework leverages multi-view knowledge to achieve significant performance gains, while in the multi-view mode, it further improves overall performance and enhances the robustness. Our demo, code and data are available at https://anonymous.4open.science/w/MVTF-Gridnet-209C/

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multi-View Based Audio Visual Target Speaker Extraction

Related Papers