KuaishouMar 3, 2026arXiv:2603.02877

DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

Yilei Wu, Changyan Zheng, Xingyu Zhang, Yakun Zhang, Chengshi Zheng, Shuang Yang, Ye Yan, Erwei Yin

AI Summary

The paper introduces DBMIF, a deep learning framework for speech enhancement that fuses air-conduction (AC) and bone-conduction (BC) audio signals, particularly in low SNR environments where AC microphones struggle. DBMIF employs a three-branch architecture with iterative attention and cross-branch gated modules to adaptively weight and exchange information between AC and BC modalities. Experiments show DBMIF outperforms unimodal and multimodal baselines in speech quality, intelligibility, and downstream ASR tasks, reducing character error rate by at least 2.5%.

Key Contribution

By intelligently fusing air- and bone-conducted audio, DBMIF achieves state-of-the-art speech enhancement, even when conventional microphones are drowned out by noise.

Abstract

The performance of conventional speech enhancement systems degrades sharply in extremely low signal-to-noise ratio (SNR) environments where air-conduction (AC) microphones are overwhelmed by ambient noise. Although bone-conduction (BC) sensors offer complementary, noise-tolerant information, existing fusion approaches struggle to maintain consistent performance across a wide range of SNR conditions. To address this limitation, we propose the Deep Balanced Multimodal Iterative Fusion Framework (DBMIF), a three-branch architecture designed to reconstruct high-fidelity speech through rigorous cross-modal interaction. Specifically, grounded in a multi-scale interactive encoder-decoder backbone, the framework orchestrates an iterative attention module and a cross-branch gated module to facilitate adaptive weighting and bidirectional exchange. To complement this dynamic interaction, a balanced-interaction bottleneck is further integrated to learn a compact, stable fused representation. Extensive experiments demonstrate that DBMIF achieves competitive performance compared with recent unimodal and multimodal baselines in both speech quality and intelligibility across diverse noise types. In downstream ASR tasks, the proposed method reduces the character error rate by at least 2.5 percent compared to competing approaches. These results confirm that DBMIF effectively harnesses the robustness of BC speech while preserving the naturalness of AC speech, ensuring reliability in real-world scenarios. The source code is publicly available at github.com/wyl516w/dbmif.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References53

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DBMIF: a deep balanced multimodal iterative fusion framework for air- and bone-conduction speech enhancement

Related Papers