PolyUApr 28, 2026arXiv:2604.25624

UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

Chong-Xin Gan, Peter Bell, Man-Wai Mak, Zhe Li, Zezhong Jin, Zilong Huang, Kong Aik Lee, K. Lee

AI Summary

This paper introduces a scalable U-Net-based Fusion framework (UF-EMA) that integrates noisy and enhanced speech as multi-channel inputs to improve speaker recognition in challenging acoustic environments. By incorporating an Exponential Moving Average strategy into the speaker encoder pre-trained on clean speech, the method effectively reduces overfitting and enhances robustness during the transition from clean to noisy conditions. Experimental results demonstrate that UF-EMA significantly outperforms existing approaches on various noise-contaminated test sets, highlighting its effectiveness in maintaining speaker information during speech enhancement.

Key Contribution

Speaker recognition accuracy improves dramatically when leveraging a U-Net-based fusion of noisy and enhanced speech, coupled with a novel training strategy.

Abstract

The joint training of speech enhancement and speaker embedding networks for speaker recognition is widely adopted under noisy acoustic environments. While effective, this paradigm often fails to leverage the generalization and robustness benefits inherent in large-scale speech enhancement pre-training. Moreover, maintaining the speaker information in the denoised speech is not an explicit objective of the speech enhancement process. To address these limitations, we proposed a scalable \textbf{U}Net-based \textbf{F}usion framework (UF-EMA) that considers the noisy and enhanced speech as a multi-channel input, thereby enabling the speaker encoder to exploit speaker information effectively. In addition, an \textbf{E}xponential \textbf{M}oving \textbf{A}verage strategy is applied to a speaker encoder pre-trained on clean speech to mitigate overfitting and facilitate a smooth transition from clean to noisy conditions. Experimental results on multiple noise-contaminated test sets showcase the superiority of the proposed approach.

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

Related Papers