Mar 31, 2026arXiv:2603.29097

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

AI Summary

The paper introduces SR-CorrNet, an asymmetric encoder-decoder framework for speech separation that uses a separation-reconstruction (SepRe) strategy within a TF dual-path backbone to address the information bottleneck of late-split architectures. SR-CorrNet formulates speech separation as a structured correlation-to-filter problem, using spatio-spectro-temporal correlations to estimate deep filters for target signal recovery. Experiments on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS show SR-CorrNet achieves consistent improvements in anechoic, noisy-reverberant, and real-recorded conditions, demonstrating the effectiveness of TF-domain SepRe with correlation-based filter estimation.

Key Contribution

By disentangling speakers earlier in the process, SR-CorrNet avoids the information bottleneck that plagues existing speech separation models, leading to improved performance in challenging acoustic environments.

Abstract

Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

Related Papers