Tsinghua AICUHKApr 21, 2026arXiv:2604.19635

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Shuhai Peng, Hui Lu, Jinjiang Liu, Liyang Chen, Guiping Zhong, Jiakui Li, Huimeng Wang, Haiyun Li, Liang Cao, Shiyin Kang, Zhiyong Wu

AI Summary

This paper introduces a novel Chunk-wise Interleaved Splicing Paradigm to adapt autoregressive (AR) generative models for streaming target speaker extraction (TSE), addressing the performance degradation typically seen when applying these models to real-time scenarios. The method incorporates a historical context refinement mechanism to maintain coherence between extracted speech segments, mitigating boundary discontinuities. Experiments on Libri2Mix demonstrate that the proposed approach maintains stability and intelligibility at low latencies, achieving results comparable to or surpassing offline baselines with a Real-Time-Factor of 0.248 on consumer GPUs.

Key Contribution

Autoregressive generative models, previously unsuitable for real-time target speaker extraction, can now achieve offline-level performance in streaming scenarios thanks to a novel chunk-wise splicing technique.

Abstract

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Related Papers