Search papers, labs, and topics across Lattice.
The paper introduces VorTEX, a text-prompted target speech extraction (TSE) architecture featuring a Decoupled Adaptive Multi-branch (DAM) Fusion block to improve performance across varying speech overlap ratios. To facilitate evaluation, the authors created PORTE, a two-speaker dataset with controlled overlap from 0% to 100%, and SuRE, a metric to detect suppression artifacts. Experiments demonstrate that VorTEX outperforms existing models in separation fidelity across overlap ratios from 20% to 100% while avoiding suppression artifacts.
Existing target speech extraction models falter under realistic speech overlap conditions, but VorTEX maintains high separation fidelity without suppression artifacts.
Target speech extraction (TSE) aims to recover a target speaker's voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX (Various overlap ratio for Target speech EXtraction), a text-prompted TSE architecture with a Decoupled Adaptive Multi-branch (DAM) Fusion block that separates primary extraction from auxiliary regularization pathways. To enable controlled analysis, we construct PORTE, a two-speaker dataset spanning overlap ratios from 0% to 100%. We further propose Suppression Ratio on Energy (SuRE), a diagnostic metric that detects suppression behavior not captured by conventional measures. Experiments show that existing models exhibit suppression or residual interference under overlap, whereas VorTEX achieves the highest separation fidelity across 20-100% overlap (e.g., 5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts.