Georgia TechNorwegian University of Science and TechnologyUniversity of PalermoApr 30, 2026arXiv:2604.27403

A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)

Chun-wei Ho, Sabato Marco Siniscalchi, Kai Li, Chin-Hui Lee

AI Summary

This paper introduces a knowledge-driven approach to target speech extraction in cinematic audio, leveraging detected manners of articulation in speech frames to improve separation from background sound effects. An "articulator-aware knowledge vector" is incorporated as a feature to enhance speech separation, particularly for segments obscured by background noise. Experiments on the CASS dataset demonstrate that this knowledge-driven approach yields superior separation performance compared to methods lacking such knowledge, especially for speech segments heavily masked by background sounds.

Key Contribution

Unbury speech from cinematic sound effects by teaching the model to "listen" for how words are formed.

Abstract

We propose a knowledge-driven approach to speech target extraction in the presence of background sound effects already recorded in cinematic audio. The specific knowledge sources studied are manners of articulation that are detected in speech frames and adopted to form a knowledge vector as a part of features to enhance speech separation and target speech extraction because some short speech segments are often difficult to separate from mixed background sounds. Testing on the recent Sound Demixing Challenge data for cinematic audio source separation (CASS) shows that utilizing articulator-aware knowledge sources produces better separation results than those obtained without using any knowledge, especially for speech segments buried in unspecified background sound events.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)

Related Papers