Search papers, labs, and topics across Lattice.
The authors introduce DRES, a 1.5-hour Dutch speech dataset recorded from 80 speakers in noisy, public indoor environments using a four-channel linear microphone array. The dataset is designed to evaluate speech enhancement (SE) and automatic speech recognition (ASR) models in realistic conditions. Experiments using DRES revealed that while some ASR models achieve WERs below 22%, single-channel SE algorithms do not consistently improve ASR performance, highlighting the need for realistic evaluation scenarios.
Modern speech enhancement algorithms may not improve ASR performance in realistic noisy environments, challenging assumptions about their effectiveness in real-world applications.
We present DRES: a 1.5-hour Dutch realistic elicited (semi-spontaneous) speech dataset from 80 speakers recorded in noisy, public indoor environments. DRES was designed as a test set for the evaluation of state-of-the-art (SOTA) automatic speech recognition (ASR) and speech enhancement (SE) models in a real-world scenario: a person speaking in a public indoor space with background talkers and noise. The speech was recorded with a four-channel linear microphone array. In this work we evaluate the speech quality of five well-known single-channel SE algorithms and the recognition performance of eight SOTA off-the-shelf ASR models before and after applying SE on the speech of DRES. We found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions. In contrast to recent work, we did not find a positive effect of modern single-channel SE on ASR performance, emphasizing the importance of evaluating in realistic conditions.