May 5, 2026arXiv:2605.03934

Towards Open World Sound Event Detection

Pham Hoang Hai, Le Trong Minh, Le Hoang Son, Lê Hoàng

AI Summary

This paper introduces Open-World Sound Event Detection (OW-SED), a new paradigm for sound event detection that addresses the limitations of closed-world assumptions in real-world environments. To tackle the challenges of OW-SED, the authors propose a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework that incorporates feature disentanglement and a one-to-many matching strategy. Experiments show that WOOT achieves comparable performance to state-of-the-art methods in closed-world settings and significantly outperforms baselines in open-world scenarios.

Key Contribution

Sound event detection gets a reality check: a new framework tackles the messy, unpredictable world of unseen sounds, not just the curated ones.

Abstract

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.

Computer Vision Speech & Audio

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Open World Sound Event Detection

Related Papers