Stanford HAIMar 10, 2026arXiv:2603.09853

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

AI Summary

SCENEBench, a new benchmark suite, is introduced to evaluate Large Audio Language Models (LALMs) on tasks beyond speech recognition, focusing on spatial audio understanding, cross-lingual speech, environmental sounds, and vocal characteristics. The benchmark uses both synthetically constructed and real-world audio samples to assess model performance and latency across these tasks. Evaluation of five state-of-the-art LALMs reveals significant performance variations and critical gaps in their ability to understand complex audio scenes.

Key Contribution

Current Large Audio Language Models (LALMs) struggle with basic audio understanding tasks like noise localization and cross-lingual speech, with some performing worse than random chance, despite excelling at speech recognition.

Abstract

Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Related Papers