Search papers, labs, and topics across Lattice.
The paper introduces VoiceGiraffe, a new benchmark to evaluate long-context audio-language models (LALMs) on hour-level audio understanding across diverse, real-world scenarios. The benchmark consists of 1500 curated triplets designed to test both single-hop perception and multi-hop reasoning. Evaluation of various LALMs reveals that VoiceGiraffe remains challenging, no single inference paradigm is universally superior, and long-range memory persistence is a key bottleneck, with models struggling to track sparse events across long audio.
LALMs struggle to keep track of sparse events across hours of audio, unlike humans who excel at this, revealing a key memory persistence bottleneck.
While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.