Search papers, labs, and topics across Lattice.
This paper introduces RAIL, a novel evaluation paradigm for large audio-language models (LALMs) that integrates cognitive principles from the Cattell-Horn-Carroll (CHC) framework to assess auditory intelligence. By formalizing auditory cognition into five core capabilities and developing structured tasks, the authors reveal significant gaps in current models' performance across these cognitive abilities. The evaluation of 26 state-of-the-art LALMs demonstrates that existing models show uneven capabilities in processing, retaining, and integrating auditory information, highlighting the need for a more nuanced assessment approach.
Current LALMs exhibit significant performance disparities across cognitive auditory capabilities, revealing a critical oversight in existing evaluation methods.
Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current evaluation paradigms remain largely task- or modality-centric, focusing on end performance while overlooking underlying auditory cognitive behaviours. This reveals a fundamental gap between how auditory cognition is understood in humans and how it is evaluated in LALMs, particularly in the lack of frameworks that operationalise cognitive principles beyond task-level metrics to systematically capture model behaviour. In this work, we introduce RAIL, a human-centric evaluation paradigm grounded in the Cattell-Horn-Carroll (CHC) cognitive framework. RAIL formalises auditory cognition into five core capabilities and develop them into structured evaluation tasks that probe how models process, retain, and integrate auditory information. We further construct a cognitively grounded benchmark with principled data curation and human-aligned evaluation protocols. Evaluating 26 state-of-the-art LALMs, we find that current models exhibit highly uneven performance across cognitive abilities. RAIL establishes a new evaluation paradigm that moves beyond task-centric benchmarking toward cognitively grounded assessment of auditory intelligence.