NTU TaiwanApr 27, 2026arXiv:2604.24401

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li, Ke-Han Lu, Hung-yi Lee

AI Summary

This paper investigates the extent to which Large Audio-Language Models (LALMs) rely on audio input versus text priors when performing on audio-language benchmarks. They introduce a diagnostic framework to quantify "text prior" (answerability from text alone) and "audio reliance" (dependency on the acoustic signal). Results show that LALMs maintain 60-72% of their performance even without audio, and only a small fraction of questions truly require the full audio clip, indicating a significant over-reliance on text priors.

Key Contribution

Audio-Language models are cheating on benchmarks, acing tests even when they barely listen.

Abstract

Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Related Papers