Apr 28, 2026arXiv:2604.25591

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee

AI Summary

This study systematically investigates uncertainty estimation methods for audio-aware large language models (ALLMs), addressing the challenges posed by audio-conditioned generation. By benchmarking five methods—predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True)—across various models and tasks, the authors reveal that semantic-level and verification-based approaches consistently outperform token-level baselines in audio reasoning. Additionally, the effectiveness of these uncertainty methods varies significantly across different models and benchmarks, particularly in trustworthiness-oriented tasks, highlighting the complexity of uncertainty in ALLMs.

Key Contribution

Semantic-level uncertainty estimation methods significantly enhance the reliability of audio-aware language models, outperforming traditional approaches in critical reasoning tasks.

Abstract

Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References99

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

Related Papers