Mar 10, 2026arXiv:2603.09714

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu

AI Summary

The paper introduces MUGEN, a new benchmark designed to evaluate multi-audio understanding capabilities in Large Audio-Language Models (LALMs) across speech, general audio, and music domains. Experiments using MUGEN reveal that LALMs exhibit significant performance degradation as the number of concurrent audio inputs increases, highlighting input scaling as a key limitation. The authors demonstrate that Audio-Permutational Self-Consistency, a training-free strategy that diversifies the order of audio inputs, can improve model robustness and achieve accuracy gains of up to 6.74% when combined with Chain-of-Thought prompting.

Key Contribution

LALMs struggle to handle multiple concurrent audio inputs, but a simple input permutation strategy can significantly boost their multi-audio understanding without retraining.

Abstract

While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Related Papers