Search papers, labs, and topics across Lattice.
The paper introduces MUGEN, a new benchmark designed to evaluate multi-audio understanding capabilities in Large Audio-Language Models (LALMs) across speech, general audio, and music domains. Experiments using MUGEN reveal that LALMs exhibit significant performance degradation as the number of concurrent audio inputs increases, highlighting input scaling as a key limitation. The authors demonstrate that Audio-Permutational Self-Consistency, a training-free strategy that diversifies the order of audio inputs, can improve model robustness and achieve accuracy gains of up to 6.74% when combined with Chain-of-Thought prompting.
LALMs struggle to handle multiple concurrent audio inputs, but a simple input permutation strategy can significantly boost their multi-audio understanding without retraining.
While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.