Apr 16, 2026arXiv:2604.14806

Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

Jieyi Wang, Yazhe Niu, Dexuan Xu, Zhongyu Wei

AI Summary

The paper introduces a new Perception-Aware Question Answering (PAQA) dataset designed to improve audio understanding by explicitly separating speech, environmental sounds, and multiple speakers. They then propose HyPeR, a two-stage Hybrid Perception-Reasoning framework that first finetunes a model on PAQA to improve acoustic perception, and then uses GRPO to refine reasoning. HyPeR incorporates PAUSE tokens and a perceptual consistency reward, achieving significant performance gains over baseline models and approaching the performance of larger models.

Key Contribution

Grounding audio understanding in structured auditory scenes with a hybrid perception-reasoning framework dramatically improves performance, rivaling that of much larger models.

Abstract

Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Inspired by Auditory Scene Analysis, we first introduce a Perception-Aware Question Answering (PAQA) dataset. PAQA implements a hierarchical decoupling strategy that separates speech from environmental sound and distinguishes multiple speakers, providing explicit perceptual reasoning for training. Building on this, we propose HyPeR, a two-stage Hybrid Perception-Reasoning framework. In Stage I, we finetune the model on PAQA to perceive acoustic attributes in complex audio. In Stage II, we leverage GRPO to refine the model's internal deliberation. We also introduce PAUSE tokens to facilitate latent computation during acoustically ambiguous phases and design perceptual consistency reward to align reasoning rationales with raw audio. Experiments across benchmarks demonstrate that HyPeR achieves absolute improvements over the base model, with performance comparable to large-scale models, stressing the effectiveness of hybrid perception-grounded reasoning for robust and multi-speaker audio understanding.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

Related Papers