Search papers, labs, and topics across Lattice.
The paper introduces a new Perception-Aware Question Answering (PAQA) dataset designed to improve audio understanding by explicitly separating speech, environmental sounds, and multiple speakers. They then propose HyPeR, a two-stage Hybrid Perception-Reasoning framework that first finetunes a model on PAQA to improve acoustic perception, and then uses GRPO to refine reasoning. HyPeR incorporates PAUSE tokens and a perceptual consistency reward, achieving significant performance gains over baseline models and approaching the performance of larger models.
Grounding audio understanding in structured auditory scenes with a hybrid perception-reasoning framework dramatically improves performance, rivaling that of much larger models.
Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Inspired by Auditory Scene Analysis, we first introduce a Perception-Aware Question Answering (PAQA) dataset. PAQA implements a hierarchical decoupling strategy that separates speech from environmental sound and distinguishes multiple speakers, providing explicit perceptual reasoning for training. Building on this, we propose HyPeR, a two-stage Hybrid Perception-Reasoning framework. In Stage I, we finetune the model on PAQA to perceive acoustic attributes in complex audio. In Stage II, we leverage GRPO to refine the model's internal deliberation. We also introduce PAUSE tokens to facilitate latent computation during acoustically ambiguous phases and design perceptual consistency reward to align reasoning rationales with raw audio. Experiments across benchmarks demonstrate that HyPeR achieves absolute improvements over the base model, with performance comparable to large-scale models, stressing the effectiveness of hybrid perception-grounded reasoning for robust and multi-speaker audio understanding.