Feb 18, 2026arXiv:2602.16334

Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

A. Sridhar, Arvind Krishna Sridhar, Yinyi Guo, Erik Visser, Erik Visser

AI Summary

This paper introduces a framework for Spatial Audio Question Answering (Spatial AQA) focused on reasoning about dynamic source movements. They propose a movement-centric spatial audio augmentation framework for generating training data and an end-to-end multimodal finetuning approach with a "thinking mode" to encourage explicit intermediate reasoning. Experiments demonstrate that the "thinking mode" significantly improves performance, especially when combined with query-conditioned source separation, achieving a +5.1% improvement in accuracy when a single event is present in the question.

Key Contribution

Explicit reasoning steps ("thinking mode") boost spatial audio question answering accuracy by 5.1%, especially when combined with source separation.

Abstract

Spatial audio understanding aims to enable machines to interpret complex auditory scenes, particularly when sound sources move over time. In this work, we study Spatial Audio Question Answering (Spatial AQA) with a focus on movement reasoning, where a model must infer object motion, position, and directional changes directly from stereo audio. First, we introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events, enabling controlled and scalable training data generation. Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps before predicting an answer. Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes: no masking, an audio grounding model (AGM), and ground-truth masks. Our results show that reasoning amplifies the benefits of source separation, with thinking mode showing significant improvement of +5.1% when a single event is present in the question. These findings highlight the interplay between movement modeling, reasoning, and separation quality, offering new insights for advancing spatial audio understanding.

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

Related Papers