Search papers, labs, and topics across Lattice.
The paper introduces JAEGER, a framework that extends audio-visual large language models (AV-LLMs) to 3D space by integrating RGB-D observations and multi-channel first-order ambisonics for joint spatial grounding and reasoning. JAEGER addresses the limitations of existing 2D-centric AV-LLMs that struggle with source localization and spatial reasoning in complex 3D environments. The key result is that JAEGER, using a novel neural intensity vector (Neural IV) representation for robust directional audio cues, outperforms 2D baselines on a new SpatialSceneQA benchmark of 61k instruction-tuning samples.
By explicitly modeling 3D space with learned spatial audio representations, JAEGER enables AV-LLMs to perform joint spatial grounding and reasoning far beyond the capabilities of 2D-centric models.
Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.