Google ResearchColumbiaSamsungUMichMar 11, 2026arXiv:2603.10465

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Tianyu Xu, Sieun Kim, Qianhuizhi Zheng, Tejasvi Ravi, A. Kulkarni, Katrina Passarella-Ward, Junyi Zhu, Adarsh Kowdle

AI Summary

MoXaRt is a real-time XR system that disentangles complex acoustic environments by combining audio-only source separation with visual object detection to guide refinement networks. The system separates up to 5 concurrent sound sources with approximately 2-second latency. Experiments on a new dataset and a user study demonstrate a 36.2% improvement in speech intelligibility and reduced cognitive load in adversarial acoustic environments.

Key Contribution

Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.

Abstract

In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p<0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p<0.001), thereby paving the way for more perceptive and socially adept XR experiences.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References64

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Related Papers