Search papers, labs, and topics across Lattice.
MoXaRt is a real-time XR system that disentangles complex acoustic environments by combining audio-only source separation with visual object detection to guide refinement networks. The system separates up to 5 concurrent sound sources with approximately 2-second latency. Experiments on a new dataset and a user study demonstrate a 36.2% improvement in speech intelligibility and reduced cognitive load in adversarial acoustic environments.
Imagine an XR experience where you can selectively isolate and enhance individual sound sources in real-time, making chaotic audio environments crystal clear.
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p<0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p<0.001), thereby paving the way for more perceptive and socially adept XR experiences.