Mar 31, 2026arXiv:2603.29820

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

AI Summary

SIREN is a novel framework for converting monaural audio from video into binaural audio by leveraging visual information to predict separate left and right audio channels. The system employs a ViT encoder with dual-head self-attention to generate a shared scene map and L/R attention weights, replacing the need for hand-crafted masks. A soft spatial prior and confidence-weighted waveform fusion are used to improve spatial grounding and reduce crosstalk, leading to improved performance on binaural audio reconstruction.

Key Contribution

Turn monaural video into immersive binaural audio with SIREN, a visually-guided framework that learns spatial audio cues without task-specific annotations.

Abstract

Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Related Papers