Search papers, labs, and topics across Lattice.
SIREN is a novel framework for converting monaural audio from video into binaural audio by leveraging visual information to predict separate left and right audio channels. The system employs a ViT encoder with dual-head self-attention to generate a shared scene map and L/R attention weights, replacing the need for hand-crafted masks. A soft spatial prior and confidence-weighted waveform fusion are used to improve spatial grounding and reduce crosstalk, leading to improved performance on binaural audio reconstruction.
Turn monaural video into immersive binaural audio with SIREN, a visually-guided framework that learns spatial audio cues without task-specific annotations.
Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.