Search papers, labs, and topics across Lattice.
The paper introduces BiFormer3D, a novel Transformer-based architecture for reconstructing Head-Related Impulse Responses (HRIRs) at arbitrary spatial locations from sparse measurements, operating directly in the time domain. By using sinusoidal spatial encodings and auxiliary ITD/ILD prediction heads, BiFormer3D avoids the limitations of frequency-domain methods and fixed direction grids. Experiments on the SONICOM dataset demonstrate improved performance in NMSE, cosine distance, and ITD/ILD error compared to existing techniques, while also showing that minimum-phase preprocessing is unnecessary.
Ditch the grid: BiFormer3D uses a spatial-encoding Transformer to reconstruct personalized 3D audio from sparse measurements, outperforming prior art without relying on frequency-domain hacks or minimum-phase assumptions.
Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose BiFormer3D, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase pre-processing is unnecessary.