Mar 3, 2026arXiv:2603.02508

Decomposing the Influence of Physical Acoustic Modeling on Neural Personal Sound Zone Rendering: An Ablation Study

AI Summary

This paper investigates the impact of different physical acoustic modeling components on the performance of deep learning-based personal sound zone (PSZ) rendering. The authors perform a controlled ablation study on a head-pose-conditioned binaural PSZ renderer (BSANN), systematically adding frequency responses of loudspeakers, analytic circular-piston directivity, and rigid-sphere head-related transfer functions to the simulated acoustic transfer functions (ATFs) used for training. Results demonstrate that loudspeaker frequency responses provide spectral calibration, directivity delivers the most consistent sound-zone separation gains, and rigid-sphere HRTFs significantly boost crosstalk cancellation, particularly at higher frequencies.

Key Contribution

Surprisingly, simply adding measured loudspeaker frequency responses to training data significantly reduces inter-listener imbalance in neural personal sound zone rendering.

Abstract

Deep learning-based Personal Sound Zones (PSZs) rely on simulated acoustic transfer functions (ATFs) for training, yet idealized point-source models exhibit large sim-to-real gaps. While physically informed components improve generalization, individual contributions remain unclear. This paper presents a controlled ablation study on a head-pose-conditioned binaural PSZ renderer using the Binaural Spatial Audio Neural Network (BSANN). We progressively enrich simulated ATFs with three components: (i) anechoically measured frequency responses of the particular loudspeakers(FR), (ii) analytic circular-piston directivity (DIR), and (iii) rigid-sphere head-related transfer functions (RS-HRTF). Four configurations are evaluated via in-situ measurements with two dummy heads. Performance metrics include inter-zone isolation (IZI), inter-program interference (IPI), and crosstalk cancellation (XTC) over 100-20000 Hz. Results show FR provides spectral calibration, yielding modest XTC improvements and reduced inter-listener IPI imbalance. DIR delivers the most consistent sound-zone separation gains (10.05 dB average IZI/IPI). RS-HRTF dominates binaural separation, boosting XTC by +2.38/+2.89 dB (average 4.51 to 7.91 dB), primarily above 2 kHz, while introducing mild listener-dependent IZI/IPI shifts. These findings guide prioritization of measurements and models when constructing training ATFs under limited budgets.

Data Curation & Synthetic Data Robotics & Embodied AI Speech & Audio

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Decomposing the Influence of Physical Acoustic Modeling on Neural Personal Sound Zone Rendering: An Ablation Study

Related Papers