May 3, 2026arXiv:2605.01673

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

Xinmeng Xu, Haoran Xie, S. Joe Qin, Lin Li, Xiaohui Tao, Fu Lee Wang

AI Summary

The paper identifies a "premature perceptual commitment" problem in stage-wise audio-visual encoders, where early fused states lack sufficient cross-layer and cross-modal support, hindering later representation formation. To address this, they propose the Delayed Perceptual Commitment Network (DPC-Net), which estimates a readiness-deficiency surrogate and applies support-aware correction at intervention-sensitive bottlenecks. Experiments across audio-visual speech separation, event localization, and speech recognition demonstrate consistent improvements, validating the effectiveness of readiness-guided bottleneck correction.

Key Contribution

Audio-visual models can be significantly improved by delaying perceptual commitment, correcting intermediate fusion states only when they have sufficient cross-layer and cross-modal support.

Abstract

Stage-wise audio-visual encoders propagate fused intermediate states across layers, making the formation of later representations depend on the readiness of earlier fusion states. Strong local audio-visual agreement provides useful correspondence evidence, yet a fused state also needs sufficient cross-layer and cross-modal support before it can reliably guide later fusion. This paper studies this issue through propagation-aware representation readiness and formulates premature perceptual commitment as a readiness-deficiency problem, where local plausibility, propagation influence, and support insufficiency jointly appear at an intermediate stage. We propose the Delayed Perceptual Commitment Network (DPC-Net), an encoder-level framework that estimates an observable readiness-deficiency surrogate, localizes the intervention-sensitive bottleneck, and applies support-aware correction with cross-layer and cross-modal evidence. DPC-Net preserves task-specific heads, losses, decoding modules, and evaluation protocols, making it applicable to different audio-visual tasks through encoder-side intervention. Experiments on audio-visual speech separation, audio-visual event localization, and audio-visual speech recognition show consistent improvements across reconstruction, localization, and recognition regimes. Further analyses on component contribution, selection criteria, counterfactual intervention, and readiness trajectories support the effectiveness of readiness-guided bottleneck correction.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Delayed Commitment for Representation Readiness in Stage-wise Audio-Visual Learning

Related Papers