Search papers, labs, and topics across Lattice.
Dr. SHAP-AV is introduced as a framework that uses Shapley values to analyze the contribution of audio and visual modalities in Audio-Visual Speech Recognition (AVSR) models. The framework includes Global SHAP, Generative SHAP, and Temporal Alignment SHAP analyses, providing a comprehensive view of modality balance. Experiments on six models across two benchmarks reveal that models maintain a high audio contribution even under severe noise, suggesting a persistent audio bias.
Despite the intuition that noisy environments should push AVSR models to rely more on visual input, a persistent audio bias remains, even under severe degradation.
Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.