Mar 12, 2026arXiv:2603.12046

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

AI Summary

Dr. SHAP-AV, a novel framework, employs Shapley values to dissect the relative contributions of audio and visual modalities in Audio-Visual Speech Recognition (AVSR) models. The framework introduces three distinct analyses—Global SHAP, Generative SHAP, and Temporal Alignment SHAP—to provide a comprehensive understanding of modality balance. Experiments across six models, two benchmarks, and varying SNR levels reveal a persistent audio bias, even under significant noise, and highlight the dynamic evolution of modality balance during speech generation.

Key Contribution

Despite the intuition that noisy environments should make models rely more on visual cues, AVSR models stubbornly cling to audio, even when it's heavily degraded.

Abstract

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

Interpretability & Mechanistic Interp Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Related Papers