Mar 30, 2026arXiv:2603.28717

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

Ashwini Dasare, Nirmesh Shah, Ashish Gudmalwar, Ashishkumar Gudmalwar, P. Wasnik, Pankaj Wasnik

AI Summary

This paper introduces a hierarchical multimodal architecture that fuses audio, video, and text features to predict human perception of AI-dubbed content quality. The model uses LoRA adapters for parameter-efficient fine-tuning and is trained on a dataset of 12k Hindi-English dubbed clips, using proxy MOS derived from objective metrics to augment limited human labels. The system achieves strong perceptual alignment (PCC > 0.75) with human ratings, demonstrating a scalable approach for automatic dubbing evaluation.

Key Contribution

Skip expensive human ratings: this hierarchical multimodal model accurately predicts human perception of AI-dubbed content quality using only audio, video, and text inputs.

Abstract

Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We present a hierarchical multimodal architecture for perceptually meaningful dubbing evaluation, integrating complementary cues from audio, video, and text. The model captures fine-grained features such as speaker identity, prosody, and content from audio, facial expressions and scene-level cues from video and semantic context from text, which are progressively fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-efficient fine-tuning across modalities. To overcome limited subjective labels, we derive proxy MOS by aggregating objective metrics with weights optimized via active learning. The proposed architecture was trained on 12k Hindi-English bidirectional dubbed clips, followed by fine-tuning with human MOS. Our approach achieves strong perceptual alignment (PCC>0.75), providing a scalable solution for automatic evaluation of AI-dubbed content.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References24

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

Related Papers