Search papers, labs, and topics across Lattice.
This paper introduces a hierarchical multimodal architecture that fuses audio, video, and text features to predict human perception of AI-dubbed content quality. The model uses LoRA adapters for parameter-efficient fine-tuning and is trained on a dataset of 12k Hindi-English dubbed clips, using proxy MOS derived from objective metrics to augment limited human labels. The system achieves strong perceptual alignment (PCC > 0.75) with human ratings, demonstrating a scalable approach for automatic dubbing evaluation.
Skip expensive human ratings: this hierarchical multimodal model accurately predicts human perception of AI-dubbed content quality using only audio, video, and text inputs.
Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We present a hierarchical multimodal architecture for perceptually meaningful dubbing evaluation, integrating complementary cues from audio, video, and text. The model captures fine-grained features such as speaker identity, prosody, and content from audio, facial expressions and scene-level cues from video and semantic context from text, which are progressively fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-efficient fine-tuning across modalities. To overcome limited subjective labels, we derive proxy MOS by aggregating objective metrics with weights optimized via active learning. The proposed architecture was trained on 12k Hindi-English bidirectional dubbed clips, followed by fine-tuning with human MOS. Our approach achieves strong perceptual alignment (PCC>0.75), providing a scalable solution for automatic evaluation of AI-dubbed content.