HITApr 13, 2026arXiv:2604.11283

Empowering Video Translation using Multimodal Large Language Models

AI Summary

This paper presents a comprehensive review of Multimodal Large Language Models (MLLMs) in video translation, categorizing their roles into Semantic Reasoner, Expressive Performer, and Visual Synthesizer. It analyzes how MLLMs overcome limitations of traditional pipelines by jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. The review identifies open challenges in video understanding, temporal modeling, and multimodal alignment, suggesting future research directions for MLLM-powered video translation.

Key Contribution

MLLMs aren't just improving video translation quality; they're fundamentally changing how we approach it by jointly optimizing for semantic accuracy, timing, speaker identity, and emotional nuance.

Abstract

Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.

Multimodal Models Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Empowering Video Translation using Multimodal Large Language Models

Related Papers