Search papers, labs, and topics across Lattice.
This paper presents a comprehensive review of Multimodal Large Language Models (MLLMs) in video translation, categorizing their roles into Semantic Reasoner, Expressive Performer, and Visual Synthesizer. It analyzes how MLLMs overcome limitations of traditional pipelines by jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. The review identifies open challenges in video understanding, temporal modeling, and multimodal alignment, suggesting future research directions for MLLM-powered video translation.
MLLMs aren't just improving video translation quality; they're fundamentally changing how we approach it by jointly optimizing for semantic accuracy, timing, speaker identity, and emotional nuance.
Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.