The Central Hospital of WuhanApr 8, 2026arXiv:2604.06789

Video-guided Machine Translation with Global Video Context

Jian Chen, JinZe Lv, Zi Long, XiangHua Fu

AI Summary

This paper introduces a global video-guided multimodal translation framework that addresses the limitations of existing VMT methods in capturing long-range video context. The approach uses a pretrained semantic encoder and vector database to retrieve relevant video segments, constructing a context set based on target subtitle semantics. A novel region-aware cross-modal attention mechanism further enhances semantic alignment, leading to improved translation performance, especially in long-video scenarios.

Key Contribution

Forget one-to-one video segment alignments: this new framework leverages global video context to significantly improve multimodal translation, especially for long videos.

Abstract

Video-guided Multimodal Translation (VMT) has advanced significantly in recent years. However, most existing methods rely on locally aligned video segments paired one-to-one with subtitles, limiting their ability to capture global narrative context across multiple segments in long videos. To overcome this limitation, we propose a globally video-guided multimodal translation framework that leverages a pretrained semantic encoder and vector database-based subtitle retrieval to construct a context set of video segments closely related to the target subtitle semantics. An attention mechanism is employed to focus on highly relevant visual content, while preserving the remaining video features to retain broader contextual information. Furthermore, we design a region-aware cross-modal attention mechanism to enhance semantic alignment during translation. Experiments on a large-scale documentary translation dataset demonstrate that our method significantly outperforms baseline models, highlighting its effectiveness in long-video scenarios.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Video-guided Machine Translation with Global Video Context

Related Papers