Tsinghua AIFudanHKUSTTJUTongjiXiaohongshuδ University of CaliforniaMay 23, 2026arXiv:2605.24675

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen

AI Summary

VaaWIT is introduced as an end-to-end framework for adapting LLMs to multilingual web image translation, addressing the visual representation gap where standard encoders fail to capture fine-grained visual details. The framework incorporates a Dual-Stream Attention Module (DSAM) for bidirectional interaction between multilingual semantic features and detailed visual representations, along with a Visual-Aware Adapter (VAA) for parameter-efficient fine-tuning. Experiments across eight tasks on three benchmarks demonstrate that VaaWIT outperforms open-source baselines and achieves competitive performance against proprietary models.

Key Contribution

LLMs can now translate text in web images with significantly improved accuracy and efficiency thanks to a novel visual-aware adaptation framework that bridges the gap between high-level semantics and fine-grained visual details.

Abstract

Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

Related Papers