Tsinghua AICUHKDeakinFudanHebei University of Science and TechnologyHKUSTMiroMind AINJUOpenGVLabQueen'sSenseTimeShanghai AI LabSJTUUSTCApr 14, 2025arXiv:2504.10479

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Ying Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Lijun Wu, Kai Zhang, Hui Deng, Jiaye Ge, Kaiming Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang

AI Summary

InternVL3 is a new multimodal model trained from scratch using a native multimodal pre-training paradigm, jointly learning from multimodal and text data, thus avoiding the alignment issues of adapting text-only LLMs. The model incorporates variable visual position encoding (V2PE) for longer contexts and uses post-training techniques like SFT and MPO, along with test-time scaling. InternVL3-78B achieves state-of-the-art performance among open-source MLLMs, scoring 72.2 on MMMU and rivaling proprietary models while maintaining strong language proficiency.

Key Contribution

Open-source multimodal models just leveled up: InternVL3 rivals closed-source titans like GPT-4o by pre-training vision and language together from the start.

Abstract

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

Multimodal Models Open-Source Models & Weights Training Efficiency & Optimization

Citation Metrics

Citations901

Influential citations117

References150

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Related Papers