Mar 31, 2026arXiv:2603.29467

M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

Seung-Hun Han, Youssef Mohamed, Mohamed Elhoseiny

AI Summary

M-MiniGPT4, a multilingual vision-language model, was created by training a MiniGPT4 architecture on a mixture of native multilingual and translated data. A novel multilingual alignment training stage using parallel text corpora further enhances the model's cross-lingual capabilities. The resulting model achieves 36% accuracy on the multilingual MMMU benchmark, surpassing other models in its weight class.

Key Contribution

Multilingual vision-language models can achieve surprisingly strong performance (36% on MMMU) simply by training on translated data and aligning with parallel text corpora.

Abstract

This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueProceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)

Related Papers

Finding related papers...

Search

M-MiniGPT4: Multilingual VLLM Alignment via Translated Data

Related Papers