Nov 6, 2025arXiv:2511.04727

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Ali Faraz, Akash, Shaharukh Khan, Raja Kolla, Akshat Patidar, Suranjan Goswami, Abhinav Ravi, Chandra Khatri, Shubham Agarwal

AI Summary

The authors introduce IndicVisionBench, a new large-scale benchmark designed to evaluate vision-language models (VLMs) on cultural and multilingual understanding, specifically focusing on the Indian subcontinent. The benchmark includes 3 multimodal tasks (OCR, MMT, VQA) across 10 Indian languages and English, comprising ~5K images and 37K+ QA pairs covering 13 culturally grounded topics. Evaluation of 8 VLMs reveals significant performance gaps, highlighting the need for more inclusive multimodal research.

Key Contribution

Current VLMs stumble significantly when faced with the cultural and linguistic nuances of the Indian subcontinent, as revealed by the new IndicVisionBench benchmark.

Abstract

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References77

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Related Papers