Search papers, labs, and topics across Lattice.
The authors introduce IndicVisionBench, a new large-scale benchmark designed to evaluate vision-language models (VLMs) on cultural and multilingual understanding, specifically focusing on the Indian subcontinent. The benchmark includes 3 multimodal tasks (OCR, MMT, VQA) across 10 Indian languages and English, comprising ~5K images and 37K+ QA pairs covering 13 culturally grounded topics. Evaluation of 8 VLMs reveals significant performance gaps, highlighting the need for more inclusive multimodal research.
Current VLMs stumble significantly when faced with the cultural and linguistic nuances of the Indian subcontinent, as revealed by the new IndicVisionBench benchmark.
Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.