KULMUMunich Center for Machine LearningSYSUTübingenJun 8, 2026arXiv:2606.08959

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

Yi Zhang, Yong Cao, Chengyan Wu, Daniel Hershcovich, Anna-Carolina Haensch

AI Summary

This paper introduces ChinaHeritaQA, a multimodal benchmark dataset designed to assess the cultural reasoning capabilities of vision-language models (VLMs) specifically on UNESCO World Heritage sites in China. The dataset features 2,279 images and 14,133 bilingual QA pairs that cover a range of cognitive dimensions, revealing that while state-of-the-art VLMs generally outperform humans, they exhibit significant weaknesses in culturally grounded reasoning. The findings highlight a disconnect between visual recognition abilities and the understanding of cultural and historical contexts, with performance varying notably by dynasty and region.

Key Contribution

Despite outperforming humans on average, top VLMs falter in culturally grounded reasoning, revealing a critical gap in their understanding of heritage.

Abstract

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

Related Papers