Getty Conservation InstituteManchesterWHUApr 8, 2026arXiv:2604.07338

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Yuechen Jiang, Enze Zhang, Mohsinul Kabir, Md Mohsinul Kabir, Qianqian Xie, S. Golfomitsou, Stavroula Golfomitsou, K. Arvanitis, Konstantinos Arvanitis, Sophia Ananiadou

AI Summary

The paper introduces Appear2Meaning, a new benchmark for evaluating Vision-Language Models (VLMs) on their ability to infer structured cultural metadata from images across diverse cultural contexts. They use an LLM-as-Judge framework to assess the semantic alignment of VLM predictions with reference annotations, focusing on exact-match, partial-match, and attribute-level accuracy across different cultural regions. Results reveal that current VLMs struggle with consistent and well-grounded predictions, exhibiting significant performance variations across cultures and metadata types.

Key Contribution

VLMs still struggle to consistently extract structured cultural metadata from images, revealing a critical gap in their ability to reason beyond visual perception across diverse cultural contexts.

Abstract

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Related Papers