Search papers, labs, and topics across Lattice.
The paper introduces Appear2Meaning, a new benchmark for evaluating Vision-Language Models (VLMs) on their ability to infer structured cultural metadata from images across diverse cultural contexts. They use an LLM-as-Judge framework to assess the semantic alignment of VLM predictions with reference annotations, focusing on exact-match, partial-match, and attribute-level accuracy across different cultural regions. Results reveal that current VLMs struggle with consistent and well-grounded predictions, exhibiting significant performance variations across cultures and metadata types.
VLMs still struggle to consistently extract structured cultural metadata from images, revealing a critical gap in their ability to reason beyond visual perception across diverse cultural contexts.
Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.