D ChartQA DocVQA InfoVQA RealWorldQAMar 4, 2026arXiv:2603.03815

Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning

Yihang Duan, Jiong Wang, Pengpeng Zeng, Ji Zhang, Lei Zhao, Chong Wang, Jingkuan Song, Lianli Gao

AI Summary

This paper tackles Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL) by leveraging the structural consistency of semantically related attributes and objects in the embedding space. They introduce Structure-aware Prompt Adaptation (SPA), comprising a Structure-aware Consistency Loss (SCL) during training to enforce local structure consistency for seen concepts, and a Structure-guided Adaptation Strategy (SAS) at inference to align unseen concepts with similar seen ones. Experiments on OV-CZSL benchmarks show SPA improves open-vocabulary generalization while maintaining closed-set performance, and is compatible with existing CZSL prompt tuning methods.

Key Contribution

Generalizing to unseen compositions? This plug-and-play method leverages structure in the embedding space to adapt prompts, significantly boosting open-vocabulary zero-shot learning.

Abstract

The goal of Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL) is to recognize attribute-object compositions in the open-vocabulary setting, where compositions of both seen and unseen attributes and objects are evaluated. Recently, prompt tuning methods have demonstrated strong generalization capabilities in the closed setting, where only compositions of seen attributes and objects are evaluated, i.e., Compositional Zero-Shot Learning (CZSL). However, directly applying these methods to OV-CZSL may not be sufficient to generalize to unseen attributes, objects and their compositions, as it is limited to seen attributes and objects. Normally, when faced with unseen concepts, humans adopt analogies with seen concepts that have the similar semantics thereby inferring their meaning (e.g., "wet" and "damp", "shirt" and "jacket"). In this paper, we experimentally show that the distribution of semantically related attributes or objects tends to form consistent local structures in the embedding space. Based on the above structures, we propose Structure-aware Prompt Adaptation (SPA) method, which enables models to generalize from seen to unseen attributes and objects. Specifically, in the training stage, we design a Structure-aware Consistency Loss (SCL) that encourages the local structure's consistency of seen attributes and objects in each iteration. In the inference stage, we devise a Structure-guided Adaptation Strategy (SAS) that adaptively aligns the structures of unseen attributes and objects with those of trained seen attributes and objects with similar semantics. Notably, SPA is a plug-and-play method that can be seamlessly integrated into existing CZSL prompt tuning methods. Extensive experiments on OV-CZSL benchmarks demonstrate that SPA achieves competitive closed-set performance while significantly improving open-vocabulary results.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning

Related Papers