SJTUUIUCFeb 26, 2026arXiv:2602.23029

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Tianyu Wang, Tianyue Wang, Leigang Qu, Leigang Qu, Tianyu Yang, Tianyu Yang, Xiangzhao Hao, Xiangzhao Hao, Yifan Xu, Yifan Xu, Haiyun Guo, JinQiao Wang, Jinqiao Wang

AI Summary

The paper introduces WISER, a training-free framework for Zero-Shot Composed Image Retrieval (ZS-CIR) that unifies Text-to-Image (T2I) and Image-to-Image (I2I) retrieval paradigms via a "retrieve-verify-refine" pipeline. WISER addresses the limitations of individual T2I and I2I approaches by performing wider search using both edited captions and images, adaptively fusing results based on a verification step, and refining uncertain retrievals through structured self-reflection. Experiments demonstrate WISER significantly outperforms existing training-free and even some training-dependent methods on CIRCO and CIRR benchmarks, achieving substantial improvements in mAP@5 and Recall@1, respectively.

Key Contribution

Training-free zero-shot image retrieval just got a whole lot better: WISER's "retrieve-verify-refine" pipeline achieves state-of-the-art results by intelligently fusing text-to-image and image-to-image retrieval.

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a"retrieve-verify-refine"pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...