UPennMar 17, 2026arXiv:2603.16737

Retrieving Counterfactuals Improves Visual In-Context Learning

Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang

AI Summary

The paper introduces CIRCLES, a framework for improving visual in-context learning (ICL) in VLMs by retrieving counterfactual-style examples. CIRCLES uses attribute-guided composed image retrieval to construct demonstration sets that enable VLMs to reason about causal relationships rather than relying on spurious correlations. Experiments on four datasets show that CIRCLES outperforms existing similarity-based retrieval methods, especially for smaller models and in low-data regimes, by retrieving more diverse and causally informative examples.

Key Contribution

Counterfactual examples supercharge visual in-context learning, enabling smaller vision-language models to outperform larger ones by focusing on causal relationships rather than superficial correlations.

Abstract

Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Retrieving Counterfactuals Improves Visual In-Context Learning

Related Papers