Search papers, labs, and topics across Lattice.
The paper introduces Visual Contrastive Self-Taught Reasoner (VC-STaR), a self-improving framework for VLMs that leverages visual contrast within contrastive VQA pairs to reduce hallucinations in generated reasoning rationales. VC-STaR generates rationales on a new curated dataset, VisCoR-55K, built from diverse VQA datasets and contrastive pairs based on multi-modal similarity. Finetuning VLMs on VisCoR-55K significantly improves their visual reasoning capabilities, surpassing existing self-improving methods and models finetuned on state-of-the-art visual reasoning datasets.
VLMs reason better when shown visually similar examples with synonymous questions, enabling a new self-training approach that beats state-of-the-art visual reasoning datasets.
Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.