Search papers, labs, and topics across Lattice.
This paper addresses the limitations of existing multimodal chain-of-thought (CoT) methods, which often exhibit hallucinations due to over-reliance on language priors and insufficient learning of visual differences. They introduce visual self-contrastive distillation (VSCD), a method that freezes both language and vision encoders to equalize modality status and employs contrastive decoding to learn image differences during distillation. Experiments on the ScienceQA dataset demonstrate that VSCD outperforms existing methods in multimodal CoT reasoning.
Freezing both language and vision encoders, combined with contrastive decoding, unlocks more accurate multimodal chain-of-thought reasoning by forcing models to truly "see" the difference between images.
Chain-of-thought (CoT) reasoning research has predominantly focused on language modality, neglecting the intricate interaction of multiple modalities crucial for real-world reasoning, such as visual question answering. Current methods primarily concentrate on modal conversion and feature fusion to enable language models to utilize CoT in a multimodal environment. However, these methods have inherent hallucination limitations. They tend to excessively rely on the prior knowledge of language unimodal and lack guidance in learning visual differences, often resulting in content incongruent with the images. This study introduces a novel approach, visual self-contrastive distillation (VSCD). The proposed method equalizes the status of language and vision modalities by freezing both encoders, enabling more balanced learning. Furthermore, we use contrastive decoding to enable the model to learn image differences during distillation, enhancing its understanding of visual nuances. The comprehensive experiment on the ScienceQA dataset demonstrates the superiority of the proposed VSCD method across various categories of multimodal CoT. Code and data are released at https://github.com/zgMin/VSCD.