Netflix ResearchYNUJun 30, 2025

Enhanced Multimodal Chain-of-Thought with Visual Self-Contrastive Distillation

Guangmin Zheng, Jun Kong, Jin Wang, Xuejie Zhang

AI Summary

This paper addresses the limitations of existing multimodal chain-of-thought (CoT) methods, which often exhibit hallucinations due to over-reliance on language priors and insufficient learning of visual differences. They introduce visual self-contrastive distillation (VSCD), a method that freezes both language and vision encoders to equalize modality status and employs contrastive decoding to learn image differences during distillation. Experiments on the ScienceQA dataset demonstrate that VSCD outperforms existing methods in multimodal CoT reasoning.

Key Contribution

Freezing both language and vision encoders, combined with contrastive decoding, unlocks more accurate multimodal chain-of-thought reasoning by forcing models to truly "see" the difference between images.

Abstract

Chain-of-thought (CoT) reasoning research has predominantly focused on language modality, neglecting the intricate interaction of multiple modalities crucial for real-world reasoning, such as visual question answering. Current methods primarily concentrate on modal conversion and feature fusion to enable language models to utilize CoT in a multimodal environment. However, these methods have inherent hallucination limitations. They tend to excessively rely on the prior knowledge of language unimodal and lack guidance in learning visual differences, often resulting in content incongruent with the images. This study introduces a novel approach, visual self-contrastive distillation (VSCD). The proposed method equalizes the status of language and vision modalities by freezing both encoders, enabling more balanced learning. Furthermore, we use contrastive decoding to enable the model to learn image differences during distillation, enhancing its understanding of visual nuances. The comprehensive experiment on the ScienceQA dataset demonstrates the superiority of the proposed VSCD method across various categories of multimodal CoT. Code and data are released at https://github.com/zgMin/VSCD.

Inference & Quantization Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations1

Influential citations0

References21

Year2025

VenueIEEE International Conference on Multimedia and Expo

Related Papers

Finding related papers...

Search

Enhanced Multimodal Chain-of-Thought with Visual Self-Contrastive Distillation

Related Papers