Search papers, labs, and topics across Lattice.
The paper introduces Cross-Modal Alignment with Visual Reasoning Prompting (CMA-VRP) to improve multimodal fake news detection by addressing cross-modal alignment challenges and robustness to noisy data. CMA-VRP constructs entity graphs for text and images, using graph contrastive learning to enhance cross-modal consistency and LLMs/LVLMs to extract deep visual-semantic attributes. The model then performs graph-based cross-modal semantic fusion and cycle alignment to obtain semantically consistent and modality-invariant features, leading to superior performance and robustness.
LLMs and LVLMs can be prompted to extract deep visual-semantic reasoning attributes for multimodal fake news detection, significantly improving performance and robustness.
The rise of multimodal fake news threatens reliable information dissemination by exploiting multiple modalities to create deceptive, engaging content, significantly impacting society safety. Existing methods still face challenges in cross-modal alignment (e.g., semantic inconsistencies, complex visual-semantic relations) and are vulnerable to low-quality or noisy samples. To address these, we propose Cross-Modal Alignment with Visual Reasoning Prompting (CMA-VRP) for multimodal fake news detection. Specifically, we model text and image entities with graphs to capture fine-grained semantic interactions and enhance cross-modal consistency through graph contrastive learning. Unlike methods relying on shallow image features (e.g., edges, textures), we leverage large language models (LLMs) and large vision-language models (LVLMs) to capture deep visual-semantic attributes related to reasoning (e.g., actions, scenes). Based on graph modeling and visual reasoning features, we perform graph-based cross-modal semantic fusion to unify textual and visual representations and cross-modal cycle alignment to align modality distributions by reducing semantic discrepancies, filtering modality-specific noise, and extracting invariant representations across domains. These steps enable the model to obtain semantically consistent and modality-invariant features. Extensive experiments demonstrate that our model outperforms existing methods in multimodal fake news detection and shows strong robustness against noisy samples.