Aug 9, 2025arXiv:2508.07023

MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering

Jingwei Peng, Jiehao Chen, Mateo Alejandro Rojas, Meilin Zhang

AI Summary

The paper introduces MV-CoRe, a novel model for Complex VQA that addresses the limitations of existing LVLMs by deeply fusing global embeddings from VLMs and LLMs with fine-grained semantic-aware visual features like object detection and scene graphs. A Multimodal Fusion Transformer is used to process and integrate these features, enabling rich cross-modal attention and complex reasoning. Experiments on GQA, A-OKVQA, and OKVQA demonstrate that MV-CoRe outperforms established LVLM baselines, achieving 77.5% accuracy on GQA, and ablation studies validate the importance of object and scene graph features.

Key Contribution

By deeply fusing fine-grained visual semantics with global VLMs and LLMs, MV-CoRe achieves state-of-the-art results on Complex VQA tasks that demand sophisticated reasoning.

Abstract

Complex Visual Question Answering (Complex VQA) tasks, which demand sophisticated multi-modal reasoning and external knowledge integration, present significant challenges for existing large vision-language models (LVLMs) often limited by their reliance on high-level global features. To address this, we propose MV-CoRe (Multimodal Visual-Conceptual Reasoning), a novel model designed to enhance Complex VQA performance through the deep fusion of diverse visual and linguistic information. MV-CoRe meticulously integrates global embeddings from pre-trained Vision Large Models (VLMs) and Language Large Models (LLMs) with fine-grained semantic-aware visual features, including object detection characteristics and scene graph representations. An innovative Multimodal Fusion Transformer then processes and deeply integrates these diverse feature sets, enabling rich cross-modal attention and facilitating complex reasoning. We evaluate MV-CoRe on challenging Complex VQA benchmarks, including GQA, A-OKVQA, and OKVQA, after training on VQAv2. Our experimental results demonstrate that MV-CoRe consistently outperforms established LVLM baselines, achieving an overall accuracy of 77.5% on GQA. Ablation studies confirm the critical contribution of both object and scene graph features, and human evaluations further validate MV-CoRe's superior factual correctness and reasoning depth, underscoring its robust capabilities for deep visual and conceptual understanding.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations2

Influential citations0

References25

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering

Related Papers