PKUMay 18, 2025arXiv:2505.12363

Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

AI Summary

The paper introduces ViCA2, a novel Multimodal Large Language Model (MLLM) designed to improve visuospatial reasoning by integrating SigLIP for semantic understanding and Hiera for spatial structure within a dual vision encoder architecture. To facilitate targeted instruction tuning, the authors created ViCA-322K, a large-scale dataset containing over 322,000 spatially grounded question-answer pairs. Experimental results on VSI-Bench demonstrate that ViCA2-7B achieves state-of-the-art performance, outperforming larger open-source and proprietary models, thus validating the effectiveness of the proposed architecture and training data.

Key Contribution

A 7B model can beat much larger models at visuospatial reasoning by using a specialized architecture and training dataset.

Abstract

While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations7

Influential citations1

References47

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

Related Papers