Search papers, labs, and topics across Lattice.
The paper introduces ViCA2, a novel Multimodal Large Language Model (MLLM) designed to improve visuospatial reasoning by integrating SigLIP for semantic understanding and Hiera for spatial structure within a dual vision encoder architecture. To facilitate targeted instruction tuning, the authors created ViCA-322K, a large-scale dataset containing over 322,000 spatially grounded question-answer pairs. Experimental results on VSI-Bench demonstrate that ViCA2-7B achieves state-of-the-art performance, outperforming larger open-source and proprietary models, thus validating the effectiveness of the proposed architecture and training data.
A 7B model can beat much larger models at visuospatial reasoning by using a specialized architecture and training dataset.
While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.