CornellSJTUJun 9, 2026arXiv:2606.10401

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

Yiming Zhang, Ruoxuan Cao, Zhihang Zhong

AI Summary

This paper introduces CoCoSI, a lightweight, model-agnostic framework that enhances spatial intelligence in multimodal large language models (MLLMs) by collaboratively constructing cognitive maps from visual inputs. By employing local-global agent coordination and atomic commits, the framework preserves spatial information beyond the native context window without requiring architectural changes or additional training. Experimental results show that CoCoSI significantly improves spatial understanding tasks, outperforming existing methods while remaining fully training-free.

Key Contribution

Spatial intelligence in MLLMs can be dramatically enhanced without any architectural modifications or retraining, thanks to a novel collaborative cognitive mapping approach.

Abstract

Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

Related Papers