BUPTCASCUHKZhongguancun AcademyJun 4, 2026arXiv:2606.05677

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

Shiqiang Lang, Jing Liu, Haoyang He, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang

AI Summary

This paper introduces LongSpace, a memory framework designed to enhance long-horizon spatial reasoning in Multimodal Large Language Models (MLLMs) by modeling long videos as sequential chunks and integrating 3D structural cues into early decoder layers. The authors evaluate this framework using LongSpace-Bench, a benchmark focused on long-horizon spatial memory tasks, which includes scene perception and spatial relations. Results indicate that LongSpace significantly improves spatial understanding in long videos, highlighting the importance of explicit spatial memory for tasks like autonomous driving and robotic navigation.

Key Contribution

LongSpace reveals that integrating explicit spatial memory into MLLMs can dramatically enhance their performance on long-horizon tasks, a critical advancement for applications in autonomous navigation.

Abstract

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

Related Papers