Search papers, labs, and topics across Lattice.
This paper introduces LongSpace, a memory framework designed to enhance long-horizon spatial reasoning in Multimodal Large Language Models (MLLMs) by modeling long videos as sequential chunks and integrating 3D structural cues into early decoder layers. The authors evaluate this framework using LongSpace-Bench, a benchmark focused on long-horizon spatial memory tasks, which includes scene perception and spatial relations. Results indicate that LongSpace significantly improves spatial understanding in long videos, highlighting the importance of explicit spatial memory for tasks like autonomous driving and robotic navigation.
LongSpace reveals that integrating explicit spatial memory into MLLMs can dramatically enhance their performance on long-horizon tasks, a critical advancement for applications in autonomous navigation.
Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.