Search papers, labs, and topics across Lattice.
The paper introduces SOMA, a spatial memory framework that enhances Vision-Language-Action (VLA) models by enabling them to perform manipulation tasks even when target objects are initially outside the camera's field of view. SOMA constructs a persistent spatial memory from multi-view observations using a movable head camera, incorporating spatial memory construction, dynamic memory refinement, and contextual memory retrieval. Experiments on real-world out-of-vision manipulation tasks demonstrate that SOMA improves task success rates, accelerates target localization, and facilitates near one-shot grasping under partial observability.
Robots can now "remember" and manipulate objects they can't currently see, thanks to a spatial memory system that lets them reason beyond their immediate field of view.
We introduce SOMA, the Spatial Memory framework for Out-of-Vision Manipulation in Vision-Language-Action (VLA) models. Most existing VLAs implicitly assume that task-relevant objects are always visible, leading to brittle and reactive behaviors when targets fall outside the camera's field of view. SOMA addresses this limitation by equipping VLAs with a persistent spatial memory constructed from multi-view observations acquired via a movable head camera, enabling reasoning beyond the current visual frustum. The framework consists of three components: Spatial Memory Construction, which aggregates angular-wise observations into a unified spatial-semantic representation through scanning; Dynamic Memory Refinement, which maintains global consistency over time; and Contextual Memory Retrieval, which activates instruction-relevant spatial cues during manipulation. We evaluate SOMA on five challenging real-world out-of-vision manipulation tasks, including multi-step and dual-arm scenarios where target objects are initially invisible. Experimental results show that SOMA not only improves task success rates, but also induces qualitatively different manipulation behaviors, with faster target localization, reduced viewpoint search, and near one-shot grasping under partial observability. Additional experiments on RoboCasa GR1 and SimplerEnv further validate the effectiveness of SOMA's memory design under conventional fully observable settings. Code will be released soon.