Search papers, labs, and topics across Lattice.
This paper introduces the SERF (Spatiotemporal Environment and Robot Feature) map, which enhances long-horizon mobile manipulation by conditioning a policy on a shared latent space of environmental and robot features. By updating this map using object-level rigid tracking and forward kinematics, the approach allows for improved reasoning over extended tasks in dynamic environments. Experimental results on the BEHAVIOR-1K benchmark demonstrate that the SERF VLA policy significantly outperforms image-only baselines, achieving faster subgoal completion and greater robustness to environmental changes.
Conditioning robot policies on a spatiotemporal feature map enables faster and more robust long-horizon mobile manipulation, outperforming traditional image-only approaches.
Long-horizon robot mobile manipulation requires continual reasoning about localization, environment changes, and task progress, all of which are challenging to infer from image observations alone. In this paper, we show that conditioning a mobile manipulation policy on a spatiotemporal feature map improves reasoning over long horizons. The map represents the environment and the articulated robot body as neural points in a shared latent space and is updated online from egocentric observations and proprioceptive state. We update the environment neural points using object-level rigid tracking and the robot neural points using forward kinematics. We use our spatiotemporal environment and robot feature (SERF) map as a state input to a vision-language-action (VLA) model by extracting map tokens from multiple reference frames and spatial scales, providing the policy with both local and global context. We demonstrate SERF on BEHAVIOR-1K, a benchmark for long-horizon mobile manipulation in household environments. Experiments show that the SERF VLA policy outperforms image-only baselines, reaches subgoals faster by following more direct trajectories, improves robustness to scene-configuration shifts, and recovers from object-drop failures.