Search papers, labs, and topics across Lattice.
This paper introduces HiMem-WAM, a Hierarchical Memory-Gated World Action Model designed to enhance long-horizon robotic manipulation by integrating motion-centric latent actions and high-level skill latents with boundary-triggered memory updates. The hierarchical latent action framework enables structured temporal abstraction, allowing the model to learn both low-level and high-level actions effectively while maintaining causal inference capabilities without the need for future video generation. Evaluations across multiple benchmarks demonstrate that HiMem-WAM significantly improves robustness and performance in memory-dependent tasks, addressing critical limitations of existing WAMs.
Hierarchical latent actions and a novel memory gating mechanism enable HiMem-WAM to excel in long-horizon robotic manipulation, outperforming traditional models in robustness and task performance.
World Action Models (WAMs) have emerged as a new powerful paradigm for embodied intelligence, learning action-relevant visual dynamics that significantly enhance generalization and robustness. However, existing WAMs still struggle with task-relevant memory in long-horizon robotic manipulation. To address this, we present HiMem-WAM, a Hierarchical Memory-Gated WAM that integrates motion-centric latent actions, high-level skill latents, and boundary-triggered memory updates. Specifically, we develop a hierarchical latent action framework that jointly learns low-level motion and high-level skill latents, providing structured temporal abstraction. Meanwhile, a boundary-aware memory gate writes compact task states at predicted skill transitions, enabling causal inference without test-time generation of future video or optical flow estimation. Evaluated on LIBERO, LIBERO-PLUS, RMBench and real-world tasks, HiMem-WAM shows that hierarchical latents improve robustness under deployment perturbations, and the memory module substantially benefits memory-dependent long-horizon manipulation.