AcademyBITHITShanghai AI LabUniversity of Science and TechnologyZhongguancun Institute of ArtificialApr 29, 2026arXiv:2604.26848

STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

Yuxuan Tian, Yurun Jin, Bin Yu, Yukun Shi, Hao Wu, Chi Harold Liu, Kai Chen, Cong Huang

AI Summary

STARRY, a novel world-model-enhanced policy, is introduced to improve robotic manipulation by explicitly modeling action-relevant spatial-temporal interactions. It jointly denoises future spatial-temporal latents and action sequences, and uses Geometry-Aware Selective Attention Modulation to incorporate predicted depth and end-effector geometry into action-attention weights. Experiments on RoboTwin 2.0 and in the real world show STARRY significantly outperforms baselines, achieving 93.82% success in simulation and improving real-world success from 42.5% to 70.8%.

Key Contribution

Robots get a spatial-temporal reasoning boost with STARRY, a world model that aligns future predictions with action generation, leading to a significant jump in manipulation success.

Abstract

Robotic manipulation critically requires reasoning about future spatial-temporal interactions, yet existing VLA policies and world-model-enhanced policies do not fully model action-relevant spatial-temporal interaction structure. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction with action generation. STARRY jointly denoises future spatial-temporal latents and action sequences, and introduces Geometry-Aware Selective Attention Modulation to convert predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings. Real-world experiments further improve average success from 42.5% to 70.8% over $π_{0.5}$, demonstrating the effectiveness of action-centric spatial-temporal world modeling for spatial-temporally demanding robotic action generation.

Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

Related Papers