Mar 16, 2026arXiv:2603.15620

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai

AI Summary

The authors introduce DOMINO, a large-scale dataset and benchmark for vision-language-action models in dynamic robotic manipulation, addressing the limitations of existing models in handling moving targets. They systematically evaluate existing VLAs and explore training strategies for dynamic awareness. To improve performance, they propose PUMA, a dynamics-aware VLA architecture that integrates historical optical flow and world queries for implicit future state forecasting, achieving a 6.3% improvement in success rate over baselines.

Key Contribution

Current vision-language-action models choke on dynamic robotic manipulation because they lack spatiotemporal reasoning, but a new dataset and architecture, DOMINO and PUMA, close the gap.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Generalizable Robotic Manipulation in Dynamic Environments

Related Papers