Mar 15, 2026arXiv:2603.14498

R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Yuhao Zhang, Wanxi Dong, Yue Shi, Yi Liang, Jingnan Gao, Qiaochu Yang, Yaxing Lyu, Zhixuan Liang, Yibin Liu, Congsheng Xu, Xianda Guo, Wei Sui, Yaohui Jin, Xiaokang Yang, Yanyan Xu, Yao Mu

AI Summary

R3DP is introduced to address the challenge of integrating computationally expensive 3D vision models into real-time embodied manipulation policies. It uses an asynchronous fast-slow collaboration module, querying a pre-trained 3D vision model (VGGT) on sparse keyframes and predicting features for intermediate frames using a lightweight Temporal Feature Prediction Network (TFPNet). R3DP also incorporates a Multi-View Feature Fuser (MVFF) for effective multi-view fusion, achieving a 32.9% and 51.4% improvement in average success rate over single-view and multi-view baselines, respectively, while also reducing inference time by 44.8% compared to a naive integration.

Key Contribution

Achieve real-time embodied manipulation with large 3D vision models using a novel asynchronous architecture that boosts success rates by up to 51.4% while simultaneously reducing inference time.

Abstract

Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D vision models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance. A core innovation of R3DP is the asynchronous fast-slow collaboration module, which seamlessly integrates large-scale 3D priors into the policy without compromising real-time performance. The system maintains real-time efficiency by querying the pre-trained slow system (VGGT) only on sparse key frames, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. By leveraging historical data to exploit temporal correlations, TFPNet explicitly improves task success rates through consistent feature estimation. Additionally, to enable more effective multi-view fusion, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics. R3DP offers a plug-and-play solution for integrating large models into real-time inference systems. We evaluate R3DP against multiple baselines across different visual configurations. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rate, respectively. Furthermore, by decoupling heavy 3D reasoning from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.

Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Related Papers