NVIDIAHUSTMar 18, 2026arXiv:2603.17720

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

Tianxing Zhou, Fei Xue, Feiyang Xue, Zhangchen Ye, Tianyuan Yuan, Hang Zhao, Tao Jiang

AI Summary

This paper introduces VolumeDP, a novel imitation learning policy architecture for robotic manipulation that addresses the 2D-3D mismatch by explicitly reasoning in 3D. VolumeDP lifts image features into a volumetric representation using cross-attention, selects task-relevant voxels, and converts them into spatial tokens for action prediction. Experiments on multiple benchmarks, including LIBERO, ManiSkill, and real-world settings, demonstrate that VolumeDP achieves state-of-the-art performance and robust generalization, outperforming existing methods by a significant margin.

Key Contribution

By explicitly reasoning in 3D, VolumeDP leaps ahead of 2D-based imitation learning methods, achieving a remarkable 14.8% improvement on the LIBERO benchmark and robust real-world generalization.

Abstract

Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code will be released.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

Related Papers