Search papers, labs, and topics across Lattice.
The paper introduces VERM (Virtual Eye for Robotic Manipulation), a method that leverages foundation models to synthesize a virtual, task-adaptive viewpoint from a 3D point cloud constructed from multi-camera inputs, aiming to reduce redundancy and improve efficiency in 3D robotic manipulation. VERM incorporates a depth-aware module and a dynamic coarse-to-fine procedure to enhance 3D action planning and fine-grained manipulation. Experiments on RLBench and real-world settings demonstrate that VERM achieves state-of-the-art performance with significant speedups in both training and inference.
Ditch the multi-camera setup: VERM leverages foundation models to synthesize a single, task-optimized "virtual eye" view for robots, slashing training time by 1.89x.
When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89× speedup in training time and 1.54× speedup in inference speed.