SDUDec 18, 2025arXiv:2512.16724

VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

Yixiang Chen, Yan Huang, Keji He, Pei Li, Liang Wang

AI Summary

The paper introduces VERM (Virtual Eye for Robotic Manipulation), a method that leverages foundation models to synthesize a virtual, task-adaptive viewpoint from a 3D point cloud constructed from multi-camera inputs, aiming to reduce redundancy and improve efficiency in 3D robotic manipulation. VERM incorporates a depth-aware module and a dynamic coarse-to-fine procedure to enhance 3D action planning and fine-grained manipulation. Experiments on RLBench and real-world settings demonstrate that VERM achieves state-of-the-art performance with significant speedups in both training and inference.

Key Contribution

Ditch the multi-camera setup: VERM leverages foundation models to synthesize a single, task-optimized "virtual eye" view for robots, slashing training time by 1.89x.

Abstract

When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89× speedup in training time and 1.54× speedup in inference speed.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations1

Influential citations0

References45

Year2025

VenueIEEE Robotics and Automation Letters

Related Papers

Finding related papers...

Search

VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

Related Papers