BAIRSJTUMar 15, 2026arXiv:2603.14401

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

Kuanning Wang, Ke Fan, Yuqian Fu, Siyu Lin, Hu Luo, Daniel Seita, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue

AI Summary

OCRA, a novel object-centric framework, is introduced for human-to-robot action transfer using video demonstrations. It reconstructs object-centric 3D point clouds from multi-view RGB videos, leveraging VGGT and detection/segmentation models, and incorporates tactile priors from a large-scale tactile image dataset. Fusing these 3D and tactile priors via a ResFiLM module into a Diffusion Policy enables the generation of robust manipulation actions, outperforming existing methods in both vision-only and visuo-tactile tasks.

Key Contribution

Teaching robots to manipulate objects just got easier: OCRA learns directly from human demonstration videos by focusing on object interactions and incorporating tactile feedback.

Abstract

We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

Related Papers