NVIDIACASPKUZhongguancun AcademyMar 19, 2026arXiv:2603.19201

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

Yuhang Zheng, Songen Gu, Weize Li, Yupeng Zheng, Yujie Zang, Shuai Tian, Xiang Li, Ruihai Wu, Ce Hao, Chen Gao, Si Liu, Haoran Li, Yilun Chen, Shuicheng Yan, Wenchao Ding, Wen-Juan Ding

AI Summary

The authors introduce OmniViTac, a large-scale visuo-tactile-action dataset with 21,000+ trajectories across 86 tasks, and OmniVTA, a world-model-based framework for contact-rich manipulation. OmniVTA integrates a self-supervised tactile encoder, a two-stream visuo-tactile world model, a contact-aware fusion policy, and a 60Hz reflexive controller. Real-robot experiments demonstrate that OmniVTA outperforms existing methods and generalizes to unseen objects, highlighting the benefits of predictive contact modeling with high-frequency tactile feedback.

Key Contribution

Robots can now manipulate objects with greater dexterity and adaptability thanks to a new world model that leverages both vision and high-frequency tactile feedback to predict and react to contact dynamics.

Abstract

Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present \textbf{OmniViTac}, a large-scale visuo-tactile-action dataset comprising $21{,}000+$ trajectories across $86$ tasks and $100+$ objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose \textbf{OmniVTA}, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high-frequency tactile feedback for contact-rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References61

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

Related Papers