Tsinghua AIBAAIGuilin University of ElectronicHUSTJun 9, 2026arXiv:2606.10899

MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation

Yinchen Tian, Huan Li, Muyao Peng, Xi Wang, Yan Wang, You Yang

AI Summary

This paper introduces MV-Actor, a multi-view perception framework designed to enhance bimanual robotic manipulation by creating a unified semantic-spatial representation. By implementing Multi-view Semantic Interaction and Semantic-Spatial Token Interaction, MV-Actor effectively shares semantic perception across different camera views and grounds visual semantics with spatial features, addressing the limitations of existing methods. The framework achieves a state-of-the-art success rate of 87.8% in simulations and outperforms RGB and RGB-D baselines in real-world scenarios, highlighting its robustness against consumer-grade depth noise and viewpoint variability.

Key Contribution

MV-Actor achieves a remarkable 87.8% success rate in bimanual manipulation by effectively sharing semantic perception across multiple camera views, outpacing traditional methods.

Abstract

Robotic manipulation has been widely applied in industrial scenarios. Compared with single-arm manipulation, bimanual manipulation is equipped with multiple cameras to capture information from different viewpoints. However, existing multi-view policies encode each view independently or fuse view features shallowly, resulting in limited sharing semantic perception and unreliable spatial awareness. In this paper, we propose \textbf{MV-Actor}, a multi-view perception framework that builds a unified semantic-spatial representation for bimanual manipulation. First, MV-Actor performs Multi-view Semantic Interaction to share semantic perception across views. Then it uses Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features and acquire reliable spatial awareness. Finally, a Guided Metric Depth Repair module refines degraded sensor depth to provide more reliable metric anchors under consumer-grade depth noise. In simulation experiments conducted on the PerAct2 bimanual benchmark, MV-Actor achieves a state-of-the-art average success rate of 87.8\%. In real-world evaluations with more frequent viewpoint changes and unstable consumer-grade depth, MV-Actor outperforms both RGB and RGB-D baselines, further demonstrating the benefit of sharing semantic perception and reliable spatial awareness for bimanual manipulation.

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation

Related Papers