CUHKApr 23, 2026arXiv:2604.21331

FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception

Zhen Zhang, Weinan Wang, Hejiang Sun, Qingpeng Ding, Xiangyu Chu, Guoxin Fang, K. W. S. Au

AI Summary

This paper introduces FingerViP, a dexterous manipulation system that uses miniature cameras embedded in robot fingertips to provide multi-view visual feedback. They train a diffusion-based visuomotor policy conditioned on a third-view camera and multi-view fingertip vision, augmented with camera pose and joint-current encodings, to improve view-proprioception alignment and contact awareness. Experiments on real-world tasks demonstrate that FingerViP achieves an 80.8% success rate, showcasing the benefits of fingertip visual perception for complex manipulation.

Key Contribution

Robot hands get a serious upgrade: embedding cameras in fingertips unlocks robust manipulation in cluttered environments where traditional wrist-mounted cameras fail.

Abstract

The current practice of dexterous manipulation generally relies on a single wrist-mounted view, which is often occluded and limits performance on tasks requiring multi-view perception. In this work, we present FingerViP, a learning system that utilizes a visuomotor policy with fingertip visual perception for dexterous manipulation. Specifically, we design a vision-enhanced fingertip module with an embedded miniature camera and install the modules on each finger of a multi-fingered hand. The fingertip cameras substantially improve visual perception by providing comprehensive, multi-view feedback of both the hand and its surrounding environment. Building on the integrated fingertip modules, we develop a diffusion-based whole-body visuomotor policy conditioned on a third-view camera and multi-view fingertip vision, which effectively learns complex manipulation skills directly from human demonstrations. To improve view-proprioception alignment and contact awareness, each fingertip visual feature is augmented with its corresponding camera pose encoding and per-finger joint-current encoding. We validate the effectiveness of the multi-view fingertip vision and demonstrate the robustness and adaptability of FingerViP on various challenging real-world tasks, including pressing buttons inside a confined box, retrieving sticks from an unstable support, retrieving objects behind an occluding curtain, and performing long-horizon cabinet opening and object retrieval, achieving an overall success rate of 80.8%. All hardware designs and code will be fully open-sourced.

Computer Vision Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References80

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception

Related Papers