Jiajun Xu

WorldEngine AI ∗Equal Contribution Abstract This paper investigates humanoid whole-body dexterous manipulation, where the efficient collection of high-quality demonstration data remains a central bottleneck. Existing teleoperation systems often suffer from limited portability, occlusion, or insufficient precision, which hinders their applicability to complex whole-body tasks. To address these challenges, we introduce HumDex, a portable teleoperation system designed for humanoid whole-body dexterous manipulation. Our system leverages IMU-based motion tracking to address the portability-precision trade-off, enabling accurate full-body tracking while remaining easy to deploy. For dexterous hand control, we further introduce a learning-based retargeting method that generates smooth and natural hand motions without manual parameter tuning. Beyond teleoperation, HumDex enables efficient collection of human motion data. Building on this capability, we propose a two-stage imitation learning framework that first pre-trains on diverse human motion data to learn generalizable priors, and then fine-tunes on robot data to bridge the embodiment gap for precise execution. We demonstrate that this approach significantly improves generalization to new configurations, objects, and backgrounds with minimal data acquisition costs. The entire system is fully reproducible and open-sourced at https://github.com/physical-superintelligence-lab/HumDex. Figure 1: The HumDex System. Our portable teleoperation system enables efficient collection of high-quality dexterous manipulation data. Left: We demonstrate data collection and autonomous policy execution on challenging tasks featuring dexterous manipulation, bimanual coordination, long-horizon planning, deformable and articulated object manipulation, and whole-body movement. Middle: We use a Unitree-G1 humanoid and two 20 DoF dexterous hands. Right: By pretraining robot policy on diverse human data, our policy generalizes to new positions, objects, and backgrounds unseen in robot data. I Introduction Humanoid dexterous manipulation holds great promise for unlocking robots to perform complex, long-horizon loco-manipulation tasks in the real world. Current robotic systems often resort to imitation learning [6, 28, 10, 11, 8] that has shown great success in acquiring complex manipulation skills. These methods rely heavily on high-quality task demonstration data collected through costly robot teleoperation. However, acquiring such data for humanoid robots with dexterous hands remains a critical bottleneck due to their complex morphology. While huge progress has been made on table-top robot data collection [29, 21, 5], teleoperation systems for humanoid robots and dexterous hands are way less mature. Previous efforts with different hardware solutions exhibit their own limitation and trade-off. Motion-capture-based [26] (e.g., optical tracking) or exoskeleton-based [2] systems can achieve high accuracy but require fixed infrastructure, which severely limits the environments in which data can be collected. In contrast, VR-based alternatives [27, 9, 12, 14] offer greater portability but suffer from reduced accuracy and occlusion issues. For instance, operators’ hands must remain within the sensors’ field of view to maintain tracking stability, constraining the range of feasible motions and, consequently, the set of tasks that can be demonstrated. Furthermore, despite recent advances in humanoid motion retargeting and low-level locomotion policies [26, 2, 27, 9, 12, 14], dexterous hand control still largely relies on optimization-based retargeting, leading to reduced accuracy and limited generalization. In this work, we introduce HumDex(Fig. 1), a portable motion-tracker-based teleoperation system for whole-body dexterous manipulation. Our system addresses the portability-precision trade-off by leveraging IMU-based tracking, enabling high-precision tracking while maintaining portability. For dexterous hand control, we propose a learning-based retargeting system trained on collected teleoperation data, which produces smooth and natural hand motions without manual parameter tuning. This hand retargeting method, compared to previous optimization-based alternatives, achieves significantly better performance in real-world deployment. We demonstrate the effectiveness of HumDex on a suite of challenging tasks involving whole-body motion, bimanual coordination, and fine-grained dexterous manipulation. Overall, our system enables faster demonstration collection, higher success rates, and improved data quality relative to existing approaches. Beyond teleoperation, our tracking system also enables efficient collection of human data of the same tasks, which offers better collection efficiency than teleoperation, thus serves as an additional data source for pre-training or co-training. However, due to embodiment gaps, directly retargeting human motion to the humanoid leads to inaccurate movements, which often leads to manipulation failures. Consequently, prior works on tabletop dexterous manipulation performs alignment [7, 17] or correction strategies [20] to mitigate this gap, while those on humanoid manipulation rely solely on teleoperation data [9, 27, 4]. To effectively leverage the diversity and motion prior in human data without explicit alignment, we propose a two-stage imitation learning framework. First, we train the policy on human demonstrations collected in diverse settings, where we use retargeting results as a joint target, and approximate proprioceptive states with previous action. Then, we fine-tune on robot teleoperation data only, refining movements towards the robot embodiment. As shown in Table III, our approach achieves successful task execution while retaining generalization to new object positions, categories, and backgrounds without requiring robot data under those settings. In conclusion, our contributions are: (1) a portable and efficient teleoperation system for humanoid dexterous manipulation, (2) a learning-based hand retargeting method, and (3) a two-stage training pipeline that leverages human data to improve generalization while reducing the need for teleoperation data. Figure 2: System Overview. (A) Our teleoperation pipeline and hand retargeting policy training. (B) Our imitation learning policy architecture. We approximate proprioceptive states missing in human data with previous-frame actions. II Related Works II-A Humanoid Whole-Body Dexterous Teleoperation Existing humanoid whole-body teleoperation systems can be categorized by their tracking hardware. Motion-capture-based [26] and exoskeleton-based [2] systems achieve high tracking accuracy but suffer from portability issues — mocap requires a dedicated room setup, while exoskeletons are heavy and typically require seated operation. Vision-based and VR-based systems [27, 9, 14, 12] offer better portability but suffer from occlusion: operators must keep their hands visible at all times, restricting feasible motions. In this work, we adopt IMU-based motion tracking, which consists of only 15 lightweight trackers worn on the body, providing unconstrained motion capture with high tracking quality. Beyond hardware, we also investigate a more challenging robot configuration. Prior whole-body teleoperation works employ simplified end-effectors such as parallel grippers or three-fingered hands, often controlled via binary open/close signals (e.g., VR controller triggers). Consequently, demonstrated tasks are limited to simple object interactions such as pick-and-place. In contrast, our system supports full dexterous control of a 20-DoF hand, enabling fine-grained manipulation such as grasping a handheld barcode scanner and pulling its trigger, while simultaneously supporting whole-body movement. II-B Dexterous Hand Retargeting Dexterous hand retargeting is a key component of teleoperation driven demonstration collection system. It maps human hand features to a robot hand under substantial embodiment gaps. A common approach is optimization-based policy, which formulates the mapping as a constrained inverse kinematics (IK) or nonlinear least squares problem. In these formulations, objective terms typically preserve task-relevant geometric relations, while constraints enforce robot executability. To improve stability in teleoperation and contact-rich manipulation, many methods additionally incorporate temporal-consistency regularizers and contact-consistency/interpenetration penalties to discourage implausible hand and object penetrations and stabilize interaction [15, 13, 23]. In contrast, learning-based approaches predict robot hand configurations directly from human observations, reducing reliance on hand-crafted objectives and enabling constant inference. GeoRT [24] proposes an ultrafast neural retargeting approach guided by principled geometric criteria, achieving real-time performance without test-time optimization and supporting scalable teleoperation pipelines [24, 25]. Our approach follows this learning-based direction with a lightweight supervised formulation. Given the

Papers on Lattice

Total citations

Topics

h-index

Research focus

Robotics & Embodied AI (1)

Frequent co-authors

Liang Heng (1)Yihe Tang (1)Henghui Bao (1)Di Huang (1)

Papers (1)

Mar 12, 2026

Mar 12, 2026·also Beihang

HumDex:Humanoid Dexterous Manipulation Made Easy

A portable IMU-based teleoperation system slashes the data requirements for humanoid robot manipulation by pre-training on human motion and then fine-tuning on robot data.

Liang Heng, Yihe Tang, Jiajun Xu +2

Robotics & Embodied AI

Search

Jiajun Xu

Research focus

Frequent co-authors

Papers (1)