Tsinghua AIBUPTCollege of AIDepartment of Computer Science and TechnologyHKUMar 12, 2026arXiv:2603.11809

HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI

Chengwen Zhang, Chun Yu, Borong Zhuang, Haopeng Jin, Qingyang Wan, Zhuojun Li, Zhe He, Zhoutong Ye, Yu Mei, Chang Liu, Weinan Shi, Yuanchun Shi

AI Summary

The paper introduces HiSync, a framework for Command Source Identification (CSI) in long-range Human-Robot Interaction (HRI) that fuses robot-mounted camera optical flow with hand-worn IMU signals. HiSync extracts frequency-domain hand motion features from both modalities, uses a learned CSINet to denoise IMU readings and temporally align modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity. Evaluated in three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, significantly outperforming the state-of-the-art.

Key Contribution

Achieve 92% accuracy in identifying who's commanding a robot from 34 meters away by fusing IMU and camera data, a 48% leap over prior art.

Abstract

Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References87

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI

Related Papers