FudanIndependent ResearcherShanghai InnovationFeb 25, 2026arXiv:2602.21631

UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

Zhihao Sun, Tong Wu, Ruirui Tu, Daoguo Dong

AI Summary

The paper introduces UniHand, a unified diffusion-based framework for both estimating and generating 4D hand motion, addressing limitations of existing approaches that treat these as separate tasks. UniHand uses a joint variational autoencoder to embed heterogeneous inputs (MANO parameters, 2D skeletons, visual observations) into a shared latent space, enabling effective use of diverse condition signals. Experiments on multiple benchmarks demonstrate UniHand's robustness and accuracy in handling occlusions and incomplete temporal inputs, showcasing its ability to perform both estimation and generation effectively.

Key Contribution

By unifying hand motion estimation and generation into a single diffusion framework, UniHand handles heterogeneous inputs and challenging conditions like occlusions better than task-specific models.

Abstract

Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

Related Papers