Search papers, labs, and topics across Lattice.
The paper introduces UniHand, a unified diffusion-based framework for both estimating and generating 4D hand motion, addressing limitations of existing approaches that treat these as separate tasks. UniHand uses a joint variational autoencoder to embed heterogeneous inputs (MANO parameters, 2D skeletons, visual observations) into a shared latent space, enabling effective use of diverse condition signals. Experiments on multiple benchmarks demonstrate UniHand's robustness and accuracy in handling occlusions and incomplete temporal inputs, showcasing its ability to perform both estimation and generation effectively.
By unifying hand motion estimation and generation into a single diffusion framework, UniHand handles heterogeneous inputs and challenging conditions like occlusions better than task-specific models.
Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.