Search papers, labs, and topics across Lattice.
This paper introduces a two-stage framework for 3D hand pose estimation from monocular RGB images, leveraging gesture semantics as an inductive bias. The approach involves gesture-aware pretraining using coarse and fine gesture labels from InterHand2.6M to learn an informative embedding space. This is followed by a per-joint token Transformer, guided by gesture embeddings, for regressing MANO hand parameters, resulting in improved single-hand accuracy over the EANet baseline on InterHand2.6M.
Gesture-aware pretraining unlocks significant improvements in 3D hand pose estimation, proving that semantic gesture information acts as a powerful inductive bias.
Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.