Google ResearchMax PlanckApr 14, 2026arXiv:2604.12929

Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

Ayce Idil Aytekin, Xu Chen, Zhengyang Shen, Thabo Beeler, Helge Rhodin, Rishabh Dabral, Christian Theobalt

AI Summary

Grasp in Gaussians (GraG) reconstructs dynamic 3D hand-object interactions from monocular video by efficiently tracking hand and object motion after initialization from pretrained models. The method leverages a compact Sum-of-Gaussians (SoG) representation for objects, initialized from a video-adapted SAM3D pipeline, and refines hand motion through 2D joint and depth alignment. GraG achieves 6.4x speedup compared to prior work while improving object reconstruction by 13.4% and reducing hand joint error by over 65%.

Key Contribution

Reconstructing dynamic hand-object interactions from monocular video can be 6x faster and significantly more accurate by ditching heavy neural representations for a revived Sum-of-Gaussians approach.

Abstract

We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand's per-joint position error by over 65%.

Computer Vision Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

Related Papers