Search papers, labs, and topics across Lattice.
Grasp in Gaussians (GraG) reconstructs dynamic 3D hand-object interactions from monocular video by efficiently tracking hand and object motion after initialization from pretrained models. The method leverages a compact Sum-of-Gaussians (SoG) representation for objects, initialized from a video-adapted SAM3D pipeline, and refines hand motion through 2D joint and depth alignment. GraG achieves 6.4x speedup compared to prior work while improving object reconstruction by 13.4% and reducing hand joint error by over 65%.
Reconstructing dynamic hand-object interactions from monocular video can be 6x faster and significantly more accurate by ditching heavy neural representations for a revived Sum-of-Gaussians approach.
We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand's per-joint position error by over 65%.