Search papers, labs, and topics across Lattice.
This paper introduces GraspFoM, a novel framework that integrates 3D foundation priors to enhance robotic grasping by utilizing a shared 3D object latent for both reconstruction and grasp pose prediction. By employing an anchor-initialized truncated pose-reasoning diffuser, GraspFoM predicts continuous and multimodal grasp poses while simultaneously reconstructing high-fidelity 3D assets. The approach achieves state-of-the-art performance in both tasks with minimal additional parameters, highlighting the synergistic relationship between reconstruction and grasping in robotic manipulation.
GraspFoM reveals that leveraging 3D object priors can dramatically enhance both grasping accuracy and 3D reconstruction fidelity with minimal overhead.
Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under partial observations. Reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware grasping methods recognize the value of reconstruction, but they typically treat geometry as an intermediate prediction rather than a reusable object prior for grasping. In this paper, we present GraspFoM, a unified framework that leverages 3D foundation priors (SAM3D) to build a shared 3D object latent for both reconstruction and grasp pose prediction. Built on this shared object latent, we introduce an anchor-initialized truncated pose-reasoning diffuser that predicts continuous and multimodal grasp poses without directly relying on discrete grasp candidates. We further investigate the interaction between reconstruction and grasping through a reconstruction-aware scorer and a residual latent updater. Reconstruction provides grounded geometric cues, while grasp supervision refines the shared object latent toward grasp-relevant affordances. GraspFoM jointly predicts grasp poses and reconstructs high-fidelity 3D assets in mesh and 3DGS forms. Comprehensive experiments demonstrate that GraspFoM achieves state-of-the-art results on both reconstruction and grasping. Notably, these improvements require only a small number of additional trainable parameters. Component-wise ablation studies also demonstrate the contribution of each component.