Search papers, labs, and topics across Lattice.
The paper introduces Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a novel structured grounding interface for unified 3D-LLMs that addresses the limitations of single-pointer grounding decisions in fine-grained 3D object localization. SSR3D-LLM uses an LLM to generate a sequence of latent spatial reasoning steps and memory tokens, which are then used by a geometry-aware scorer to iteratively refine candidate object rankings. Experiments on ReferIt3D, ScanRefer, and Multi3DRef demonstrate that SSR3D-LLM achieves state-of-the-art results among unified 3D-LLM baselines, particularly on fine-grained grounding tasks.
Fine-grained 3D object grounding gets a boost: SSR3D-LLM uses latent spatial reasoning steps to iteratively refine candidate rankings, outperforming single-pointer methods and setting a new standard for unified 3D-LLMs.
3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.