D detectorsHKUSTXiaohongshuMay 27, 2026arXiv:2605.28490

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

Jiawei Li, Ziyi Liu, Weijie Shi, Long Chen, Jiajie Xu, Xiaofang Zhou

AI Summary

The paper introduces Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a novel structured grounding interface for unified 3D-LLMs that addresses the limitations of single-pointer grounding decisions in fine-grained 3D object localization. SSR3D-LLM uses an LLM to generate a sequence of latent spatial reasoning steps and memory tokens, which are then used by a geometry-aware scorer to iteratively refine candidate object rankings. Experiments on ReferIt3D, ScanRefer, and Multi3DRef demonstrate that SSR3D-LLM achieves state-of-the-art results among unified 3D-LLM baselines, particularly on fine-grained grounding tasks.

Key Contribution

Fine-grained 3D object grounding gets a boost: SSR3D-LLM uses latent spatial reasoning steps to iteratively refine candidate rankings, outperforming single-pointer methods and setting a new standard for unified 3D-LLMs.

Abstract

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

Related Papers