Oct 19, 2025arXiv:2510.17034

Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding

AI Summary

The paper addresses the "2D semantic bias" in Vision-Language Models (VLMs) for 3D grounding, where models over-rely on 2D image features, hindering effective 3D geometric reasoning. To mitigate this, they introduce the What-Where Representation Re-Forming (W2R2) framework, which disentangles 2D and 3D feature roles during training, assigning 2D features to semantic identification and 3D features to spatial localization. Experiments on ScanRefer and ScanQA show that W2R2 significantly improves localization accuracy and robustness, especially in complex scenes, without altering the inference architecture.

Key Contribution

By forcing VLMs to treat 2D features as "what" and 3D features as "where," this method significantly boosts 3D grounding accuracy without modifying the model architecture.

Abstract

Multimodal 3D grounding has garnered considerable interest in Vision-Language Models (VLMs) \cite{yin2025spatial} for advancing spatial reasoning in complex environments. However, these models suffer from a severe"2D semantic bias"that arises from over-reliance on 2D image features for coarse localization, largely disregarding 3D geometric inputs and resulting in suboptimal fusion performance. In this paper, we propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue via disentangled representation learning and targeted shortcut suppression. Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for"What"identification and 3D features as spatial anchors for"Where"localization, enabling precise 3D grounding without modifying inference architecture. Key components include a dual-objective loss function with an Alignment Loss that supervises fused predictions using adapted cross-entropy for multimodal synergy, and a Pseudo-Label Loss that penalizes overly effective 2D-dominant pseudo-outputs via a margin-based mechanism. Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness, particularly in cluttered outdoor scenes.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References19

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding

Related Papers