Search papers, labs, and topics across Lattice.
The paper addresses geometric hallucination in Point-Vision-Language Models (Point-VLMs) by identifying and mitigating structural misalignment in reinforcement learning. They introduce Geometric Reward Credit Assignment, which disentangles holistic supervision into field-specific signals routed to responsible token spans, and a Reprojection-Consistency term to enforce physical constraints. Experiments on a ShapeNetCore-derived benchmark demonstrate significant improvements in 3D Keypoint Accuracy (KPA), 3D bounding box IoU, and reprojection consistency, while maintaining 2D localization performance.
Point-VLMs can learn to see the world as it really is: targeted reward assignment and cross-modal verification nearly close the reality gap in 3D reasoning.
Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.