Search papers, labs, and topics across Lattice.
The paper introduces PanoEnv, a large-scale VQA benchmark for evaluating 3D spatial reasoning in VLMs using panoramic images, addressing the limitations of current VLMs in handling geometric distortions and lack of 3D supervision. They benchmarked 14 VLMs, revealing poor 3D understanding, and then proposed a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with geometry-aware rewards and a two-stage curriculum to improve performance. The proposed 7B model achieved state-of-the-art performance on PanoEnv, demonstrating the effectiveness of the benchmark and the RL framework in enhancing 3D spatial intelligence in VLMs.
VLMs are surprisingly bad at 3D spatial reasoning in panoramic images, but a new RL-based training method closes the gap.
360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.