Stanford HAINorthwesternOxfordStan- ford UniversityMay 28, 2026arXiv:2605.29563

Planning with the Views via Scene Self-Exploration

Kangrui Wang, Kangrui Wang, Linjie Li, Linjie Li, Zhengyuan Yang, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Fei-Fei Li, Li Fei-Fei, Jiajun Wu, Jiajun Wu, Leonidas Guibas, Leonidas J. Guibas, Lijuan Wang, Manling Li, Manling Li

AI Summary

The paper introduces "view planning," a task requiring VLMs to predict and compose camera movements to reach a target view in a 3D environment. They find that existing VLMs struggle to compose single-action view transformations across multi-turn plans, especially with increasing viewpoint distance. To address this, they propose an iterative framework that alternates self-exploration with view graph distillation, using all exploration trajectories to create a view graph that compactly captures viewpoint connections. This approach significantly improves performance on interactive view planning, outperforming GPT-4 Pro and Gemini 1.5 Pro.

Key Contribution

VLMs can learn to actively reason and plan in 3D environments by distilling view graphs from self-exploration trajectories, enabling them to surpass even larger models like GPT-4 Pro and Gemini 1.5 Pro on interactive view planning.

Abstract

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Planning with the Views via Scene Self-Exploration

Related Papers