UESTCZJUMay 31, 2026arXiv:2606.01247

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Liyang Li, Muzhi Zhu, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Chunhua Shen

AI Summary

This paper introduces Target Viewpoint Reproduction (TVR), an active exploration task where agents adjust their viewpoint in a 3D environment to match a target image, addressing the limitations of passive spatial understanding in foundation models. The authors evaluate various models on the newly created TVRBench benchmark, revealing that even the best-performing models achieve only 12% success, primarily due to difficulties with multi-turn visual history and the need for body translation. By implementing a unified post-training framework that includes techniques like visual-action SFT and Multi-turn GRPO, they significantly improve performance, achieving a 51.4% success rate on the benchmark.

Key Contribution

Foundation models struggle with spatial tasks, achieving only 12% success in reproducing target viewpoints, but a novel post-training framework boosts performance to over 51%.

Abstract

Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Related Papers