Search papers, labs, and topics across Lattice.
This paper introduces Target Viewpoint Reproduction (TVR), an active exploration task where agents adjust their viewpoint in a 3D environment to match a target image, addressing the limitations of passive spatial understanding in foundation models. The authors evaluate various models on the newly created TVRBench benchmark, revealing that even the best-performing models achieve only 12% success, primarily due to difficulties with multi-turn visual history and the need for body translation. By implementing a unified post-training framework that includes techniques like visual-action SFT and Multi-turn GRPO, they significantly improve performance, achieving a 51.4% success rate on the benchmark.
Foundation models struggle with spatial tasks, achieving only 12% success in reproducing target viewpoints, but a novel post-training framework boosts performance to over 51%.
Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.