Search papers, labs, and topics across Lattice.
This paper introduces E3VS-Bench, a novel benchmark designed to evaluate viewpoint-dependent active perception in 3D environments, addressing the limitations of existing benchmarks that rely on static observations. By utilizing 3D Gaussian Splatting for photorealistic rendering, the benchmark allows agents to explore 99 high-fidelity scenes and engage in 2,014 question-driven episodes that require fine-grained viewpoint control. The evaluation reveals that while state-of-the-art visual language models (VLMs) demonstrate strong 2D reasoning, they significantly lag behind human performance in active perception tasks requiring coherent viewpoint planning across 5-DoF changes.
Despite advances in VLMs, agents struggle with active perception in 3D environments, revealing a significant gap in performance compared to humans.
Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce {E3VS-Bench}, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes reconstructed using 3D Gaussian Splatting and 2,014 question-driven episodes. 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators, thereby allowing the construction of questions that cannot be answered from a single view and instead require active inspection across viewpoints in 5-DoF. We evaluate multiple state-of-the-art VLMs and compare their performance with humans. Despite strong 2D reasoning ability, all models exhibit a substantial gap from humans, highlighting limitations in active perception and coherent viewpoint planning specifically under full 5-DoF viewpoint changes.