Mar 12, 2026arXiv:2603.12238

SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Junfeng Luo, Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng

AI Summary

SceneAssistant is introduced, a visual-feedback-driven agent that leverages Vision-Language Models (VLMs) for open-vocabulary 3D scene generation. The agent iteratively refines scenes by receiving rendered visual feedback and acting on a set of atomic operations, enabling coherent spatial arrangements and alignment with input text. Experiments demonstrate the generation of diverse, high-quality 3D scenes, outperforming existing methods in qualitative analysis and quantitative human evaluations.

Key Contribution

Forget predefined relationships: SceneAssistant uses visual feedback to let VLMs generate diverse and high-quality 3D scenes from open-vocabulary text prompts.

Abstract

Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Related Papers