CornellJun 1, 2026arXiv:2606.02580

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Guangzhao He, Rundong Luo, Wei-Chiu Ma, Hadar Averbuch-Elor

AI Summary

This paper introduces Staged Executable Inverse Graphics (SEIG), a novel framework that leverages pretrained vision-language models to reconstruct 3D scenes from single images as editable Blender programs. By decomposing the reconstruction process into staged refinements of geometry, materials, composition, and lighting, SEIG achieves significant improvements in fidelity compared to traditional methods that rely on specialized models or multi-view data. The results demonstrate that task decomposition is crucial for enhancing the performance of general-purpose VLMs in executable inverse graphics, paving the way for diverse applications in 3D scene manipulation.

Key Contribution

Executable inverse graphics can now be achieved from a single image using vision-language models, revolutionizing how we create and manipulate 3D scenes.

Abstract

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.

Code Generation & Program Synthesis Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Related Papers