Search papers, labs, and topics across Lattice.
This paper introduces the Hierarchical Descriptive Scene Language (HDSL), a domain-specific language designed for structured 3D indoor scene generation and localized editing using LLM agents. By representing scenes as a tree structure with local coordinates, HDSL facilitates more precise planning and editing of complex scenes compared to existing methods that rely on scene graphs. The results demonstrate that HDSL significantly enhances object coverage, text-scene alignment, and generation efficiency, while also improving editing performance by reducing token usage and runtime substantially.
HDSL achieves a remarkable reduction in editing token usage by over 5 times while maintaining scene integrity and enhancing generation speed.
Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.