Tsinghua AIAgiBotCUHKMcGillMar 12, 2026arXiv:2603.11554

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Lirong Che, Shuo Wen, Shan Huang, Yuzhe Yang, Gregory Dudek, Jian-Fei Su, Jian Su

AI Summary

MANSION is introduced as a language-driven framework for generating realistic, multi-floor 3D environments to address the limitations of current embodied benchmarks in reflecting real-world task complexity. The framework generates navigable building structures with diverse scenes, considering vertical structural constraints. A dataset, MansionWorld, of over 1,000 buildings is released, along with a Task-Semantic Scene Editing Agent, and benchmarking shows performance degradation of existing agents, highlighting MANSION's role as a testbed for spatial reasoning.

Key Contribution

Current embodied AI agents falter when faced with the multi-floor complexity of MANSION, a new language-driven framework for generating realistic, building-scale 3D environments.

Abstract

Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References52

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Related Papers