CornellD horizontal center distances against the ground truthMay 21, 2026arXiv:2605.22581

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Junhyeong Cho, Ruojin Cai, Hadar Averbuch-Elor

AI Summary

SceneAligner addresses floorplan localization in real-world settings by reconstructing a 3D scene from unconstrained images and projecting it into a 2D density map, which serves as a floorplan proxy. They then align this proxy with the input floorplan using a 2D similarity transform, leveraging a fine-tuned 2D foundation model to bridge the appearance gap between density maps and architectural floorplans. The method achieves significant improvements over existing approaches, even with sparse image data.

Key Contribution

Floorplan localization in the wild, previously limited to controlled environments, is now robust and scalable thanks to a 3D-grounded approach that works even with single images.

Abstract

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Related Papers