Search papers, labs, and topics across Lattice.
The paper introduces EarthSpatialBench, a new benchmark designed to evaluate spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) on Earth imagery, addressing the limitations of existing benchmarks in quantitative distance/direction reasoning, topological relations, and complex object geometries. EarthSpatialBench comprises over 325K question-answer pairs that cover qualitative/quantitative spatial reasoning, topological relations, and various object query types, using textual descriptions, visual overlays, and explicit geometry coordinates for object references. Experiments using EarthSpatialBench revealed limitations in current open-source and proprietary MLLMs' spatial reasoning abilities when applied to Earth imagery.
MLLMs struggle to quantitatively reason about spatial relationships in Earth imagery, despite advances in other spatial reasoning tasks, highlighting a critical gap for applications requiring precise georeferencing.
Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.