Search papers, labs, and topics across Lattice.
B-Thinking). In addition, the performance of proprietary models typically surpasses that of open-source models. These confirm the effectiveness of the benchmarking and the correctness of the evaluation metrics. (2) Humans achieved excellent performance on the benchmark, with the lowest metric exceeding 90%. However, even the best-performing VLMs show a significant gap compared to humans, particularly in precondition prediction, which indicates that there is a notable performance gap between VLMs and humans in spatial logical reasoning. (3) The prediction of preconditions generally performs worse than the prediction of content, indicating that VLMs have significant deficiencies in causal reasoning. Even though they may predict the steps roughly, they do not understand the logical relationships between them. (4) The recall for content and preconditions is generally lower than precision, indicating that VLMs tend to output more certain answers during prediction, avoiding incorrect ‘false positives’. Specifically, the model generates very certain step contents or preconditions, but for uncertain steps, it may choose not to predict or skip them, leading to the omission of some steps that should have been identified. According to the statistics, the best-performing GPT-5 produces answers with an average of 3.1 steps, while the annotated answers have an average of 4.2 steps, which further confirms this observation. 3.5 Analysis and Discussions In this section, we delve into two key questions: (1) the alignment between our metric and human judgment; (2) the underlying causes of VLMs’ poor performance. Human Alignment of VLM-based Evaluation. We employ GPT-4o when matching predicted steps with annotated ones. In the following analysis, we investigate the consistency between the scoring VLM (used to generate the matching matrix) and the human evaluators. To this end, we selected eight representative VLMs as the evaluated models and four VLMs as the scoring VLMs. All evaluations were conducted on 300 randomly sampled instances from SpatiaLQA. Fig. 6 presents the evaluation results obtained from human evaluators and different scoring VLMs, while Tab. 3 compares the outcomes between the scoring VLMs and human evaluators. The results show that evaluation scores vary significantly depending on which VLM is used as the scoring model. Notably, proprietary models (Qwen-VL-Max and GPT-4o) yield results that are more consistent with human evaluations, likely because they learn more stable semantic similarity judgment patterns from larger, higher-quality data, bringing their judgments closer to human intuition. Specifically, proprietary models generally exhibit higher correlation coefficients and lower mean absolute errors (around 3 percentage points), whereas open-source models show mean absolute errors exceeding 10 percentage points. Furthermore, GPT-4o achieves the highest correlation and lowest mean absolute error. Therefore, to maintain consistency with human judgment, we adopt GPT-4o as the scoring VLM. Figure 6: Evaluation results of human and different scoring VLMs. The x-axis represents the eight representative VLMs being evaluated. Each point with the same marker shape denotes the F1 scores obtained by the same scoring VLM or by human evaluators. The solid lines indicate the F1 scores for content, while the dashed lines represent the F1 scores for preconditions. The Underlying Causes of Poor Performance. To answer this question, we selected four representative VLMs and analyzed their performance more comprehensively from three dimensions: the number of annotated answer steps, annotation source, and scene category. As shown in Fig. 7, model performance generally decreases as the number of annotated answer steps increases. Moreover, model performance shows clear patterns across different annotation sources: VLMs perform best on data generated by subgraph extraction augmentation, followed by manual annotations, and worst on graph expansion augmentation. This trend arises because the samples generated through subgraph extraction augmentation have fewer answer steps (simpler problems), whereas the samples generated through graph expansion augmentation contain more steps (more complex problems). However, VLMs exhibit relatively consistent performance across various scene categories. These observations suggest that VLMs tend to perform worse on tasks requiring more steps, as such tasks demand longer and more stable reasoning processes. VLMs’ failures on these tasks result in overall poor performance. Scoring VLMs ρc\rho_{c} ρp\rho_{p} ss Δ\Delta Qwen3-VL-
1
1
3
1
VLNVerse tackles the sim-to-real gap in vision-language navigation by providing a unified, large-scale benchmark with realistic physics simulation and full-kinematics embodied agents.