Tsinghua AIHKUSTJun 4, 2026arXiv:2606.06076

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li

AI Summary

This paper introduces MGSD, a two-stage modality-gap-aware self-distillation framework designed to enhance visual spatial planning in vision-language models. By first grounding visual representations to reduce perception noise and then employing a privileged teacher for on-policy distillation using symbolic states, the approach effectively bridges the gap between visual perception and reasoning. Experiments demonstrate that MGSD significantly improves visual planning performance, achieving a 19.3% and 18.4% increase in macro average scores for 4B and 8B model backbones, respectively, while narrowing the performance gap to symbolic-input benchmarks.

Key Contribution

Bridging the perception-reasoning gap in visual planning, MGSD boosts model performance by over 19% while relying solely on visual inference during deployment.

Abstract

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.

Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Related Papers