Om AI ResearchZJUMay 27, 2026arXiv:2605.28132

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Haozhan Shen, Tiancheng Zhao, Kangjia Zhao

AI Summary

This study systematically compares Vision-Language Models (VLMs) and Video Generation Models (VGMs) to evaluate their effectiveness in representing spatial intelligence across semantic tagging, instance grouping, and 3D geometry prediction. The findings indicate that while VLMs excel in semantic tasks, VGMs provide superior performance in capturing geometric information and camera motion. Notably, a simple fusion of features from both models results in a representation that significantly enhances performance in both geometry and semantics, highlighting the potential for improved spatial intelligence architectures.

Key Contribution

VLMs and VGMs reveal a surprising complementarity in spatial intelligence tasks, with a simple fusion of their features outperforming either model alone.

Abstract

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at https://github.com/om-ai-lab/Probing-VLM-VGM{https://github.com/om-ai-lab/Probing-VLM-VGM}.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Related Papers