Aug 25, 2025arXiv:2508.17595

TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints

Vinh-Thuan Ly, Hoang M. Truong, Xuan-Huong Nguyen

AI Summary

TinyGiantVLM, a lightweight two-stage VLM, is introduced to address the challenge of fine-grained spatial reasoning in warehouse-scale environments by encoding global and region-level features from RGB and depth modalities using pretrained visual backbones. A Mixture-of-Experts (MoE) fusion module dynamically combines spatial representations to handle high-modality inputs and diverse question types, improving convergence. Evaluated on the AI City Challenge 2025, the 64M-parameter base model achieved 5th place, demonstrating strong performance in bridging visual perception and spatial understanding.

Key Contribution

A 64M-parameter VLM can achieve surprisingly strong spatial reasoning performance in complex industrial environments, rivaling much larger models.

Abstract

Reasoning about fine-grained spatial relationships in warehouse-scale environments poses a significant challenge for existing vision-language models (VLMs), which often struggle to comprehend 3D layouts, object arrangements, and multimodal cues in real-world industrial settings. In this paper, we present TinyGiantVLM, a lightweight and modular two-stage framework designed for physical spatial reasoning, distinguishing itself from traditional geographic reasoning in complex logistics scenes. Our approach encodes both global and region-level features from RGB and depth modalities using pretrained visual backbones. To effectively handle the complexity of high-modality inputs and diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion module, which dynamically combines spatial representations to support downstream reasoning tasks and improve convergence. Training is conducted in a two-phase strategy: the first phase focuses on generating free-form answers to enhance spatial reasoning ability, while the second phase uses normalized answers for evaluation. Evaluated on Track 3 of the AI City Challenge 2025, our 64M-parameter base model achieved 5th place on the leaderboard with a score of 66.8861, demonstrating strong performance in bridging visual perception and spatial understanding in industrial environments. We further present an 80M-parameter variant with expanded MoE capacity, which demonstrates improved performance on spatial reasoning tasks.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References28

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints

Related Papers