SJTUMar 10, 2026arXiv:2603.09896

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong

AI Summary

The authors introduce CourtSI, a large-scale dataset of over 1M QA pairs for benchmarking spatial intelligence in VLMs within sports scenarios, covering spatial counting, distance measurement, localization, and relational reasoning. They also present CourtSI-Bench, a high-quality evaluation benchmark with 3,686 QA pairs, and evaluate 25 VLMs, revealing a performance gap compared to humans and limited generalization from existing benchmarks. Fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points and enhances generalization to unseen sports and spatial-aware commentary generation.

Key Contribution

VLMs still struggle to grasp spatial relationships in dynamic sports scenes, as evidenced by a new benchmark revealing a significant human-AI performance gap.

Abstract

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Related Papers