HKUShenzhen UniversityApr 21, 2026arXiv:2604.19318

Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

Qi Zhang, Jixuan Chen, Kaiyi Zhang, Xinquan Yu, Antoni B. Chan, Hui Huang

AI Summary

This paper introduces MVTrackTrans, a Transformer-based multi-view crowd tracking model that leverages interactions between camera views and the ground plane to improve tracking performance. To facilitate evaluation in more realistic scenarios, the authors also present two new large-scale multi-view tracking datasets, MVCrowdTrack and CityTrack, which feature larger scene sizes and longer time periods than existing datasets. Experiments on these new datasets demonstrate that MVTrackTrans outperforms existing CNN-based methods, highlighting its effectiveness in handling complex, large-scale scenes.

Key Contribution

Transformer-based architectures can now outperform CNNs in multi-view crowd tracking, especially in large, complex real-world scenes, thanks to a novel view-ground interaction mechanism.

Abstract

Multi-view crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textit{MVTrackTrans}, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: https://github.com/zqyq/MVTrackTrans.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

Related Papers