UTokyoApr 9, 2026arXiv:2604.08337

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Ashutosh Kumar, Rajat Saini, Jingjing Pan, J. Pan, Mustafa Erdogan, Mingfang Zhang, Betty Le Dem, N. Kobori, Norimasa Kobori, Quan Kong

AI Summary

The paper introduces InstAP, a novel vision-language pre-training framework that enhances instance-level reasoning by incorporating fine-grained, instance-level contrastive alignment between textual mentions and spatial-temporal regions. To facilitate this, the authors created InstVL, a large-scale dataset with both holistic scene captions and dense, grounded instance descriptions. Experiments show that InstAP significantly outperforms existing VLP models on instance-level retrieval and achieves competitive zero-shot performance on video benchmarks, demonstrating the benefits of instance-aware pre-training for both local and global understanding.

Key Contribution

Current vision-language models struggle with instance-level reasoning, but InstAP grounds textual mentions to specific spatial-temporal regions, unlocking a new level of fine-grained understanding.

Abstract

Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References72

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Related Papers