ChongqingHKUJun 10, 2026arXiv:2606.11546

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

AI Summary

This paper introduces VL-DINO, an innovative open-vocabulary object detector that enhances the DINO framework by leveraging CLIP's vision-language knowledge. The authors develop a Query-guided Positive Sample Construction (QPSC) module to generate high-quality positive samples and a Visual Semantic Encoder (VSE) to distill CLIP's visual knowledge, resulting in improved training across diverse data sources. In zero-shot evaluations, VL-DINO achieves state-of-the-art performance on the LVIS benchmark, with AP scores of 36.3 and 38.1 for VL-DINO-T and VL-DINO-L, respectively, surpassing previous methods.

Key Contribution

VL-DINO outperforms existing models in zero-shot object detection by effectively integrating vision-language knowledge, achieving a remarkable 38.1 AP on the LVIS benchmark.

Abstract

Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

Related Papers