Search papers, labs, and topics across Lattice.
This paper introduces VL-DINO, an innovative open-vocabulary object detector that enhances the DINO framework by leveraging CLIP's vision-language knowledge. The authors develop a Query-guided Positive Sample Construction (QPSC) module to generate high-quality positive samples and a Visual Semantic Encoder (VSE) to distill CLIP's visual knowledge, resulting in improved training across diverse data sources. In zero-shot evaluations, VL-DINO achieves state-of-the-art performance on the LVIS benchmark, with AP scores of 36.3 and 38.1 for VL-DINO-T and VL-DINO-L, respectively, surpassing previous methods.
VL-DINO outperforms existing models in zero-shot object detection by effectively integrating vision-language knowledge, achieving a remarkable 38.1 AP on the LVIS benchmark.
Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.