Mar 3, 2026arXiv:2603.02924

HDINO: A Concise and Efficient Open-Vocabulary Detector

Hao Zhang, Yiqun Wang, Qi Lin, Qinran Lin, Run-Ze Fan, Yong Li

AI Summary

HDINO, a novel open-vocabulary object detector, eliminates the need for manually curated datasets and resource-intensive cross-modal feature extraction by employing a two-stage training strategy built upon the DINO model. The first stage uses a One-to-Many Semantic Alignment Mechanism (O2M) to align visual and textual modalities with noisy samples treated as positives, along with a Difficulty Weighted Classification Loss (DWCL) to mine hard examples. HDINO achieves state-of-the-art performance, reaching 49.2 mAP on COCO with fewer training images and without grounding data, and further improves to 56.4 mAP and 59.2 mAP after fine-tuning.

Key Contribution

Outperforms existing open-vocabulary object detectors like Grounding DINO and T-Rex2 while using significantly less training data and eliminating manual data curation.

Abstract

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf{49.2} mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf{0.8} mAP and \textbf{2.8} mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf{56.4} mAP and \textbf{59.2} mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HDINO: A Concise and Efficient Open-Vocabulary Detector

Related Papers