Shenzhen University of Advanced TechnologyTJUMay 27, 2026arXiv:2605.28271

LV-OSD: Language-Vision-Complementary Open-Set Object Detection

Yupeng Zhang, Ruize Han, Wei Feng, Song Wang, Liang Wan

AI Summary

The paper introduces a new open-set object detection problem, Language-Vision-Complementary Open-Set Detection (LV-OSD), which leverages both text and image prompts to specify object categories. To address this, they propose a dual-branch detection framework, LVDor, incorporating a Target-guided Prompt Dynamic Weighting (TPDW) module to bridge the semantic gap between input images, text prompts, and image prompts. Experiments demonstrate the effectiveness of LVDor and the validity of the LV-OSD problem formulation.

Key Contribution

Object detection gets a flexible upgrade: now you can specify objects with text *and* images, opening the door to more intuitive and practical real-world applications.

Abstract

Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation's reasonability and our method's effectiveness. Prompts and code will be released publicly.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LV-OSD: Language-Vision-Complementary Open-Set Object Detection

Related Papers