Search papers, labs, and topics across Lattice.
The paper introduces a new open-set object detection problem, Language-Vision-Complementary Open-Set Detection (LV-OSD), which leverages both text and image prompts to specify object categories. To address this, they propose a dual-branch detection framework, LVDor, incorporating a Target-guided Prompt Dynamic Weighting (TPDW) module to bridge the semantic gap between input images, text prompts, and image prompts. Experiments demonstrate the effectiveness of LVDor and the validity of the LV-OSD problem formulation.
Object detection gets a flexible upgrade: now you can specify objects with text *and* images, opening the door to more intuitive and practical real-world applications.
Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation's reasonability and our method's effectiveness. Prompts and code will be released publicly.