Search papers, labs, and topics across Lattice.
This paper investigates the use of text-to-image diffusion models for object detection, leveraging the semantic and positional information encoded in their intermediate layers. They introduce scene text prompts to inject additional semantic information and concatenate cross-attention maps with diffusion features to enhance representation learning. Experiments demonstrate that this approach, termed Diff-OD, outperforms object detectors using ResNet101 and CLIP backbones, and achieves comparable performance to ConvNeXt.
Text-to-image diffusion models, typically used for generation, can surprisingly serve as strong feature extractors for object detection, rivaling dedicated visual backbones like ConvNeXt.
Diffusion models have recently made revolutionary breakthroughs in the task of generating images from text, which has aroused many works in transferring implicit knowledge in diffusion models to visual tasks. Although these studies have found that the visual representations of the intermediate layers of diffusion models are distinguishable, they mainly focus on tasks such as semantic segmentation and depth estimation, and have not yet explored the basic visual task of object detection. Inspired by the above findings, we set out to explore the application prospects of diffusion models in visual feature object detection tasks. We found that the visual features of diffusion models are both semantic and positional, achieving excellent visual recognition results in object detection tasks. Inspired by this, we designed scene text prompts to introduce more semantic information to the diffusion model to further enhance its detection ability. Since there is a strong positional connection between the conditional text prompt and the feature map in the pre-trained text-to-image diffusion model, we concatenate the cross-attention maps with diffusion features to further enhance its representation ability. Our method surpasses the methods based on ResnetlOl and Clip as feature extractors, and achieves comparable performance with Convnext, which is currently the most advanced visual backbone.