Search papers, labs, and topics across Lattice.
This paper addresses the limitations of prototype-based similarity learning in few-shot object detection by introducing two innovative components: Text-Anchored Semantic Mask (TSMa) and Stage-Aligned Hierarchical Autoregressive Regression (SHARe). TSMa enhances inter-class similarity margins and reduces class confusion by integrating class-level text features to refine visual cues, while SHARe reformulates bounding box localization into a hierarchical autoregressive process that aligns feature abstraction levels with regression stages. The proposed methods achieve a new state-of-the-art performance on the COCO dataset, improving the previous best by +10.1 nAP, demonstrating their effectiveness in enhancing few-shot detection capabilities.
Class confusion in few-shot object detection can be drastically reduced, leading to a +10.1 nAP improvement over previous methods.
Few-shot object detection aims to detect novel object categories from only a few labeled examples, avoiding costly large-scale annotation. Recent prototype-based similarity learning approaches enable training-free adaptation by matching query features with class prototypes. However, they suffer from two fundamental limitations: (i) class confusion arising from inter-class similarity margin collapse, and (ii) insufficient visual cues for precise localization, as similarity scores capture only class-level semantic affinity while providing limited spatial information. To address these issues, we introduce two complementary components. Text-Anchored Semantic Mask (TSMa) leverages class-level text features as semantic anchors to identify semantically aligned channels through channel-wise interaction between visual and text features. By suppressing style-induced spurious responses and emphasizing class-intrinsic signals, TSMa enlarges inter-class similarity margins and mitigates class confusion. We further propose Stage-Aligned Hierarchical Autoregressive Regression (SHARe), which reformulates localization as a hierarchical autoregressive process that progressively refines bounding boxes across multiple stages. SHARe leverages the layer-wise characteristics of ViT representations by aligning feature abstraction levels with regression stages: deeper layers guide early coarse localization, while shallower layers rich in edge and texture cues refine spatial details in later stages. Experiments on COCO demonstrate a new state of the art, outperforming the previous best by +10.1 nAP, with extensive analysis validating each component. The code is available at https://github.com/VisualScienceLab-KHU/ReSet.