Data Space Research InstituteFudanHuazhong Agricultural UniversityM vs.RiceSEUZhejiang Normal UniversityApr 16, 2026arXiv:2604.15065

Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

Yan Zeng, Yangchen Zeng, Zhenyu Yu, Dongming Jiang, Wenbo Zhang, Yifan Hong, Zhanhua Hu, Jiao Luo, Kangning Cui

AI Summary

This paper introduces Heatmap-guided Embedding Learning Paradigm (HELP), a novel positional-semantic fusion framework for small object detection that selectively preserves positional encodings in foreground regions while suppressing background clutter. HELP uses Heatmap-guided Positional Embedding (HPE) to guide noise-suppressed feature encoding and enable high-quality query retrieval by filtering background-dominant embeddings. By integrating Linear-Snake Convolution to address feature sparsity, the method achieves comparable or better accuracy with significantly reduced parameters and decoder layers.

Key Contribution

Suppressing background noise with heatmap-guided positional embeddings slashes transformer detector parameters by 59% without sacrificing accuracy in small object detection.

Abstract

Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: https://github.com/yidimopozhibai/Noise-Suppressed-Query-Retrieval

Architecture Design (Transformers, SSMs, MoE)Computer Vision Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References73

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

Related Papers