Search papers, labs, and topics across Lattice.
This paper introduces MVMamba, a novel remote sensing object detection architecture addressing challenges in detecting small, similar targets with imbalanced distributions. MVMamba incorporates a Feature Split-Attention Module (FSAM) to decouple foreground/background features, a Global Convolutional Mamba Module (GCMM) based on MambaV2 with state-space duality for long-range dependencies, and a BiFPN for multiscale feature extraction. Experiments on VisDrone and DIOR datasets demonstrate that MVMamba achieves state-of-the-art performance, with 34.6% and 50.7% mAP@0.5:0.95, respectively.
Mamba's efficient sequence modeling finally cracks remote sensing object detection, outperforming CNNs and Transformers on small, cluttered targets.
Remote sensing object detection is a crucial task in ground analysis. Currently, object detectors based on convolution and transformer frameworks have shown significant performance. However, there are three pressing issues that still need to be addressed: 1) detection of diminutive remote sensing targets exhibiting high interclass similarity and imbalanced foreground–background distribution presents significant challenges; 2) conventional CNN architectures demonstrate limited capability in capturing long-range dependencies, while transformer frameworks incur substantial computational overhead; and 3) simply feature pyramid network (FPN) fails to fusing the fine-grained characteristics of small targets. As the result, this work first introduces the feature split-attention module (FSAM), which incorporates maxpooling and spatial-attention mechanisms to decouple foreground and background features while preserving critical edge information. Moreover, we propose global convolutional Mamba module (GCMM) that leverages MambaV2 architecture with state-space duality (SSD) mechanisms for global feature extraction, thereby enhancing the long-range semantic modeling capabilities. Furthermore, bidirectional FPN (BiFPN) is adopted to strengthen multiscale feature extraction for diminutive targets. Finally, these plug-and-play modules can be easily integrated into various object detection architectures. Experimental results demonstrate that our MVMamba model achieves 34.6% and 50.7% mAP@0.5:0.95 on the VisDrone and DIOR remote sensing datasets, respectively, outperforming all other state-of-the-art approaches.