Apr 23, 2026arXiv:2604.21502

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

Yupeng Zhang, Ruize Han, Ning Guo, Wei Feng, Song Wang, Liang Wan

AI Summary

This paper addresses the challenge of single-domain generalized object detection (SDGOD) where detectors trained on one domain fail in unseen environments due to domain shifts. The authors identify that performance degradation stems from unstable object-background and inter-instance relations in the encoding stage, and semantic-spatial alignment issues in the decoding stage. To mitigate this, they propose VFM$^{4}$SDG, a dual-prior learning framework that leverages a frozen vision foundation model (VFM) to inject transferable cross-domain stability priors into both the encoding and decoding stages, leading to improved generalization performance.

Key Contribution

Frozen vision foundation models can be surprisingly effective at improving out-of-domain object detection by stabilizing relational modeling and semantic-spatial alignment in the detector.

Abstract

In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References53

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

Related Papers