Search papers, labs, and topics across Lattice.
The paper introduces Reward-Guided Semantic Evolution (RGSE), a training-free test-time adaptation method for open-vocabulary object detection that addresses semantic misalignment in VLMs under distribution shift. RGSE refines text embeddings by perturbing them, evaluating the perturbations using cosine similarity with high-confidence visual proposals, and fusing them via reward-weighted averaging. Experiments show RGSE achieves state-of-the-art performance on multiple detection benchmarks with minimal overhead, without requiring backpropagation.
Forget training, just nudge your text embeddings: RGSE closes the open-vocabulary object detection gap under distribution shift by directly and efficiently adapting text embeddings at test time.
Open-vocabulary object detection with vision-language models (VLMs) such as Grounding DINO suffers from performance degradation under test-time distribution shifts, primarily due to semantic misalignment between text embeddings and shifted visual embeddings of region proposals. While recent test-time adaptive object detection methods for VLM-based either rely on costly backpropagation or bypass semantic misalignment via external memory, none directly and efficiently align text and vision in a training-free manner. To address this, we propose Reward-Guided Semantic Evolution (RGSE), a training-free framework that directly refines the text embeddings at test time. Inspired by evolutionary search, RGSE treats text embedding adaptation as a semantic search process: it perturbs text embeddings as candidate variants, evaluates them via cosine similarity with current and historical high-confidence visual proposals as a reward signal, and fuses them into a refined embedding through reward-weighted averaging. Without any backpropagation, RGSE achieves state-of-the-art performance across multiple detection benchmarks while adding minimal computational overhead. Our code will be open source upon publication.