SoochowWHUJun 16, 2026arXiv:2606.17601

Test-Time Training for Robust Text-Guided Open-Vocabulary Object Counting

Hao-Yuan Ma, Yuda Zou, Li Zhang, Yongchao Xu

AI Summary

This paper introduces Robust-TOOC, a benchmark designed to evaluate Text-guided Open-vocabulary Object Counting (TOOC) under various real-world degradation conditions, including rain, fog, and sensor noise. To enhance robustness without altering the original counting architecture, the authors propose Dual-TTT, a test-time training framework that optimizes a lightweight denoising module while keeping the counting network static. Experimental results show that Dual-TTT significantly improves counting accuracy in adverse conditions, highlighting its practical utility in real-world applications.

Key Contribution

Real-world conditions can severely impair object counting accuracy, but a novel test-time training approach boosts performance without requiring architectural changes.

Abstract

Text-guided Open-vocabulary Object Counting (TOOC) enables counting arbitrary object categories specified by text prompts, offering substantially greater flexibility than conventional closed-set counting. However, existing TOOC methods are developed and evaluated primarily on ideal images, while real-world scenes often suffer from adverse conditions such as rain, fog, darkness, and sensor noise, which severely degrade visual quality and impair vision-language alignment. To bridge this gap, we introduce Robust-TOOC, the first benchmark for evaluating TOOC under diverse corruption conditions, which covers six representative degradation types: rain, fog, darkness, Gaussian noise, salt-and-pepper noise, and mixed corruption. To improve robustness while preserving the original counting architecture, we propose Dual-TTT, a dual-architecture test-time training framework for TOOC. Specifically, during test-time training, Dual-TTT updates only the Text-guided Lightweight Denoising module (TL-Denoiser), while keeping the original counting network frozen. Inspired by diffusion models, the TL-Denoiser is optimized to remove corruption-aware noise from image representations under degraded conditions. Since only the TL-Denoiser is trained at test time, Dual-TTT is annotation-free and can be seamlessly integrated into existing TOOC models without modifying their original architecture. Extensive experiments on multiple recent TOOC baselines demonstrate the effectiveness of our method.

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Test-Time Training for Robust Text-Guided Open-Vocabulary Object Counting

Related Papers