Haomiao Ni

Department of Computer Science, University of Memphis Abstract Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don’t fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs’ reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM’s attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM’s ability to focus on and reason about rare objects. 1 Introduction Vision language models (VLMs) have made remarkable advances in recent years, with both open-source [3, 45, 20] and closed-source [1, 33] systems demonstrating strong performance across a wide range of multi-modal tasks. A key driver of this progress has been visual instruction tuning [20], which bridges a pretrained vision encoder (e.g., CLIP [28]) and large language models via a lightweight projection layer. This design enables the language model to interpret and reason over visual inputs, thereby enabling effective vision-language alignment and fusion. Despite these successes, numerous studies [34, 27, 7] report persistent limitations of VLMs in vision-centric tasks such as referred object recognition and spatial reasoning. Particularly, VLMs perform much worse when dealing with rare or uncommon objects than common objects [25, 29, 2]. For example, Figure 1(a) shows that LLaVA fails to recognize or reason correctly about the “bollard,” even when it is clearly visible in the input image. In contrast, our refinement on LLaVA resolves this issue, as illustrated in Figure 1(b). Figure 1: Comparison on rare object recognition: (a) shows that LLaVA tends to predict the “bollard” as a common object “traffic light”, while (b) demonstrates that our method corrects LLaVA by predicting “bollard” and providing reasoning through visual enhancement and text prompt enrichment with object hints, both based on the learned multi-modal class embeddings. Existing approaches largely attribute these shortcomings to the visual encoder or the projector. In response, subsequent works have introduced stronger vision encoders [24, 16] and more expressive projectors [19, 26], aiming to provide the language model with richer, more comprehensive visual representations. Recent studies [9, 40] leverage vision foundation models to align with the visual tokens in VLMs, making the visual tokens in VLMs preserve more spatial details during finetuning. While delivering measurable improvements, these methods are not specifically optimized toward rare objects, making them inefficient for such scenarios. [21] attempts to mitigate the imbalanced distribution for rare objects through retrieval-augmented learning (RAL) from large-scale public data and builds a class-balanced training dataset. However, it still requires VLMs’ computational finetuning and may lose original information. Based on these, it naturally raises the question: How can we efficiently improve VLMs’ capability in recognizing and reasoning about rare object-centric scenes? Figure 2: Visual attention on the object “bollard” from the CODA-LM dataset. The attention weights across layers show that LLaVA-1.5-

Papers on Lattice

Total citations

Topics