Search papers, labs, and topics across Lattice.
B LLM consistently underperforming compared to the, M entities in Wikipedia is infeasible. 5 Limitation While WikiCLIP provides a simple and efficient baseline for open-domain Visual Entity Recognition (VER), it still underutilizes the knowledge encoded in large language models (LLMs). Our analysis shows that performance quickly saturates with longer text inputs and that scaling to larger LLMs yields only marginal gains. These findings suggest that expanding model size or context length alone cannot fully exploit LLM-guided contrastive learning. These limitations motivate future work on better leveraging LLM knowledge and refining entity representations. 6 Conclusion In this work, we present WikiCLIP, a simple yet efficient framework for open-domain visual entity recognition. WikiCLIP employs a Vision-Guided Knowledge Adaptor to extract discriminative entity representation and a hard negative synthesis strategy to generate challenging negatives for the contrastive training of VGKA. Extensive experiments on standard open-domain VER benchmarks show that WikiCLIP substantially outperforms strong baselines, achieving significant performance gains while maintaining fast inference speed. 7 Acknowledge This work was supported by NSFC 62350610269, Shanghai Frontiers Science Center of Human-centered Artificial Intelligence, and MoE Key Lab of Intelligent Perception and Human-Machine Collaboration (ShanghaiTech University). This work was also supported by HPC platform of ShanghaiTech University. References [1] J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. (2022) Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, pp. 23716–23736. Cited by: §2.2. [2] A. F. Biten, L. Gomez, M. Rusinol, and D. Karatzas (2019-06) Good news, everyone! context driven entity-aware captioning for news images. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (en-US). External Links: Link, Document Cited by: §1. [3] M. Caron, A. Fathi, C. Schmid, and A. Iscen (2024) Web-scale visual entity recognition: an llm-driven data approach. ArXiv abs/2410.23676. Cited by: §1, §2.1, Table 1, §4.2, §4.4, §4.4. [4] M. Caron, A. Iscen, A. Fathi, and C. Schmid (2024) A generative approach for wikipedia-scale visual entity recognition. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17313–17322. Cited by: §1, §2.1, Table 1, Table 1, Table 1, §4.2. [5] X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. M. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. V. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut (2022) PaLI: a jointly-scaled multilingual language-image model. ArXiv abs/2209.06794. Cited by: Table 1, Table 1, Table 1. [6] Y. Chen, H. Hu, Y. Luan, H. Sun, S. Changpinyo, A. Ritter, and M. Chang (2023) Can pre-trained vision and language models answer visual information-seeking questions?. ArXiv abs/2302.11713. Cited by: §1, §1, §1, §4.2, §4.4, §4.5. [7] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou (2024) The faiss library. External Links: 2401.08281
3
0
4
LLMs can be prompted to generate part-aware instructions that substantially improve open-vocabulary 3D affordance grounding by linking semantically similar affordances and refining geometric differentiation.
Forget slow generative models: WikiCLIP delivers a 16% accuracy boost in visual entity recognition with 100x faster inference by cleverly combining CLIP-style contrastive learning with vision-guided knowledge adaptation.
Diffusion models can now generate more realistic and semantically appropriate hand grasps by explicitly modeling affordances and interaction semantics, outperforming prior methods on grasp quality, semantic accuracy, and diversity.