MCrossEntropy(piSep 29, 2025arXiv:2509.25033

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin

AI Summary

The paper introduces VT-FSL, a novel few-shot learning framework that leverages Large Language Models (LLMs) to generate precise cross-modal prompts conditioned on both class names and support images, addressing the issue of hallucinated semantics in existing methods. VT-FSL employs Cross-modal Iterative Prompting (CIP) to generate class descriptions and synthetic images, which are then integrated with support images using Cross-modal Geometric Alignment (CGA) to ensure structured multimodal integration. The proposed VT-FSL achieves state-of-the-art performance on ten diverse few-shot learning benchmarks.

Key Contribution

LLMs can generate semantically consistent images and textual descriptions that, when geometrically aligned, dramatically improve few-shot learning performance.

Abstract

Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations4

Influential citations0

References97

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Related Papers