Search papers, labs, and topics across Lattice.
D\mathcal{D}_{\text{noisy}}=\{(x_{i},y_{i})\}_{i=1}^{D}, we extract global image features 𝑭∈ℝD×d\bm{F}\in\mathbb{R}^{D\times d}, and the prompt features 𝑮∈ℝC×d\bm{G}\in\mathbb{R}^{C\times d} using a pre-trained vision-language model. Next, we compute the similarity matrix between the image and prompt features, 𝑭⋅𝑮⊤\bm{F}\cdot\bm{G}^{\top}, and use its negative logarithm as the cost matrix. To ensure proper alignment, we enforce uniform marginal distributions for both the samples and the classes. The OT problem is then formulated as: dOT(μ,ν)=min𝑻∈Π(𝝁,𝝂)⟨−log(𝑭⋅𝑮⊤),𝑻⟩\displaystyle d_{\text{OT}}(\mu,\nu)=\min_{\bm{T}\in\Pi(\bm{\mu},\bm{\nu})}\langle-\log(\bm{F}\cdot\bm{G}^{\top}),\bm{T}\rangle (14) Π(μ,ν)={𝐓∈ℝ+C×D|𝐓𝟙D=μ,𝐓⊤𝟙C=ν}\displaystyle\Pi(\mu,\nu)=\left\{\mathbf{T}\in\mathbb{R}_{+}^{C\times D}\;\middle|\;\mathbf{T}\mathds{1}_{D}=\mu,\;\mathbf{T}^{\top}\mathds{1}_{C}=\nu\right\} (15) where 𝟙C\mathds{1}_{C} is the vector of ones with length CC, representing the total probability mass of the noisy label distribution, and 𝟙D\mathds{1}_{D} is the vector of ones with length DD, representing the total probability mass of the sample distribution. These constraints ensure that the total probability mass is conserved across both the samples and the labels. Once the optimal transport plan T∗T^{*} is computed, the pseudo-label for each image xix_{i} is obtained by selecting the class with the highest transport mass: y~i=argmaxjTij∗\tilde{y}_{i}=\arg\max_{j}T^{*}_{ij} (16) This process generates refined labels by using the transport plan T∗T^{*} to assign the most probable class for each image. To further improve reliability, we integrate the adaptive threshold ϕi,k\phi_{i,k} defined earlier to identify potentially mislabeled samples. Only those samples whose similarity to clean prompts falls below the threshold are considered for refinement, ensuring that clean samples remain unaltered while noisy instances are corrected: 𝒟refinement={(xi,y~i)∣pikc<ϕi,k,k=yi}.\mathcal{D}_{\text{refinement}}=\left\{(x_{i},\tilde{y}_{i})\mid p_{ik}^{c}<\phi_{i,k},k=y_{i}\right\}. (17) The selective mechanism, built on the bi-directional multi-view prompt learning framework, enables the model to isolate and correct corrupted labels effectively. This not only improves the label quality but also enhances the robustness of the model under noisy supervision. Ultimately, the denoised training set is constructed by combining reliable clean samples with the refined noisy ones: 𝒟denoised=𝒟clean∪𝒟refinement.\mathcal{D}_{\text{denoised}}=\mathcal{D}_{\text{clean}}\cup\mathcal{D}_{\text{refinement}}. (18) By training on 𝒟denoised\mathcal{D}_{\text{denoised}}, the model benefits from both trustworthy clean supervision and corrected noisy labels, leading to more stable convergence and improved generalization performance. 3.3 Training Details Training schedule To improve robustness, we delay the label refinement process and only start modifying labels after TsupT_{\text{sup}} epochs. Details of the full training procedure are provided in Appendix. In the early phase, the model is trained on the noisy dataset with the Generalized Cross-Entropy (GCE) loss [65] combined with the ITBP loss: ℒsup=ℒgce+λi⋅ℒitbp,\mathcal{L}_{\text{sup}}=\mathcal{L}_{\text{gce}}+\lambda_{\text{i}}\cdot\mathcal{L}_{\text{itbp}}, (19) where λi\lambda_{\text{i}} controls the strength of auxiliary supervision. Once the refinement process is activated, noisy samples identified by our prompt-guided mechanism are selectively corrected, and training continues on the updated dataset with GCE loss. This delayed refinement allows the model to first acquire stable representations before adapting to cleaner supervision. A detailed sensitivity study on the effect of λi\lambda_{\text{i}} is reported in the Appendix. Inference During inference, both clean and noise-aware prompt alignments are incorporated into the prediction. We first compute the noise-aware confidence for class k as: pikn=exp(si,kn/τ)exp(si,kc/τ)+exp(si,kn/τ)\displaystyle p_{ik}^{n}=\frac{\exp(s_{i,k}^{n}/\tau)}{\exp(s_{i,k}^{c}/\tau)+\exp(s_{i,k}^{n}/\tau)} (20) Using both the clean-prompt confidence pikcp_{ik}^{c} and the noise-aware confidence piknp_{ik}^{n} , the final probability of assigning label kk to image xix_{i} is defined as: p(y=k∣xi)=(1−pikn)⋅pikc\displaystyle p(y=k\mid x_{i})=(1-p_{ik}^{n})\cdot p_{ik}^{c} (21) Table 1: Comparison of methods under symmetric and asymmetric noise on five datasets. (%) Dataset, AIIA, Ministry of Education, China {lniu,cxue}@seu.edu.cn Corresponding author. Abstract Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision. 1 Introduction Vision–language models (VLMs), such as CLIP [42], have advanced multimodal understanding by embedding images and text into a shared semantic space. Building on this, prompt learning adapts VLMs to downstream tasks by optimizing a small set of learnable textual embeddings while keeping the backbone frozen [68, 67]. This paradigm is especially appealing in few-shot and resource-limited settings due to its parameter efficiency, modularity, and fast adaptation. However, real-world deployments frequently face noisy supervision, and the few-shot regime exacerbates this vulnerability: with only a handful of examples per class, even a small number of corrupted labels can disproportionately bias gradient updates and induce spurious correlations. Recent studies suggest that prompt learning can be made robust to label noise [55], inspiring combinations with noisy-label learning such as negative learning [45, 54] and noisy-label selection [20, 38]. Yet, as summarized in Figure 1, they still face key limitations. First, prompt expressiveness is constrained, since most methods employ only one or two learnable prompts (a positive and a negative pair) [55, 54], enforcing a single-view alignment that cannot capture diverse and fine-grained cues that are essential for reducing the influence of noisy labels in few-shot settings. Second, assigning an explicit negative label to each image imposes a rigid supervision signal tied to a fixed counter class, and such hard negatives are often inaccurate or uninformative, making the optimization process less reliable in noisy settings. Third, denoising is typically coarse, relying either on fixed confidence thresholds or pseudo-labeling without selective correction, leading to error propagation. These limitations highlight a missing perspective in prior work: robust noisy few-shot learning requires adaptively decomposing and aligning clean and noisy semantics at a fine-grained, region-aware level. Figure 1: Limitations of existing prompt learning approaches under noisy labels. Single-view reliance: Limited prompts miss diverse visual patterns. Explicit negatives: Fixed negatives impose rigid supervision. Fixed threshold: Coarse denoising lets noise propagate. To address these challenges, we propose NA-MVP, a framework for noisy few-shot learning. NA-MVP combines multi-view, fine-grained patch-to-prompt alignment with Unbalanced Optimal Transport (UOT), allowing local features to be partially matched with multiple prompt views and mitigating the limitations of single-view prompting. To avoid rigid negative supervision, we introduce a bi-directional prompt design that jointly learns clean-oriented and noise-aware prompts, where the noise-aware view serves as an implicit negative and provides more flexible guidance under noisy labels. Finally, a prompt-guided selective refinement module uses alignment signals to identify unreliable samples and correct them via classical OT, offering a more targeted alternative to confidence-based relabeling. Our main contributions are summarized as follows: • A new conceptual perspective for few shot learning with noisy label. We introduce a new formulation of robustness in prompt learning: robust noisy few-shot learning requires decomposing and aligning clean and noisy semantics in a region-aware, class-conditional manner, moving beyond global image–prompt matching adopted in prior work. • Bi-directional multi-view prompts for noise-aware alignment. We design clean-oriented and noise-aware prompt views to capture reliable and corrupted cues respectively. Coupled with unbalanced patch-to-prompt alignment, this enables the model to downweight noisy regions and enhance consistent semantic signals. • Selective label refinement guided by alignment signals. We develop a prompt-guided selective refinement mechanism that uses bi-directional alignment cues to identify mislabeled samples and correct them via classical OT, avoiding the over-correction issues of global pseudo-labeling approaches. • We validate NA-MVP on multiple benchmarks and noise settings, showing consistent gains and robustness under noisy supervision. 2 Related Work 2.1 Learning with noisy labels. Learning with noisy labels (LNL) presents a significant challenge in training models that generalize well without overfitting to noisy labels. Existing approaches include robust loss functions [65, 49, 36, 53], loss correction [5, 1, 57], robust noise regularization [51, 52, 21, 27], and sample selection [60, 61, 26, 23, 62, 17, 31, 58, 54, 59]. Sample selection methods often rely on the small-loss criterion, which may discard clean hard samples and retain noisy ones. Label correction strategies, like MLC [66] and SELC [34], aim to correct noisy annotations by generating pseudo-labels from model predictions. However, these methods typically process each sample independently, overlooking the relationships between data points, which can lead to suboptimal corrections. To better exploit global structure, recent works [56, 17, 7] based on OT align feature distributions for improving pseudo-labeling. Recently, prompt learning has shown promise in noisy settings [55, 54, 38]. However, existing prompt-based LNL approaches still inherit core limitations of traditional methods: they typically operate at the global image–prompt level and rely on single-view or explicit negative cues, making them insensitive to fine-grained inconsistencies between clean and corrupted signals. As a result, their robustness can degrade significantly in few-shot regimes where noisy labels disproportionately influence prompt semantics. 2.2 Prompt Learning in Vision-Language Models. Prompt learning, initially developed in natural language processing, has become a central technique for adapting vision–language models (VLMs) [22, 42, 64]. While early models such as CLIP relied on manually crafted prompts, recent work focuses on learning continuous prompt embeddings. CoOp [68] introduces learnable prompts in the continuous space, and CoCoOp [67] further adapts them at the image level to improve generalization to unseen classes. This paradigm has inspired a broad line of extensions [44, 14, 24, 25, 32, 43, 69]. However, using a single prompt [68] limits the ability to capture diverse visual cues, motivating multi-prompt designs [35, 45]. CLIPN [48] employs a positive/negative prompt pair for OOD detection, where the negative prompts serve as class-agnostic cues to identify distribution shift. PLOT [8] aligns multiple prompts with local image features through optimal transport to enhance image–text correspondence. While effective, these methods are developed for clean data or OOD generalization and do not address the challenges posed by noisy labels and extremely limited supervision in the noisy few-shot setting. Inspired by these works, we propose a framework that combines bi-directional and multi-view prompt learning to better distinguish clean and noisy semantics. 2.3 Optimal Transport. OT provides a principled way to compare probability distributions by finding the most efficient mapping between them at a given cost. It defines the Wasserstein distance [41] and has been widely adopted in machine learning and computer vision. However, the high computational complexity of OT was a bottleneck until Cuturi introduced entropic regularization, enabling efficient computation via the Sinkhorn algorithm [15]. To improve flexibility, UOT [33, 9] was introduced, replacing the strict mass conservation constraint in classical OT with soft penalization terms [18, 28]. These advances have enabled OT to support a wide range of applications, including semi-supervised learning [46, 29, 47], object detection [19, 63, 13], generative models [2, 12, 10], domain adaptation [6, 50], learning with noisy labels [17, 7] and others. Building on these advances, our method adopts UOT with relaxed mass constraints to align local image features with multi-view prompts, allowing the model to focus on reliable features while suppressing noise. Meanwhile, classical OT with strict mass preservation is used to refine noisy labels by aligning global image features with class-level prompts, ensuring reliable label correction. This design leverages the complementary strengths of both OT variants for robust learning under label noise. Figure 2: Overview of the NA-MVP framework. Our framework consists of two key modules: (1) Noise-aware alignment (blue arrows): Multiple clean and noise-aware prompts per class are encoded and aligned with local image patches via UOT to generate clean/noisy probabilities. (2) Selective label refinement (green arrows): An adaptive threshold ϕ\phi derived from these probabilities identifies mislabeled samples, which are refined via classical OT by aligning global image features with clean text features. The two modules work together to iteratively update the training set while optimizing the prompts, producing a denoised dataset for robust prediction under noisy supervision. 3 Methodology Conceptual motivation. Existing prompt-based LNL largely rely on global image–prompt coupling or explicit negative labels. No sample-dependent mechanisms are capable of separating clean signals from corrupted ones within an image. In contrast, NA-MVP is built on the idea that robustness should arise from fine-grained, region-aware alignment and sample-dependent label correction. Our method consists of two key components: Bi-directional multi-view prompts for noise-aware alignment, where objects are observed from multiple perspectives. Selective noisy label refinement with OT, where the refinement is guided by prompts, as shown in Figure 2. We present multi-view and bi-directional as a unified prompt design, since the two are tightly coupled in how they provide complementary semantics and generate alignment signals for refinement. Problem definition Let 𝒟noisy={(xi,yi)}i=
1
0
3
0
Even with noisy labels, NA-MVP achieves robust few-shot learning by adaptively separating clean from noisy signals using bi-directional multi-view prompt alignment.