Zhixiong Nan

School of Computer Science & Technology, Chongqing University {xujiahao,zhangxin,dongjiajun}@stu.cqu.edu.cn, {huangsheng,nanxz,nankun.mu}@cqu.edu.cn Corresponding author Abstract In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization. Our code is available at: https://github.com/JiahaoXu-god/CVPR2026_MUSE. 1 Introduction Computational pathology (CPath) [42, 9, 8], a key branch of digital pathology [33], leverages advanced machine learning to enable objective and quantitative analysis of whole slide images (WSIs). These methods support automated diagnosis by interpreting complex histopathological patterns. However, the extreme scale and structural heterogeneity of WSIs present major obstacles to end-to-end learning. To circumvent the need for exhaustive pixel-level annotations, weakly supervised learning has become the de facto paradigm, most notably through the multiple instance learning (MIL) framework [44, 38, 31, 25]. Figure 1: Comparison of MUSE with existing VLM-based MIL methods. (a) VLM-based MIL methods incorporate pathological text and enable cross-modal interaction between text and image modalities. (b) Our method performs fine-grained modeling of semantics and enables interaction between text and image modalities, while enhancing text diversity through knowledge base retrieval and stochastic optimization. Conventional MIL approaches for WSI analysis [43, 10, 22, 28, 53, 48, 52] typically follow a three-stage pipeline: (1) patch extraction, (2) feature encoding via a pretrained backbone, and (3) bag-level aggregation under slide-level supervision. These methods have demonstrated strong performance in tasks such as cancer diagnosis [22, 43], subtyping [28, 53, 10], and survival prediction [48, 52]—but only when sufficient labeled WSIs are available. In practice, however, acquiring slide-level annotations requires expert pathologists and is further constrained by stringent data privacy regulations. Consequently, labeled WSIs are not only scarce but often limited to just a few examples per diagnostic category. This reality necessitates a few-shot learning paradigm, where models must generalize from minimal labeled slides. Under such extreme data scarcity, visual features alone are insufficient to capture the nuanced diagnostic criteria that distinguish pathological subtypes. Instead, high-level semantic knowledge such as disease descriptions, histological terminology, and clinical narratives becomes critical. Such semantics encode invariant diagnostic principles shared across patients and institutions, offering a powerful inductive bias for generalization. Motivated by this, recent data-efficient methods [35, 40, 16, 36, 11] leverage vision-language models (VLMs) [37, 27] pretrained on large-scale image-text corpora to align histopathological patterns with their corresponding textual semantics. By fusing or interacting visual and semantic representations, these approaches aim to solve the few-shot WSI classification (FSWC) problem, effectively using language as a proxy for expert knowledge when labeled data is unavailable. Recent vision-language methods for computational pathology have moved beyond generic models trained on natural images by adopting pathology-specific foundation models such as PLIP [20], CONCH [29], and MUSK [49]. These models are pretrained on large-scale histopathology image-text pairs via contrastive learning [21, 20, 29, 49] and provide domain-aligned encoders for both modalities. Building on this progress, several approaches [14, 41, 40, 32] leverage large language models to generate additional textual descriptions and design cross-modal interactions between text and patch features, as shown in Figure 1(a). However, large language models are typically used only as description generators rather than semantic optimizers, resulting in static and unrefined textual prompts. This superficial use of semantics leads to two key limitations. First, complex pathological concepts are often collapsed into a single global query, preventing disentanglement of fine-grained diagnostic attributes such as tumor grade or immune infiltration. As a result, visual-semantic alignment remains coarse and fails to attend to diagnostically relevant regions with concept-level precision. Second, the reliance on unoptimized prompts ignores the structural diversity of clinical language, including variations in abstraction level, contextual nuance, and syntactic formulation. Under few-shot settings, this not only underutilizes the expressive capacity of the text encoder but also encourages overfitting to specific phrasings, degrading generalization across clinical contexts. To address the aforementioned limitations, we propose a stochastic MUlti-view Semantic Enhancement framework, abbreviated as MUSE. As shown in Figure 1(b), the core idea of MUSE is to jointly enhance the model’s generalization capability through precise semantic perception and enriched semantic diversity. Our framework consists of two core components: Sample-wise Fine-grained Semantic Enhancement (SFSE) and Stochastic Multi-view Model Optimization (SMMO). SFSE enhances semantic precision through decompositional semantic refinement and fine-grained, query-driven sample-wise cross-modal interaction. SMMO promotes semantic diversity by retrieving multi-view textual descriptions from a contextual semantic knowledge base generated by an LLM with semantics refined by SFSE, followed by a fast stochastic optimization process for final WSI classification. These components enable MUSE to achieve strong generalization in few-shot settings. Extensive experiments on multiple widely used WSI datasets show that our method achieves superior performance compared to existing FSWC baselines. The main contrutions of our paper can be summarized as follows: • We propose the MUSE framework, which improves semantic understanding in multiple instance learning through fine-grained semantic modeling and effective exploitation of semantic diversity, significantly boosting generalization in few-shot scenarios. To the best of our knowledge, this work is the first attempt to improve few-shot WSI classification performance from the perspective of semantic optimization. • We propose an MoE based mechanism that decompositionally refines category level semantics and adapts them to individual samples through interaction with visual features. This enables the learning of sample-wise semantic priors that capture fine-grained semantic cues, thereby enhancing semantic precision beyond conventional class level representations. • We build an LLM-generated knowledge base of multi-view and class-specific pathological descriptions, whose semantic diversity offers complementary signals for few-shot learning. Guided by SFSE-refined sample-wise priors and integrated stochastically, these multi-view semantics enhance generalization under limited labels. 2 Related Works 2.1 Multiple Instance Learning in CPath Due to the ultra-high resolution of whole-slide images (WSIs), multiple instance learning (MIL) has become the standard framework in computational pathology. MIL-based methods have demonstrated strong performance on WSI diagnosis, subtyping, and prognosis tasks [22, 28, 53, 48, 52]. A typical pipeline first tiles the WSI into patches, extracts features using a pretrained encoder, and then aggregates these features to predict the slide-level label. Early aggregators relied on parameter-free pooling operations [7]. Subsequent works introduced learnable mechanisms to identify diagnostically relevant patches. ABMIL [22] uses attention to assign importance scores to individual patches. CLAM [30] enhances this with clustering constraints to localize critical regions. TransMIL [39] models global inter-patch dependencies via self-attention, while graph-based methods [26, 18, 54] incorporate spatial structure to improve contextual reasoning. However, under sparse annotations, models relying solely on visual features often underperform, highlighting the need to integrate domain knowledge for effective weakly supervised learning. 2.2 Vision-Language Models in CPath General-purpose vision-language models such as CLIP [37] and BLIP [27] have demonstrated strong performance across a wide range of visual tasks [19, 15, 51, 23]. In computational pathology (CPath), domain-adapted foundation models including PLIP [20], CONCH [29], and MUSK [49] leverage large-scale pathological image-text data to improve diagnosis and downstream analysis. To address sparse annotations, recent works integrate these models into the MIL framework by exploiting their few-shot and zero-shot transfer capabilities. Top [35] introduces the FSWC paradigm using text-guided patch aggregation for WSI classification under data scarcity. ViLa-MIL [40] proposes a dual-scale MIL framework that fuses textual descriptions with image features at multiple resolutions. FOCUS [14] enhances representation quality through a knowledge-guided adaptive visual compression mechanism. However, these methods treat semantic prompts as static category-level descriptors, ignoring both sample-wise fine-grained semantics and the structural and perspectival diversity of clinical language, thereby limiting the expressiveness of vision-language learning. 3 Methodology Figure 2: Overview of the proposed MUSE framework. (DSR: Decompositional Semantic Refinement. SVTI: Sample-wise Vision-Text Interaction) (a) The input semantic information is decomposed and modeled in a fine-grained manner. We then leverage the refined semantic representations to extract sample-relevant visual-semantic information and facilitate cross-modal interaction. (b) The semantic-enhanced features are used to retrieve relevant texts from the multi-view text knowledge base, and these retrieved texts are subsequently leveraged through stochastic optimization to enrich semantic diversity. 3.1 Overview We propose a stochastic multi-view semantic enhancement framework for few-shot whole slide image classification, termed MUSE, as shown in Figure 2. The framework consists of two core components: sample-wise fine-grained semantic enhancement (SFSE) and stochastic multi-view model optimization (SMMO). Through these components, we achieve fine-grained semantic modeling and effectively harness semantic diversity, thereby enhancing the model’s understanding of the semantic modality and improving its generalization capability under few-shot settings. 3.2 Fine-grained Semantic Enhancement As shown in Figure 2(a), the sample-wise fine-grained semantic enhancement (SFSE) comprises two components: decompositional semantic refinement (DSR) and sample-wise vision–text interaction (SVTI). DSR decomposes the input textual semantics and isolates task-relevant core semantic segments, which serve as semantic cues for cross-modal interaction. SVTI employs cross-attention to dynamically attend to visual patches using these cues as queries, selectively aggregating semantically relevant features at the sample level. This attention-driven fusion enriches the representation with fine-grained, sample-wise, and context-aware semantic information. 3.2.1 Decompositional Semantic Refinement To adapt the frozen pathology foundation model to the downstream FSWC task, we augment each category name c∈Cc\in C with MM learnable prompt vectors. For example, in CAMELYON [5, 4], categories include “normal lymph node” and “metastatic lymph node”. The resulting text prompt for category cc is formulated as: Tc=[V]1[V]2…[V]M[c],T_{c}=[V]_{1}[V]_{2}\dots[V]_{M}[c], (1) where [V]i∈ℝd[V]_{i}\in\mathbb{R}^{d} (i=1,…,Mi=1,\dots,M) are trainable embeddings. This prompt is encoded by the text encoder ET(⋅)E_{T}(\cdot) of the foundation model to produce a textual feature representation D∈ℝ|C|×dD\in\mathbb{R}^{|C|\times d}, where each row DiD_{i} corresponds to the encoded semantics of the ii-th category. Although prompt tuning enhances task adaptability, the resulting representation remains holistic and lacks explicit fine-grained structure. To address this, the Decompositional Semantic Refinement (DSR) module refines DD into task-relevant semantic cues. Specifically, inspired by the Mixture-of-Experts (MoE) paradigm, we construct RR expert query matrices {WiQ}i=

Papers on Lattice

Total citations

Topics

Research focus

Computer Vision (1)Multimodal Models (1)Scientific Discovery & Drug Design (1)

Frequent co-authors

Jiahao Xu (1)Sheng Huang (1)Jiajun Dong (1)Nankun Mu (1)

Papers (1)

Feb 24, 2026

MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification

LLMs can boost few-shot learning for pathology images, but only if you dynamically adapt the language priors to each image and stochastically integrate multiple "expert" descriptions.

Jiahao Xu, Sheng Huang, Zhixiong Nan +2

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Search

Zhixiong Nan

Research focus

Frequent co-authors

Papers (1)