Jiajun Dong

N\{b_{n}\}_{n=1}^{N} and form a bag representation B∈ℝN×dB\in\mathbb{R}^{N\times d}. Given the ii-th category semantics DiD_{i}, we leverage the fine-grained semantic cues {Qij}j=1k\{Q_{ij}\}_{j=1}^{k} produced by DSR, which encode discriminative sub-concepts relevant to pathological diagnosis. The goal of SVTI is to dynamically identify and fuse the subset of cues most aligned with the visual content of the current sample through vision–text interaction, thereby constructing a sample-level semantic prior. Specifically, we compute multi-head cross-attention between each cue QijQ_{ij} and the visual bag BB. For attention head hh, the alignment scores are: Aijh=(QijWQ,h)(BWK,h)⊤dhead,A^{h}_{ij}=\frac{(Q_{ij}W^{Q,h})(BW^{K,h})^{\top}}{\sqrt{d_{\text{head}}}}, (4) where WQ,h,WK,h∈ℝd×dheadW^{Q,h},W^{K,h}\in\mathbb{R}^{d\times d_{\text{head}}} are learnable projection matrices. To focus on regions highly correlated with the semantic cues, we retain only the top-r%r\% patches with the highest attention scores for each head. Let ℐijh\mathcal{I}_{ij}^{h} denote the corresponding index set. The filtered value projection is then: Vijh,filtered=BℐijhWV,h,V^{h,\text{filtered}}_{ij}=B_{\mathcal{I}_{ij}^{h}}W^{V,h}, (5) with WV,h∈ℝd×dheadW^{V,h}\in\mathbb{R}^{d\times d_{\text{head}}} as the value projection matrix. We aggregate across heads to obtain a vision–text fused representation for cue jj: headh=Softmax(Aijh)Vijh,filtered,\displaystyle\text{head}_{h}=\text{Softmax}(A^{h}_{ij})\,V^{h,\text{filtered}}_{ij}, (6) fij=Concat(head1,…,headH)WO,\displaystyle f_{ij}=\text{Concat}(\text{head}_{1},\dots,\text{head}_{H})W^{O}, (7) where WO∈ℝHdhead×dW^{O}\in\mathbb{R}^{Hd_{\text{head}}\times d} is the output projection matrix. Finally, we combine the kk fused representations using the expert scores {Sij}j=1k\{S_{ij}\}_{j=1}^{k} from the DSR router, and obtain the final representation ff through average pooling: {Sijnorm}j=1k=Softmax({Sij}j=1k),\displaystyle\{S_{ij}^{\text{norm}}\}_{j=1}^{k}=\text{Softmax}(\{S_{ij}\}_{j=1}^{k}), (8) fi=∑j=1kSijnormfij,f=1|C|∑i=1|C|fi.\displaystyle f_{i}=\sum_{j=1}^{k}S_{ij}^{\text{norm}}f_{ij},\quad f=\frac{1}{\left|C\right|}\sum_{i=1}^{\left|C\right|}f_{i}. (9) The resulting vector ff serves as a fine-grained semantic prior for the input WSI with respect to the corresponding text feature DD. This prior reflects the semantic sub-concepts most supported by the current sample visually, and is specifically designed to serve as a query for retrieving complementary textual knowledge from an external semantic knowledge base in the subsequent stage. 3.3 Stochastic Multi-view Model Optimization As illustrated in Figure 2, the sample-specific semantic prior ff generated by SVTI is used to query a pathology-oriented knowledge base, which is constructed offline via LLM-guided generation using chain-of-thought (CoT) [47] and in-context learning (ICL) [6] to produce diverse, multi-view textual descriptions. During training, the stochastic multi-view model optimization retrieves a set of semantically complementary texts using ff and randomly samples one of them at each iteration to update the model, thereby improving generalization through exposure to diverse semantic views. 3.3.1 LLM-based Semantic Knowledge Base Generation Figure 3: Our pipeline of the category related text knowledge base generation. (a) We employ ChatGPT to analyze and decompose the concepts associated with class names. (b) We leverage ChatGPT to construct concrete examples for each aspect. (c) We randomly sample some examples and combine them with prompts to guide the locally deployed LLM in generating the category knowledge base. As shown in Figure 3, we construct a multi-view textual knowledge base for each pathological category cc by leveraging category-level diagnostic concepts to guide LLM generation. The process involves three steps: concept decomposition, exemplar generation, and knowledge base assembly. First, we prompt GPT-4 with the category name to decompose its visual diagnostic criteria into four clinically meaningful aspects: cellular morphology, tissue architecture, color-staining characteristics, and spatial-texture patterns (Figure 3(a)). Next, for each aspect and category, GPT-4 generates 10 concrete exemplar descriptions (Figure 3(b)). To promote semantic diversity, we form synthesis prompts by randomly selecting one exemplar from each aspect and concatenating them. Due to API cost and time constraints, we use a lightweight open-source LLM deployed locally to generate 300 multi-view descriptions per category based on these prompts (Figure 3(c)). Finally, all generated texts are encoded by the text encoder ET(⋅)E_{T}(\cdot) into feature vectors, forming the category-specific knowledge base ℬc\mathcal{B}_{c}. Additional details are provided in the Supplementary Material. 3.3.2 LLM-based Multi-view Semantic Retrieval Given the sample-wise semantic prior ff generated by SVTI and the ground-truth category label cc, we retrieve semantically relevant text features from the pre-constructed knowledge base ℬc\mathcal{B}_{c}. The retrieval is performed by computing cosine similarity between an adapted representation of ff and all entries in ℬc\mathcal{B}_{c}: Tm=Top-mt∈ℬc⁡(Sim(A(f),t)),T_{m}=\operatorname{Top\text{-}m}_{\,t\in\mathcal{B}_{c}}\big(\text{Sim}(A(f),t)\big), (10) where A(⋅)A(\cdot) is a lightweight adapter that aligns the feature space of ff with that of the text encoder, and Sim(⋅,⋅)\text{Sim}(\cdot,\cdot) denotes cosine similarity. To facilitate subsequent stochastic utilization, the retrieved set TmT_{m} is randomly shuffled and stored in a queue structure, enabling sequential popping of text features in a randomized order during optimization. 3.3.3 Stochastic Optimization with Multiple-Semantics Building upon the retrieved text features TmT_{m} , we stochastically incorporate diverse textual views to enrich semantic representation. The set TmT_{m} is stored in a shuffled queue to enable random access during training. At each iteration, a single text feature tt is dequeued from TmT_{m}. We apply Decompositional Semantic Refinement (DSR) to decompose tt into fine-grained semantic queries {Qjt}j=1k\{Q^{t}_{j}\}_{j=1}^{k}, which are then processed by the Sample-wise Vision-Text Interaction (SVTI) module together with the input bag BB to produce an auxiliary sample-wise prior fauxtf^{t}_{\text{aux}}. This prior introduces semantic information from additional textual views, thereby compensating for the semantic representation of the primary prior ff. Both ff and fauxtf^{t}_{\text{aux}} are projected into logits via a shared MLP: z=MLP(f),zauxt=MLP(fauxt),\displaystyle z=\text{MLP}(f),\quad z^{t}_{\text{aux}}=\text{MLP}(f^{t}_{\text{aux}}), (11) and fused by summation: zfinalt=z+zauxt2.z^{t}_{\text{final}}=\frac{z+z^{t}_{\text{aux}}}{2}. (12) The model is trained with cross-entropy loss: ℒt=CE(zfinalt,GT),\mathcal{L}^{t}=\text{CE}(z^{t}_{\text{final}},GT), (13) where GTGT denotes the sample-wise (slide-level) ground-truth label. By stochastically exposing the model to multiple textual views from TmT_{m}, this procedure realizes multi-view semantic compensation, thereby improving generalization in few-shot scenarios. The full training algorithm is provided in the Supplementary Material. 4 Experiments Table 1: Few-shot weakly supervised learning results (presented in %) on CAMELYON, TCGA-NSCLC, and TCGA-BRCA under 4-shot, 8-shot, and 16-shot settings are presented. The best performance is highlighted in bold, and the second-best is underlined. Dataset Model 4-shot 8-shot 16-shot ACC AUC F1 Score ACC AUC F1 Score ACC AUC F1 Score CAMELYON

Papers on Lattice

Total citations

Topics

Research focus

Computer Vision (1)Multimodal Models (1)Scientific Discovery & Drug Design (1)

Frequent co-authors

Jiahao Xu (1)Sheng Huang (1)Zhixiong Nan (1)Nankun Mu (1)

Papers (1)

Feb 24, 2026

MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification

LLMs can boost few-shot learning for pathology images, but only if you dynamically adapt the language priors to each image and stochastically integrate multiple "expert" descriptions.

Jiahao Xu, Sheng Huang, Zhixiong Nan +2

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Search

Jiajun Dong

Research focus

Frequent co-authors

Papers (1)