Di Zhang

M(λSPFαi+(1−λSPF)βi)giAR,z_{\mathrm{WSI}}=\sum_{i=1}^{M}\big(\lambda_{\mathrm{SPF}}\alpha_{i}+(1-\lambda_{\mathrm{SPF}})\beta_{i}\big)g_{i}^{\mathrm{AR}}, (7) where λSPF∈[0,1]\lambda_{\mathrm{SPF}}\in[0,1] controlling the trade-off between coverage prior and semantic attention weights. ROI feature. For each WSI, we select the adaptive region with the largest SPF weight λSPFαi+(1−λSPF)βi\lambda_{\mathrm{SPF}}\alpha_{i}+(1-\lambda_{\mathrm{SPF}})\beta_{i} as the ROI and use its descriptor giARg_{i}^{\mathrm{AR}} as the ROI feature. We then evaluate these ROI features on downstream tasks. 3.3 Model Pretraining Pipeline In this paper, we adopt a two-stage training strategy, as illustrated in Fig. 2. First, we perform self-supervised pretraining on unimodal data by adapting the iBOT algorithm for the CARE architecture. Second, we conduct cross-modal contrastive training under the CLIP framework using paired WSIs with RNA/protein profiles. Figure 3: (a) Adaptive Region Generator. Based on soft inclusion, each patch retains only its top-3 candidate subregions and masks out the rest. Cosine similarity is then computed to the unmasked candidates, and the patch is assigned to the highest-scoring subregion, yielding an adaptive repartition of patches. (b) Semantic and Prior Fusion. A lightweight module that aggregates adaptive region features into a slide-level embedding. We curate a public multimodal cohort from TCGA [39] and GTEx [6] for training. The dataset includes 11,463 H&\&E-stained, formalin-fixed, paraffin-embedded (FFPE) whole-slide images from TCGA and 22,814 normal-tissue slides from GTEx. Within this cohort, we identify 13,289 WSI–RNA pairs and 8,225 WSI–protein pairs. 3.3.1 Stage I: Unimodal Self-Supervised Pretraining We perform unimodal self-supervised pretraining with iBOT [46], a ViT-based teacher–student framework that predicts masked patch targets with an online prototype vocabulary while enforcing multiview consistency. We adapt iBOT in a backbone-agnostic manner with mechanisms that scale pretraining to gigapixel WSIs. Balancing batch size and gigapixel WSI. Gigapixel WSIs yield tens of thousands of patches, constraining training to small batches and weakening self-supervised pretraining. To trade off patch count and batch size, we cluster patch coordinates with DBSCAN [26] to partition each slide into sub-WSIs (≤360\leq 360 patches), converting 34,277 WSIs into 285,710 sub-WSIs and enabling larger effective batch sizes for iBOT pretraining. Table 1: Average ACC (or C-index) by task category across the 33-task benchmark. The best score is in bold, and the second-best is underlined. “Morph. Class.” denotes the average performance on the morphological classification task. “Molecular Class.” reports the mean result for molecular tasks. “Molecular Class.V\text{Class.}_{V}” denotes the average result on no-validation subset of molecular tasks. “Sur. Anal.” denotes to the average performance on the survival prediction task. Detailed results for the 33 benchmarks are provided in the appendix. Task Head Mean-pool CHIEF PRISM GigaPath TANGLE FEATHER TITAN CARE Morph. Class.

Microsoft Research

Papers on Lattice

Total citations

Topics

h-index