Search papers, labs, and topics across Lattice.
This paper introduces a scalable Graph Transformer approach to classify healthy versus tumor epithelial cells in whole-slide images (WSIs) of cutaneous squamous cell carcinoma (cSCC), addressing the limitations of patch-based CNNs and Vision Transformers which lose tissue-level context. The method constructs a full-WSI cell graph and employs Graph Transformer models (SGFormer and DIFFormer) for classification, demonstrating superior performance compared to image-based methods on single and multiple WSIs. Key findings indicate that incorporating morphological, texture features, and the cell classes of non-epithelial cells significantly improves classification accuracy, emphasizing the importance of contextual information.
Graph Transformers can beat state-of-the-art image-based models at classifying cancer cells in whole-slide images by explicitly modeling tissue-level context.
Whole-slide images (WSIs) from cancer patients contain rich information that can be used for medical diagnosis or to follow treatment progress. To automate their analysis, numerous deep learning methods based on convolutional neural networks and Vision Transformers have been developed and have achieved strong performance in segmentation and classification tasks. However, due to the large size and complex cellular organization of WSIs, these models rely on patch-based representations, losing vital tissue-level context. We propose using scalable Graph Transformers on a full-WSI cell graph for classification. We evaluate this methodology on a challenging task: the classification of healthy versus tumor epithelial cells in cutaneous squamous cell carcinoma (cSCC), where both cell types exhibit very similar morphologies and are therefore difficult to differentiate for image-based approaches. We first compared image-based and graph-based methods on a single WSI. Graph Transformer models SGFormer and DIFFormer achieved balanced accuracies of $85.2 \pm 1.5$ ($\pm$ standard error) and $85.1 \pm 2.5$ in 3-fold cross-validation, respectively, whereas the best image-based method reached $81.2 \pm 3.0$. By evaluating several node feature configurations, we found that the most informative representation combined morphological and texture features as well as the cell classes of non-epithelial cells, highlighting the importance of the surrounding cellular context. We then extended our work to train on several WSIs from several patients. To address the computational constraints of image-based models, we extracted four $2560 \times 2560$ pixel patches from each image and converted them into graphs. In this setting, DIFFormer achieved a balanced accuracy of $83.6 \pm 1.9$ (3-fold cross-validation), while the state-of-the-art image-based model CellViT256 reached $78.1 \pm 0.5$.