Department of Computer Science and EngineeringIIT DelhiPGIMER ChandigarhApr 21, 2026arXiv:2604.19350

Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms

Samyak Sanghvi, Piyush Miglani, Sarvesh Shashikumar, Kaustubh R Borgavi, Veenu Singla, Chetan Arora

AI Summary

This paper addresses the limited performance of Vision Transformers (ViTs) in breast cancer detection from mammograms by tackling challenges related to high-resolution medical images and fine-grained classification. They propose a framework that incorporates RoI-based token reduction using an object detection model, contrastive learning between selected RoIs, and a DINOv2 pretrained ViT. Experiments on mammography datasets demonstrate superior performance compared to existing baselines, highlighting the method's effectiveness for breast cancer screening.

Key Contribution

Pretraining ViTs with DINOv2 and focusing attention on regions of interest dramatically improves breast cancer detection from mammograms, outperforming standard approaches.

Abstract

Vision Transformers $(\texttt{ViT})$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(\texttt{RoI})$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected $\texttt{RoI}$ to enhance fine-grained discrimination through hard-negative based training; and (3) a $\texttt{DINOv2}$ pretrained $\texttt{ViT}$ that captures localization-aware, fine-grained features instead of global $\texttt{CLIP}$ representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: https://aih-iitd.github.io/publications/attend-what-matters

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms

Related Papers