Mar 4, 2026arXiv:2603.04081

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

Hiroki Kagiyama, Toru Nagasaka, Yukari Adachi, Takaaki Tachibana, Ryota Ito, Mitsugu Fujita, Kimihiro Yamashita, Yoshihiro Kakeji

AI Summary

This paper investigates the effectiveness of foundation models versus task-specific architectures for cell-level histopathological image analysis using small image patches (40x40 pixels). They trained eight task-specific CNN and Vision Transformer architectures from scratch and evaluated three foundation models via linear probing and fine-tuning, varying the training data scale. Results showed that task-specific models, especially a custom-designed Vision Transformer (CustomViT), outperformed foundation models with less computational cost when trained on sufficient data, and that blur robustness was similar across architectures.

Key Contribution

Foundation models hit a wall in cell-level image analysis, with task-specific architectures proving more effective and efficient when trained on sufficient data, challenging the assumption that pre-trained models always offer an advantage in small-patch regimes.

Abstract

Background and objective: Cell-level pathological image analysis requires working with extremely small image patches (40x40 pixels), far below standard ImageNet resolutions. It remains unclear whether modern deep learning architectures and foundation models can learn robust and scalable representations under this constraint. We systematically evaluated architectural suitability and data-scale effects for small-patch cell classification. Methods: We analyzed 303 colorectal cancer specimens with CD103/CD8 immunostaining, generating 185,432 annotated cell images. Eight task-specific architectures were trained from scratch at multiple data scales (FlagLimit: 256--16,384 samples per class), and three foundation models were evaluated via linear probing and fine-tuning after resizing inputs to 224x224 pixels. Robustness to blur was assessed using pre- and post-resize Gaussian perturbations. Results: Task-specific models improved consistently with increasing data scale, whereas foundation models saturated at moderate sample sizes. A Vision Transformer optimized for small patches (CustomViT) achieved the highest accuracy, outperforming all foundation models with substantially lower inference cost. Blur robustness was comparable across architectures, with no qualitative advantage observed for foundation models. Conclusion: For cell-level classification under extreme spatial constraints, task-specific architectures are more effective and efficient than foundation models once sufficient training data are available. Higher clean accuracy does not imply superior robustness, and large pre-trained models offer limited benefit in the small-patch regime.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Data Curation & Synthetic Data

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

Related Papers