MIT CSAILHarvardJun 10, 2026arXiv:2606.12407

How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

Kian R. Weihrauch, Kian R. Weihrauch, Thomas A. Buckley, Thomas A. Buckley, William Lotter, William Lotter, A. K. Manrai, Arjun K. Manrai

AI Summary

This study systematically evaluates the impact of design choices on the performance of large language models (LLMs) in pathology tasks involving whole-slide images (WSIs). By optimizing input configurations—specifically patch size, magnification, and processing mode—the authors significantly improve the classification accuracy of GPT-5 and other models, demonstrating that previous evaluations may have misrepresented the capabilities of generalist LLMs. The findings reveal that a single optimized configuration can enhance performance on cancer-type and organ classification tasks, suggesting that LLMs can be more competitive with specialized models than previously thought.

Key Contribution

Optimizing input configurations can boost LLM performance in pathology tasks, closing the gap with specialized models and challenging assumptions about domain-specific training.

Abstract

General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.

Computer Vision Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

Related Papers