Search papers, labs, and topics across Lattice.
This study systematically evaluates the impact of design choices on the performance of large language models (LLMs) in pathology tasks involving whole-slide images (WSIs). By optimizing input configurations鈥攕pecifically patch size, magnification, and processing mode鈥攖he authors significantly improve the classification accuracy of GPT-5 and other models, demonstrating that previous evaluations may have misrepresented the capabilities of generalist LLMs. The findings reveal that a single optimized configuration can enhance performance on cancer-type and organ classification tasks, suggesting that LLMs can be more competitive with specialized models than previously thought.
Optimizing input configurations can boost LLM performance in pathology tasks, closing the gap with specialized models and challenging assumptions about domain-specific training.
General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.