Search papers, labs, and topics across Lattice.
This study benchmarked the accuracy of nine multimodal LLMs, VLMs, and agent-based frameworks for pulmonary nodule detection on chest radiographs (CXRs) using a dataset of 100 CXRs. MedRAX and BiomedCLIP achieved the highest accuracy (0.711), while Claude 3.7 Sonnet performed best among proprietary LLMs (0.651). The findings suggest potential for improving open-source LLMs, but the overall accuracy of evaluated models was insufficient for clinical implementation, warranting further research with larger datasets and standardized protocols.
Off-the-shelf multimodal LLMs still lag behind specialized models like MedRAX and BiomedCLIP for pulmonary nodule detection, suggesting domain-specific fine-tuning remains crucial for clinical applications.
Artificial intelligence technologies are being actively introduced in clinical practice. The most promising solutions are AI-assistants based on large language models (LLMs). Determining the feasibility of integrating such applications in clinical practice requires independent performance assessments. This study assessed accuracy of several multimodal LLMs in detecting pulmonary nodules on chest radiographs (CXR). This study included 9 models: Llama 3.2 Vision 90B, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Gemini 2.0 Pro Experimental, Perplexity, CXR-LLaVA, XrayGPT, BiomedCLIP, MedRAX. Each model determined presence or absence of pulmonary nodules in dataset containing 100 CXR, 50 of which contained pulmonary nodules. ROC curves were constructed, diagnostic accuracy metrics were calculated. McNemar's test was used for pairwise accuracy comparisons. Best results were achieved by MedRAX framework and BiomedCLIP vision-language model, with accuracy of 0.711 (95% CI 0.613–0.808). Among proprietary single-model LLMs, Claude 3.7 Sonnet demonstrated the best performance: accuracy 0.651 (0.548–0.753). Llama 3.2 Vision 90B, Claude 3.5 Sonnet, Gemini 2.0 Pro Experimental demonstrated matching accuracy values: 0.602 (0.497–0.708). MedRAX framework and BiomedCLIP vision-language model showed the highest accuracy values. No statistically significant difference was observed between proprietary and open-source models, which may indicate potential for improving accuracy through refinement of open-source LLM-based models. Overall, accuracy values of evaluated models were insufficient for current clinical practice implementation. These results should be seen as exploratory given the small dataset size, single-centre design, different prompting strategies for foundation and domain-adapted models and use of PNG images instead of DICOM.