Search papers, labs, and topics across Lattice.
This study compared the performance of three large language models (LLMs) – ChatGPT-4, Claude 2, and Google Med-PaLM – against 17 physicians across different specialties on a 25-question multiple-choice exam focused on pulmonary thromboembolism (PTE). The goal was to assess the AI systems' clinical reasoning capabilities in a complex medical domain. Claude 2 matched the performance of internal medicine and pulmonary specialists (80% accuracy) and significantly outperformed emergency medicine physicians, while ChatGPT-4 and Med-PaLM demonstrated non-inferiority to the specialists.
Claude 2 can match the performance of top medical specialists on pulmonary thromboembolism knowledge assessments, suggesting AI's potential for clinical decision support.
Background: AI systems are increasingly being evaluated for their potential role in medical decision-making. Pulmonary thromboembolism (PTE) represents an ideal test domain for evaluating AI clinical reasoning capabilities due to its high prevalence, significant mortality risk, and clinical complexity requiring integration of validated risk stratification tools, multiple imaging modalities, and nuanced treatment algorithms across diverse patient populations, including pregnancy, malignancy, and renal impairment. We compared the performance of large language models (LLMs) with specialist physicians on PTE knowledge assessment. Methods: We administered 25 multiple-choice questions covering the diagnosis, treatment, complications, and management of PTE to 17 physicians (seven emergency medicine, five internal medicine, and five pulmonary specialists) and three AI systems: ChatGPT-4 (OpenAI, San Francisco, CA, USA), Claude 2 (Anthropic, San Francisco, CA, USA), and Google Med-PaLM (Google Research, Mountain View, CA, USA). Questions were categorized into four domains: diagnosis, treatment, complications, and management/ICU. We calculated overall accuracy and domain-specific performance. We applied a pre-specified non-inferiority margin of 10 percentage points, a threshold consistent with FDA guidance for medical device comparison studies and prior AI-physician trials, representing the maximum clinically acceptable performance gap that would still support practical utility in adjunctive clinical decision support while maintaining appropriate safety standards. Results: Internal medicine and pulmonary specialists achieved the highest scores (80% each), matched by Claude 2 (80%). ChatGPT-4 and MedPalm scored 72% each, while emergency medicine specialists averaged 64.6%. Claude 2 significantly outperformed emergency medicine physicians (+15.4 percentage points, p<0.05). ChatGPT-4 and MedPalm demonstrated non-inferiority to internal medicine and pulmonary specialists (-8 percentage points, within the 10% margin). All groups performed well on diagnostic questions but struggled with nuanced treatment and management scenarios. AI systems showed particular difficulty with guideline-based edge cases and cancer-associated thromboembolism management. Conclusions: Advanced AI systems can achieve specialist-level performance on structured medical knowledge assessments. Claude 2 matched top specialists and exceeded emergency medicine performance, while other AI systems were non-inferior to domain experts. These findings support the potential utility of AI in medical education and clinical decision support while highlighting areas requiring further development.