MIT CSAILKAUSTJun 10, 2026arXiv:2606.11702

MedCTA: A Benchmark for Clinical Tool Agents

Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

AI Summary

This paper introduces MedCTA, a benchmark designed to evaluate clinical tool agents on complex, step-implicit tasks using realistic multimodal clinical inputs. By assessing 18 multimodal models against 107 clinician-validated tasks, the study reveals significant shortcomings in current systems, including high rates of protocol failures and incorrect tool recruitment during multi-step processes. The findings underscore that even advanced perception capabilities do not guarantee reliable performance in clinical decision-making contexts.

Key Contribution

Even state-of-the-art multimodal models struggle with reliability in clinical tool use, revealing critical gaps in AI agent performance.

Abstract

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MedCTA: A Benchmark for Clinical Tool Agents

Related Papers