Search papers, labs, and topics across Lattice.
The paper introduces MedRCube, a novel multidimensional evaluation framework for MLLMs in medical imaging, designed to provide fine-grained insights into model performance and reasoning. They benchmarked 33 MLLMs, revealing limitations of existing evaluation metrics and highlighting the top performance of Lingshu-32B. A credibility evaluation subset exposed a significant positive correlation between shortcut behavior and diagnostic performance, raising concerns about clinical trustworthiness.
MLLMs that excel at medical image diagnosis may be relying on shortcuts, undermining their trustworthiness for clinical deployment.
The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textit{Lingshu-32B} achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.