Search papers, labs, and topics across Lattice.
The paper introduces SurgCoT, a new benchmark to evaluate chain-of-thought (CoT) reasoning in multi-modal large language models (MLLMs) applied to surgical videos across diverse specialties and procedures. SurgCoT uses a structured CoT framework with detailed annotations to assess five core reasoning dimensions, including causal action ordering and anomaly detection. Experiments on 10 leading MLLMs reveal significant gaps in surgical CoT reasoning, even in commercial models, highlighting the need for further research in this area.
MLLMs still struggle with the spatiotemporal reasoning needed to understand surgical videos, even with chain-of-thought prompting.
Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.