School of Computer Science and Software EngineeringShenzhen UniversityUniversity of NottinghamApr 22, 2026arXiv:2604.20319

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Gui Wang, YongSong Zhou, Kaijun Deng, W. Cheah, Rong Qu, Jianfeng Ren, Linlin Shen

AI Summary

The paper introduces SurgCoT, a new benchmark to evaluate chain-of-thought (CoT) reasoning in multi-modal large language models (MLLMs) applied to surgical videos across diverse specialties and procedures. SurgCoT uses a structured CoT framework with detailed annotations to assess five core reasoning dimensions, including causal action ordering and anomaly detection. Experiments on 10 leading MLLMs reveal significant gaps in surgical CoT reasoning, even in commercial models, highlighting the need for further research in this area.

Key Contribution

MLLMs still struggle with the spatiotemporal reasoning needed to understand surgical videos, even with chain-of-thought prompting.

Abstract

Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Related Papers