SYSUJan 1, 2026

Automated Tumor and Node Staging from Esophageal Cancer Endoscopic Ultrasound Reports: A Benchmark of Advanced Reasoning Models with Prompt Engineering and Cross-Lingual Evaluation

Xudong Hu, Lingde Feng, Bingzhong Jing, Linna Luo, Wencheng Tan, Yin Li, Xinyi Zheng, Xinxin Huang, Shiyong Lin, Huiling Wu, Longjun He

AI Summary

The study benchmarked the performance of DeepSeek-R1, GPT-4o, Qwen3, and Grok-3 in extracting T/N staging from esophageal cancer endoscopic ultrasound reports, evaluating the impact of language (Chinese/English) and prompting strategies. Using a 2x2 factorial design under a zero-shot setting on 625 EUS reports for T-staging and 579 for N-staging, DeepSeek-R1 demonstrated superior overall accuracy in both T and N-staging tasks compared to the other models. The findings highlight the potential of LLMs for automated cancer staging and suggest DeepSeek-R1's robust reasoning capabilities can improve treatment planning.

Key Contribution

DeepSeek-R1 embarrasses GPT-4o, Qwen3, and Grok-3 in extracting cancer stages from medical reports, especially when prompts are absent and the language is Chinese.

Abstract

Objectives: To benchmark the performance of DeepSeek-R1 against three other advanced AI reasoning models (GPT-4o, Qwen3, Grok-3) in automatically extracting T/N staging from esophageal cancer endoscopic ultrasound (EUS) complex medical reports, and to evaluate the impact of language (Chinese/English) and prompting strategy (with/without designed prompt) on model accuracy and robustness. Methods: We retrospectively analyzed 625 EUS reports for T-staging and 579 for N-staging, which were collected from 663 patients at the Sun Yat-sen University Cancer Center between 2018 and 2020. A 2 × 2 factorial design (Language × Prompt) was employed under a zero-shot setting. The performance of the models was evaluated using accuracy, and the odds ratio (OR) was calculated to quantify the comparative performance advantage between models across different scenarios. Results: Performance was evaluated across four scenarios: (1) Chinese with-prompt, (2) Chinese without-prompt, (3) English with-prompt, and (4) English without-prompt. In both T and N-staging tasks, DeepSeek-R1 demonstrated superior overall performance compared to the competitors. For T-staging, the average accuracy was (DeepSeek-R1 vs. GPT-4o vs. Qwen3 vs. Grok-3: 91.4% vs. 84.2% vs. 89.5% vs. 81.3%). For N-staging, the respective average accuracy was 84.2% vs. 65.0% vs. 68.4% vs. 51.9%. Notably, N-staging proved more challenging than T-staging for all models, as indicated by lower accuracy. This superiority was most pronounced in the Chinese without-prompt T-staging scenario, where DeepSeek-R1 achieved significantly higher accuracy than GPT-4o (OR = 7.84, 95% CI [4.62–13.30], p < 0.001), Qwen3 (OR = 5.00, 95% CI [2.85–8.79], p < 0.001), and Grok-3 (OR = 6.47, 95% CI [4.30–9.74], p < 0.001). Conclusions: This study validates the feasibility and effectiveness of large language models (LLMs) for automated T/N staging from EUS reports. Our findings confirm that DeepSeek-R1 possesses strong intrinsic reasoning capabilities, achieving the most robust performance across diverse conditions, with the most pronounced advantage observed in the challenging English without-prompt N-staging task. By establishing a standardized, objective benchmark, DeepSeek-R1 mitigates inter-observer variability, and its deployment provides a reliable foundation for guiding precise, individualized treatment planning for esophageal cancer patients.

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueDiagnostics

Related Papers

Finding related papers...

Search

Automated Tumor and Node Staging from Esophageal Cancer Endoscopic Ultrasound Reports: A Benchmark of Advanced Reasoning Models with Prompt Engineering and Cross-Lingual Evaluation

Related Papers