Search papers, labs, and topics across Lattice.
The study benchmarked the performance of DeepSeek-R1, GPT-4o, Qwen3, and Grok-3 in extracting T/N staging from esophageal cancer endoscopic ultrasound reports, evaluating the impact of language (Chinese/English) and prompting strategies. Using a 2x2 factorial design under a zero-shot setting on 625 EUS reports for T-staging and 579 for N-staging, DeepSeek-R1 demonstrated superior overall accuracy in both T and N-staging tasks compared to the other models. The findings highlight the potential of LLMs for automated cancer staging and suggest DeepSeek-R1's robust reasoning capabilities can improve treatment planning.
DeepSeek-R1 embarrasses GPT-4o, Qwen3, and Grok-3 in extracting cancer stages from medical reports, especially when prompts are absent and the language is Chinese.
Objectives: To benchmark the performance of DeepSeek-R1 against three other advanced AI reasoning models (GPT-4o, Qwen3, Grok-3) in automatically extracting T/N staging from esophageal cancer endoscopic ultrasound (EUS) complex medical reports, and to evaluate the impact of language (Chinese/English) and prompting strategy (with/without designed prompt) on model accuracy and robustness. Methods: We retrospectively analyzed 625 EUS reports for T-staging and 579 for N-staging, which were collected from 663 patients at the Sun Yat-sen University Cancer Center between 2018 and 2020. A 2 × 2 factorial design (Language × Prompt) was employed under a zero-shot setting. The performance of the models was evaluated using accuracy, and the odds ratio (OR) was calculated to quantify the comparative performance advantage between models across different scenarios. Results: Performance was evaluated across four scenarios: (1) Chinese with-prompt, (2) Chinese without-prompt, (3) English with-prompt, and (4) English without-prompt. In both T and N-staging tasks, DeepSeek-R1 demonstrated superior overall performance compared to the competitors. For T-staging, the average accuracy was (DeepSeek-R1 vs. GPT-4o vs. Qwen3 vs. Grok-3: 91.4% vs. 84.2% vs. 89.5% vs. 81.3%). For N-staging, the respective average accuracy was 84.2% vs. 65.0% vs. 68.4% vs. 51.9%. Notably, N-staging proved more challenging than T-staging for all models, as indicated by lower accuracy. This superiority was most pronounced in the Chinese without-prompt T-staging scenario, where DeepSeek-R1 achieved significantly higher accuracy than GPT-4o (OR = 7.84, 95% CI [4.62–13.30], p < 0.001), Qwen3 (OR = 5.00, 95% CI [2.85–8.79], p < 0.001), and Grok-3 (OR = 6.47, 95% CI [4.30–9.74], p < 0.001). Conclusions: This study validates the feasibility and effectiveness of large language models (LLMs) for automated T/N staging from EUS reports. Our findings confirm that DeepSeek-R1 possesses strong intrinsic reasoning capabilities, achieving the most robust performance across diverse conditions, with the most pronounced advantage observed in the challenging English without-prompt N-staging task. By establishing a standardized, objective benchmark, DeepSeek-R1 mitigates inter-observer variability, and its deployment provides a reliable foundation for guiding precise, individualized treatment planning for esophageal cancer patients.