WHUOct 3, 2025arXiv:2510.03160

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyan Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan

AI Summary

The authors introduce SpineMed, an ecosystem for AI-assisted diagnosis of spine disorders, comprising SpineMed-450k, a large-scale, vertebral-level instruction dataset spanning multiple imaging modalities, and SpineBench, a clinically-grounded evaluation framework. They curated SpineMed-450k from diverse sources using a clinician-in-the-loop pipeline with a two-stage LLM generation method to create high-quality, traceable data. Evaluation of several advanced LVLMs on SpineBench revealed weaknesses in fine-grained, level-specific reasoning, while a model fine-tuned on SpineMed-450k showed significant improvements, validated by clinician assessments.

Key Contribution

Current vision-language models stumble on subtle spine diagnoses, but a new dataset and benchmark expose these weaknesses and pave the way for clinically useful AI.

Abstract

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References56

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Related Papers