Aarhus UniversityEqual supervisionFreiburgJun 18, 2026arXiv:2606.20477

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

AI Summary

This paper introduces RefRad2D, a large-scale bilingual dataset of 1.2M CT and MR image-text pairs for training vision-language models in radiology without manual spatial annotations. The authors present RadGrounder, a model that integrates report generation, visual question answering, and spatial grounding, achieving competitive performance on external benchmarks while demonstrating improved open-ended VQA when trained with their dataset. Notably, the incorporation of grounding supervision maintains language quality, allowing for spatially verifiable outputs without sacrificing VQA effectiveness.

Key Contribution

RadGrounder achieves competitive performance in radiology VQA while enabling spatial grounding without compromising language quality.

Abstract

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

Related Papers