Birmingham City UniversityMBZUAINHS EnglandUniversity Hospitals Birmingham NHSMay 25, 2026arXiv:2605.25956

RAPTOR+: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing

Sofiat Abioye, Ufaq Khan, Shazad Ashraf, Anusha Jose, Benjamin Wallace, William Poulett, Adam Byfield, Lukman Akanbi, Muhammad Bilal

AI Summary

RAPTOR+ is introduced as a multimodal extension to the RAPTOR system, leveraging Vision-Language Models (VLMs) for end-to-end understanding of clinical cancer referral documents, thereby eliminating the need for a separate OCR stage. The study evaluates fine-tuned VLMs (Qwen3-VL-8B), commercial and open-source zero-shot VLMs (Gemini 2.5 Flash), and the original OCR-based pipeline on a dataset of 223 CRC urgent referral forms, using a grounding-aware evaluation framework. Results demonstrate that fine-tuning Qwen3-VL-8B significantly improves both reading accuracy (96.1%) and strict safety (60.6%) compared to zero-shot models, highlighting the importance of task-specific fine-tuning for reliable clinical document processing.

Key Contribution

Zero-shot VLMs might ace the reading test, but when it comes to actually *grounding* their understanding in visual evidence for critical clinical decisions, fine-tuning is the only way to fly.

Abstract

Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RAPTOR+: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing

Related Papers