Search papers, labs, and topics across Lattice.
RAPTOR+ is introduced as a multimodal extension to the RAPTOR system, leveraging Vision-Language Models (VLMs) for end-to-end understanding of clinical cancer referral documents, thereby eliminating the need for a separate OCR stage. The study evaluates fine-tuned VLMs (Qwen3-VL-8B), commercial and open-source zero-shot VLMs (Gemini 2.5 Flash), and the original OCR-based pipeline on a dataset of 223 CRC urgent referral forms, using a grounding-aware evaluation framework. Results demonstrate that fine-tuning Qwen3-VL-8B significantly improves both reading accuracy (96.1%) and strict safety (60.6%) compared to zero-shot models, highlighting the importance of task-specific fine-tuning for reliable clinical document processing.
Zero-shot VLMs might ace the reading test, but when it comes to actually *grounding* their understanding in visual evidence for critical clinical decisions, fine-tuning is the only way to fly.
Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.