Penn StateMar 5, 2026arXiv:2603.04950

Location-Aware Pretraining for Medical Difference Visual Question Answering

Denis Musinguzi, Caren Han, Prasenjit Mitra

AI Summary

This paper introduces a location-aware pretraining framework for medical difference VQA, designed to improve the ability of vision encoders to capture subtle visual variations in medical images. The framework incorporates three location-aware tasks: automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). By pretraining vision encoders with these tasks, the authors achieve state-of-the-art performance in detecting and reasoning about changes in chest X-ray images.

Key Contribution

Location-aware pretraining unlocks significant gains in medical difference VQA by teaching models to spot clinically relevant changes that standard vision encoders miss.

Abstract

Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Location-Aware Pretraining for Medical Difference Visual Question Answering

Related Papers