DAMOSCUShanghai AI LabXJUApr 13, 2026arXiv:2604.11197

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Jiahui Peng, He Yao, Jingwen Li, Yanzhou Su, Sibo Ju, Yujie Lu, Jin Ye, Hongchun Lu, Xue Li, Lincheng Jiang, Min Zhu, Junlong Cheng

AI Summary

MedP-CLIP, a region-aware medical vision-language model, is introduced to improve fine-grained understanding of anatomical structures and lesions in medical images. It integrates medical prior knowledge with a feature-level region prompt integration mechanism, accommodating various prompt forms while preserving global context. Pre-trained on a large-scale dataset of medical images with region annotations, MedP-CLIP achieves superior performance in zero-shot recognition, interactive segmentation, and multimodal large language model enhancement compared to existing methods.

Key Contribution

Unlock zero-shot medical image analysis with MedP-CLIP, a model that understands both the big picture and the critical details, outperforming baselines in tasks from recognition to segmentation.

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Related Papers