Feb 25, 2026arXiv:2602.22143

MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining

Yuetan Chu, Xinhua Ma, Xinran Jin, Gongning Luo

AI Summary

The paper introduces MedTri, a normalization framework that transforms free-text medical reports into structured [Anatomical Entity: Radiologic Description + Diagnosis Category] triplets for improved vision-language pretraining. This structured normalization aims to reduce stylistic heterogeneity and image-irrelevant content in medical reports, providing more consistent and anatomy-grounded supervision. Experiments on X-ray and CT datasets demonstrate that MedTri consistently improves vision-language pretraining performance compared to raw reports and existing normalization methods, and facilitates modular text augmentation strategies.

Key Contribution

Standard text normalization for medical reports is leaving performance on the table: structuring reports into anatomy-grounded triplets unlocks significant gains in medical vision-language pretraining.

Abstract

Medical vision-language pretraining increasingly relies on medical reports as large-scale supervisory signals; however, raw reports often exhibit substantial stylistic heterogeneity, variable length, and a considerable amount of image-irrelevant content. Although text normalization is frequently adopted as a preprocessing step in prior work, its design principles and empirical impact on vision-language pretraining remain insufficiently and systematically examined. In this study, we present MedTri, a deployable normalization framework for medical vision-language pretraining that converts free-text reports into a unified [Anatomical Entity: Radiologic Description + Diagnosis Category] triplet. This structured, anatomy-grounded normalization preserves essential morphological and spatial information while removing stylistic noise and image-irrelevant content, providing consistent and image-grounded textual supervision at scale. Across multiple datasets spanning both X-ray and computed tomography (CT) modalities, we demonstrate that structured, anatomy-grounded text normalization is an important factor in medical vision-language pretraining quality, yielding consistent improvements over raw reports and existing normalization baselines. In addition, we illustrate how this normalization can easily support modular text-level augmentation strategies, including knowledge enrichment and anatomy-grounded counterfactual supervision, which provide complementary gains in robustness and generalization without altering the core normalization process. Together, our results position structured text normalization as a critical and generalizable preprocessing component for medical vision-language learning, while MedTri provides this normalization platform. Code and data will be released at https://github.com/Arturia-Pendragon-Iris/MedTri.

Data Curation & Synthetic Data Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining

Related Papers