Search papers, labs, and topics across Lattice.
The paper introduces ProtoSR, a method to improve structured radiology reporting by injecting knowledge extracted from free-text reports using an instruction-tuned LLM. ProtoSR constructs a multimodal knowledge base from MIMIC-CXR, aligning free-text descriptions with structured reporting templates and representing answer options with visual prototypes. By retrieving relevant prototypes and using them to condition a residual prediction, ProtoSR achieves state-of-the-art results on the Rad-ReStruct benchmark, particularly on detailed attribute questions.
Instruction-tuned LLMs can mine free-text radiology reports to create a knowledge base that significantly improves the accuracy of structured report generation, especially for rare and detailed findings.
Structured radiology reporting promises faster, more consistent communication than free text, but automation remains difficult as models must make many fine-grained, discrete decisions about rare findings and attributes from limited structured supervision. In contrast, free-text reports are produced at scale in routine care and implicitly encode fine-grained, image-linked information through detailed descriptions. To leverage this unstructured knowledge, we propose ProtoSR, an approach for injecting free-text information into structured report population. First, we introduce an automatic extraction pipeline that uses an instruction-tuned LLM to mine 80k+ MIMIC-CXR studies and build a multimodal knowledge base aligned with a structured reporting template, representing each answer option with a visual prototype. Using this knowledge base, ProtoSR is trained to retrieve prototypes relevant for the current image-question pair and augment the model predictions through a prototype-conditioned residual, providing a data-driven second opinion that selectively corrects predictions. On the Rad-ReStruct benchmark, ProtoSR achieves state-of-the-art results, with the largest improvements on detailed attribute questions, demonstrating the value of integrating free-text derived signal for fine-grained image understanding.