Search papers, labs, and topics across Lattice.
This paper introduces a hierarchical vision-language framework for automated histopathology report generation from whole slide images (WSIs). The approach uses multi-resolution pyramidal patch selection and a frozen UNI Vision Transformer to extract features, which are then fed into a Transformer decoder to generate diagnostic text tokenized with BioGPT. A retrieval-based verification step using Sentence BERT embeddings is implemented to improve report reliability by comparing generated reports with a reference corpus and substituting them with ground truth references when a high similarity match is found.
Achieve more reliable histopathology reports by verifying AI-generated text against a reference corpus, swapping in ground truth when a close match is found.
Generating diagnostic text from histopathology whole slide images (WSIs) is challenging due to the gigapixel scale of the input and the requirement for precise, domain specific language. We propose a hierarchical vision language framework that combines a frozen pathology foundation model with a Transformer decoder for report generation. To make WSI processing tractable, we perform multi resolution pyramidal patch selection (downsampling factors 2^3 to 2^6) and remove background and artifacts using Laplacian variance and HSV based criteria. Patch features are extracted with the UNI Vision Transformer and projected to a 6 layer Transformer decoder that generates diagnostic text via cross attention. To better represent biomedical terminology, we tokenize the output using BioGPT. Finally, we add a retrieval based verification step that compares generated reports with a reference corpus using Sentence BERT embeddings; if a high similarity match is found, the generated report is replaced with the retrieved ground truth reference to improve reliability.