Search papers, labs, and topics across Lattice.
The authors introduce EPITOME, a multimodal document ingestion and question-answering pipeline designed to accelerate biocuration of immunological data by integrating OCR, text matching, and vision-language models (VLMs). EPITOME performs three stages: regex-based identification of epitopes and MHC molecules, visual element extraction from PDFs, and contextual indexing linking peptide sequences, MHC molecules, and assays across text, tables, and figures. Preliminary zero-shot performance of open-source VLMs within EPITOME suggests potential for accelerating biocuration through curator-in-the-loop processes.
Biocuration bottlenecks, begone: a new vision-language pipeline lets human curators leverage AI to extract immunological data from papers faster than ever.
The Immune Epitope Database (IEDB, iedb.org) has manually curated epitope data from over 26,000 publications across two decades. With PubMed adding ∼5,000 articles daily, traditional curation methods face scalability challenges. Given the multimodality of data contained in scientific papers, we have sought to build an open-source vision language model (VLM)-based tool that human curators can use to speed up and automate biological data curation. Here we present a multimodal document ingestion and Question-Answering (QnA) pipeline that ties traditional Optical Character Recognition (OCR) and text matching with Vision-Language Model (VLM) capabilities. The system, which we call EPITOME, implements three-stage processing: regex-based epitope and MHC molecule identification, visual element extraction from PDFs, and contextual indexing that links peptide sequences, MHC molecules, and assays to their locations across text, tables, and figures. This indexing is used to supply context for further VLM QnA. Our preliminary results from EPITOME demonstrate promising zero-shot performance of open-source VLMs that suggest promise for accelerating biocuration through a curator-in-the-loop process, with our evaluation identifying strategic points where curator-in-the-loop intervention can enhance overall system accuracy.