Independent ResearcherNIIRIKENTohokuUTokyoMay 27, 2026arXiv:2605.28375

PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature

An Dao, Nhan Ly, Thao Tran, Yuji Matsumoto, Akiko Aizawa

AI Summary

The authors introduce PrionNER, a new manually annotated named entity recognition (NER) dataset focused on prion disease clinical information extracted from PubMed abstracts. The dataset contains 317 abstracts, 2,943 sentences, and 6,955 entity annotations across 15 coarse-grained and 31 fine-grained entity types relevant to prion disease. Benchmarking experiments using BERT, W2NER, and Gemma-4-31B show that while W2NER performs best in supervised settings and Gemma-4-31B in zero-shot, the dataset presents challenges, particularly for complex mentions and fine-grained distinctions.

Key Contribution

Clinically-focused NER for prion diseases is now possible with PrionNER, a new dataset that exposes the limitations of existing models in extracting fine-grained, complex information from biomedical literature.

Abstract

Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/.

Data Curation & Synthetic Data Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature

Related Papers