Apr 28, 2026arXiv:2604.25374

Language corpora for the Dutch medical domain

AI Summary

This paper addresses the scarcity of Dutch medical corpora by creating a 35 billion token dataset from translated English datasets, medical text extracted from generic corpora, and open Dutch medical resources. The resulting corpus, comprising approximately 100 million documents, aims to facilitate NLP development in the Dutch medical domain. The authors release this large-scale resource on Hugging Face, enabling pre-training and downstream tasks.

Key Contribution

Dutch NLP researchers, rejoice: a massive, freely available 35B token medical corpus has arrived to jumpstart your models.

Abstract

\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

Data Curation & Synthetic Data Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Language corpora for the Dutch medical domain

Related Papers