Search papers, labs, and topics across Lattice.
This paper addresses the scarcity of Dutch medical corpora by creating a 35 billion token dataset from translated English datasets, medical text extracted from generic corpora, and open Dutch medical resources. The resulting corpus, comprising approximately 100 million documents, aims to facilitate NLP development in the Dutch medical domain. The authors release this large-scale resource on Hugging Face, enabling pre-training and downstream tasks.
Dutch NLP researchers, rejoice: a massive, freely available 35B token medical corpus has arrived to jumpstart your models.
\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.