Mar 10, 2026arXiv:2603.09470

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

AI Summary

The authors introduce the Patrologia Graeca Corpus, a large-scale OCR and linguistic resource derived from 19th-century editions of Ancient Greek texts. They employ a pipeline combining YOLO-based layout detection and CRNN-based text recognition to process complex bilingual (Greek-Latin) layouts with degraded polytonic Greek typography. The resulting corpus achieves a character error rate of 1.05% and a word error rate of 4.69%, and contains approximately six million lemmatized and part-of-speech tagged tokens.

Key Contribution

A new OCR pipeline slashes error rates on noisy, polytonic Greek texts, opening up a vast historical corpus for NLP research and LLM training.

Abstract

We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.

Computer Vision Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

Related Papers