Mar 3, 2026arXiv:2603.02803

Structure-Aware Text Recognition for Ancient Greek Critical Editions

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

AI Summary

This paper addresses the challenge of structure-aware text recognition in complex historical documents, specifically Ancient Greek critical editions, which present difficulties due to dense reference hierarchies and marginal annotations. The authors introduce a large-scale synthetic dataset of 185,000 page images and a curated benchmark of real scanned editions to evaluate the performance of state-of-the-art Visual Language Models (VLMs). Experiments reveal limitations in existing VLMs, but fine-tuning Qwen3VL-8B achieves a state-of-the-art 1.0% median Character Error Rate on real scans, demonstrating the potential for VLMs in this domain.

Key Contribution

VLMs still struggle to decipher the intricate layouts of historical scholarly texts, but Qwen3VL-8B shows promise with a 1.0% character error rate on real Ancient Greek critical editions.

Abstract

Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Structure-Aware Text Recognition for Ancient Greek Critical Editions

Related Papers