Search papers, labs, and topics across Lattice.
This paper addresses the challenge of structure-aware text recognition in complex historical documents, specifically Ancient Greek critical editions, which present difficulties due to dense reference hierarchies and marginal annotations. The authors introduce a large-scale synthetic dataset of 185,000 page images and a curated benchmark of real scanned editions to evaluate the performance of state-of-the-art Visual Language Models (VLMs). Experiments reveal limitations in existing VLMs, but fine-tuning Qwen3VL-8B achieves a state-of-the-art 1.0% median Character Error Rate on real scans, demonstrating the potential for VLMs in this domain.
VLMs still struggle to decipher the intricate layouts of historical scholarly texts, but Qwen3VL-8B shows promise with a 1.0% character error rate on real Ancient Greek critical editions.
Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.