Search papers, labs, and topics across Lattice.
This paper introduces a decoupled OCR framework consisting of a domain-agnostic character detector and a domain-specific language model corrector, enabling efficient domain adaptation. The language model correctors, based on T5, ByT5, and BART, are trained solely on synthetic noise, eliminating the need for labeled target images. Experiments across diverse document types reveal that this approach achieves near-SOTA accuracy with a 95% reduction in compute compared to end-to-end transformers, with ByT5 excelling on historical documents due to its byte-level reconstruction capabilities.
Achieve near state-of-the-art OCR accuracy with 95% less compute by decoupling character detection from language correction and training the language model on synthetic noise alone.
Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical "Pareto frontier" in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.