Search papers, labs, and topics across Lattice.
This survey paper reviews the evolution of Optical Character Recognition (OCR), focusing on the shift from handcrafted feature-based methods to deep learning approaches, particularly those leveraging Transformer architectures. It highlights the advantages of Transformer-based models like TrOCR in achieving state-of-the-art performance in multilingual and domain-specific OCR tasks due to their self-attention mechanisms and pre-training capabilities. The survey also identifies remaining challenges such as adversarial robustness, complex layout understanding, and fairness, and discusses emerging research directions including few-shot learning and privacy-preserving inference.
Transformer-based OCR models are achieving state-of-the-art results across multilingual and domain-specific applications, but significant challenges remain in areas like adversarial robustness and fairness across languages.
Optical Character Recognition (OCR) transforms visual text into machine-readable form, supporting the large-scale digitization of printed, handwritten, and scene-based documents. Early approaches, such as template matching and motion analysis, relied on handcrafted patterns and were constrained to limited fonts and simple layouts. The introduction of statistical models, including Hidden Markov Models and Conditional Random Fields, expanded OCR capabilities through probabilistic sequence modeling. With the rise of deep learning, Convolutional and Recurrent Neural Networks enabled end-to-end recognition, reducing dependence on manual feature engineering and improving performance on noisy or cursive text. More recently, transformer-based models like TrOCR have redefined OCR by leveraging self-attention and large-scale pretraining, achieving state-of-the-art results across multilingual and domain-specific applications. These models excel in cross-lingual transfer, low-resource adaptation, and specialized domains such as biomedical and historical text recognition, while integrating pretrained vision–language components for greater robustness against degraded inputs. Despite these advances, challenges persist in adversarial robustness, complex document layout understanding, and fairness across underrepresented languages and scripts. Emerging research directions include zero-shot and few-shot learning, modular adapters for scalable multilingual OCR, post-OCR correction pipelines, efficiency improvements, and privacy-preserving inference. This survey outlines OCR’s historical progression, highlights deep learning and transformer-based breakthroughs, and points to future work needed to address enduring challenges in this critical field of document analysis.