Korea UApr 9, 2026arXiv:2604.08115

REVISE: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

Gyuho Shim, Seongtae Hong, Heu-Jeoung Lim, Heuiseok Lim

AI Summary

The paper introduces REVISE, a framework for correcting OCR errors in documents using LLMs, addressing the limitation of existing Document AI systems in structurally organizing document information. REVISE employs a hierarchical taxonomy of common OCR errors and a synthetic data generation strategy to train a correction model. Experiments show that REVISE effectively corrects OCR outputs, leading to improved performance in downstream tasks like document retrieval and question answering.

Key Contribution

Synthetically corrupting data with a taxonomy of OCR errors lets you train LLMs to fix real-world OCR mistakes and dramatically improve document understanding.

Abstract

Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose Revise, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, Revise employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that Revise effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents. Consequently, our method significantly enhances downstream performance in document retrieval and question answering tasks, highlighting the potential to overcome the structural management limitations of existing Document AI frameworks.

Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations3

Influential citations0

References57

Year2026

VenueAnnual Meeting of the Association for Computational Linguistics

Related Papers

Finding related papers...

Search

REVISE: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

Related Papers