Apr 24, 2026arXiv:2604.22880

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Chengye Wang, Ling Fu, Zexi Kuang, Yilun Zhao

AI Summary

The paper introduces TexOCR, a 2B-parameter model for page-level reconstruction of scientific PDFs into compilable LaTeX, trained using supervised fine-tuning and reinforcement learning with verifiable rewards. To facilitate this, the authors created TexOCR-Bench, a benchmark for evaluating transcription fidelity, structural faithfulness, and compilability, and TexOCR-Train, a large-scale training corpus. Experiments demonstrate that TexOCR outperforms existing systems by better preserving document invariants and achieving higher compilation reliability, particularly when using RL with verifiable rewards.

Key Contribution

Existing document OCR models fail to preserve crucial structural and executable properties of LaTeX, but TexOCR, trained with verifiable rewards, finally delivers compilable page-to-LaTeX reconstruction.

Abstract

Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.

Code Generation & Program Synthesis Computer Vision Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Related Papers