Feb 18, 2026arXiv:2602.16872

DODO: Discrete OCR Diffusion Models

Sean Man, Sean Man, Roy Ganz, Roy Ganz, Roi Ronen, Roi Ronen, Shahar Tsiper, Shahar Tsiper, Shai Mazor, Shai Mazor, Niv Nayman, Niv Nayman

AI Summary

The paper introduces DODO, a novel Vision-Language Model (VLM) for OCR that leverages discrete diffusion to achieve faster inference speeds. It addresses the limitations of existing masked diffusion models, which suffer from structural instabilities when applied to the deterministic task of OCR. DODO decomposes the generation process into blocks, mitigating synchronization errors and enabling parallel decoding, resulting in up to 3x faster inference compared to autoregressive models while maintaining near state-of-the-art accuracy.

Key Contribution

Ditch slow, token-by-token OCR: DODO unlocks 3x faster inference by reframing OCR as a parallelizable diffusion process.

Abstract

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DODO: Discrete OCR Diffusion Models

Related Papers