Search papers, labs, and topics across Lattice.
This paper introduces Parallel-Token Prediction (PTP), a novel plug-in method for vision-language models (VLMs) that enables parallel decoding of multiple tokens for document parsing. PTP inserts learnable tokens into the input sequence and trains the model with specific objectives to achieve parallel generation. Experiments on OmniDocBench and olmOCR-bench show that PTP improves decoding speed by 1.6x-2.2x, reduces hallucinations, and generalizes well.
Document parsing just got a whole lot faster: a simple plug-in method boosts VLM decoding speed by up to 2.2x while also reducing hallucinations.
Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.