Search papers, labs, and topics across Lattice.
GLM-OCR, a 0.9B-parameter multimodal model, is introduced for document understanding, combining a CogViT visual encoder and a GLM language decoder. To improve decoding efficiency in OCR, the model uses a Multi-Token Prediction (MTP) mechanism to predict multiple tokens per step. Evaluations on benchmarks and industrial scenarios demonstrate competitive or state-of-the-art performance in document parsing, transcription, table structure recovery, and key information extraction, making it suitable for edge and large-scale deployment.
A compact 0.9B multimodal model, GLM-OCR, achieves state-of-the-art document understanding by predicting multiple tokens at once, boosting decoding throughput without blowing up memory.
GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.