Tsinghua AICASZJUMar 11, 2026arXiv:2603.10910

GLM-OCR Technical Report

Shuaiqi Duan, Ya-Qi Xue, Weihan Wang, Zhèngyuān Sū, Huan Liu, Shenghe Yang, Guobing Gan, G. Wang, Zihan Wang, Sheng Yan, Dexin Jin, Guohong Wen, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Wenmeng Yu, Jie Tang

AI Summary

GLM-OCR, a 0.9B-parameter multimodal model, is introduced for document understanding, combining a CogViT visual encoder and a GLM language decoder. To improve decoding efficiency in OCR, the model uses a Multi-Token Prediction (MTP) mechanism to predict multiple tokens per step. Evaluations on benchmarks and industrial scenarios demonstrate competitive or state-of-the-art performance in document parsing, transcription, table structure recovery, and key information extraction, making it suitable for edge and large-scale deployment.

Key Contribution

A compact 0.9B multimodal model, GLM-OCR, achieves state-of-the-art document understanding by predicting multiple tokens at once, boosting decoding throughput without blowing up memory.

Abstract

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GLM-OCR Technical Report

Related Papers