Mar 3, 2026arXiv:2603.02805

ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

AI Summary

The paper introduces ScribeTokens, a novel tokenization scheme for digital ink that decomposes pen movement into unit pixel steps, augmented with two pen-state tokens, resulting in a fixed 10-token vocabulary. This fixed vocabulary addresses the limitations of continuous vector representations and large-vocabulary token representations in digital ink processing. Experiments demonstrate that ScribeTokens outperform vector representations in handwritten text generation and recognition tasks, especially when combined with a next-ink-token prediction pretraining strategy, achieving state-of-the-art recognition accuracy.

Key Contribution

Digital ink can be represented and generated far more effectively using a fixed 10-token vocabulary of pixel-based movements than with continuous vectors, achieving state-of-the-art handwritten text recognition.

Abstract

Digital ink -- the coordinate stream captured from stylus or touch input -- lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

Related Papers