ETHMay 21, 2026arXiv:2605.22821

Tokenisation via Convex Relaxations

Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel

AI Summary

The paper introduces ConvexTok, a novel tokenization algorithm that formulates tokeniser construction as a linear program solved via convex optimization, moving beyond greedy approaches like BPE and Unigram. ConvexTok consistently improves intrinsic tokenization metrics and bits-per-byte compression in language models. Importantly, the method allows for certification of sub-optimality, empirically achieving tokenizers within 1% of optimal for common vocabulary sizes.

Key Contribution

Escape the greedy trap: Convex optimization yields tokenizers that compress better and come with optimality guarantees.

Abstract

Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.

Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tokenisation via Convex Relaxations

Related Papers