Search papers, labs, and topics across Lattice.
The paper introduces frequency-ordered tokenization, a preprocessing technique that reorders a BPE-derived vocabulary such that frequent tokens are assigned smaller integer identifiers, which are then encoded with variable-length integers before compression. This exploits Zipf's law to improve lossless compression ratios by making the input more amenable to compressors. Experiments on enwik8 and enwik9 show improvements of up to 7.08 percentage points for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd, and also demonstrates faster overall compression times for computationally expensive algorithms.
Reordering your BPE vocabulary by token frequency can dramatically boost lossless compression ratios and speed, beating out classical methods like Word Replacing Transform.
We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.