BytePiece: A Purer, Higher Compression Ratio Tokenizer

By 苏剑林 | September 07, 2023

Currently, the most popular Tokenizer in LLMs is likely Google's SentencePiece. It aligns with several ideal properties for a tokenizer, such as being language-agnostic and data-driven. Because it is written in C++, its realization (tokenization) speed is very fast, making it highly suitable for efficiency-critical scenarios. However, it also has some obvious drawbacks, such as slow training speeds (for the BPE algorithm) and high memory consumption. Furthermore, because it is written in C++, it often acts as a "black box" for most users, making it difficult to study or perform secondary development. In fact, training a tokenizer is equivalent to the traditional task of "new word discovery." Given the author's prior work on Chinese word segmentation and the Minimum Entropy series, I have accumulated considerable experience in new word discovery. Consequently, I have long wanted to write my own version of a tokenizer. These past few days, I finally found the time to complete the initial version. Emulating SentencePiece, I have named it "BytePiece."

Ideal Characteristics

Since we are rewriting a tokenizer, we must consider what an ideal tokenizer should look like to determine if the result meets expectations. In my view, a tokenizer should possess at least the following basic characteristics:

  1. Lossless Reconstruction: The tokenization result should be capable of being restored to the input without any loss.
  2. High Compression Ratio: For a given vocabulary size, the number of tokens for the same dataset should be as small as possible.
  3. Language Agnostic: Based on statistics, neither the training nor the tokenization process should introduce language-specific features.
  4. Data Driven: It should be possible to perform unsupervised training directly on raw corpora.
  5. Training Friendly: The training process should be completed within a reasonable timeframe and on reasonable hardware configurations.

Finally, there are some "bonus points," such as fast tokenization speed, readable code, and convenience for secondary extension. While these are desirable, I believe they are not strictly part of the "basic characteristics."

For me, the biggest complaints about SentencePiece are "lossless reconstruction" and "training friendliness." First, SentencePiece defaults to NFKC normalization, which leads to irreversible changes like converting full-width commas to half-width commas. Thus, by default, it doesn't even satisfy "lossless reconstruction." For a long time, it wasn't on my candidate list until I discovered that adding the parameter --normalization_rule_name=identity during training prevents any conversion. So SentencePiece can support lossless reconstruction, but it requires specific settings.

As for training, it can be quite frustrating. SentencePiece supports two main algorithms: BPE and Unigram. The Unigram training speed is acceptable, but its compression ratio is slightly lower. BPE has a higher compression ratio, but its training speed is an order of magnitude slower than Unigram! Moreover, regardless of whether BPE or Unigram is used, the training process is extremely memory-intensive. In short, training a SentencePiece model on a large corpus is not a pleasant experience.

Model Concept

The construction of a new tokenizer can be decomposed into three parts: 1) basic unit; 2) tokenization algorithm; 3) training algorithm. Once these three parts are determined, the rest is just a matter of programming technique. Below is an introduction to the thinking behind BytePiece regarding these issues.

Basic Unit

We know that Python 3's default string type is Unicode. If we use Unicode as the basic unit, we call it Char-based. Char-based is intuitive and convenient; a Chinese character is represented as a single character of length 1. However, the number of Chars in different languages is far too large. Even just covering individual characters requires a massive vocab_size, not to mention introducing multi-character words. Therefore, like mainstream tokenizers, BytePiece uses the Byte as the basic unit.

Returning to bytes makes many problems "clear as day." Since there are only 256 different single bytes, as long as the vocabulary contains these 256 single bytes, OOV (Out of Vocabulary) issues can be eliminated. This is an obvious advantage. Furthermore, we know the average information entropy of Chinese characters is greater than that of English letters. If we choose Char-based, although each Char appears to be length 1, their "internal" granularity is different, leading to biased statistical results. In contrast, the information entropy of each byte is more uniform (for example, most Chinese characters correspond to 3 bytes in UTF-8 encoding, and the average information entropy of a Chinese character is exactly 2 to 3 times that of an English letter, which corresponds to one byte). Thus, statistical results based on bytes will be more unbiased, making the model more "language agnostic."

In terms of being Byte-based, BytePiece is more thorough than SentencePiece. SentencePiece first processes with Char-based logic and only handles OOV using Byte-based logic. BytePiece, however, converts text to bytes via text.encode() at the very beginning before any subsequent operations. This makes it much purer.

Tokenization Algorithm

There are only a few algorithms for tokenization based on a dictionary, such as maximum matching, shortest path, and maximum probability path. Readers interested in the history can refer to "A Chat on Chinese Automatic Segmentation and Semantic Recognition" by Matrix67. Like jieba and other Chinese segmentation tools, BytePiece chooses maximum probability path segmentation, also known as the "Unigram language model." This choice is based on three considerations: first, the maximum probability of Unigram is, in other words, maximum likelihood, which is more consistent with the training objective of LLMs; second, from a compression perspective, maximum probability actually corresponds to the shortest encoding length (also called minimum description length), embodying the maximization of compression ratio, which aligns with the belief that "compression is intelligence"; third, the optimal tokenization scheme for Unigram can be found via the Viterbi algorithm in linear complexity, which is theoretical optimal complexity.

Of course, while there is a "one-gram (Unigram) model," there are naturally more complex "bigram models" and "trigram models." However, the increase in their complexity far outweighs the benefits they bring, so we generally do not consider these higher-order models.

Training Algorithm

The reason for discussing the tokenization algorithm before the training algorithm is that the optimization goal for training can only be determined once the tokenization algorithm is fixed. As mentioned at the beginning, tokenizer training is essentially the old "new word discovery." I have previously proposed several new word discovery algorithms, such as "Segmentation-based New Word Discovery", "Unsupervised Segmentation Based on Language Models", and "Better New Word Discovery Algorithms." It now seems that the most compatible and potential algorithm for Unigram tokenization is "Unsupervised Segmentation Based on Language Models." BytePiece's training is implemented based on it, here referred to as Byte-based N-gram Language Model (BNLM).

Specifically, for Unigram tokenization, if a byte string $c_1, c_2, \dots, c_l$ of length $l$ has an optimal tokenization result $w_1, w_2, \dots, w_m$, then the product of probabilities $p(w_1)p(w_2)\dots p(w_m)$ should be the maximum among all possible segmentations. Let the lengths of $w_1, w_2, \dots, w_m$ be $l_1, l_2, \dots, l_m$. Then, according to the conditional decomposition formula:

\begin{equation}\prod_{i=1}^m p(w_i) = \prod_{i=1}^m \prod_{j=L_{i-1} + 1}^{j=L_{i-1} + l_i} p(c_j|c_{L_{i-1} + 1},\dots,c_{j-1})\end{equation}

Here $L_i=l_1+l_2+\dots+l_i$. Considering only the $n$-gram model, for $j > L_{i-1} + n$, we approximate $p(c_j|c_{L_{i-1} + 1},\dots,c_{j-1})$ with $p(c_j|c_{j - n + 1},\dots,c_{j-1})$. Thus, Unigram tokenization is transformed into a character (byte) labeling problem, and Tokenizer training is transformed into $n$-gram language model training (recommended $n=6$), which can be completed directly via unsupervised learning. For a more detailed introduction, please refer to the original article "Unsupervised Segmentation Based on Language Models."

(Note: $n=6$ only means that BytePiece's statistical information goes up to 6-gram, but it does not mean it can only generate pieces of length at most 6. Since we approximate conditional probabilities for $n$-grams where $n > 6$ using the 6-gram model, it can theoretically be of any order and generate pieces of arbitrary length.)

Code Implementation

Once the principles were determined, the rest was tedious development work. Fortunately, I managed to write a set of usable code: https://github.com/bojone/bytepiece.

The code is simple, contained in a single file, with two main classes: Trainer and Tokenizer, corresponding to the two parts of tokenization. Tokenization uses pyahocorasick to build an AC automaton to speed things up a bit; it's usable, but still much slower than SentencePiece, as pure Python cannot compare with C++ in terms of speed. Training is divided into four main steps: 1) $n$-gram counting; 2) $n$-gram pruning; 3) pre-segmentation; 4) pre-segmentation result pruning. Among these, steps 1, 3, and 4 are computationally intensive and parallelizable, so corresponding multi-processing implementations were written. With enough processes (I used 64 processes, and each process was basically at full utilization), the training speed can rival that of SentencePiece's Unigram training.

I should specifically mention the result pruning. The most basic criteria for pruning are frequency and vocab_size, but this is not enough. Sometimes $p(w_1)p(w_2) > p(w_1 \circ w_2)$ (where $w_1 \circ w_2$ denotes the concatenation of two words) occurs while $w_1, w_2, w_1 \circ w_2$ all exist in the vocabulary. In such cases, the word $w_1 \circ w_2$ will never be segmented out, so keeping it in the vocabulary is a pure waste of space. Therefore, the pruning process also includes the exclusion of such results.

Performance Testing

Now for the part everyone loves—testing. First, let's do a small-scale test. We randomly sampled 100,000 entries from the previously open-sourced WuDao dataset as the training set (the exported file is about 330MB), and another 1,000 entries as the test set. We trained a vocabulary with vocab_size=50k. The comparison results are as follows:

Method Training Time $\downarrow$ Max Memory Usage $\downarrow$ Compression Ratio $\uparrow$
SP-BPE 55.3 min 5.2 GB 4.80
SP-Unigram 1.6 min 2.5 GB 4.73
BytePiece 6.5 min 4.3 GB 5.05

To clarify, SP-BPE and SP-Unigram refer to SentencePiece with model_type set to BPE and Unigram, respectively. The training codes were:

spm.SentencePieceTrainer.train('--input=wudao.txt --model_prefix=wudao_m --vocab_size=50000 --model_type=bpe --train_extremely_large_corpus=true --normalization_rule_name=identity')
spm.SentencePieceTrainer.train('--input=wudao.txt --model_prefix=wudao_m2 --vocab_size=50000 --model_type=unigram --train_extremely_large_corpus=true --normalization_rule_name=identity')

The unit for the compression ratio is "bytes/token," i.e., the average number of bytes corresponding to each token. It is evident that BytePiece achieves the highest compression ratio with a relatively balanced training time and memory usage.

Next, we conducted a larger-scale test. From a mixed Chinese-English corpus with a ratio of roughly 3:5, we extracted 100,000 samples to train a Tokenizer with vocab_size=100k. The texts in this corpus are quite long, so the exported file for these 100,000 entries was already 13GB. The test set consisted of two parts: one part was 1,000 entries sampled from the same corpus (homogeneous), and the other was the 1,000 entries from the WuDao dataset sampled earlier (heterogeneous). The results are as follows:

Method Training Time $\downarrow$ Max Memory Usage $\downarrow$ Comp. Ratio (Homog.) $\uparrow$ Comp. Ratio (Hetero.) $\uparrow$
SP-BPE 19.21 hours 97 GB 4.52 4.46
SP-Unigram 2.02 hours 384 GB 4.51 4.48
BytePiece 2.24 hours 51 GB 5.39 4.51

Whether in terms of training time, memory, or compression ratio, it seems that the larger the training data, the more advantageous BytePiece becomes!

To Be Continued

Based on current results, BytePiece has certain advantages in training, and the tokenization effect is also decent. However, being written in pure Python, its tokenization speed is only about 1/10 that of SentencePiece. This is one of the future directions for optimization, and I hope C/C++ experts can get involved to help improve BytePiece's tokenization speed.

(Note: Starting from version 0.2.0, the tokenization function has been accelerated using Cython. BytePiece's tokenization speed is now close to BPE, and it can outperform BPE when the text is sufficiently long.)

In fact, if techniques such as random sampling and dynamic pruning are adopted, the training speed and memory of BytePiece can be further optimized. Currently, to ensure deterministic results, BytePiece does not perform pruning until all statistics are gathered, which ensures consistency across single-process or multi-process runs. If the input is randomly shuffled and pruning is performed at intervals, memory usage can be further controlled and statistical speed increased, with minimal expected impact on the final result. This part of the work will be introduced later based on user feedback.

Beyond these points, there are many details in BytePiece that need refinement, and there may be undiscovered bugs. I ask for your understanding and feedback.

Summary

This article introduced BytePiece, a Tokenizer I developed myself. It is a Byte-based Unigram tokenizer, implemented in pure Python, making it more readable and easier to extend. Because it adopts a new training algorithm, its compression ratio is typically higher than existing tokenizers, and it supports multi-process accelerated training. Furthermore, it directly operates on the UTF-8 bytes of text, performing almost no preprocessing, making it purer and language-agnostic.