A Brief Exploration of Stochastic Tokenization: From Viterbi Decoding to Viterbi Sampling

By 苏剑林 | September 16, 2023

After the previous article "A Problem and Countermeasure for Large Vocabulary Language Models in Continuation Tasks" was published, a reader quickly pointed out that introducing stochastic tokenization during the training phase can solve the same problem, and there are already papers and implementations for it. After further research and learning, I discovered that this technique is called Subword Regularization, which was first applied in NMT (Neural Machine Translation), and SentencePiece already has a corresponding implementation. It seems this technique can indeed alleviate the aforementioned problem and may even help enhance the fault tolerance of language models, so I had the idea to include it in BytePiece.

So the question arises: how to change deterministic tokenization into stochastic tokenization? BytePiece is based on the Unigram model, which uses the Viterbi algorithm to find the tokenization scheme with the maximum probability. Since there are probabilities involved, can we naturally derive a stochastic sampling method? This article discusses this issue and shares my own solution.

Key Analysis

At present, Unigram tokenization directly outputs the segmentation scheme with the maximum probability, which is typically a deterministic output. Specifically, suppose $\boldsymbol{w}=(w_1,w_2,\cdots,w_k)$ represents a segmentation scheme, and the corresponding score is $P(\boldsymbol{w})=p(w_1)p(w_2)\cdots p(w_k)$. If $\Omega(S)$ represents the set of all possible segmentation schemes for sentence $S$, then the tokenization algorithm can be described as:

\begin{equation}\boldsymbol{w}^* = \mathop{\text{argmax}}_{\boldsymbol{w}\in \Omega(S)}P(\boldsymbol{w})\end{equation}

This can be completed in linear time using the Viterbi algorithm, so this process is also referred to as "Viterbi Decoding." It seems that the Unigram model inherently carries probabilities, so it shouldn't be difficult to change it into a probabilistic sampling form. However, upon closer reflection, this is not a trivial problem, and there are many technical details to overcome.

I initially envisioned designing a recursive sampling process mimicking autoregressive language models. However, the most difficult part here is how to maintain the ranking of original candidate segmentation schemes as much as possible, or at least ensure that the maximum probability remains unchanged—meaning the maximum probability path $\boldsymbol{w}^*$ from Viterbi decoding should correspond to the most probable sampling result of the designed recursive sampling algorithm. Since all segmentation schemes $\Omega(S)$ form a Directed Acyclic Graph (DAG), I initially thought a random walk on the DAG might be a viable solution. But upon further thought, I realized it is difficult to design appropriate transition probabilities to ensure the maximum probability path remains unchanged (because different edges from the same starting point are not equally weighted and cannot be sampled simply by their frequency).

Existing Solutions

Since I didn't have any new ideas for a while, I decided to check the "reference answer"—the original Subword Regularization paper: "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates".

However, this "standard answer" left me somewhat caught between laughter and tears. It turns out the idea of Subword Regularization is very simple and direct: first, search for the $n$ segmentation schemes with the highest $P(\boldsymbol{w})$, denoted as $\boldsymbol{w}^*_1,\boldsymbol{w}^*_2,\cdots,\boldsymbol{w}^*_n$ ($n$-best segmentations), and then construct the following distribution:

\begin{equation}p_i = \frac{P(\boldsymbol{w}^*_i)^{\alpha}}{\sum\limits_{j=1}^n P(\boldsymbol{w}^*_j)^{\alpha}}\end{equation}

Then sample from these $n$ schemes according to their probabilities, where $\alpha > 0$ is a hyperparameter. This algorithm is already integrated into SentencePiece, and readers can test it themselves (refer to usage here).

The problem is that "simple and direct" does not necessarily mean "efficient." Although the complexity of searching for the top-$n$ optimal segmentation schemes is also linear (interested readers can look up N-best Viterbi), it is significantly larger than finding only the top-1 Viterbi Decoding (theoretically $n$ times the complexity). Consequently, once stochastic sampling is enabled, it becomes much slower than deterministic tokenization, so this was not my ideal sampling method.

My Approach

My thoughts hit a dead end again. In my frustration, I decided to re-evaluate the goals: we want to find a stochastic sampling algorithm with complexity similar to Viterbi Decoding. As such, Viterbi Decoding itself should be a good starting point. So, I looked at the tokenization code again. The tokenization function at that time looked like this:

def _tokenize(self, bytes text):
    cdef int e, k, s
    cdef double v, score
    cdef list routes = [(0, None)] + [(-INFINITY, None) for _ in text]
    cdef list tokens = []
    for e, (k, v) in self._automaton.iter(text):
        s, e = e - k + 1, e + 1
        score = routes[s][0] + v
        if score > routes[e][0]:
            routes[e] = score, s
    while text:
        s = routes[e][1]
        tokens.append(text[s:e])
        text, e = text[:s], s
    return tokens[::-1]

After reading it a few times, I finally got an inspiration: the key to Viterbi Decoding is the line if score > routes[e][0]:. It represents keeping the optimal segmentation scheme up to the current position, where score is the score of the new segmentation scheme (log probability) and routes[e][0] is the historical optimal score; if the new scheme is better, it overwrites it. This reminded me of the acceptance rate design in the MCMC algorithm. If we introduce stochastic sampling here, wouldn't we be able to randomize the tokenization results?

Let $r \in \{1, 0\}$ represent acceptance/rejection of the new scheme. Since this step is just a binary choice, probabilizing it is also very simple:

\begin{equation} r_i = \left\{\begin{aligned}&\,1\,, \,\, s_i > s_{i-1} \\ &\,0\,, \,\, \text{else}\end{aligned}\right.\qquad\longrightarrow\qquad r_i = \left\{\begin{aligned}&\,1\,, \,\, \varepsilon < \sigma(\alpha(s_i - s_{i-1})) \\ &\,0\,, \,\, \text{else}\end{aligned}\right. \end{equation}

Here $\varepsilon \sim U[0,1]$ is a uniform random number, $\alpha > 0$ is a hyperparameter, $\sigma(t)=1/(1+e^{-t})$ is the Sigmoid function, and $s_i, s_{i-1}$ are the scores (log probabilities) of the new and old schemes, respectively. It is not difficult to see that the deterministic sampling on the left corresponds to the stochastic sampling as $\alpha \to \infty$.

In this way, based on Viterbi decoding, we have obtained a very natural and lightweight stochastic sampling algorithm, which I call "Viterbi Sampling." Implementing it only requires replacing the if score > routes[e][0]: criterion with a version containing a random number. Due to the monotonicity of the Sigmoid function, when $s_i > s_{i-1}$, it naturally assigns a higher probability to the new scheme. Thus, it is clear that the original maximum probability segmentation under Viterbi Sampling is also the most probable result. Furthermore, as $s_i - s_{i-1}$ increases, $\sigma(\alpha(s_i - s_{i-1}))$ also increases, which means schemes that originally had higher scores have a higher probability of being sampled. This maintains the ranking of segmentation schemes to a certain extent (although it hasn't been proven to be strictly order-preserving, from an application standpoint, approximate order-preservation is sufficient).

Simple Test

Starting from version 0.4.0, Viterbi Sampling is built into BytePiece's tokenization function. You only need to add an alpha parameter greater than 0 when using tokenizer.tokenize or tokenizer.encode to get randomized results:

import bytepiece
assert bytepiece.__version__ >= '0.4.0'

tokenizer = bytepiece.Tokenizer('bytepiece_160k.model')
text = '今天天气不错'
print(tokenizer.tokenize(text)) # alpha defaults to -1; alpha ≤ 0 represents deterministic tokenization
for i in range(5):
    print(tokenizer.tokenize(text, alpha=0.1))

# [b'\xe4\xbb\x8a\xe5\xa4\xa9', b'\xe5\xa4\xa9\xe6\xb0\x94', b'\xe4\xb8\x8d\xe9\x94\x99']
# [b'\xe4\xbb\x8a\xe5\xa4\xa9', b'\xe5\xa4\xa9\xe6', b'\xb0\x94', b'\xe4\xb8\x8d\xe9\x94\x99']
# [b'\xe4\xbb\x8a\xe5\xa4\xa9', b'\xe5\xa4\xa9\xe6\xb0\x94', b'\xe4\xb8\x8d\xe9\x94\x99']
# [b'\xe4\xbb\x8a\xe5\xa4\xa9', b'\xe5\xa4\xa9\xe6\xb0\x94', b'\xe4\xb8', b'\x8d', b'\xe9\x94', b'\x99']
# [b'\xe4\xbb\x8a\xe5\xa4\xa9', b'\xe5\xa4\xa9', b'\xe6\xb0\x94', b'\xe4\xb8\x8d\xe9\x94\x99']
# [b'\xe4\xbb', b'\x8a\xe5\xa4\xa9', b'\xe5\xa4\xa9', b'\xe6\xb0\x94\xe4\xb8\x8d', b'\xe9\x94', b'\x99']

Next, let's compare the speed of SentencePiece's Subword Regularization and BytePiece's Viterbi Sampling (both set to $\alpha=0.1$ for stochastic tokenization):

\begin{array}{c|cc} \hline & \text{Deterministic Tokenization} & \text{Stochastic Tokenization} \\ \hline \text{SP-BPE} & \text{1.36M bytes/sec} & \text{1.25M bytes/sec} \\ \text{SP-Unigram} & \text{5.65M bytes/sec} & \text{1.28M bytes/sec} \\ \text{BytePiece} & \text{1.95M bytes/sec} & \text{1.36M bytes/sec} \\ \hline \end{array}

As can be seen, after Subword Regularization (the "SP-Unigram" row) is enabled, the tokenization speed is less than 1/4 of the original, which indicates that the sampling algorithm for Subword Regularization is quite inefficient. In contrast, the Viterbi Sampling proposed in this article only drops by about 30%. The efficiency is clearly higher; the decrease is due to the generation of random numbers and the computation of the Sigmoid function. If these two parts can be further optimized, the speed could be increased even further. As for the BPE model, its stochastic tokenization is called BPE Dropout, which is a method specific to the BPE model; interested readers can explore it on their own, and it won't be introduced here.

Summary

This article primarily explores strategies for converting deterministic tokenization in Unigram models into stochastic tokenization. Although an existing method called "Subword Regularization" achieves this goal, its efficiency is relatively low. To this end, I have proposed a more efficient sampling algorithm, Viterbi Sampling, which only requires simple modifications to the deterministic Viterbi Decoding, thereby largely maintaining the original efficiency. Experiments show that the new algorithm's sampling speed significantly exceeds Subword Regularization. The corresponding implementation is already built into the latest version of BytePiece.