A Problem and Countermeasure for Large-Vocabulary Language Models in Text Continuation Tasks

By 苏剑林 | September 13, 2023

For LLMs, increasing the vocabulary size of the Tokenizer to improve the compression rate, thereby shortening sequence lengths and reducing decoding costs, is a development that everyone welcomes. After all, increasing the vocabulary only requires expanding the Embedding layer and the output Dense layer, parts where the added computational overhead is nearly imperceptible. However, the performance boost in decoding speed brought by shortening the sequence length is very tangible. Of course, increasing the vocabulary size may also have some negative impacts on model performance, so it cannot be increased without restraint. This article analyzes a problem that arises in language models on text continuation tasks after increasing the vocabulary and proposes a reference solution.

Pros and Cons Analysis

The benefits of increasing the vocabulary size are obvious. On one hand, since LLMs are autoregressive, decoding becomes progressively slower. "Increasing vocabulary → improving compression rate → shortening sequence length" means that the number of tokens corresponding to the same text decreases. In other words, the number of decoding steps is reduced, thereby increasing decoding speed. On the other hand, since language models are trained using Teacher Forcing, shortening the sequence length can alleviate the Exposure Bias problem caused by Teacher Forcing, potentially improving model performance.

However, the disadvantages of increasing the vocabulary are also clear—the most direct being that it severs the connections between tokens at the character level, which may affect generalization or even result in the loss of ability to perform certain tasks. For example, if "Solar Energy" (太阳能) and "Solar" (太阳) are both individual tokens in the vocabulary, the model does not inherently know that "Solar Energy" is composed of "Solar" and "Energy," nor that "Solar" consists of specific characters. This makes it difficult to perform subword-related tasks, such as the classic question "How do you read 'Solar Energy' backwards?" The expected answer is "Energy Solar," but since the model does not perceive it as being composed of individual components, it is difficult to answer correctly.

The Continuation Problem

Recently, @Armen Aghajanyan shared another issue. They used an ultra-large vocabulary when training a code model, resulting in common commands like import numpy as np becoming a single token. They then discovered that when a user inputs import numpy, the model is unable to continue the sequence with as np. The reason is simple: import numpy as np was treated as a single token. When import numpy appears alone, the model finds that it is never followed by as np (since cases followed by as np were merged into the single import numpy as np token), and thus it cannot complete the continuation naturally.

This phenomenon is indeed classic. It is not limited to code models; it also appears in common natural language models. For instance, when "Solar Energy" (太阳能) and "Solar" (太阳) both become independent tokens, if a user inputs "Solar," the next continued character will likely not be "Energy" (能), which may not match the user's distributional expectations. Similarly, if "White Cloud" (白云), "White Cloud Mountain" (白云山), and "White Cloud Airport" (白云机场) are all independent tokens, after a user inputs "Guangzhou's White Cloud," the model will almost never continue with "Guangzhou's White Cloud Airport" or "Guangzhou's White Cloud Mountain," and so on.

Reference Strategy

However, I believe that the phenomenon mentioned by Armen Aghajanyan does not necessarily constitute a disadvantage of large vocabularies; rather, with a bit of processing, it could actually become an advantage. This problem is quite simple. Before LLMs existed, we could perform certain completion tasks based on "vocabulary + prefix search." Now that we have LLMs, must we be confined by the LLM alone and not combine LLM-based continuation with vocabulary-based continuation?

Returning to the previous example, suppose the user inputs "Guangzhou's White Cloud" (广州的白云). The Tokenizer splits it into "Guangzhou / of / White Cloud." Now, if these three tokens are converted directly into IDs and input into the model, it will be unable to continue with results like "Guangzhou / of / White Cloud Airport." This is essentially because the Tokenizer cannot predict future text in advance, leading to an incorrect tokenization result. (Of course, one could also consider using a stochastic tokenization algorithm during the training phase. In such a case, "White Cloud Airport" might appear as a single word or as "White Cloud / Airport." This prevents the tokenization result from severely impacting subsequent performance and can even enhance generalization; see "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates").

So, can we estimate the future text? Suppose that after tokenizing into "Guangzhou / of / White Cloud," we take one step back and use "White Cloud" to perform a prefix search in the vocabulary. Let's further assume the search results are "White Cloud," "White Cloud Airport," "White Cloud Mountain," and "White Cloud Road." This search step is performed purely based on the vocabulary, and its computational cost is negligible compared to the LLM. Once we have the search results, we use the LLM to calculate:

\begin{equation}\begin{aligned} p(\text{White Cloud}|\text{Guangzhou},\text{of}) \\ p(\text{White Cloud Airport}|\text{Guangzhou},\text{of}) \\ p(\text{White Cloud Mountain}|\text{Guangzhou},\text{of}) \\ p(\text{White Cloud Road}|\text{Guangzhou},\text{of}) \\ \end{aligned}\end{equation}

Since the inputs are identical, calculating these four conditional probabilities only requires running the LLM once. Once we have these four conditional probabilities, we re-normalize them and perform sampling. If the sampled result is "White Cloud," we continue the generation based on "Guangzhou / of / White Cloud." If we sample "White Cloud Airport," we can output "Airport" and continue based on "Guangzhou / of / White Cloud Airport," and so on. This easily solves the problem mentioned by Armen Aghajanyan and turns the disadvantage into an advantage (when the compression rate is high, even if we step back one token, the word found through prefix search might be very long, allowing for more text to be generated at once). Specifically, the backtrack operation only needs to be performed at the first step of sampling; it is only to avoid tokenization errors caused by incomplete input. From the second step onwards, no backtracking is needed, so the additional computational overhead is minimal.

It is worth mentioning that Microsoft has a library called "guidance" which also proposes a similar trick. Furthermore, considering more general scenarios, sometimes backtracking one step is not enough. For example, in the import numpy as np case, the input import numpy might be split into import / numpy. In this situation, one might need to backtrack at least two steps to form a complete and logical sequence. However, there is no fundamental difference; the details are just slightly more complex. I will not expand on that here; readers can construct it themselves when deploying inference models.

Summary

This article introduced a problem that can occur with ultra-large vocabulary LLMs during text continuation tasks and shared a reference solution.