By 苏剑林 | September 18, 2020
Currently, most Chinese pre-trained models use characters as the basic unit, meaning that Chinese sentences are split into individual characters. There are also some multi-granularity Chinese language models, such as Innovation Works' ZEN and ByteDance's AMBERT, but the basic unit of these models is still characters, with mechanisms added to fuse word information. Currently, there are very few Chinese pre-trained models purely based on words; as far as the author knows, only Tencent UER has open-sourced a word-granularity BERT model, but its practical performance was not ideal. So, how effective is a purely word-based Chinese pre-trained model? Does it have any value? Recently, we pre-trained and open-sourced a word-based Chinese BERT model, which we call WoBERT (Word-based BERT, or "My BERT!" in Chinese), and experiments show that word-based WoBERT has unique advantages in many tasks, such as significant speed improvements, while performance remains roughly the same or even improves. Here is a summary of our work.
Is "character" or "word" better? This is a frustrating question in Chinese NLP, and there have been several works systematically studying it. A relatively recent one is "Is Word Segmentation Necessary for Deep Learning of Chinese Representations?" published by ShannonAI at ACL 2019, which concluded that characters are almost always superior to words. As mentioned earlier, current Chinese pre-trained models are indeed basically character-based. So, does it seem like the problem is solved? Are characters just better?
Things are far from that simple. Taking the ShannonAI paper as an example, its experimental results are not wrong, but they are not representative. Why? Because it compared models where the embedding layers were randomly initialized. In such a case, for the same task, a word-based model has more parameters in the embedding layer, making it naturally more prone to overfitting and resulting in poorer performance—one could guess this without even running experiments. The issue is that when we use word-based models, we usually don't initialize them randomly; instead, we often use pre-trained word vectors (choosing whether to fine-tune them based on the downstream task). This is the typical scenario for word-segmented NLP models, yet the paper did not compare this scenario. Thus, the paper's results are not very convincing.
In fact, "overfitting" is a double-edged sword. We want to prevent it, but overfitting also demonstrates that a model has strong fitting capabilities. If we find ways to suppress overfitting, we can obtain a stronger model under the same complexity, or a lower-complexity model for the same performance. One of the most important means to alleviate overfitting is more sufficient pre-training. Therefore, comparisons that do not introduce pre-training are unfair to word-based models, and our WoBERT confirms the feasibility of word-based pre-trained models.
The general consensus is that character-based models offer the following benefits:
The reasons for choosing word-based models are:
Some might have doubts about the benefits of words. For instance, regarding the second point about Exposure Bias, this is because, theoretically, shorter sequences make the Exposure Bias problem less pronounced (a word-based model predicting an $n$-character word in one step is equivalent to a character model taking $n$ steps, where each step is recursively dependent; thus, the character model's Exposure Bias is more severe). Regarding the third point, although polysemy exists, the meaning of most words is relatively fixed—at least clearer than the meaning of individual characters. Consequently, an embedding layer alone might be sufficient to model word meanings, unlike character models which require multiple layers to combine characters into words.
While they seem evenly matched, the benefits of characters are not necessarily the weaknesses of words, provided some techniques are used. For example:
Therefore, word-based models actually offer many benefits. Except for sequence labeling tasks that require extremely precise boundaries, most NLP tasks won't have issues with word-based units. Thus, we proceeded to build a word-based BERT model.
To include Chinese words in BERT, the first step is to enable the Tokenizer to recognize them. Is it enough to just add words to vocab.txt? Not really. BERT's built-in Tokenizer forcibly separates Chinese characters with spaces, so even if you add the words to the dictionary, it won't segment them as Chinese words. Furthermore, when BERT performs English WordPiece tokenization, it uses a maximum matching method, which is not accurate enough for Chinese word segmentation.
To segment words, we modified BERT's Tokenizer slightly by adding a "pre_tokenize" operation. This allows us to segment Chinese words as follows:
vocab.txt;pre_tokenize to segment it first, obtaining $[w_1, w_2, \dots, w_l]$;tokenize function;tokenize results of each $w_i$ in order as the final tokenization result.In bert4keras>=0.8.8, implementing this change only requires passing a single parameter when constructing the Tokenizer, for example:
tokenizer = Tokenizer(
dict_path,
do_lower_case=True,
pre_tokenize=lambda s: jieba.cut(s, HMM=False)
)
Here, pre_tokenize is an externally passed segmentation function; if not passed, it defaults to None. For simplicity, WoBERT uses Jieba segmentation. We removed redundant parts of BERT's original vocabulary (such as Chinese words with ##) and added 20,000 additional Chinese words (the top 20,000 highest frequency words from Jieba's built-in dictionary). The final vocab.txt size for WoBERT is 33,586.
The currently open-sourced WoBERT is the Base version, built by continuing the pre-training of the RoBERTa-wwm-ext open-sourced by the Harbin Institute of Technology. The pre-training task is MLM. During the initialization phase, each word is segmented into characters by BERT's built-in Tokenizer, and the average of the character embeddings is used to initialize the word embedding.
At this point, the technical highlights of WoBERT have basically been clarified; the rest was the training. We used a single 24G RTX to train for 1,000,000 steps (roughly 10 days), with a sequence length of 512, a learning rate of $5e^{-6}$, a batch_size of 16, and gradient accumulation for 16 steps, which is equivalent to training for about 60,000 steps with a batch_size=256. The training corpus consists of about 30GB of general-purpose text. The training code has been open-sourced at the link provided at the beginning of the article.
In addition, we have provided WoNEZHA, which is based on the NEZHA open-sourced by Huawei. The training details are basically the same as WoBERT. NEZHA's model structure is similar to BERT, but it uses relative position encoding instead of absolute position encoding, meaning that theoretically, the text length NEZHA can handle is unlimited. Providing the word-based WoNEZHA here gives everyone another choice.
Finally, let's talk about the results of WoBERT. Simply put, in our evaluations, compared to BERT, WoBERT did not perform worse on NLP tasks that do not require precise boundaries, and in some cases even showed improvements. Meanwhile, there was a significant increase in speed, so in one sentence: "speed up without dropping points."
For example, the comparison on two classification tasks from the Chinese benchmarks:
\[ \begin{array}{c} \text{Text Classification Performance Comparison}\\ {\begin{array}{c|cc} \hline & \text{IFLYTEK} & \text{TNEWS} \\ \hline \text{BERT} & 60.31\% & 56.94\% \\ \text{WoBERT} & \textbf{61.15%} & \textbf{57.05%} \\ \hline \end{array}} \end{array} \]We also tested many internal tasks, and the results were similar, indicating that WoBERT and BERT are basically comparable on these NLU tasks. However, in terms of speed, WoBERT has a clear advantage. The table below compares the speeds of the two models when processing texts of different lengths:
\[ \begin{array}{c} \text{Speed Comparison}\\ {\begin{array}{c|ccc} \hline & \text{128} & \text{256} & \text{512} \\ \hline \text{BERT} & \text{1.0x} & \text{1.0x} & \text{1.0x} \\ \text{WoBERT} & \textbf{1.16x} & \textbf{1.22x} & \textbf{1.28x} \\ \hline \end{array}} \end{array} \]We also tested Seq2Seq tasks (CSL/LCSTS title generation) using WoBERT + UniLM, and the results showed a marked improvement over character-based models:
\[ \begin{array}{c} \text{CSL Abstract Generation Experimental Results}\\ {\begin{array}{c|c|cccc} \hline & \text{beam size} & \text{Rouge-L} & \text{Rouge-1} & \text{Rouge-2} & \text{BLEU} \\ \hline \text{BERT} & 1 & 63.81 & 65.45 & 54.91 & 45.52 \\ \text{WoBERT} & 1 & \textbf{66.38} & \textbf{68.22} & \textbf{57.83} & \textbf{47.76} \\ \hline \text{BERT} & 2 & 64.44 & 66.09 & 55.75 & 46.39 \\ \text{WoBERT} & 2 & \textbf{66.65} & \textbf{68.68} & \textbf{58.5} & \textbf{48.4} \\ \hline \text{BERT} & 3 & 64.75 & 66.34 & 56.06 & 46.7 \\ \text{WoBERT} & 3 & \textbf{66.83} & \textbf{68.81} & \textbf{58.67} & \textbf{48.6} \\ \hline \end{array}}\\ \\ \text{LCSTS Abstract Generation Experimental Results}\\ {\begin{array}{c|c|cccc} \hline & \text{beam size} & \text{Rouge-L} & \text{Rouge-1} & \text{Rouge-2} & \text{BLEU} \\ \hline \text{BERT} & 1 & 27.99 & 29.57 & 18.04 & 11.72 \\ \text{WoBERT} & 1 & \textbf{31.51} & \textbf{32.9} & \textbf{21.13} & \textbf{13.74} \\ \hline \text{BERT} & 2 & 29.2 & 30.7 & 19.17 & 12.64 \\ \text{WoBERT} & 2 & \textbf{31.91} & \textbf{33.35} & \textbf{21.55} & \textbf{14.13} \\ \hline \text{BERT} & 3 & 29.45 & 30.95 & 19.5 & 12.93 \\ \text{WoBERT} & 3 & \textbf{32.19} & \textbf{33.72} & \textbf{21.81} & \textbf{14.29} \\ \hline \end{array}} \end{array} \]This shows that using words as units is actually more advantageous for text generation. If even longer texts were generated, this advantage would be further amplified. Of course, we do not deny that when using WoBERT for sequence labeling tasks like NER, there may be a noticeable drop in performance; for example, on the People's Daily NER dataset, there was a drop of about 3%. Perhaps surprisingly, through error analysis, we found the cause was not segmentation errors, but rather sparsity (on average, there are fewer samples for each word, so the training is not as sufficient).
Regardless, we are open-sourcing our work to provide everyone with an additional choice when using pre-trained models.
In this article, we open-sourced a word-based Chinese BERT model (WoBERT) and discussed the advantages and disadvantages of using words as units. Finally, through experiments, we showed that word-based pre-trained models have unique value in many NLP tasks (especially text generation). They offer an advantage in speed while matching the performance of character-based BERT. We welcome everyone to test it.
Please include the original address when reprinting: https://kexue.fm/archives/7758