By 苏剑林 | November 06, 2020
I wonder if everyone still has an impression of Google's chart-topping work T5 from last year? That's the model that, under the banner of "Everything is Seq2Seq," scaled up to 11 billion parameters and swept multiple NLP leaderboards like GLUE and SuperGLUE. Even after a year, T5 still holds the top spot on the SuperGLUE leaderboard, currently maintaining a steady 2% lead over second place. However, for friends in the Chinese NLP community, T5 might not have had much presence for a simple reason: there was no Chinese version of T5 available. But this situation is about to change, because Google recently released the multilingual version of T5 (mT5), which naturally includes Chinese. Although it’s not a "pure" Chinese version, it’s good enough to make do with.

"Everything is Seq2Seq" T5
This article will provide a brief review and introduction to the T5 model, and then explain how to call the mT5 model in bert4keras for Chinese tasks. As a native Seq2Seq pre-trained model, mT5 performs quite well on text generation tasks and is well worth a try.
Like BERT, T5 is also produced by Google, coming from the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", with the GitHub repository at text-to-text-transfer-transformer. The core philosophy of T5 is "Everything is Seq2Seq." It uses a standard Encoder-Decoder model and constructs unsupervised/supervised text generation pre-training tasks, ultimately pushing performance to a new height.
T5's pre-training includes both unsupervised and supervised parts. The unsupervised part uses nearly 800G of corpus constructed by Google (referred to in the paper as C4), and the training objective is similar to BERT, but modified into a Seq2Seq version. We can think of it as an advanced version of fill-in-the-blanks:
Input: When will the bright moon be here? [M0] ask the blue sky, not knowing [M1], what year is it tonight? I desire [M2] return, yet fear the jade towers and palace of pearl, where the high altitude [M3]; dancing to [M4] clear shadow, what is it like in the human world.
Output: [M0] holding a wine cup [M1] the palace in the heavens [M2] riding the wind [M3] is unbearably cold [M4] playing with
The supervised part involves collecting common NLP supervised task data and uniformly transforming them into Seq2Seq tasks for training. For example, sentiment classification can be transformed like this:
Input: Identify the sentiment tendency of this sentence: I felt very good about this trip to Beijing.
Output: Positive
Topic classification can be transformed like this:
Input: What kind of news is the following? After eight months, we can finally see the volleyball girls on the court again.
Output: Sports
Reading comprehension can be transformed like this:
Input: Reading Comprehension: Trump and Biden are running together for the next US president. Answer the question based on the above information: What nationality is Trump?
Output: American
As can be seen, this transformation is consistent with the ideas of GPT2, GPT3, and PET—all aiming to use text to express the tasks we want to perform and then transforming them into text prediction. Readers can also refer to the previous post "Do We Really Need GPT3? No, BERT’s MLM Model Can Also Do Few-shot Learning" to understand related content. In general, based on our internal experiments, the large model size, massive data, and supervised pre-training are all key factors in T5's success. "Everything is Seq2Seq" provides an effective scheme to integrate these key factors.
T5's main achievements are summarized in the table below:

T5 Score Summary
In addition to dominating multiple leaderboards, T5 also tuned many adjustable hyperparameters throughout the entire training process—for instance, whether a standard Encoder-Decoder or a UniLM-style architecture is better, whether BERT’s method or others are better for unsupervised pre-training tasks, whether a random Mask ratio of 15% is best, and so on. Finally, they provided the following table, and even expressed a regretful "In fact, we feel that T5's experiments are not thorough enough," which felt a bit like "taking every path so others have no path left to take." Regardless, these "alchemy" results are worth reading for every student who wants to build language models; they might allow us to take fewer detours.

T5's exhaustive "Alchemy Bible" (Click to enlarge)
As for mT5, which is Multilingual T5, the multilingual version of T5, it comes from the recent paper "mT5: A massively multilingual pre-trained text-to-text transformer", with the GitHub repository at multilingual-t5. This has once again pushed the leaderboard scores for multilingual NLP tasks to a new height. Of course, for us, the most important thing is that mT5 includes Chinese, so we finally have the opportunity to try T5 on Chinese tasks.
Overall, mT5 follows the same lineage as T5 and is basically the same, but in terms of model structure, mT5 uses the T5.1.1 scheme. Here is a basic introduction to it.
What many people don’t know is that since its release last October, T5 underwent a low-key minor upgrade earlier this year. Specific details can be found at the GitHub link. The official name for T5 before the upgrade is T5.1.0, and after the upgrade, it's called T5.1.1. Its main changes come from the paper "GLU Variants Improve Transformer", mainly borrowing the GLU (Gated Linear Unit) from "Language Modeling with Gated Convolutional Networks" to enhance the performance of the FFN (Feed-Forward Network) part. Specifically, the original T5 FFN was (T5 has no Bias): \begin{equation}\text{FFN}(x)=\text{relu}(xW_1)W_2\end{equation} This has now been changed to: \begin{equation}\text{FFN}_{\text{GEGLU}}(x)=\big(\text{gelu}(xW_1)\otimes xW_2\big)W_3\end{equation} That is, the first transformation layer activated by ReLU was changed to a gated linear unit activated by GeLU. This increases the FFN layer parameters by 50%, but according to the paper, the effect is significantly improved. Additionally, T5.1.1 changed the Embedding layer. In T5.1.0, the Encoder and Decoder’s Embedding layers, and the Softmax layer for the Decoder’s final probability distribution prediction, all shared the same Embedding matrix. Now, T5.1.1 only shares the Embedding layer between the Encoder and Decoder, while the Softmax layer for final probability prediction uses an independent Embedding matrix. This naturally increases the number of parameters significantly, but Google's conclusion states that this performs better, a conclusion summarized in the recent paper "Rethinking embedding coupling in pre-trained language models". One last change: T5.1.1 removed Dropout during the pre-training phase, only using it during the downstream fine-tuning phase.
After these adjustments, Google re-trained and opened up the full series of T5.1.1 models. Download links can be found at the GitHub link mentioned above. Note that T5.1.1 only underwent unsupervised pre-training, but its performance is still outstanding. Since T5.1.1 shows significant improvements, mT5 continued to use the T5.1.1 structure.
mT5 is essentially a re-construction of a multilingual dataset mC4, followed by a round of training using the T5.1.1 scheme; there is no significant technical innovation in the roadmap. Regarding training details, readers can look at the original paper, which isn't very long since T5 already paved the way.
As for mT5’s performance, it is mainly concentrated in the following table:

mT5 "Record"
Readers might wonder how such a multilingual version is evaluated. Simply put, we could directly fine-tune a cross-lingual machine translation task on this basis to see the improvement. However, for multilingual models, researchers are more concerned with their Zero-Shot performance in cross-lingual tasks. To put it bluntly, for the same task, if we fine-tune on one language, can the model be used directly on other languages? This is the meaning of "Cross-lingual zero-shot transfer (models fine-tuned on English data only)" in the table above. It can be seen that mT5's performance is quite impressive.
Finally, we come to the part everyone loves: practice. Here we will briefly introduce the process and techniques of using the mT5 model for Chinese text generation tasks in bert4keras. bert4keras started supporting mT5 from version 0.9.1. Readers who wish to conduct the following experiments should first upgrade bert4keras to version 0.9.1 or higher.
GitHub Link: https://github.com/bojone/t5_in_bert4keras
The basic code to load the mT5 model into Keras using bert4keras is:
# Model paths
config_path = '/root/kg/bert/mt5/mt5_small/t5_config.json'
checkpoint_path = '/root/kg/bert/mt5/mt5_small/model.ckpt-1000000'
spm_path = '/root/kg/bert/mt5/sentencepiece.model'
# Load tokenizer
tokenizer = SpTokenizer(spm_path, token_start=None, token_end=' ')
# Load model
t5 = build_transformer_model(
config_path=config_path,
checkpoint_path=checkpoint_path,
model='t5.1.1',
return_keras_model=False,
name='T5',
)
encoder = t5.encoder
decoder = t5.decoder
model = t5.model
It can be seen that there isn't much difference from loading BERT in bert4keras. The construction of t5_config.json and the download of model.ckpt-1000000 are detailed on GitHub, so please head there to take a look. Complete code (including training and decoding details) can also be found on GitHub and won't be expanded upon here.
It is worth mentioning that for Chinese, the results given by the tokenizer include words; that is, for Chinese, mT5 uses words as the unit, though the word granularity is relatively small. This further proves that our previous work "Speed Up Without Performance Loss: Chinese WoBERT Based on Word Granularity" was heading in the right direction.
I believe most readers of this blog only care about Chinese tasks, some may also care about English tasks, and very few would care about tasks in languages other than Chinese and English. However, mT5 covers 101 languages and has a total vocabulary of 250,000. Furthermore, the Softmax in the T5.1.1 structure does not share parameters, which leads to the Embedding layer occupying a significant number of parameters. For example, the mT5 small model has 300 million parameters, of which 250 million are related to the Embedding. The key point is that most of these parameters are useless to us; they are purely a waste. Therefore, for those of us who mainly care about Chinese tasks, it is necessary to streamline this Embedding layer.
Simplifying the model is easy: you just need to delete unnecessary rows from the two Embedding matrices. The key lies in how to decide which tokens to retain and how to obtain a streamlined SentencePiece model. To decide which tokens to keep, a simple idea is to keep Chinese tokens, but not just Chinese—English ones must also be partially kept. It looks like a simple regular expression problem, but it’s actually not that simple; those using English letters aren't necessarily English, and those using Chinese characters aren't necessarily Chinese. This is a confusing issue. So I thought of another method: use this 250,000-token tokenizer on the tens of gigabytes of Chinese corpus I collected, count the tokenization results, and then select the top portion based on frequency (ultimately, over 30,000 tokens were kept). Although this is more time-consuming, it is more reliable and ensures that the tokens we need most are kept. After deciding on the vocabulary, you have to modify and obtain a new SentencePiece model, which is also a bit troublesome, but after searching, I finally solved it. The processing methods are shared on GitHub.
After this processing, to build a new model, you only need to add three lines of code related to keep_tokens. The required video memory is greatly reduced, and the performance of Chinese generation remains basically unchanged:
# Model paths
config_path = '/root/kg/bert/mt5/mt5_base/t5_config.json'
checkpoint_path = '/root/kg/bert/mt5/mt5_base/model.ckpt-1000000'
spm_path = '/root/kg/bert/mt5/sentencepiece_cn.model'
keep_tokens_path = '/root/kg/bert/mt5/sentencepiece_cn_keep_tokens.json'
# Load tokenizer
tokenizer = SpTokenizer(spm_path, token_start=None, token_end=' ')
keep_tokens = json.load(open(keep_tokens_path))
# Load model
t5 = build_transformer_model(
config_path=config_path,
checkpoint_path=checkpoint_path,
keep_tokens=keep_tokens,
model='t5.1.1',
return_keras_model=False,
name='T5',
)
encoder = t5.encoder
decoder = t5.decoder
model = t5.model
Finally, everyone is likely concerned with whether, after all this effort, the generation effect has improved and if it's worth using. Let's put it this way: a CSL title generation model fine-tuned using the mT5 small version can match the BLEU score of a UniLM model based on WoBERT, while being 130% faster in decoding. A CSL title generation model fine-tuned with the mT5 base version can exceed the WoBERT-based UniLM model by more than 1% in metrics, and the decoding speed is also 60% faster.
Simply put, it is indeed both faster and better. As for hardware requirements, students who have successfully run BERT base before should basically be able to run mT5 small/base, and can even try the large version. As for XL and XXL, those are difficult to handle, and I suggest giving up on them. More surprises are waiting for everyone to discover for themselves... Oh, by the way, I need to remind you that when fine-tuning the T5 model, the learning rate should be more than 10 times larger than when fine-tuning BERT (i.e., at the $10^{-4}$ level, while BERT is generally at $10^{-5}$). This is determined by the architectural differences between the two models.
This article reviewed the T5 model released by Google last year, then introduced the recently released multilingual mT5, and finally explained how to fine-tune mT5 for Chinese tasks in bert4keras. The results show that mT5 performs very well in Chinese generation and is worth a try for anyone working on text generation tasks.