Playing with the Currently Largest Chinese GPT-2 Model (bert4keras)

By 苏剑林 | November 20, 2020

I believe many readers have seen the "Qingyuan Plan" launched by Tsinghua University and the Beijing Academy of Artificial Intelligence (BAAI) over the past few days (related link: "Chinese version of GPT-3? BAAI Releases Qingyuan CPM — A Large-scale Pre-trained Model Centered on Chinese"). It open-sourced CPM-LM (2.6 billion parameters), currently the largest Chinese GPT-2 model. It is said that in the future, they will open-source models with 20 billion or even 100 billion parameters to create a "Chinese version of GPT-3."

Official Few Shot effect demonstration of CPM-LM

We know that GPT-3 can achieve Few Shot learning without fine-tuning. In the current demonstration examples of CPM-LM, the Few Shot performance is also quite impressive, which makes one eager to try it out. Naturally, I wanted to adapt it to my bert4keras to make it easier to use. Thus, the adaptation work began. I originally thought it would be a simple task, but I ended up stumbling into pitfalls for nearly three days before getting it right. Here, I'm recording the process of troubleshooting and testing.

Model Introduction

The first model released under this plan is called CPM-LM, with approximately 2.6 billion parameters and pre-trained on 100GB of Chinese text data. It is a unidirectional language model. For other details, you can read more at the links below. With such a massive parameter count, we generally use it directly rather than considering fine-tuning. Its primary capability is unconditional random text generation. Of course, we can also provide it with some prompts and use it for text continuation; applications like Few Shot are essentially variants of text continuation.

Homepage: https://cpm.baai.ac.cn/
GitHub: https://github.com/TsinghuaAI/CPM-Generate
Official Account: https://mp.weixin.qq.com/s/oI2Ak-M57MSuycLVpVEiHw

Regarding the model structure—which was the first pitfall I encountered during adaptation—the CPM-LM architecture is identical to OpenAI's GPT-2. So, in plain terms, this is a 2.6 billion parameter Chinese GPT-2 model. Initially, I didn't look closely enough and was slightly misled by the CPM-LM-TF2 project, which led me to believe its structure was like GPT2_ML (GPT2_ML is neither GPT nor GPT-2; it sits somewhere in between). Consequently, I spent a long time unable to get reasonable results. Once I realized this issue, rebuilding the GPT-2 model and adapting the weights wasn't difficult. This included converting the weights to TF format, which wasn't hard given the reference from the CPM-LM-TF2 project.

Tokenizer

The second pitfall in the adaptation process concerned the tokenizer. I must say, the tokenizer written for CPM-LM is, in my view, quite unrefined, and it still bothers me.

The tokenizer is essentially a wrapper around Google's sentencepiece, but it is wrapped in a way that is particularly inelegant—a nightmare for anyone with OCD. Specifically, while tokenizers like BERT's or sentencepiece usually remove delimiters like spaces and newlines by default, CPM-LM wants to preserve them. So, before sending text to the tokenizer, it replaces them with other symbols (currently spaces are replaced with "▂" and newlines with "▃"), and then replaces them back before output. This is a common approach and is understandable. However, what I cannot understand is that the replacement symbol for the newline, "▃", is actually not in its sentencepiece model's vocabulary! To prevent "▃" from becoming an Unknown token (<unk>), CPM-LM performs a second replacement to turn it into a specific string, only then obtaining the newline ID...

When I first saw this design, my mind nearly collapsed: Is it that difficult to add one more character to the sentencepiece model? Why write it like this... Regardless, since the open-source model providers are the "bosses," I had to find a way to adapt to it. After much thought, I patched bert4keras's original SpTokenizer and managed to get it working.

Usage Test

Enough complaining. Anyway, after more than two days of effort, bert4keras can now load the CPM-LM model starting from version 0.9.3. Running inference alone likely requires over 16GB of VRAM (I personally used a 24GB RTX card). The weight conversion process and basic loading scheme can be found here:

GitHub: https://github.com/bojone/CPM_LM_bert4keras

Some Few Shot results (output can be random; if you only care about Few Shot performance, consider changing the decoding method to beam search):

# Common Sense Reasoning
# Result: Beijing
query = u"""
The capital of the USA is Washington
The capital of France is Paris
The capital of Japan is Tokyo
The capital of China is
"""
print(text_expansion.generate(query[1:-1], 1)[0])

# Word Translation
# Result: bird
query = u"""
狗 dog
猫 cat
猪 pig
鸟
"""
print(text_expansion.generate(query[1:-1], 1)[0])

# Subject Extraction
# Result: 杨振宁 (Chen-Ning Yang)
query = u"""
从1931年起，华罗庚在清华大学边学习边工作 华罗庚
在一间简陋的房间里，陈景润攻克了“哥德巴赫猜想” 陈景润
在这里，丘成桐得到IBM奖学金 丘成桐
杨振宁在粒子物理学、统计力学和凝聚态物理等领域作出里程碑性贡献
"""
print(text_expansion.generate(query[1:-1], 1)[0])

# Triplet Extraction
# Result: 张红,体重,140斤 (Zhang Hong, weight, 140 jin)
query = u"""
姚明的身高是211cm，是很多人心目中的偶像。 ->姚明，身高，211cm
虽然周杰伦在欧洲办的婚礼，但是他是土生土长的中国人->周杰伦，国籍，中国
小明出生于武汉，但是却不喜欢在武汉生成，长大后去了北京。->小明，出生地，武汉
吴亦凡是很多人的偶像，但是他却是加拿大人，另很多人失望->吴亦凡，国籍，加拿大
武耀的生日在5月8号，这一天，大家都为他庆祝了生日->武耀，生日，5月8号
《青花瓷》是周杰伦最得意的一首歌。->周杰伦，作品，《青花瓷》
北京是中国的首都。->中国，首都，北京
蒋碧的家乡在盘龙城，毕业后去了深圳工作。->蒋碧，籍贯，盘龙城
上周我们和王立一起去了他的家乡云南玩昨天才回到了武汉。->王立，籍贯，云南
昨天11月17号，我和朋友一起去了海底捞，期间服务员为我的朋友刘章庆祝了生日。->刘章，生日，11月17号
张红的体重达到了140斤，她很苦恼。->
"""
print(text_expansion.generate(query[1:-1], 1)[0])

Summary

This article briefly introduced the new 2.6-billion-parameter GPT-2 model, CPM-LM, recently open-sourced by Tsinghua University, and adapted it into the bert4keras framework. I shared a few complaints about the pitfalls encountered during the conversion and finally demonstrated the impressive Few Shot performance of CPM-LM.