By 苏剑林 | September 07, 2020
A while ago, while browsing Arxiv, I noticed that Tsinghua University open-sourced a large-scale cleaned Chinese conversation corpus called LCCC (paper link, project address). Based on the open-sourced files, this might be the largest and highest-quality open-source chitchat corpus currently available, and it even includes some multi-turn conversations. Overall, it's quite playable. I was drawn to it and tried using it to train a chitchat dialogue model. The results look pretty good, so I’m sharing my experience here.
Corpus Introduction
Here’s a brief introduction to the LCCC dataset (Large-scale Cleaned Chinese Conversation). For specific details, you can visit the GitHub page; the download links are also there. LCCC is divided into "base" and "large" versions. The "base" version mainly originates from Weibo conversations, while "large" integrates other open-source dialogue corpora on top of "base." According to the authors, LCCC underwent a rigorous cleaning process, so the overall quality appears to be very high.
\[\begin{array}{c|c|c}
\hline
\text{LCCC-base} & \text{Single-turn Dialogue} & \text{Multi-turn Dialogue} \\
\hline
\text{Total Dialogue Rounds} & 3,354,382 & 3,466,607 \\
\hline
\text{Total Dialogue Utterances} & 6,708,554 & 13,365,268 \\
\hline
\end{array}\]
\[\begin{array}{c|c|c}
\hline
\text{LCCC-large} & \text{Single-turn Dialogue} & \text{Multi-turn Dialogue} \\
\hline
\text{Total Dialogue Rounds} & 7,273,804 & 4,733,955 \\
\hline
\text{Total Dialogue Utterances} & 14,547,608 & 18,341,167 \\
\hline
\end{array}\]
To simplify the task, all samples were processed into two-person dialogues. Here are some sample examples:
A: When it's Chinese New Year, let's go back and buy some rabbit heads to have a good hotpot meal.
B: I haven't seen any good rabbit heads in Taiyuan.
A: I'll bring one back for you from Hongqiao; I saw an authentic one there the other day.
B: Love you the most!
A: That's for sure.
A: Hmm, I'll wait a bit longer! You're in Shanghai now, right? The wind in Shanghai seems even stronger than in Nanjing. Try to stay indoors.
B: Yeah, I'm at home, I'm fine. You definitely be careful!
A: I also went for a walk around there last year and even bumped into my old PE teacher; we took a photo together.
B: Haha, I went looking for my English teacher from 10th grade but couldn't find her; she happened to have something to do and wasn't at school~
A: You're really trying to find memories.
B: Haha, I haven't been back since graduation and wanted to take a look.
Model Design
After knowing what the data looks like, we need to design the model. Obviously, we need to train a model to predict what response should follow given a context. Since the corpus contains multi-turn dialogues, we also require the model to support multi-turn dialogues. The simplest way to consider dialogue history is to concatenate all previous dialogues up to the current sentence into a single piece of text as the input for the model.
Given some input and predicting an output, we should naturally use a Seq2Seq model. Using Seq2Seq directly isn't a huge problem, but standard Seq2Seq is generally used for inputs and outputs with relatively fixed forms—for example, the input length should be concentrated within a certain range and shouldn't vary too much. However, when considering multi-turn dialogues, we theoretically don't know how many rounds of dialogue are preceding it, so in principle, the input text length is unlimited. Using Seq2Seq also presents low training efficiency: for each dialogue, we can only train one response at a time. If a multi-turn dialogue has $n$ responses, it must be split into $n$ samples for training.
Therefore, we need a model whose length can vary quite freely while simultaneously predicting an entire multi-turn dialogue. A suitable choice for this requirement is a Unidirectional Language Model (LM, GPT). The approach is as follows:
Schematic of multi-turn dialogue using a unidirectional language model
As shown in the figure, we choose the current mainstream Transformer model and follow the conventional BERT input format, concatenating each dialogue utterance with [SEP], and then training a left-to-right unidirectional language model. To distinguish different speaking roles, we use different Segment IDs for different speakers. Additionally, considering that both BERT and GPT use absolute position embeddings which have an upper limit on text length, while the number of dialogue rounds is theoretically infinite, we adopted the NEZHA structure with relative position embeddings and used NEZHA's pre-trained weights as the model's initialization weights.
To put it simply, we added a lower-triangular Attention Mask to NEZHA, turning it into a language model. For related information, please refer to "From Language Models to Seq2Seq: Transformer is All About the Mask".
Training Details
First, here is my reference implementation and the pre-trained model:
https://github.com/bojone/nezha_gpt_dialog
The code is implemented based on bert4keras, and the base structure is NEZHA Base. To make training more "civilian-friendly," I used only one card (a 22G RTX) combined with gradient accumulation. Readers with smaller VRAM can reduce the batch_size. The open-sourced model was trained using the Adam optimizer with a learning rate of $2 \times 10^{-5}$ and a weight decay rate of $0.01$. It was trained for 10 million steps (over 20 days). The single-step batch_size was 16, and the cumulative gradient steps were 16, so it is equivalent to training for 625,000 steps with a batch_size of 256. The final loss was approximately 2.83 (average cross-entropy per token).
Let's test our training results:
>>> print(chatbot.response([u'我最近天天吃火锅'])) (I've been eating hotpot every day lately)
火锅好吃啊 (Hotpot is delicious)
>>> print(chatbot.response([u'我最近天天吃火锅']))
我也是 (Me too)
>>> print(chatbot.response([u'我最近天天吃火锅']))
哈哈哈哈哈哈我最近一周吃了两顿火锅 (Hahahaha I've had hotpot twice this week)
>>> print(chatbot.response([u'我最近天天吃火锅']))
我也是,每次都吃的撑 (Me too, I get stuffed every time)
>>> print(chatbot.response([u'我最近天天吃火锅']))
吃火锅不长胖 (Eating hotpot won't make you fat)
>>> print(chatbot.response([u'我最近天天吃火锅']))
火锅好啊 (Hotpot is good)
>>> print(chatbot.response([u'我最近天天吃火锅']))
你们那边有这种火锅? (They have that kind of hotpot where you are?)
>>> print(chatbot.response([u'我最近天天吃火锅']))
火锅好吃吗?我也想吃 (Is the hotpot good? I want some too)
>>> print(chatbot.response([u'我最近天天吃火锅']))
吃吧,吃了就开心了 (Go eat it, you'll be happy once you eat it)
>>> print(chatbot.response([u'我最近天天吃火锅']))
我也想天天吃火锅! (I want to eat hotpot every day too!)
>>> print(chatbot.response([u'我最近天天吃火锅']))
火锅吃不了多久的。(You can't eat hotpot for very long.)
>>> print(chatbot.response([u'你那天气好吗'])) (How's the weather where you are?)
还好,不热,有点冷 (It's okay, not hot, a little cold)
>>> print(chatbot.response([u'你那天气好吗', u'还好,不热,有点冷', u'那你得穿多厚的衣服'])) (Then how thick are the clothes you have to wear?)
我穿了两件 (I'm wearing two layers)
Comparative Analysis
CDial-GPT also open-sourced their own pre-trained models, and I have converted them into a format that bert4keras can load (CDial-GPT-tf); readers can test and compare them. In terms of training, CDial-GPT uses a PyTorch implementation with a GPT Base structure. They used four 2080Ti cards, with a total batch size of 32 and 64 gradient accumulation steps. The paper states they trained for 30 epochs, totaling about 21 million steps (twice as many as mine), so it is roughly equivalent to training 330,000 steps with a batch size of 2048.
In terms of input design, CDial-GPT is also different, as shown below:
CDial-GPT model schematic
As shown in the figure, the primary difference between CDial-GPT and our aforementioned design is the concatenation method between multi-turn dialogues. We previously used [SEP] to connect them directly, whereas they use role tags like [speaker1] and [speaker2] (abbreviated as S1, S2 in the figure) to connect them, and only used a [SEP] at the end to indicate the end of the response. Consequently, because the format of the prediction part is different from the history format, only one response can be trained at a time. Multi-turn dialogues must be split into multiple samples for training, which theoretically increases training complexity (it requires multiple steps to finish training a single multi-turn dialogue sample).
As for the actual effect, my personal testing feels like there is no significant difference between the two. Interested readers can compare and test for themselves.
Summary
This article mainly shared a practice in dialogue modeling. Based on the LCCC chitchat corpus open-sourced by CDial-GPT, we utilized a language model (GPT) to perform generative modeling on multi-turn dialogues, obtaining a relatively general chitchat dialogue model. Finally, the approach of this article was compared with the original models open-sourced by CDial-GPT.