[Searching for Text] · (III) Text Sampling Based on BERT

By 苏剑林 | Jan 22, 2021

Starting from this article, we will apply the sampling algorithms introduced earlier to specific text generation examples. As the first example, we will introduce how to use BERT for random text sampling. The so-called random text sampling is the process of randomly generating natural language sentences from a model. The common view is that such random sampling is a unique function of unidirectional autoregressive language models like GPT-2 and GPT-3, while bidirectional Masked Language Models (MLM) like BERT cannot do it. Is that really the case? Of course not. Utilizing BERT's MLM model can also achieve text sampling; in fact, it is exactly the Gibbs sampling introduced in the previous article. This fact was first clearly pointed out in the paper "BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model." The title of the paper is quite interesting: "BERT has a mouth, so it must say something." Now let's see what BERT can actually say~

Sampling Process

First, let's review the Gibbs sampling process introduced in the previous article:

Gibbs Sampling
The initial state is $\boldsymbol{x}_0=(x_{0,1},x_{0,2},\cdots,x_{0,l})$, and the state at time $t$ is $\boldsymbol{x}_t=(x_{t,1},x_{t,2},\cdots,x_{t,l})$.
Sample $\boldsymbol{x}_{t+1}$ through the following flow:
1. Uniformly sample an index $i$ from $1,2,\cdots,l$;
2. Calculate $$p(y|\boldsymbol{x}_{t,-i})=\frac{p(x_{t,1},\dots,x_{t,i-1},y,x_{t,i+1},\cdots,x_{t,l})}{\sum\limits_y p(x_{t,1},\dots,x_{t,i-1},y,x_{t,i+1},\cdots,x_{t,l})}$$
3. Sample $y\sim p(y|\boldsymbol{x}_{t,-i})$;
4. $\boldsymbol{x}_{t+1} = {\boldsymbol{x}_t}_{[x_{t,i}=y]}$ (i.e., replace the $i$-th position of $\boldsymbol{x}_t$ with $y$ to serve as $\boldsymbol{x}_{t+1}$).

The most critical step is the calculation of $p(y|\boldsymbol{x}_{-i})$. Its specific meaning is "predicting the probability of the $i$-th element by removing the $i$-th element and using all remaining $l-1$ elements." Readers familiar with BERT should realize: isn't this exactly what BERT's MLM model does? Therefore, combining the MLM model with Gibbs sampling can indeed achieve random text sampling.

Thus, implementing the above Gibbs sampling process for MLM-based text sampling yields the following flow:

MLM Model Random Sampling
The initial sentence is $\boldsymbol{x}_0=(x_{0,1},x_{0,2},\cdots,x_{0,l})$, and the sentence at time $t$ is $\boldsymbol{x}_t=(x_{t,1},x_{t,2},\cdots,x_{t,l})$.
Sample a new sentence $\boldsymbol{x}_{t+1}$ through the following flow:
1. Uniformly sample an index $i$ from $1,2,\cdots,l$, and replace the token at the $i$-th position with [MASK] to get the sequence $\boldsymbol{x}_{t,-i}=(x_{t,1},\dots,x_{t,i-1},\text{[MASK]},x_{t,i+1},\cdots,x_{t,l})$;
2. Feed $\boldsymbol{x}_{t,-i}$ into the MLM model and calculate the probability distribution at the $i$-th position, denoted as $p_{t+1}$;
3. Sample a token from $p_{t+1}$, denoted as $y$;
4. Replace the $i$-th token of $\boldsymbol{x}_t$ with $y$ to serve as $\boldsymbol{x}_{t+1}$.

Readers might notice that this sampling process can only sample sentences of a fixed length and won't change the sentence length. This is true because Gibbs sampling can only sample from a certain distribution, and sentences of different lengths actually belong to different distributions. Theoretically, there is no intersection between them. It’s just that when we build language models usually, we use autoregressive models to uniformly model the distribution of sentences of different lengths, to the point where we don't perceive the fact that "different sentences actually belong to different probability distributions."

Of course, it is not impossible to solve this. The original paper "BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model" points out that the initial sentence can be set as a sequence consisting entirely of [MASK] tokens. In this way, we can randomly sample a length $l$ beforehand, and then use $l$ [MASK] tokens as the initial sentence to start the Gibbs sampling process, thereby obtaining sentences of different lengths.

Reference Code

With an existing MLM model, implementing the aforementioned Gibbs sampling is a very simple task. Below is the reference code implemented based on bert4keras:

Here are some examples:

Initial Sentence:
科学技术是第一生产力。 (Science and technology are the primary productive forces.)
Sampling Results:
荣耀笔记本开箱怎么样? (How is the Honor laptop unboxing?)
微信记录没用被怎么办? (What to do if WeChat records are useless/deleted?)
无法安装浏览器怎么办? (What to do if the browser cannot be installed?)
epf转换器a7l怎么用? (How to use epf converter a7l?)
没有安装浏览器怎么办? (What to do if no browser is installed?)
荣耀笔记本充电怎么用? (How to use Honor laptop charging?)
无法打开asp. net怎么办? (What to do if asp.net cannot be opened?)
没有安装浏览器怎么办? (What to do if no browser is installed?)
无法重启浏览器怎么办? (What to do if the browser cannot be restarted?)
ro汉巴换mac tv版怎么用? (How to use ro Hanba exchange mac tv version?)
Initial Sentence:
北京新增3例本地确诊病例和1例无症状感染者 (Beijing adds 3 new local confirmed cases and 1 asymptomatic infected person)
Sampling Results:
澳门录得233宗h1n1感染案例和13宗放射性感染。 (Macau recorded 233 H1N1 infection cases and 13 radioactive infections.)
庆祝仪式是学院绘画、钢铁工参与创作的盛会。 (The celebration ceremony is a grand event where academy painters and steelworkers participate in creation.)
庆祝仪式后吉卜力平台其他游戏也参加了庆祝。 (After the celebration, other games on the Ghibli platform also joined the celebration.)
临床试验发现中的g染色体多来自胃肠道感染。 (G-chromosomes in clinical trials mostly come from gastrointestinal infections.)
临床试验发现,人们通常真正享受阴蒂的快感。 (Clinical trials found that people usually truly enjoy the pleasure of the clitoris.)
庆祝模式在吉卜力平台其他游戏中更加与庆祝。 (The celebration mode is more [aligned] with celebration in other Ghibli platform games.)
庆祝模式在吉卜力,或其他游戏上更新和庆祝。 (Celebration mode updated and celebrated on Ghibli or other games.)
澳门录得20宗h1n1感染病例,2宗放射性感染。 (Macau recorded 20 H1N1 infection cases and 2 radioactive infections.)
临床试验发现女性的染色体常来自胃肠道感染。 (Clinical trials found that female chromosomes often come from gastrointestinal infections.)
临床试验发现90% 感染病例为m型胃肠道感染。 (Clinical trials found that 90% of infection cases are type-M gastrointestinal infections.)
Initial Sentence:
9 consecutive [MASK] tokens
Sampling Results:
你每天学你妈妈啊! (You copy your mother every day!)
那晚,眼前白茫茫。 (That night, everything was white before my eyes.)
層層青翠綠意盎然。 (Layers of verdant green, full of life.)
幼儿园想做生意了。 (The kindergarten wants to do business.)
究竟如何才能入官? (How on earth can one enter officialdom?)
老师、同学,您好! (Hello, teacher and classmates!)
云山重重,两茫茫。 (The cloud-covered mountains are heavy, both vast and hazy.)
梅雨,窗外霧茫茫。 (Plum rains, the fog is thick outside the window.)
那时,眼前白茫茫。 (At that time, everything was white before my eyes.)
還是很棒的切蛋糕! (Still a great cake cutting!)

In the experiments, I used Google's open-source Chinese BERT base. As you can see, the sampled sentences are quite diverse and have a certain degree of readability, which is decent for a base-sized model.

Using consecutive [MASK] tokens as initial values, repeated experiments might yield very different results:

Initial Sentence:
17 consecutive [MASK] tokens
Sampling Results:
其他面瘫吃什么?其他面瘫吃什么好? (What else to eat for facial paralysis? What is good to eat for other facial paralysis?)
小儿面瘫怎么样治疗?面瘫吃什么药? (How to treat infantile facial paralysis? What medicine to take for facial paralysis?)
幼儿面瘫怎么样治疗?面瘫吃什么好? (How to treat toddler facial paralysis? What is good to eat for facial paralysis?)
儿童头痛是什么原因荨麻疹是什么病? (What causes headaches in children? What kind of disease is urticaria?)
其他面瘫吃什么・ 其他面瘫吃什么好? (What else to eat for facial paralysis · What is good for other facial paralysis?)
竟然洁具要怎么装进去水龙头怎么接? (How to surprisingly install sanitary ware? How to connect the faucet?)
其他面瘫吃什么好其他面瘫吃什么好? (What is good for other facial paralysis? What is good for other facial paralysis?)
孩子头疼是什么原因荨麻疹是什么病? (What causes a child's headache? What kind of disease is urticaria?)
竟然厨房柜子挑不进去水龙头怎么插? (How is it that the kitchen cabinet can't fit in? How to plug in the faucet?)
不然厨房壁橱找不到热水龙头怎么办? (Otherwise, what if I can't find the hot water faucet in the kitchen cupboard?)
Initial Sentence:
17 consecutive [MASK] tokens
Sampling Results:
フロクのツイートは下記リンクからこ覧下さい。
天方町に運行したいシステムをこ利用くたさい。
エリアあります2つクロカー専門店からこ案内まて!
当サイトては割引に合うシステムを採用しています!
同時作品ては表面のシステムを使用する。
メーカーの品は真面まてシステムを使用しています。
掲示板こ利用いたたシステムこ利用くたさい。
住中方は、生产のシステムを使用しています。
エアウェアの住所レヘルをこ一覧下さい。
フロクのサホートは下記リンクてこ覧下さい。

Amazingly, even Japanese was sampled, and after checking with Baidu Translate, these Japanese sentences are quite readable. On one hand, this reflects the diversity of the random sampling results; on the other hand, it also shows that Google's Chinese BERT did not undergo thorough denoising, as the training data must have contained a significant amount of non-Chinese/English text.

Insights and Reflections

A while ago, Google, Stanford, OpenAI, and others co-published an article "Extracting Training Data from Large Language Models", which pointed out that language models like GPT-2 are fully capable of reproducing (exposing) training data. This is not hard to understand, as the essence of a language model is memorizing sentences. MLM-based Gibbs sampling shows that this issue exists not only in explicit language models like GPT-2 but also in bidirectional models like MLM. We can see hints of this from the sampling examples above; for instance, the appearance of Japanese indicates that the original corpus was not perfectly denoised. Furthermore, starting from "Beijing added 3 new local confirmed cases...", we sampled results related to H1N1, which reflects the temporal nature of the training corpus. These results imply that if you don't want your open-source model to expose privacy, you must do a good job of cleaning the pre-training data.

In addition, there is another "tidbit" regarding the paper "BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model". The original paper claimed that the MLM model is a Markov Random Field, but in fact, this is incorrect. Later, the authors clarified this on their homepage; interested readers can look at "BERT has a Mouth and must Speak, but it is not an MRF." Generally speaking, it is perfectly fine to use MLM for random sampling, but it does not map directly onto a Markov Random Field.

Summary

This article introduced random text sampling based on BERT's MLM, which is essentially a natural application of Gibbs sampling. Overall, this article is just a fairly simple example. For readers who already understand Gibbs sampling, there is almost no technical difficulty here; for readers who are not yet familiar with Gibbs sampling, this specific example serves as a good way to further understand the Gibbs sampling process.


To repost, please include the original address of this article: https://kexue.fm/archives/8119

,
    author={Su Jianlin},
    year={2021},
    month={Jan},
    url={\url{https://kexue.fm/archives/8119}},
}