By 苏剑林 | September 27, 2020
Everyone knows that GPT3 is currently in the spotlight. However, with GPT3 being promoted everywhere, do readers remember the actual title of the GPT3 paper? In fact, the GPT3 paper is titled "Language Models are Few-Shot Learners." The title no longer contains the letters G, P, or T, but because it is a direct descendant of the original GPT, it is still referred to as GPT3. As the name suggests, GPT3 focuses on Few-Shot Learning. Additionally, another characteristic of GPT3 is its size—the largest version has 175 billion parameters, which is more than a thousand times the size of BERT Base.
Because of this, a paper titled "It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners" appeared on Arxiv recently and caught my attention. Translated, it means "Who says it has to be big? Small models can also do few-shot learning." Obviously, this title is aimed directly at GPT3. I was curious to see who had the courage to challenge GPT3 and what kind of small model could do it. After reading it, I realized the authors proposed that with proper construction, BERT's MLM model can also perform few-shot learning. After finishing it, I had a sudden "so that's how it can be done" moment of realization. I’d like to share it with everyone here.
The Rising MLM
MLM stands for "Masked Language Model." It is essentially a "fill-in-the-blanks" task where certain words in a text are randomly masked, and the model is asked to predict the masked words. The diagram is as follows:

The masked parts can be directly randomly selected tokens or continuous tokens that form a whole word; the latter is called WWM (Whole Word Masking).
Initially, MLM was viewed only as a pre-training task for BERT—something to be discarded once training was finished. Because of this, some open-source models didn't even keep the weights for the MLM part, such as the brightmart version and the CLUE version of RoBERTa. Meanwhile, the RoBERTa-wwm-ext-large released by HIT (Harbin Institute of Technology) randomly initialized the MLM weights for some unknown reason. Therefore, to reproduce the results discussed later in this article, these versions are not suitable.
However, as research progressed, researchers found that not only is BERT's Encoder useful, but the MLM used for pre-training is also very useful. For example, the paper "BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model" pointed out that MLM can be used as a general generative model. The paper "Spelling Error Correction with Soft-Masked BERT" applied MLM to text error correction. My previous experiments in "From Language Models to Seq2Seq: Transformer is like a play, it all depends on the Mask" also showed that MLM pre-training weights can be used as UniLM for Seq2Seq tasks. Furthermore, "Unsupervised Word Segmentation and Syntactic Analysis! BERT can also be used this way" applied MLM ideas to unsupervised word segmentation and syntactic analysis. It's fair to say that MLM has already proven its worth.
Converting Tasks into Cloze Tasks
In this article, we will learn another brilliant application of MLM: for few-shot learning or semi-supervised learning, and in some scenarios, it can even achieve zero-shot learning.
How can we combine the task we want to perform with MLM? It's simple: give the task a textual description and then convert it into a fill-in-the-blanks (cloze) problem. For example, given the sentence "I felt this trip to Beijing was quite good," we can add a description to build the following cloze task:
______ satisfied. I felt this trip to Beijing was quite good.
Furthermore, we restrict the blank to only be filled with words like "Very" or "Not." The problem becomes very clear: we need to judge whether the user is satisfied based on contextual consistency. If the probability of "Very" is higher than "Not," it indicates a positive sentiment; otherwise, it is negative. Thus, we have converted a sentiment classification problem into a cloze problem. It can be predicted using an MLM model. Since MLM training does not necessarily require supervised data, this can theoretically achieve zero-shot learning.
Multi-class problems can be converted similarly. For example, in news topic classification, for the input sentence "After eight months, we finally see the girls of the women's volleyball team on the court again," we can construct:
The following is a report on ______ news. After eight months, we finally see the girls of the women's volleyball team on the court again.
In this way, we convert news topic classification into a cloze problem. A good MLM model should be able to predict the word "Sports."
Some simple inference tasks can also be converted this way. A common one is given two sentences, determining if they are compatible. For example, "I went to Beijing" and "I went to Shanghai" are contradictory, while "I went to Beijing" and "I am in Tiananmen Square" are compatible. The common approach is to concatenate the two sentences and input them into a model for binary classification. To convert this to a cloze task, how should we construct it? A natural way is:
I went to Beijing? ______, I went to Shanghai.
I went to Beijing? ______, I am in Tiananmen Square.
Where the candidate words for the blank are $\{\text{Yes}, \text{No}\}$.
Pattern-Exploiting
At this point, readers should easily see the pattern: add a prefix or suffix description to the input text and mask certain tokens to convert it into a cloze problem. This conversion is called a Pattern in the original paper. This transformation should make the sentence as natural as possible because the pre-trained MLM model was trained on natural language. Obviously, the same problem can have many different patterns. For the sentiment classification example, the description could be at the end: "I felt this trip to Beijing was quite good. ____ satisfied."; or you could add more words: "How do you feel? ____ satisfied. I felt this trip to Beijing was quite good."
Next, we need to construct a candidate space for predicting tokens and establish a mapping from tokens to actual categories. This is called a Verbalizer in the original paper. For the sentiment classification example, our candidate space is $\{\text{Very}, \text{Not}\}$, and the mapping relationship is $\text{Very} \to \text{Positive}, \text{Not} \to \text{Negative}$. The mapping doesn't have to be one-to-one. For example, we could include "Quite," "Too," or "Hardly," and map $\{\text{Very}, \text{Quite}, \text{Too}\} \to \text{Positive}$ and $\{\text{Not}, \text{Hardly}\} \to \text{Negative}$, and so on. It's understandable that many NLP tasks can undergo this conversion, but this conversion is generally only suitable for tasks with a finite candidate space—basically, for multiple-choice questions, most commonly text classification.
As mentioned, one task can have many different Patterns. The original paper handles this as follows:
1. For each Pattern, finetune an individual MLM model using the training set;
2. Integrate the models corresponding to different Patterns to obtain a fused model;
3. Use the fused model to predict pseudo-labels for unlabelled data;
4. Use the pseudo-labeled data to finetune a conventional (non-MLM) model.
The specific integration method can be found in the paper; it is not the main focus here. This training mode is called Pattern-Exploiting Training (PET). It first appeared in the paper "Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference." The paper we are discussing today further confirms the value and results of Pattern-Exploiting Training and integrates multi-task learning, allowing its few-shot learning performance on the SuperGLUE leaderboard to surpass GPT3. The authors of the two papers are the same, representing a continuation of their work.

However, one thing to complain about is that in the figure above, the 223M parameters for PET used the ALBERT-xxlarge-v2 model. In fact, calling ALBERT a "small model" is somewhat disingenuous because its forward calculation speed is not improved. ALBERT-xxlarge has 12 layers, and weights are shared between layers. In terms of forward calculation, it should be equivalent to a GPT with about 2700M (12 times) parameters.
Chinese Practice, Verifying Effectiveness
To truly confirm the value of a method or model, looking at the experimental tables in a paper is not enough. No one can say for sure if the experimental results in a paper can be reproduced, and even if they can be reproduced in English, it doesn't mean they are valuable in Chinese. Therefore, the most practical approach is to conduct experiments personally. Below is my experimental code for reference:
Github Address: https://github.com/bojone/Pattern-Exploiting-Training
We will explore the feasibility of PET from the following perspectives:
1. How effective is it using ready-made MLM models directly? (Zero-shot learning 1)
2. How effective is it to finetune ready-made MLM models with "large amounts of unlabelled data"? (Zero-shot learning 2)
3. How effective is it to finetune ready-made MLM models with "small amounts of labeled data"? (Few-shot learning)
4. How effective is it to finetune ready-made MLM models with "small amounts of labeled data + large amounts of unlabelled data"? (Semi-supervised learning)
Below, I primarily present experimental results for binary sentiment classification. I also conducted experiments on short news multi-class classification, and the code is on Github. The results are similar, so I won't repeat them here.
Zero-shot Learning 1
Here we mainly explore adding corresponding Patterns to input text and then directly predicting using ready-made MLM models, evaluating the prediction accuracy. Since the entire process of building the model does not involve supervised training with labeled data, this qualifies as "Zero-shot learning." We need to compare different Patterns and different MLM models:
Below are several Patterns used in the experiment, where the candidate words for the blank are all "Very" and "Not" (Chinese equivalents: 很, 不):
P1: ____ satisfied. I felt this trip to Beijing was quite good.
P2: I felt this trip to Beijing was quite good. ____ satisfied.
P3: ____ good. I felt this trip to Beijing was quite good.
P4: ____ ideal. I felt this trip to Beijing was quite good.
P5: How do you feel? ____ satisfied. I felt this trip to Beijing was quite good.
As for the MLM models, they are as follows:
M1: Google's open-source Chinese BERT Base;
M2: HIT's open-source RoBERTa-wwm-ext Base;
M3: Tencent UER's open-source BERT Base;
M4: Tencent UER's open-source BERT Large.
Experimental results are shown in the table below (Validation Set / Test Set):
\[\begin{array}{c}
\text{Zero-Shot Learning Results for Different Models and Patterns} \\
\begin{array}{c|ccccc}
\hline
& \text{P1} & \text{P2} & \text{P3} & \text{P4} & \text{P5} \\
\hline
\text{M1} & 66.94\,/\,67.60 & 57.56\,/\,56.13 & 58.83\,/\,59.69 & 83.70\,/\,83.33 & 75.98\,/\,76.13 \\
\text{M2} & 85.17\,/\,84.27 & 70.63\,/\,68.69 & 58.55\,/\,59.12 & 81.81\,/\,82.28 & 80.25\,/\,81.62 \\
\text{M3} & 66.75\,/\,68.64 & 50.45\,/\,50.97 & 68.97\,/\,70.11 & 81.95\,/\,81.48 & 61.49\,/\,62.58 \\
\text{M4} & 83.56\,/\,85.08 & 72.52\,/\,72.10 & 76.46\,/\,77.03 & 88.25\,/\,87.45 & 82.43\,/\,83.56 \\
\hline
\end{array}
\end{array}\]
The best result actually reached 88%! In other words, by loading a ready-made MLM and pairing it with a proper Pattern, we can correctly identify the sentiment orientation of most samples without any labeled data. This makes us look at the potential of MLM models with new respect.
It can be observed that there are certain differences between different Patterns and different pre-trained models. Overall, the Large version's performance is significantly better than Base models, indicating that, just like the progression from GPT to GPT2 and GPT3, making the model larger still helps. Furthermore, this might suggest that MLM has not yet been fully trained, perhaps because the training method of masking a part in BERT is too inefficient. It might be better to use the improved MLM version mentioned in "Modifying Transformer Structure to Design a Faster and Better MLM Model."
Zero-shot Learning 2
Seeing the above results, readers might think: if I continue to pre-train the MLM model with in-domain data, can I improve the results? The answer is: Yes! Below are our experimental results. Due to limited computing power, we only made comparisons based on RoBERTa-wwm-ext (M2 above; the model after continuous pre-training is called $\text{M2}^{+\text{unsupervised}}$):
\[\begin{array}{c}
\text{Zero-Shot Results with Continued MLM Pre-training} \\
\begin{array}{c|ccccc}
\hline
& \text{P1} & \text{P2} & \text{P3} & \text{P4} & \text{P5} \\
\hline
\text{M2} & 85.17\,/\,84.27 & 70.63\,/\,68.69 & 58.55\,/\,59.12 & 81.81\,/\,82.28 & 80.25\,/\,81.62 \\
\text{M2}^{+\text{unsupervised}} & 88.05\,/\,87.53 & 71.01\,/\,68.78 & 81.05\,/\,81.24 & 86.40\,/\,85.65 & 87.26\,/\,87.40 \\
\hline
\end{array}
\end{array}\]
It should be noted that here we only continued MLM training with in-domain data. This process is unsupervised and does not require label signals, so it is still considered "Zero-shot learning." At the same time, from the results so far, we can see that adding a "prefix" to the input text is more advantageous than a "suffix."
Few-Shot Learning
We just discussed the improvements from pre-training MLM with unlabelled data. If we return to the target scenario of PET—directly using a small amount of labeled data with specific Patterns to train MLM, how does that perform? This is true "Few-shot learning" training. Here we kept about 200 labeled samples. When constructing samples, we first add a Pattern to each sentence. Besides the Mask position provided by the Pattern, we also randomly mask a part of the rest of the text to increase regularization of the model. The final experimental results are as follows:
\[\begin{array}{c}
\text{Few-Shot Learning Results} \\
\begin{array}{c|ccccc}
\hline
& \text{P1} & \text{P2} & \text{P3} & \text{P4} & \text{P5} \\
\hline
\text{M2} & 85.17\,/\,84.27 & 70.63\,/\,68.69 & 58.55\,/\,59.12 & 81.81\,/\,82.28 & 80.25\,/\,81.62 \\
\text{M2}^{+\text{few-shot}} & 89.29\,/\,89.18 & 84.71\,/\,82.76 & 88.91\,/\,89.05 & 89.31\,/\,89.13 & 89.07\,/\,88.75 \\
\hline
\end{array}
\end{array}\]
The conclusion is that except for the "suffix-style" P2, other results are similar. This further indicates that "prefix-style" Patterns are more competitive than "suffix-style" ones. In terms of performance, directly finetuning a BERT model with the same data using conventional methods yields about 88.93, so the "MLM + Pattern" few-shot learning method can bring a slight performance boost.
Semi-supervised Learning
Now that unsupervised zero-shot learning and supervised few-shot learning have been discussed, it's naturally time for "Semi-supervised learning," which combines both labeled and unlabelled data. Using the same task, the ratio of labeled to unlabelled data is approximately 1:99. Labeled data includes Patterns, whereas unlabelled data does not. Everyone is subjected to MLM pre-training with some tokens masked. The results measured are as follows:
\[\begin{array}{c}
\text{Semi-Supervised Learning Results} \\
\begin{array}{c|ccccc}
\hline
& \text{P1} & \text{P2} & \text{P3} & \text{P4} & \text{P5} \\
\hline
\text{M2} & 85.17\,/\,84.27 & 70.63\,/\,68.69 & 58.55\,/\,59.12 & 81.81\,/\,82.28 & 80.25\,/\,81.62 \\
\text{M2}^{+\text{semi-supervised}} & 90.09\,/\,89.76 & 79.58\,/\,79.35 & 90.19\,/\,88.96 & 90.05\,/\,89.54 & 89.88\,/\,89.23 \\
\hline
\end{array}
\end{array}\]
Again, "suffix" is significantly worse than "prefix," and the result for "prefix" is similar across patterns. Specific results confirmed that extra unlabelled data is helpful. Intuitively, "prefix" is better than "suffix" likely because the Mask position in "prefix" is more fixed, allowing weak supervised signals to accumulate and strengthen? But this doesn't explain why "prefix" is also better in zero-shot learning; it's likely related to the model's learning difficulty. Perhaps the patterns in the early part of the sentence are more obvious and therefore easier to learn? All of this is still speculation.
Summary and Conclusion
The above results are summarized as follows:
\[\begin{array}{c}
\text{Summary Comparison of Results} \\
\begin{array}{c|ccccc}
\hline
& \text{P1} & \text{P2} & \text{P3} & \text{P4} & \text{P5} \\
\hline
\text{M2} & 85.17\,/\,84.27 & 70.63\,/\,68.69 & 58.55\,/\,59.12 & 81.81\,/\,82.28 & 80.25\,/\,81.62 \\
\text{M2}^{+\text{unsupervised}} & 88.05\,/\,87.53 & 71.01\,/\,68.78 & 81.05\,/\,81.24 & 86.40\,/\,85.65 & 87.26\,/\,87.40 \\
\text{M2}^{+\text{few-shot}} & 89.29\,/\,89.18 & 84.71\,/\,82.76 & 88.91\,/\,89.05 & 89.31\,/\,89.13 & 89.07\,/\,88.75 \\
\text{M2}^{+\text{semi-supervised}} & 90.09\,/\,89.76 & 79.58\,/\,79.35 & 90.19\,/\,88.96 & 90.05\,/\,89.54 & 89.88\,/\,89.23 \\
\hline
\end{array}
\end{array}\]
Readers can also compare this to the results of our previous article "On Generalization: From Random Noise, Gradient Penalty to Virtual Adversarial Training" where we used Virtual Adversarial Training (VAT) for semi-supervised learning. It is clear that whether in zero-shot, few-shot, or semi-supervised learning, the MLM-based method can match the results of VAT-based semi-supervised learning. Our results in the short news multi-class classification experiment were similar. Therefore, this shows that the MLM model can indeed be used as an excellent zero-shot/few-shot/semi-supervised learner.
Of course, the MLM-based approach still has disadvantages. For example, the independence assumption used by MLM limits its predictive power for longer texts (to put it plainly, the text in the blank should not be too long), and the inability to predict indefinite-length answers also restricts its application scenarios (so it can currently only be used for multiple-choice style tasks, not generation). We look forward to more powerful MLM models emerging; at that time, it might be possible to compete with GPT3 in all tasks.
Summary
This article introduced a novel application of BERT's MLM model: converting tasks into cloze tests by pairing them with specific descriptions and using the MLM model for zero-shot, few-shot, and semi-supervised learning. In the original paper's SuperGLUE experiments, it achieved results comparable to GPT3, and I have conducted some experiments on Chinese tasks that further confirm the effectiveness of this idea. The entire approach is quite unique and gives a sense of "so that's how it can be done." I recommend everyone learn from it.