By 苏剑林 | December 26, 2019
Some time ago, I wrote "The Universal seq2seq: Reading Comprehension QA Based on seq2seq", exploring how to perform reading comprehension question answering using the most general seq2seq approach, achieving quite good results (single model 0.77, exceeding the best fine-tuned model used during the competition). In this article, we continue with this task but change the approach to be directly based on the Masked Language Model (MLM). The final performance is basically consistent, but the prediction speed is significantly improved.
Two Types of Generation
Broadly speaking, the MLM generation method can also be considered a seq2seq model, but it belongs to "non-autoregressive" generation, whereas what we usually call (narrowly defined) seq2seq refers to autoregressive generation. This section provides a brief introduction to these two concepts.
Autoregressive Generation
As the name suggests, autoregressive generation refers to the decoding stage where tokens are generated recursively, character by character. It models the following probability distribution:
\begin{equation}p(y_1,y_2,\dots,y_n|x)=p(y_1|x)p(y_2|x,y_1)\dots p(y_n|x,y_1,\dots,y_{n-1})\label{eq:at}\end{equation}
For more detailed introductions, you can refer to "Playing with Keras: Automatic Title Generation via seq2seq" and "From Language Models to Seq2Seq: Transformer is All About the Mask"; I will not go into further detail about autoregressive generation here.
Non-Autoregressive Generation
Since autoregressive generation requires recursive decoding, it cannot be parallelized, leading to slower decoding speeds. Therefore, in recent years, many works have been researching non-autoregressive generation and have achieved significant results. Simply put, non-autoregressive generation aims to find ways to make the decoding of each character parallelizable. The simplest non-autoregressive model directly assumes that each character is independent:
\begin{equation}p(y_1,y_2,\dots,y_n|x)=p(y_1|x)p(y_2|x)\dots p(y_n|x)\label{eq:nat}\end{equation}
This is a very strong assumption and is only applicable in specific cases. If used directly for general text generation tasks like automatic summarization, the results would be quite poor. For more complex work related to non-autoregressive generation, one can find plenty of resources by searching for "non-autoregressive text generation" on Arxiv or Google.
As the title mentions, the reading comprehension method in this article is "based on MLM." Readers familiar with the BERT model know that MLM (Masked Language Model) is actually a special case of $\eqref{eq:nat}$, so generation models based on MLM fall under the category of non-autoregressive generation.
Model Introduction
The "introduction" is indeed brief because using MLM for reading comprehension is very simple.
Model Diagram
First, define a maximum length $l_{\max}$. Then, concatenate the question and the passage, and insert $l_{\max}$ [MASK] tokens within them. This is fed into BERT, and finally, the [MASK] parts are used to predict the answer (this applies to both the training and prediction stages), as shown in the following diagram:

Model diagram for reading comprehension using MLM (where [M] represents the [MASK] token)
Code and Results
Code link: task_reading_comprehension_by_mlm.py
Ultimately, using SogouQA's own evaluation script, the valid set score is approximately 0.77 (Accuracy=0.7282149325820084, F1=0.8207266829447049, Final=0.7744708077633566). This is level with the previous "Universal seq2seq" model, but the prediction speed is significantly improved. The previous seq2seq solution could only predict about 2 items per second, while the current method reaches 12 items per second—a 6x speedup without sacrificing performance.
Which One to Choose?
In principle, seq2seq is universal, and in principle, the modeling of $\eqref{eq:at}$ used by seq2seq is more reasonable than the $\eqref{eq:nat}$ used by MLM. Why can the MLM solution achieve performance level with seq2seq? When should one use MLM, and when should one use seq2seq?
Training and Prediction
First, the biggest problem with seq2seq is its slowness, especially for long text generation. Therefore, if high efficiency is required, one naturally has to abandon the seq2seq approach.
If efficiency isn't a concern, is seq2seq always the best? Not necessarily. Although $\eqref{eq:at}$ is more accurate from a modeling perspective, seq2seq training is done via teacher forcing, which leads to the "exposure bias" problem: during training, the input at each step comes from the ground truth text; during generation, the input at each step comes from the output of the previous step. Thus, if one character is poorly generated, the error may propagate, making subsequent generation worse and worse.
In short, there is an inconsistency between training and prediction, which can lead to error accumulation. In contrast, the MLM-based approach behaves identifyingly during training and prediction because it does not require ground truth labels as input (the answer position is filled with [MASK] during prediction). Thus, there is no error accumulation. Furthermore, because of this feature, decoding no longer requires recursion and can be parallelized, increasing decoding speed.
Unique Correct Answer
Additionally, non-autoregressive generation like MLM is relatively more suitable for short text generation because shorter texts better fit the independence assumption. At the same time, non-autoregressive generation is suitable for scenarios where "there is only one correct answer." The reading comprehension task in this article is primarily extractive, which corresponds exactly to this scenario, hence MLM performs well.
In fact, sequence labeling models like frame-by-frame Softmax or CRF can also be seen as non-autoregressive generation models. I believe the fundamental reason they are effective is that the "correct answer sequence is unique," rather than the intuitive belief that "input and output are aligned." That is to say, if the condition "there is only one correct answer" is met, then non-autoregressive generation can be considered.
Note that "unique answer" here does not mean each sample has only one human-annotated answer, but rather that the task is designed such that the answer is unique. For example, in word segmentation, once the labeling method is designed, each sentence corresponds to exactly one correct segmentation scheme. In contrast, for title generation, a single article can clearly have different titles; therefore, the answer to title generation is not unique (even if only one title is provided for each article in the training data).
Summary
This article experimented with non-autoregressive generation via MLM for reading comprehension QA and found that the final results were not bad, while the speed increased several times. Additionally, the article briefly compared autoregressive and non-autoregressive generation, analyzing when the non-autoregressive solution is applicable and the reasons why.