By 苏剑林 | July 25, 2020
Machine Reading Comprehension (MRC) tasks are likely familiar to many readers. Simply put, it involves finding the answer to a given question from a given passage—a process of "Passage + Question → Answer." I have previously written articles on reading comprehension, such as "CNN-based Machine Reading Comprehension Question Answering Model: DGCNN". Constructing Q&A pairs is essentially the inverse task of reading comprehension: the process of "Passage → Answer + Question." In academia, this is generally referred to as "Question Generation (QG)." Since answers can often be selected through relatively regular random selection in most cases, many papers focus only on the "Passage + Answer → Question" step.
This article presents a fully end-to-end practice of "Passage → Answer + Question," including the model introduction and implementation code based on bert4keras. Readers are welcome to try it out.
First, See the Results
Input Passage: Mount K2 is the second highest peak in the world, located in China.
Generated Q&A: What is the name of the second highest mountain in the world? Mount K2
Which highest peak in the world is Mount K2? Second
Input Passage: On July 28, Thailand will celebrate the 68th birthday of King Maha Vajiralongkorn, Rama X.
Generated Q&A: Who is the Rama X of Thailand? King Maha Vajiralongkorn
How old is King Maha Vajiralongkorn, Rama X of Thailand? 68 years old
What is the birth date of King Maha Vajiralongkorn of Thailand? July 28
Which country does King Maha Vajiralongkorn belong to? Thailand
Input Passage: The Water Splashing Festival, also known as the Songkran Festival, has a history of 700 years and is the grandest traditional festival for the Dai (China) and De'ang ethnic groups.
Generated Q&A: What is the Water Splashing Festival also known as? Songkran Festival
Which ethnic group in China observes the Water Splashing Festival as their grandest festival? Dai
How many years of history does the Water Splashing Festival have? 700 years
Thought Analysis
The goal of this article is to implement "Passage → Answer + Question" fully end-to-end, including answer selection being automatically completed by the model without manual rules. To put it simply, it uses the "BERT + UniLM" method to construct a Seq2Seq model (UniLM's Attention Mask combined with BERT's pre-trained weights). If readers are unfamiliar with UniLM, they are welcome to first read "From Language Models to Seq2Seq: Transformer as a Play, All About the Mask".
In a previous article, "Universal seq2seq: Reading Comprehension Q&A Based on seq2seq", I also provided an implementation of reading comprehension using a Seq2Seq model, which directly constructs $p\big(\text{Answer}\big\|\text{Passage}, \text{Question}\big)$ using a Seq2Seq model, illustrated as follows:

In fact, by slightly modifying the above model to include the question in the generation target, Q&A pair generation can be achieved. That is, the model becomes $p\big(\text{Question}, \text{Answer}\big\|\text{Passage}\big)$, as shown below:

However, intuitively, it is easy to imagine that the difficulty of "Passage → Answer" and "Passage + Answer → Question" should be lower than "Passage + Question → Answer." Therefore, we swap the generation order of the question and the answer to $p\big(\text{Answer}, \text{Question}\big\|\text{Passage}\big)$, which yields better final results:

Implementation Analysis
The model introduction ends here. There isn't much more to say—simply determine what is input and what is output, and apply the "BERT + UniLM" approach. Below is my reference implementation:
task_question_answer_generation_by_seq2seq.py
The decoding strategy is worth discussing here. In a typical Seq2Seq model, decoding ends upon reaching a [SEP]. However, the model in this article needs to decode until two [SEP] tokens are reached. The text up to the first [SEP] is the answer, while the text between the two [SEP] tokens is the question. Theoretically, many Q&A pairs can be constructed from a given passage; in other words, the target is not unique. Therefore, we cannot use deterministic decoding algorithms like Beam Search. Instead, we should use random decoding algorithms (concepts related to this can be found in the "Decoding Algorithms" section of "How to Deal with the 'Can't Stop' Problem in Seq2Seq?").
However, the problem is that if a purely random decoding algorithm is used, the generated questions might be too "out of this world"—they might include content irrelevant to the passage. For example, if the passage is "China's Mars probe Tianwen-1 was successfully launched," the generated question might be "What was China's first artificial satellite?" Although related, it is too divergent. Therefore, I suggest a compromise strategy: use random decoding to generate the answer, and then use deterministic decoding to generate the question. This helps ensure the reliability of the question as much as possible. Of course, if readers care more about the diversity of the generated questions, using random decoding for everything is also fine; it's a matter of personal tuning.
Readers should also note that the aforementioned reference script does not impose constraints on the answer; thus, the generated answer may not necessarily be a snippet from the passage. After all, this is just a reference implementation and is still some distance from practical application. Interested readers are encouraged to understand and modify the code according to their own needs. Furthermore, since Q&A pair construction has become a pure Seq2Seq problem, any techniques used to improve Seq2Seq performance can be applied to improve the quality of Q&A generation, such as the previously discussed "A Brief Analysis and Countermeasures for the Exposure Bias Phenomenon in Seq2Seq". These are left for the readers to explore.
Article Summary
This article is an end-to-end practice of Q&A pair generation, primarily based on the "BERT + UniLM" Seq2Seq model to generate answers and questions directly from passages and discusses decoding strategies. Generally speaking, there is nothing unique about the model in this article, but because it leverages BERT's pre-trained weights, the quality of the finally generated Q&A pairs is quite noteworthy.