By 苏剑林 | December 05, 2019
Today, I added a new example to bert4keras: Reading Comprehension Question Answering (task_reading_comprehension_by_seq2seq.py). The corpora used are the same as before, WebQA and SogouQA. The final score is around 0.77 (single model, without fine-tuning).
Since the primary purpose this time was to add a demo to bert4keras, efficiency was not the main concern. The goal was universality and ease of use, so I adopted the most "universal" solution—using seq2seq to implement reading comprehension.
When using seq2seq, you basically don't need to worry about model design; you just concatenate the passage and the question and then predict the answer. Furthermore, the seq2seq approach naturally includes a method for determining whether a passage contains an answer and naturally leads to a multi-passage voting strategy. In short, if efficiency is not considered, seq2seq is a very elegant solution for reading comprehension.
This implementation of seq2seq still uses the UNILM scheme. Readers who are not familiar with it can read "From Language Models to Seq2Seq: Transformer as a Play, All in the Mask" to understand the corresponding content.
Building a seq2seq model using the UNILM scheme in bert4keras is basically a one-line task. Therefore, the main work in this example is not on building the model itself, but on handling the input and output.
First is the input. The input format is simple and can be clearly explained with a diagram:
Illustration of Reading Comprehension Model using seq2seq
If you are inputting a single passage and a single question to answer, you can simply follow the standard seq2seq processing scheme—beam search—for decoding.
However, WebQA and SogouQA are designed for search scenarios, meaning multiple passages exist simultaneously to answer the same question. This involves choosing a voting strategy. A naive idea is: each passage is combined with the question and decoded separately using beam search, providing a confidence level, and finally voting according to the method in "DGCNN: A Reading Comprehension-based Question Answering Model". The difficulty with this approach is assigning a reasonable confidence level to each answer; it appears less natural than the idea we present below and is also slightly less efficient.
Here, we provide a scheme that is more "integrated" with beam search:
Exclude passages that have no answer first, and then when decoding each character of the answer, directly average the probability values predicted by all passages (in a specific way).
Specifically, all passages are concatenated with the question separately, and their respective probability distributions for the first character are produced. Passages where the first character produced is [SEP] are considered to have no answer and are excluded. After exclusion, the probability distributions of the first character from the remaining passages are averaged, and the top-k are retained (standard beam search procedure). When predicting the second character, each passage is combined with the top-k candidate values to predict their respective second-character probability distributions, which are then averaged by passage to provide the next top-k. This continues until [SEP] appears. (Essentially, it is standard beam search plus averaging by passage. If you still can't quite grasp it, you'll have to check the source code~)
Additionally, there should be two ways to generate answers: one is extractive, where the answer must be a fragment of the passage; the other is generative, where you don't need to consider whether the answer is a passage fragment and just decode to generate the answer. Both modes have corresponding logic in the decoding process of this article.
Code link: task_reading_comprehension_by_seq2seq.py
Ultimately, on the evaluation script provided by SogouQA, the valid set score is approximately 0.77 (Accuracy=0.7259005836184343, F1=0.813860036706151, Final=0.7698803101622926). This single-model performance far exceeds the previous "Open Source Edition of DGCNN Reading Comprehension QA Model (Keras Version)". Of course, the improvement comes at a cost—inference speed is significantly reduced, only predicting about 2 samples per second.
(The model has not been finely tuned; there is likely still room for improvement. The current focus is primarily on the demo.)
This article mainly provides an example of reading comprehension based on BERT and seq2seq ideas, and presents a multi-passage voting beam search strategy for readers' reference and testing~
If you found this article helpful, you are welcome to share or reward this article. Rewards are not intended for profit, but to let me know how much sincere attention Scientific Spaces has received from its readers. Of course, if you ignore it, it will not affect your reading. Welcome and thank you again!