[Corpus] Baidu's Chinese Question Answering Dataset WebQA

By 苏剑林 | April 12, 2017

Information Extraction

As is well known, a large number of people post numerous questions on Baidu Zhidao and receive a vast amount of replies. However, many respondents on Baidu Zhidao seem to be quite lazy; they often prefer to directly copy and paste large sections of text from the internet as their answers. These contents may or may not be relevant to the question. For example:

https://zhidao.baidu.com/question/557785746.html

Question: How high is the altitude of Guangzhou Baiyun Mountain?

Answer: Guangzhou Baiyun Mountain is the head of the new "Eight Sights of Guangzhou," a national 4A-level scenic spot, and a national key scenic resort. It is located in the northeast of Guangzhou and is one of the famous mountains in South Guangdong, known since ancient times as the "First Show of Guangzhou." The mountain body is quite broad, consisting of more than 30 peaks, and is a branch of the Jiulian Mountains, the highest peak in Guangdong. It covers an area of 20.98 square kilometers, and the main peak, Moxing Ridge, is 382 meters high (Note: the latest surveyed height is 372.6 meters — National Administration of Surveying, Mapping and Geoinformation, 2008). The peaks overlap, and streams crisscross; climbing high allows a bird's-eye view of the entire city and a distant view of the Pearl River. Whenever it clears up after rain or in late spring, white clouds linger among the mountains, creating a magnificent spectacle, hence the name Baiyun (White Cloud) Mountain.

In fact, for this question, only the sentence "the main peak, Moxing Ridge, is 382 meters high" is meaningful. If one were to be even more concise, only "382 meters" is meaningful; the rest is essentially fluff. Indeed, how to extract correct and concise answers for a given question from a large volume of relevant text is a difficult problem for both humans and machines. This requires not only good algorithms but also high-quality datasets for training.

WebQA

To this end, Baidu utilized Baidu Zhidao and other resources to construct such a dataset, called WebQA, currently in version v1.0:

http://idl.baidu.com/WebQA.html

The paper published by Baidu using this dataset: Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. 2016. Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering. arXiv:1607.06275.

Thanks to Baidu! Open-source datasets are rare and valuable!

Data Overview

The content in this section is reprinted from http://idl.baidu.com/WebQA.html

\[\begin{array}{c|c|lc|c} \hline & & \rlap{\text{Annotated Evidence}}\\ & \text{Question} & Positive & Negative & \text{Retrieved Evidence}\\ \hline Training & 36,181 & 140,897 & 125,886 & 181,661\\ Validation & 3,018 & \,\,5,305 & / & 60,351\\ Training & 3,024 & \,\,5,315 & / & 60,465\\ \hline \end{array}\]

Question Length

Question Length

Evidence Length

Evidence Length

Answer Length

Answer Length

The published file is 267MB, but for us, the contents seem a bit excessive because it includes word segmentation results, sequence labeling results, and word vector results—presumably the internal research group used these directly for their experiments. For our purposes, we clearly only need the pure Q&A corpus. Therefore, I have streamlined it, retaining only the most basic corpus content:

Clean Version

Link: https://pan.baidu.com/s/1pLXEYtd Password: 6fbf

File List:
WebQA.v1.0/readme.txt
WebQA.v1.0/me_test.ann.json (One question paired with one piece of evidence; evidence contains the answer)
WebQA.v1.0/me_test.ir.json (One question paired with multiple pieces of evidence; evidence may or may not contain the answer)
WebQA.v1.0/me_train.json (Mixed training corpus)
WebQA.v1.0/me_validation.ann.json (One question paired with one piece of evidence; evidence contains the answer)
WebQA.v1.0/me_validation.ir.json (One question paired with multiple pieces of evidence; evidence may or may not contain the answer)

The difference between test and validation is that, theoretically, the distribution of validation is closer to the distribution of train. Generally, validation is used to verify model accuracy, while test is used to verify the model's transferability. The difference between ann and ir is that, because ir provides multiple pieces of evidence for each question, one can obtain a more reliable answer through voting across different segments; whereas ann is in the form of one-question-one-evidence, which is a test set that truly tests reading comprehension ability.

The organized data format is as follows, taking me_train.json as an example:

1. If read using Python's json library, you get a dictionary me_train where the dictionary keys are question IDs like Q_TRN_010878;

2. A single record is obtained via me_train['Q_TRN_010878'], which is also a dictionary containing two keys: question and evidences;

3. me_train['Q_TRN_010878']['question'] retrieves the text content of the question, e.g., "Who played the father of Huo Xiaolin in Brave Heart?";

4. evidences contains the material and corresponding answers for the question, also formatted as a dictionary, with keys being IDs like Q_TRN_010878#06;

5. me_train['Q_TRN_010878']['evidences']['Q_TRN_010878#05'] retrieves a single record, which is another dictionary with two keys: evidence and answer;

6. evidence is the corresponding material, e.g., "A: 'Brave Heart' Huo Shaochang and Madam Hua's son Yang Zhigang as Huo Xiaolin [Abstract] Male lead, Young Master Huo; Starring Kou Zhenhai as Huo Shaochang [Abstract] Huo Xiaolin's father 'Juren'; Starring Shi Ke as Madam Hua [Abstract] Mother of Huo Xiaolin and Zhao Shucheng, Starring". answer is a list of answers (since there may be multiple), e.g., [u'寇振海']. If the evidence does not contain the answer, the answer is [u'no_answer'].

These all correspond one-to-one with the original dataset.