By 苏剑林 | April 12, 2017
As is well known, a large number of people post numerous questions on Baidu Zhidao and receive a vast amount of replies. However, many respondents on Baidu Zhidao seem to be quite lazy; they often prefer to directly copy and paste large sections of text from the internet as their answers. These contents may or may not be relevant to the question. For example:
https://zhidao.baidu.com/question/557785746.html
Question: How high is the altitude of Guangzhou Baiyun Mountain?
Answer: Guangzhou Baiyun Mountain is the head of the new "Eight Sights of Guangzhou," a national 4A-level scenic spot, and a national key scenic resort. It is located in the northeast of Guangzhou and is one of the famous mountains in South Guangdong, known since ancient times as the "First Show of Guangzhou." The mountain body is quite broad, consisting of more than 30 peaks, and is a branch of the Jiulian Mountains, the highest peak in Guangdong. It covers an area of 20.98 square kilometers, and the main peak, Moxing Ridge, is 382 meters high (Note: the latest surveyed height is 372.6 meters — National Administration of Surveying, Mapping and Geoinformation, 2008). The peaks overlap, and streams crisscross; climbing high allows a bird's-eye view of the entire city and a distant view of the Pearl River. Whenever it clears up after rain or in late spring, white clouds linger among the mountains, creating a magnificent spectacle, hence the name Baiyun (White Cloud) Mountain.
In fact, for this question, only the sentence "the main peak, Moxing Ridge, is 382 meters high" is meaningful. If one were to be even more concise, only "382 meters" is meaningful; the rest is essentially fluff. Indeed, how to extract correct and concise answers for a given question from a large volume of relevant text is a difficult problem for both humans and machines. This requires not only good algorithms but also high-quality datasets for training.
To this end, Baidu utilized Baidu Zhidao and other resources to construct such a dataset, called WebQA, currently in version v1.0:
http://idl.baidu.com/WebQA.html
The paper published by Baidu using this dataset: Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. 2016. Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering. arXiv:1607.06275.
Thanks to Baidu! Open-source datasets are rare and valuable!
The content in this section is reprinted from http://idl.baidu.com/WebQA.html
\[\begin{array}{c|c|lc|c} \hline & & \rlap{\text{Annotated Evidence}}\\ & \text{Question} & Positive & Negative & \text{Retrieved Evidence}\\ \hline Training & 36,181 & 140,897 & 125,886 & 181,661\\ Validation & 3,018 & \,\,5,305 & / & 60,351\\ Training & 3,024 & \,\,5,315 & / & 60,465\\ \hline \end{array}\]
Question Length

Evidence Length

Answer Length
The published file is 267MB, but for us, the contents seem a bit excessive because it includes word segmentation results, sequence labeling results, and word vector results—presumably the internal research group used these directly for their experiments. For our purposes, we clearly only need the pure Q&A corpus. Therefore, I have streamlined it, retaining only the most basic corpus content:
Link: https://pan.baidu.com/s/1pLXEYtd Password: 6fbf
File List:
WebQA.v1.0/readme.txt
WebQA.v1.0/me_test.ann.json (One question paired with one piece of evidence; evidence contains the answer)
WebQA.v1.0/me_test.ir.json (One question paired with multiple pieces of evidence; evidence may or may not contain the answer)
WebQA.v1.0/me_train.json (Mixed training corpus)
WebQA.v1.0/me_validation.ann.json (One question paired with one piece of evidence; evidence contains the answer)
WebQA.v1.0/me_validation.ir.json (One question paired with multiple pieces of evidence; evidence may or may not contain the answer)The difference between
testandvalidationis that, theoretically, the distribution ofvalidationis closer to the distribution oftrain. Generally,validationis used to verify model accuracy, whiletestis used to verify the model's transferability. The difference betweenannandiris that, becauseirprovides multiple pieces of evidence for each question, one can obtain a more reliable answer through voting across different segments; whereasannis in the form of one-question-one-evidence, which is a test set that truly tests reading comprehension ability.
The organized data format is as follows, taking me_train.json as an example:
1. If read using Python's
jsonlibrary, you get a dictionaryme_trainwhere the dictionary keys are question IDs likeQ_TRN_010878;2. A single record is obtained via
me_train['Q_TRN_010878'], which is also a dictionary containing two keys:questionandevidences;3.
me_train['Q_TRN_010878']['question']retrieves the text content of the question, e.g., "Who played the father of Huo Xiaolin in Brave Heart?";4.
evidencescontains the material and corresponding answers for the question, also formatted as a dictionary, with keys being IDs likeQ_TRN_010878#06;5.
me_train['Q_TRN_010878']['evidences']['Q_TRN_010878#05']retrieves a single record, which is another dictionary with two keys:evidenceandanswer;6.
evidenceis the corresponding material, e.g., "A: 'Brave Heart' Huo Shaochang and Madam Hua's son Yang Zhigang as Huo Xiaolin [Abstract] Male lead, Young Master Huo; Starring Kou Zhenhai as Huo Shaochang [Abstract] Huo Xiaolin's father 'Juren'; Starring Shi Ke as Madam Hua [Abstract] Mother of Huo Xiaolin and Zhao Shucheng, Starring".answeris a list of answers (since there may be multiple), e.g.,[u'寇振海']. If the evidence does not contain the answer, the answer is[u'no_answer'].
These all correspond one-to-one with the original dataset.