By 苏剑林 | April 24, 2017
Chit-chat
Over the past two years, fields such as knowledge graphs, question-answering systems, and chatbots have become increasingly popular. A knowledge graph is a very generalized concept; in my view, anything involving the construction, retrieval, and utilization of knowledge bases related to machine learning can be considered part of knowledge graphs. Of course, this isn't a formal definition, just a personal intuition. Readers working on knowledge graphs know that triples are a method for structuring knowledge and are an important component of knowledge-based question-answering systems. For the English domain, there are already several large open-source triple corpora, but clearly, there has not been such a corpus shared for Chinese (even if someone has crawled one, they tend to keep it for themselves). Some time ago, I wrote a crawler for Baidu Baike and ran it for a while, capturing several million Baidu Baike entries. Many of these entries contain structured information that, when extracted directly, forms effective "triples" that can be used for knowledge graphs. The triple corpus shared in this article comes from this process, totaling 25 million triples.
Baidu Baike Triples
Preview
The structure of a triple is (Entity, Attribute, Value). Partial previews are as follows:
Science, encompasses, fields such as nature, society, and thought
Science, foreign name, science
Science, pinyin, kē xué
Science, Chinese name, 科学
Science, explanation, the application and practice of discovered and accumulated truths
Grammar, foreign name, syntactics
Grammar, Chinese name, 语法学
Physical cosmology, object, large-scale structure and cosmic formation
Physical cosmology, time, twentieth century
Physical cosmology, affiliation, astrophysics
Physical cosmology, Chinese name, 物理宇宙学
Cao Yu, birthplace, Tianjin
Cao Yu, alma mater, Tsinghua University
Cao Yu, date of death, December 13, 1996 (Year of Bingzi)
Cao Yu, Chinese name, Wan Jiabao
The file is in CSV format with UTF-8 encoding. The total number of triples is 25,454,710, and the total number of entities is 4,695,579.
Since these were extracted directly and the source data was manually edited, they tend toward natural language descriptions. This leads to some inconsistencies in descriptions: the same attribute meaning may have multiple different descriptions. For example, "birthplace" and "born in" both represent the location of birth; "foreign name" and "English name" both represent the English name, and so on.
Download
In the spirit of open source, these resources are shared freely. However, considering the hardship of crawling, please cite the original article URL: http://kexue.fm/archives/4359/ when reposting or referencing. Thank you.
Download Address:
Link: https://pan.baidu.com/s/1mkcKP2C
Password: uajy
Category: Resource Sharing | Tags: QA, Corpus, Dataset