By 苏剑林 | January 06, 2017
Among Chinese corpora, the one that is highest in quality and easiest to obtain is likely the Wikipedia Chinese corpus. Furthermore, Wikipedia is quite generous, packaging all entries every month (Download address here: https://dumps.wikimedia.org/zhwiki/) for the world to use. This is truly "taken from the people, given back to the people." Regrettably, due to the unreasonable blockade of the Great Firewall, the number of Chinese Wikipedia entries is currently only just over 910,000, while Baidu Baike and Hudong Baike both have tens of millions (the English Wikipedia also has over ten million). Despite this, it has not stopped Chinese Wikipedia from being arguably the highest quality Chinese corpus available. (Baidu Baike and Hudong Baike can only be obtained by crawling, and many records are of quite poor quality, often being fragments of mutual copying or even plagiarism.)
Threshold
While downloading is easy, there is a certain threshold to using the Wikipedia corpus. The raw material downloaded directly from Wikipedia is a compressed text package containing many HTML and Markdown markers, making it basically unusable directly. Fortunately, enthusiastic experts have already written processing tools, mainly two: 1. Wikipedia Extractor; 2. gensim's wikicorpus library. Both are based on Python.
However, neither of these two mainstream processing methods satisfies me. First, the results extracted by Wikipedia Extractor remove content inside {{}} markers, which leads to the following situation:
In Western languages, the word "mathematics" (; ) originates from Ancient Greek ()
This happens because the words inside the parentheses contained {{}} markers and were cleared. Following common online tutorials to use gensim.corpora.wikicorpus.WikiCorpus directly is even more problematic, as it removes all punctuation. For a person with a "quality obsession" pursuing a high-quality corpus, this is unacceptable. Therefore, I wrote a processing script myself by combining gensim.
Code
# (The original post contains a Python script here)
Notes
As you can see, the main part of the code consists of regular expressions. First, we use bz2file to read the downloaded corpus without decompressing it, and then use gensim's extract_pages to extract each page. After extraction, we first handle some special non-text markers on the page, then replace some useful {{}} markers with [[]], because [[]] markers will not be completely cleared (readers should test the specific principle themselves). Then, we use gensim's filter_wiki function for direct cleaning. Next, we address line break issues, and finally, we use opencc to convert Traditional Chinese to Simplified Chinese.
In the subsequent loop, the condition re.findall('^[a-zA-Z]+:', d[0]) is used to remove help pages, and re.findall(u'^#', d[1]) is used to remove redirection pages. In the end, approximately 919,000 pages are obtained. tqdm is used to display progress; this is a must-have. The program ran for about 40 minutes on my machine, resulting in a pure text corpus of about 1.5G. Running time is not critical because preprocessing is a one-time task.
It is worth noting that opencc should not be installed using sudo apt-get install opencc, as the default version is too low. You should install it from source and then use pip install opencc to install the Python interface. If calling opencc in Python results in a "Segmentation fault," you should run:
# (Original post provides a bash fix here)
Byproduct
The redirection mentioned above implies that two words have the same meaning. I have extracted all redirections from the Chinese Wikipedia and created a matching table. This means that both words in each line of the word list have the same meaning. This can be considered a byproduct.
Synonym table based on Chinese Wikipedia redirections: wiki_cn_mapping.7z