Rewriting the Previous New Word Discovery Algorithm: Faster and Better New Word Discovery

By 苏剑林 | September 09, 2019

New word discovery is one of the foundational tasks in NLP. It primarily aims to identify character sequences in a corpus that could potentially be "new words" through the unsupervised discovery of linguistic features (mainly statistical features). This site has published several articles on the topic of "new word discovery," such as: among these articles, the author believes the one with the most elegant theory is "Unsupervised Word Segmentation based on Language Models", while the one with better overall performance as a new word discovery algorithm is "A Better New Word Discovery Algorithm". This article is a reimplementation of the algorithm from that post.

Background

When I wrote "[Chinese Word Segmentation Series] 8. A Better New Word Discovery Algorithm", I already provided a basic implementation and verified it. However, the original version was written in pure Python and was only intended for quick verification of the effect, so it was written quite casually and suffered from serious efficiency issues. I recently recalled this and felt it would be a waste to neglect the algorithm, so I rewrote it. I utilized some tools and techniques to significantly increase the speed.

By the way, "new word discovery" is a colloquial term; a more accurate name should be "unsupervised lexicon construction," because in principle, it can fully build a dictionary rather than just "new words." Of course, you can compare it with a common dictionary and remove existing words to obtain new words.

Details

The main changes are as follows:

Used the count_ngrams program from the language modeling tool kenlm to count ngrams. Since kenlm is written in C++, its speed is guaranteed, and it has been optimized to be very memory-friendly.
Used a Trie tree structure to accelerate the search for whether an ngram has appeared during the second pass of the vocabulary to get candidate words. Trie trees or their variants are essentially standard for all dictionary-based segmentation tools because they speed up searching for whether words from a dictionary appear in a string.

Usage

The open-source address for this version is located at: https://github.com/bojone/word-discovery. Note that this script should only be used on Linux systems. If you want to use it on Windows, you will likely need to make some modifications; as for what specific modifications are needed, I don't know, so please resolve that yourself.

Note that while the algorithm itself can theoretically apply to any language, the implementation in this article is, in principle, only suitable for languages where the "character" (字) is the basic unit.

Before deciding to use this library, it is recommended that the reader spend some time reading "[Chinese Word Segmentation Series] 8. A Better New Word Discovery Algorithm" to gain a basic understanding of the algorithm steps, in order to better understand the usage steps below.

The core script on GitHub is word_discovery.py, which contains the complete implementation and usage examples. Let's briefly walk through the example.

First, write a generator for the corpus that returns the corpus line by line:

import re
import glob

# Corpus generator, and preliminary preprocessing of the corpus
# The specific meaning of this generator example is not important; 
# you just need to know that it yields the text line by line.
def text_generator():
    txts = glob.glob('/root/thuctc/THUCNews/*/*.txt')
    for txt in txts:
        d = open(txt).read()
        d = d.decode('utf-8').replace(u'\u3000', ' ').strip()
        yield re.sub(u'[^\u4e00-\u9fa50-9a-zA-Z ]+', '\n', d)

The reader does not need to understand exactly what this generator is doing; you only need to know that this generator is yielding the raw corpus text line by line. If you don't know how to write a generator, please learn it yourself. Please do not discuss questions like "what should the corpus format be" or "how do I change this to fit my corpus" within this article, thank you.

Incidentally, because it is unsupervised training, the corpus is generally "the bigger, the better"—several hundred MB to several GB is fine. But actually, if you only have a few MB of corpus (such as a novel), you can also test it directly and see basic results (though you may need to modify the parameters below).

Once you have the generator, configure some parameters, and then you can execute each step sequentially:

min_count = 32
order = 4
corpus_file = 'thucnews.corpus' # Filename for saved corpus
vocab_file = 'thucnews.chars' # Filename for saved character set
ngram_file = 'thucnews.ngrams' # Filename for saved ngrams
output_file = 'thucnews.vocab' # Filename for final exported dictionary

write_corpus(text_generator(), corpus_file) # Save corpus as text
count_ngrams(corpus_file, order, vocab_file, ngram_file) # Use Kenlm to count ngrams
ngrams = KenlmNgrams(vocab_file, ngram_file, order, min_count) # Load ngrams
ngrams = filter_ngrams(ngrams.ngrams, ngrams.total, [0, 2, 4, 6]) # Filter ngrams

Note that kenlm requires a space-separated, plain text format corpus as input, and the write_corpus function helps us do that. Then count_ngrams calls kenlm's count_ngrams program to count ngrams. Therefore, you need to compile kenlm yourself and place its count_ngrams executable in the same directory as word_discovery.py. If you have a Linux environment, compiling kenlm is quite simple; I have previously discussed kenlm here, which you can refer to.

After count_ngrams is executed, the results will be saved in a binary file, which KenlmNgrams reads. If your input corpus is large, this step will require a significant amount of memory. Finally, filter_ngrams filters the ngrams. [0, 2, 4, 6] are the thresholds for mutual information, where the first 0 is meaningless and served as filler, while 2, 4, and 6 are the mutual information thresholds for 2-gram, 3-gram, and 4-gram respectively; generally, a monotonic increase is better.

At this point, we have completed all the "preparatory work." Now we can proceed to build the lexicon. First, build an ngram Trie tree, and then use this Trie tree to perform a basic "pre-segmentation":

ngtrie = SimpleTrie() # Build the ngram Trie tree
for w in Progress(ngrams, 100000, desc=u'build ngram trie'):
    _ = ngtrie.add_word(w)

candidates = {} # Get candidate words
for t in Progress(text_generator(), 1000, desc='discovering words'):
    for w in ngtrie.tokenize(t): # Pre-segmentation
        candidates[w] = candidates.get(w, 0) + 1

This pre-segmentation process was introduced in "[Chinese Word Segmentation Series] 8. A Better New Word Discovery Algorithm". In short, it is somewhat like maximum matching, where ngram fragments are connected into the longest possible candidate words.

Finally, filter the candidate words again, and you can save the dictionary:

# Frequency filtering
candidates = {i: j for i, j in candidates.items() if j >= min_count}
# Mutual information filtering (backtracking)
candidates = filter_vocab(candidates, ngrams, order)
# Output result file
with open(output_file, 'w') as f:
    for i, j in sorted(candidates.items(), key=lambda s: -s[1]):
        s = '%s %s\n' % (i.encode('utf-8'), j)
        f.write(s)

Evaluation

Readers have previously complained that the algorithms I write lack standard evaluations, so I performed a simple evaluation this time with the script evaluate.py.

Specifically, using THUCNews as the base corpus, I constructed a dictionary using the above script (taking approximately 40 minutes). Retaining only the top 50,000 words, I loaded this 50,000-word dictionary into Jieba segmentation (not using its built-in dictionary and closing its new word discovery function). This forms a segmentation tool based on an unsupervised dictionary. I then used this tool to segment the test set provided by Bakeoff 2005 and evaluated it using its test script. The final score on the PKU test set was:

$$\begin{array}{c|c|c} \hline \text{RECALL} & \text{PRECISION} & \text{F1}\\ \hline 0.777 & 0.711 & 0.742\\ \hline\end{array}$$

In other words, it can achieve an F1 score of 0.742. What level is this? An article from ICLR 2019 called "Unsupervised Word Discovery with Segmental Neural Language Models" mentioned that its result on the same test set was $F1=0.731$. Looking at it this way, the result of this algorithm is not worse than the state-of-the-art results from top conferences. Readers can download the THUCNews corpus themselves to completely replicate the above results.

Additionally, more data can achieve better results. This is a dictionary I extracted from 5 million WeChat public account articles (totaling more than 20GB after saving as text): wx.vocab.zip, for readers to use if needed. Retaining the top 50,000 words of this dictionary and performing the same evaluation, the F1 significantly exceeded the results of the top conference paper:

$$\begin{array}{c|c|c} \hline \text{RECALL} & \text{PRECISION} & \text{F1}\\ \hline 0.799 & 0.734 & 0.765\\ \hline\end{array}$$

(Note: This is to provide an intuitive perception of the effect. The comparison may be unfair because I am not sure which corpora were used for the training set in that paper. But I feel that within the same amount of time, the algorithm in this article would be superior to the algorithm in the paper, because intuitively the paper's algorithm would be very slow to train. The authors also haven't open-sourced it, so there are many uncertainties. If there are any errors, please point them out.)

Summary

This article reimplements the new word discovery (lexicon construction) algorithm I previously proposed, primarily making speed optimizations and performing simple performance evaluations. However, the specific effect will still need to be fine-tuned by the reader during actual use.

I hope everyone enjoys using it! Enjoy it!

Article Address: https://kexue.fm/archives/6920