By 苏剑林 | August 17, 2016
Foreword: This summer, I spent a lot of time on Chinese word segmentation and language models. I hit walls countless times and gained a few scattered insights. I plan to write a series to share these experiences. Although it is a "series," it is more of a collection of notes rather than a systematic tutorial. I hope readers will understand.
Chinese Word Segmentation
I won't say much about the introduction and importance of Chinese word segmentation. matrix67 has a very clear introduction here regarding segmentation and segmentation algorithms, which is well worth reading. In text mining, although many articles have explored processing methods without segmentation, such as the blog post "Text Sentiment Classification (3): To Segment or Not to Segment", segmentation is generally the first step for text mining. Therefore, an effective segmentation algorithm is crucial. Of course, as the first step, Chinese word segmentation has been explored for a long time. Much of the current work is summary-based or involves minor improvements rather than major shifts.
Currently, there are two main approaches to Chinese word segmentation: dictionary lookup and character tagging. First, dictionary-based methods include: mechanical maximum matching, minimum word count methods, maximum probability combinations based on Directed Acyclic Graphs (DAG), and maximum probability combinations based on language models, etc. Dictionary-based methods are simple and efficient (benefiting from the idea of dynamic programming), especially the maximum probability method combined with language models, which solves ambiguity well. However, they cannot solve a major difficulty in Chinese segmentation—Out-of-Vocabulary (OOV) words (the two main difficulties are ambiguity and OOV words). To address this, character tagging was proposed. Character tagging involves using several markers (e.g., the 4-tag set: single, a single character as a word; begin, the start of a multi-character word; middle, the middle part of words with three or more characters; and end, the end of a multi-character word) to represent the correct segmentation of a sentence. This is a sequence-to-sequence process (input sentence to tag sequence) that solves OOV issues relatively well but is slower. Furthermore, in scenarios where a complete dictionary already exists, character tagging might not perform as well as dictionary methods. In short, both have pros and cons. In practice, they are often combined; for example, "Jieba" segmentation uses maximum probability combinations on a DAG and uses an HMM model based on character tagging to identify sequences of single characters.
AC Automaton
The first thing this article implements is a dictionary-based segmentation method. The process of dictionary lookup involves: 1. Given a set of words, check if those words exist in a given sentence; 2. If they do, resolve ambiguity. Step 1 is known in computer science as "multi-pattern matching." This step looks simple, but implementing it efficiently is not easy. A complete dictionary has at least hundreds of thousands of words; enumerating each one to search would be computationally prohibitive. In fact, humans don't do this either. When looking up a word in a dictionary, we look at the first letter, search only the block with that starting letter, then compare the next letter, and so on. This requires two conditions: 1. A dictionary with a special sorting structure; 2. Efficient search techniques. For the first condition, we have the "Trie" (prefix tree). For the second, we have classic algorithms like the AC Automaton (Aho and Corasick).
I won't elaborate too much on these two conditions—not because I don't want to, but because my understanding ends there. My understanding of the AC Automaton is that it is an efficient multi-pattern matching algorithm using the Trie data structure. I don't need to implement it myself because Python already has a corresponding library: pyahocorasick. Therefore, we only need to care about how to use it. The official tutorial already details the basics of pyahocorasick, so I won't repeat it here. (Regrettably, while pyahocorasick supports both Python 2 and 3, in Python 2 it only supports bytes strings and not unicode strings, whereas in Python 3 it defaults to unicode. This brings some confusion when coding, though it's not fundamental. This article uses Python 2.7.)
To build an AC-automaton-based segmentation system, you first need a text dictionary. Suppose the dictionary has two columns: the word and its corresponding frequency, separated by a space. You can build an AC automaton using the following code:
import ahocorasick
def load_dic(dicfile):
from math import log
dic = ahocorasick.Automaton()
total = 0.0
with open(dicfile) as f:
words = []
for line in f:
line = line.split(' ')
words.append((line[0], int(line[1])))
total += int(line[1])
for i, j in words:
# Here, log probability is used to prevent overflow
dic.add_word(i, (i, log(j/total)))
dic.make_automaton()
return dic
dic = load_dic('me.dic')
Building an AC automaton with pyahocorasick facilitates adding words in "key-comment" pairs (note the line dic.add_word(i, (i, log(j/total)))). This allows us to store information we want (like frequency or part-of-speech) in the comment, which will be returned during matching. With this AC automaton, we can easily build a full-mode segmentation—that is, scanning out every word present in the dictionary (which is actually the primary job of an AC automaton).
def all_cut(sentence):
words = []
for i, j in dic.iter(sentence):
words.append(j[0])
return words
For a long sentence, this might return many words; please use with caution.
Maximum Matching Method
The "full-mode segmentation" mentioned above isn't really segmentation; it's just a simple search. Next, let's implement a classic segmentation algorithm: the Maximum Matching method.
The Maximum Matching method refers to matching words from the dictionary from left to right, matching the longest possible word at each step. This is a relatively crude method. As mentioned in matrix67's article, it is easy to construct counterexamples. If the dictionary contains "不" (not), "不可" (not possible/cannot), "能" (ability), "可能" (possible), but does not contain "不可能" (impossible), then "不可能" would be split as "不可/能". Despite this, in cases where accuracy requirements are not high, this algorithm is acceptable because it is very fast. Below is an implementation of Maximum Matching based on the AC automaton:
def max_match_cut(sentence):
sentence = sentence.decode('utf-8')
words = ['']
for i in sentence:
i = i.encode('utf-8')
if dic.match(words[-1] + i):
words[-1] += i
else:
words.append(i)
return words
The code is short and clear, primarily using the match function from pyahocorasick. In tests on my machine, the efficiency of this algorithm is about 4MB/s. According to the author of hanlp, doing something similar in Java can reach 20MB/s! Using Python imposes two limits: Python's own speed and the limitation of pyahocorasick, which does not support unicode. Consequently, the implementation above isn't the most efficient because it requires constant encoding conversions to handle Chinese character lengths.
The method described above is technically "Forward Maximum Matching." Similarly, there is "Reverse Maximum Matching," which scans the sentence from right to left. Its results are generally better than Forward Maximum Matching. To implement this with an AC automaton, the only way is to store all words in the dictionary in reverse order, reverse the input sentence, and then perform Forward Maximum Matching.
Maximum Probability Combination
The method based on Maximum Probability Combination is currently a superior approach that balances speed and accuracy. It states: for a sentence, if the split into words $w_1, w_2, \dots, w_n$ is the optimal scheme, it should maximize the following probability:
$$P(w_1, w_2, \dots, w_n)$$
Directly estimating this probability is difficult. Usually, approximation schemes are used, such as:
$$P(w_1, w_2, \dots, w_n) \approx P(w_1)P(w_2|w_1)P(w_3|w_2)\dots P(w_n|w_{n-1})$$
Here, $P(w_k|w_{k-1})$ is called a language model, which begins to consider semantics. However, general segmentation tools find it hard to estimate $P(w_k|w_{k-1})$, so an even simpler approximation is often adopted:
$$P(w_1, w_2, \dots, w_n) \approx P(w_1)P(w_2)P(w_3)\dots P(w_n)$$
From a graph theory perspective, this is finding the maximum probability path in a DAG.
Below is the implementation of the latter scheme using an AC automaton combined with dynamic programming.
def max_proba_cut(sentence):
paths = {0:([], 0)}
end = 0
for i, j in dic.iter(sentence):
start, end = 1 + i - len(j[0]), i + 1
if start not in paths:
last = max([i for i in paths if i < start])
paths[start] = (paths[last][0] + [sentence[last:start]], paths[last][1] - 10)
proba = paths[start][1] + j[1]
if end not in paths or proba > paths[end][1]:
paths[end] = (paths[start][0] + [j[0]], proba)
if end < len(sentence):
return paths[end][0] + [sentence[end:]]
else:
return paths[end][0]
The code is concise. Here, I assumed the frequency of unmatched parts is $e^{-10}$, which can be adjusted. Note that since the approach is different, this dynamic programming scheme differs from the typical DAG dynamic programming, though the logic is natural. Be aware that if you use this function directly on a sentence tens of thousands of characters long, it will be slow and memory-intensive because I am storing all temporary schemes during the dynamic programming process in a dictionary. Fortunately, there are many natural break markers in Chinese sentences, such as punctuation and line breaks. We can use these markers to split the sentence into parts and segment them step-by-step, as shown below.
to_break = ahocorasick.Automaton()
for i in [',', '。', '!', '、', '?', ' ', '\n']:
to_break.add_word(i, i)
to_break.make_automaton()
def map_cut(sentence):
start = 0
words = []
for i in to_break.iter(sentence):
words.extend(max_proba_cut(sentence[start:i[0]+1]))
start = i[0] + 1
words.extend(max_proba_cut(sentence[start:]))
return words
On a server, I extracted 100,000 articles (over 100 million characters) and compared the speed with Jieba segmentation. I found that using the same dictionary and disabling Jieba's OOV discovery, the map_cut implemented with the AC automaton was about 2 to 3 times faster than Jieba, reaching approximately 1MB/s.
Finally, it is worth mentioning that the logic in max_proba_cut can be applied to other segmentation methods involving dynamic programming, such as the Minimum Word Count segmentation:
def min_words_cut(sentence):
paths = {0:([], 0)}
end = 0
for i, j in dic.iter(sentence):
start, end = 1 + i - len(j[0]), i + 1
if start not in paths:
last = max([i for i in paths if i < start])
paths[start] = (paths[last][0] + [sentence[last:start]], paths[last][1] + 1)
num = paths[start][1] + 1
if end not in paths or num < paths[end][1]:
paths[end] = (paths[start][0] + [j[0]], num)
if end < len(sentence):
return paths[end][0] + [sentence[end:]]
else:
return paths[end][0]
This adopts a penalty rule: a point is penalized for every word, with an additional point for OOV words. The scheme with the lowest total penalty wins.
Summary
In fact, wherever dictionary lookups are involved, the AC automaton has a role to play. Using an AC automaton for word segmentation is a very natural application. We look forward to more data structures and algorithms with better support for Chinese, which might allow us to design even more efficient algorithms.