[Chinese Segmentation Series] 2. New Word Discovery Based on Segmentation

By 苏剑林 | August 18, 2016

The previous article discussed fast segmentation based on dictionaries and AC automata. Dictionary-based segmentation has a distinct advantage: it is easy to maintain and adapts well to specific fields. If migrating to a new domain, one only needs to add new words corresponding to that field to achieve better results. Of course, whether a high-quality, domain-adapted dictionary is easy to obtain depends on the specific situation. This article focuses on the discovery of new words.

This topic was previously discussed in last year's article "Information Entropy Methods and Implementation of New Word Discovery". The algorithm was derived from matrix67's article "Sociolinguistics in the Internet Era: SNS-Based Text Data Mining". In that article, three main indicators were used—frequency, cohesion (which, after taking the logarithm, is what we call Pointwise Mutual Information), and freedom (boundary entropy)—to judge whether a snippet constitutes a word. If you have actually tried to implement this algorithm, you will find several difficulties. First, to find $n$-character words, you need to extract slices of $1 \sim n$ characters and perform calculations on each, which is time-consuming when $n$ is large. Second, and most painfully, calculating boundary entropy requires grouping and statistics for every snippet, which involves a massive workload. This article provides a solution that significantly reduces the computational cost of new word discovery.

Algorithm

Reviewing the process of new word discovery in matrix67's algorithm, we can recognize that what new word discovery does is judge whether a given snippet is truly a word based on the corpus. A so-called "word" is something that is relatively independent and indivisible. So, why not do the opposite? Why don't we try to find which snippets cannot form a word? According to the previous logic, we say a snippet might be a word when its cohesion is above a certain threshold (then we consider its boundary entropy). Doesn't this imply that if the cohesion of a snippet is below a certain threshold, it is impossible for it to be a word? Therefore, we can simply break it apart in the original corpus.

We can make an appropriate simplification. If $a, b$ are two adjacent characters in the corpus, we can count the number of times the pair $(a,b)$ appears as $\#(a,b)$, and then estimate its frequency $P(a,b)$. We then separately count the occurrences of $a$ and $b$ as $\#a, \#b$ and estimate their frequencies $P(a), P(b)$. If

$$\frac{P(a,b)}{P(a)P(b)} < \alpha \quad (\alpha \text{ is a given threshold greater than 1})$$

then we should split these two characters in the original corpus. This operation is essentially performing a preliminary segmentation on the original corpus based on this indicator! After completing this preliminary segmentation, we can count word frequencies and then filter them based on those frequencies.

Comparing this to the three indicators in matrix67's article, we are now only using two: frequency and cohesion. We have removed the most computationally expensive one—boundary entropy. Furthermore, when calculating cohesion, we only need to calculate it for two-character snippets, skipping cohesion calculations for longer snippets. However, because we are using a segmentation-based approach, we significantly reduce the workload, while theoretically being able to obtain words of any length!

Implementation

It sounds perfect—less computation, more power. How is the actual performance? How much does it differ from the results of matrix67's algorithm? One has to try it to find out. I experimented with 300,000 WeChat public account articles (about 1GB), and the results were satisfactory, taking about 10 minutes. Below is the implementation code. It is very short, pure Python, requires no third-party libraries, and is very memory-friendly. Here, texts can be a list or an iterator (returning one article at a time). Combined with the tqdm library, progress can be easily displayed. Finally, during statistics, a $\gamma$ smoothing method is used to mitigate the occurrence of unreasonable words. Previously, for these statistical calculations, I would use Pandas without a second thought, but lately, I’ve tried some native Python libraries and found them quite useful too~

import pymongo

db = pymongo.MongoClient().baike.items
def texts():
    for a in db.find(no_cursor_timeout=True).limit(1000000):
        yield a['content']

from collections import defaultdict # defaultdict is a wrapped dict allowing default values
from tqdm import tqdm # tqdm is an easy-to-use library for progress bars
from math import log
import re

class Find_Words:
    def __init__(self, min_count=10, min_pmi=0):
        self.min_count = min_count
        self.min_pmi = min_pmi
        self.chars, self.pairs = defaultdict(int), defaultdict(int) 
        # If key doesn't exist, use int() which defaults to 0
        self.total = 0.
    def text_filter(self, texts): 
        # Pre-split sentences to avoid too many meaningless (non-Chinese/English/digit) strings
        for a in tqdm(texts):
            for t in re.split(u'[^\u4e00-\u9fa50-9a-zA-Z]+', a): 
                # This regex matches any non-Chinese, non-English, 
                # non-digit character, splitting the sentence by them.
                if t:
                    yield t
    def count(self, texts): # Counting function: single characters and adjacent character pairs
        for text in self.text_filter(texts):
            self.chars[text[0]] += 1
            for i in range(len(text)-1):
                self.chars[text[i+1]] += 1
                self.pairs[text[i:i+2]] += 1
            self.total += 1
        self.chars = {i:j for i,j in self.chars.items() if j >= self.min_count} # Filter by min frequency
        self.pairs = {i:j for i,j in self.pairs.items() if j >= self.min_count} # Filter by min frequency
        self.strong_segments = set()
        for i,j in self.pairs.items(): # Find "closely related" neighbors based on mutual information
            _ = log(self.total*j/(self.chars[i[0]]*self.chars[i[1]]))
            if _ >= self.min_pmi:
                self.strong_segments.add(i)
    def find_words(self, texts): # Discover words based on the results above
        self.words = defaultdict(int)
        for text in self.text_filter(texts):
            s = text[0]
            for i in range(len(text)-1):
                if text[i:i+2] in self.strong_segments: # If "close", don't split
                    s += text[i+1]
                else:
                    self.words[s] += 1 # Otherwise split, count the preceding snippet as a word
                    s = text[i+1]
            self.words[s] += 1 # Count the last "word" of the sentence
        self.words = {i:j for i,j in self.words.items() if j >= self.min_count} # Final frequency filter

fw = Find_Words(16, 1)
fw.count(texts())
fw.find_words(texts())

import pandas as pd
words = pd.Series(fw.words).sort_values(ascending=False)

Reference code for streaming SQL data in Python:

from sqlalchemy import *

def sql_data_generator():
    db = create_engine('mysql+pymysql://user:password@123.456.789.123/yourdatabase?charset=utf8')
    result = db.execution_options(stream_results=True).execute(text('select content from articles'))
    for t in result:
        yield t[0]

Analysis

Of course, this algorithm is not without its flaws; there are still issues worth discussing. Generally, to obtain finer-grained words (and avoid extracting too many invalid long words), we can choose a larger $\alpha$, such as $\alpha=10$. However, this introduces a problem: the cohesion between adjacent characters in a word is not necessarily very high. A typical example is "共和国" (Republic). Both "和" and "国" are very frequent characters. The cohesion between "和国" is not high (around 3 in WeChat text). If $\alpha$ is too large, it might lead to incorrectly splitting this word (in fact, the cohesion between "共和" and "国" is high). There are many such examples, like "林心如" (Ruby Lin), where the cohesion of "心如" is not very high (unless, of course, the corpus comes from the entertainment industry). Conversely, if $\alpha=1$ is set, a larger corpus is required to make the dictionary comprehensive. This is something that needs careful consideration when using this algorithm.

WeChat Dictionary

Finally, I am sharing a word list I extracted from approximately 300,000 recent WeChat public account articles (about 1GB, over 300 million characters), with the minimum cohesion set to 1 and the minimum frequency set to 100. From the table, you can see that words clearly related to WeChat have been extracted. Furthermore, since these are recent articles, hot topics—such as the Olympics and Wang Baoqiang—have also been captured.

WeChat Dictionary: dict.txt

Reference Links

"Non-mainstream Natural Language Processing—Forgetting Algorithm Series (2): Large-scale Corpus Dictionary Generation": http://www.52nlp.cn/forgetnlp2