[Chinese Word Segmentation Series] 7. Deep Learning Segmentation? All You Need is a Dictionary!

By 苏剑林 | March 06, 2017

As this series slowly reaches its 7th post, we have basically clarified various models for word segmentation, except for some minor adjustments (like swapping the final classifier with a CRF); the rest is just about how to play with it. Generally speaking, if you want speed, you use dictionary-based segmentation. If you want better resolution of combination ambiguities and out-of-vocabulary (OOV) word recognition, you use complex models like the previously introduced LSTM or FCN. However, the problem is that training a segmenter with deep learning requires annotated corpora, which is time-consuming and labor-intensive. The few publicly available annotated corpora cannot keep up with the times—for example, almost no public segmentation system can correctly segment "扫描二维码，关注微信号" (Scan the QR code, follow the WeChat account).

This article describes an experiment: using only a dictionary, I trained a deep learning segmenter, and the results were surprisingly good! This approach can be called semi-supervised or even unsupervised.

Random Combinations are Enough

The method is very simple: since deep learning needs a corpus, I will generate one myself. How? I simply combine words from the dictionary at random. Wait—won't random combinations result in something that isn't natural language? I suspected this at first, but after experimenting, I found that the effect was exceptionally good, sometimes even surpassing results from annotated corpora.

Without further ado, let's get to work. First, prepare a word list with word frequencies; frequencies are essential, or the results will be significantly diminished. Then, write a function to randomly pick words from the word list with a probability proportional to their frequency to form "sentences."

import numpy as np
import pandas as pd

class Random_Choice:
    def __init__(self, elements, weights):
        d = pd.DataFrame(zip(elements, weights))
        self.elements, self.weights = [], []
        for i, j in d.groupby(1):
            self.weights.append(len(j) * i)
            self.elements.append(tuple(j[0]))
        self.weights = np.cumsum(self.weights).astype(np.float64) / sum(self.weights)

    def choice(self):
        r = np.random.random()
        w = self.elements[np.where(self.weights >= r)[0][0]]
        return w[np.random.randint(0, len(w))]

Note that here, grouping by weights is used to implement random sampling. The speed of random sampling depends on the number of groups, so it is best to perform some preprocessing on the word frequencies to make them "cluster." For example, words appearing 10001, 10002, and 10003 times can be rounded to 10000, and words appearing 12001, 12002, 12003, and 12004 times can be rounded to 12000. This step is extremely important because if you are using a GPU, this part usually becomes the bottleneck in training speed.

Next, count the character table and write a generator. These are standard steps using the 4-tag character labeling method, adding an 'x' tag for padding. Readers who are unclear can refer back to the LSTM segmentation article:

import pickle
words = pd.read_csv('dict.txt', delimiter='\t', header=None, encoding='utf-8')
words[0] = words[0].apply(unicode)
words = words.set_index(0)[1]

try:
    char2id = pickle.load(open('char2id.dic'))
except:
    from collections import defaultdict
    print u'fail to load old char2id.'
    char2id = pd.Series(list(''.join(words.index))).value_counts()
    char2id[:] = range(1, len(char2id) + 1)
    char2id = defaultdict(int, char2id.to_dict())
    pickle.dump(char2id, open('char2id.dic', 'w'))

word_size = 128
maxlen = 48
batch_size = 1024

def word2tag(s):
    if len(s) == 1:
        return 's'
    elif len(s) >= 2:
        return 'b' + 'm' * (len(s) - 2) + 'e'

tag2id = {'s': [1,0,0,0,0], 'b': [0,1,0,0,0], 'm': [0,0,1,0,0], 'e': [0,0,0,1,0]}

def data_generator():
    wc = Random_Choice(words.index, words)
    x, y = [], []
    while True:
        n = np.random.randint(1, 17)
        seq = [wc.choice() for i in range(n)]
        tag = ''.join([word2tag(i) for i in seq])
        seq = [char2id[i] for i in ''.join(seq)]
        if len(seq) > maxlen:
            continue
        else:
            seq = seq + [0] * (maxlen - len(seq))
            tag = [tag2id[i] for i in tag]
            tag = tag + [[0,0,0,0,1]] * (maxlen - len(tag))
            x.append(seq)
            y.append(tag)
        if len(x) == batch_size:
            yield np.array(x), np.array(y)
            x, y = [], []

The Same Old Model

For the model, you can use the LSTM or CNN I wrote previously. Here, I used an LSTM, and the results show that LSTM has very strong memory capabilities.

# Running on Keras 2.0 + Tensorflow 1.0

from keras.layers import Dense, Embedding, LSTM, TimeDistributed, Input, Bidirectional
from keras.models import Model

sequence = Input(shape=(maxlen,), dtype='int32')
embedded = Embedding(len(char2id) + 1, word_size, input_length=maxlen, mask_zero=True)(sequence)
blstm = Bidirectional(LSTM(64, return_sequences=True))(embedded)
output = TimeDistributed(Dense(5, activation='softmax'))(blstm)
model = Model(inputs=sequence, outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam')

try:
    model.load_weights('model.weights')
except:
    print u'fail to load old weights.'

for i in range(100):
    print i
    model.fit_generator(data_generator(), steps_per_epoch=100, epochs=10)
    model.save_weights('model.weights')

Using my GTX 1060 and my dictionary (500,000 distinct words), each round takes about 70s. The model is saved every 10 rounds. The 100 in range(100) is arbitrary; since it saves every 10 rounds, you can stop the program whenever you're satisfied with the results.

Regarding accuracy: note that because mask_zero=True is used, the 'x' tag is ignored during training. However, the final training accuracy displayed by Keras includes the 'x' tags, so the displayed accuracy won't exceed 0.3 (around 0.28). This isn't important; we can test the model after training is complete.

One final tip: larger character vector dimensions generally lead to better recognition of long words.

Output via Dynamic Programming

Now, we combine the Viterbi algorithm through dynamic programming to output the final results. Dynamic programming ensures the optimal result but reduces efficiency. Simply outputting the maximum result predicted by the classifier also yields similar results (though theoretically, it could produce invalid tag sequences like "bbbb"). It depends on the situation; since this is an experiment, I used Viterbi. In a production environment, for the sake of speed, you might skip it (sacrificing a bit of precision for a massive increase in speed).

zy = {'be': 0.5,
      'bm': 0.5,
      'eb': 0.5,
      'es': 0.5,
      'me': 0.5,
      'mm': 0.5,
      'sb': 0.5,
      'ss': 0.5
     }

zy = {i: np.log(zy[i]) for i in zy.keys()}

def viterbi(nodes):
    paths = {'b': nodes[0]['b'], 's': nodes[0]['s']}
    for l in range(1, len(nodes)):
        paths_ = paths.copy()
        paths = {}
        for i in nodes[l].keys():
            nows = {}
            for j in paths_.keys():
                if j[-1] + i in zy.keys():
                    nows[j + i] = paths_[j] + nodes[l][i] + zy[j[-1] + i]
            k = np.argmax(nows.values())
            paths[nows.keys()[k]] = nows.values()[k]
    return paths.keys()[np.argmax(paths.values())]

def simple_cut(s):
    if s:
        s = s[:maxlen]
        r = model.predict(np.array([[char2id[i] for i in s] + [0] * (maxlen - len(s))]), verbose=False)[0][:len(s)]
        r = np.log(r)
        nodes = [dict(zip(['s','b','m','e'], i[:4])) for i in r]
        t = viterbi(nodes)
        words = []
        for i in range(len(s)):
            if t[i] in ['s', 'b']:
                words.append(s[i])
            else:
                words[-1] += s[i]
        return words
    else:
        return []

import re
not_cuts = re.compile(u'([\da-zA-Z ]+)|[。，、？！\.\?,!“”]')
def cut_word(s):
    result = []
    j = 0
    for i in not_cuts.finditer(s):
        result.extend(simple_cut(s[j:i.start()]))
        result.append(s[i.start():i.end()])
        j = i.end()
    result.extend(simple_cut(s[j:]))
    return result

The code here is the same as before, basically unchanged. The efficiency is not very high, but as mentioned, this is an experiment. If you are interested in using it in production, find your own ways to optimize.

Let's Test It

Combined with the dictionary I organized myself, the final model reaches an accuracy of about 85% on the backoff2005 evaluation set (calculated by the score script provided by backoff2005). This accuracy depends on your dictionary.

Does it look bad? Not really. First, we did not use its training set; this was purely unsupervised training using only a dictionary. This accuracy is already quite satisfying. Second, the accuracy appears low, but the actual situation is better because much of the error comes from differences in corpus segmentation standards. For example:

1. The evaluation set's standard answer splits "古老的中华文化" (ancient Chinese culture) into "古老/的/中华/文化" (ancient / of / China / culture), whereas the model splits it as "古老/的/中华文化" (ancient / of / Chinese culture).

2. The evaluation set's standard answer splits "在录入上走向中西文求同的道路" as "在/录入/上/走/向/中/西文/求/同/的/道路", whereas the model splits it as "在/录入/上/走向/中西文/求同/的/道路".

3. The evaluation set's standard answer splits "更是教育学家关心的问题" as "更/是/教育学/家/关心/的/问题", whereas the model splits it as "更是/教育学家/关心/的/问题".

Scanning through like this, many more examples can be found. This indicates that the backoff2005 tagging itself is not perfectly standard. We shouldn't worry too much about this accuracy. Instead, we should notice that our model has better recognition effects for new words and long words, and achieving this effect only requires a dictionary—which is much easier than obtaining labeled data. This is very effective for customizing segmentation systems for specific domains (like medicine, tourism, etc.).

Below are some test examples; these results are generally even better than those from supervised training:

罗斯福是第二次世界大战期间同盟国阵营的重要领导人之一。 1941 年珍珠港事件发生后，罗斯福力主对日本宣战，并引进了价格管制和配给。罗斯福以租借法案使美国转变为 “ 民主国家的兵工厂 ” ，使美国成为同盟国主要的军火供应商和融资者，也使得美国国内产业大幅扩张，实现充分就业。二战后期同盟国逐渐扭转形势后，罗斯福对塑造战后世界秩序发挥了关键作用，其影响力在雅尔塔会议及联合国的成立中尤其明显。后来，在美国协助下，盟军击败德国、意大利和日本。

苏剑林是科学空间的博主

结婚的和尚未结婚的

大肠杆菌是人和许多动物肠道中最主要且数量最多的一种细菌

九寨沟国家级自然保护区位于四川省阿坝藏族羌族自治州南坪县境内，距离成都市 400 多公里，是一条纵深 40 余公里的山沟谷地

现代内地人，因莫高窟而知敦煌。敦煌因莫高窟也在近代蜚声海外。但莫高窟始凿于四世纪，到 1900 年才为世人瞩目，而敦煌早从汉武帝时，即公元前一百多年起，就已是西北名城了

从前对巴特农神庙怎么干，现在对圆明园也怎么干，只是更彻底，更漂亮，以至于荡然无存。我们所有大教堂的财宝加在一起，也许还抵不上东方这座了不起的富丽堂皇的博物馆。那儿不仅仅有艺术珍品，还有大堆的金银制品。丰功伟绩！收获巨大！两个胜利者，一个塞满了腰包，这是看得见的，另一个装满了箱子。

Don't forget, this was done using only a dictionary. Many segments, especially names, are not even in the dictionary. This result should be satisfying enough!

Thinking Why

Returning to our initial doubt: why can random text combinations train a great segmenter? The reason is that our initial dictionary-based segmentation was based on an assumption: sentences are formed by random combinations of words. Thus, to segment a sentence, we must cut the string to maximize the following probability:

$$p(w_1)p(w_2)\dots p(w_n)$$

The final solving process uses dynamic programming.

Here, we follow the same assumption—that text is a random combination—which is not strictly true but generally suffices. The results show its effectiveness. What might be surprising is that this segmenter can even resolve combination ambiguity sentences like "结婚的和尚未结婚的" (the married ones and those not yet married). It's not hard to understand: when we combine randomly, we pick words based on their frequencies. This results in high-frequency words appearing more and low-frequency words less. After a large number of repeated operations, we are essentially training the LSTM to learn the process of dynamic programming!

This is startling. It means we can use LSTMs to learn traditional optimization algorithms! Furthermore, by improving RNNs to solve traditional CS problems—like convex hulls, Delaunay triangulation, and even the Traveling Salesman Problem (TSP)—it turns out these things work quite well, even better than some approximation algorithms (https://www.zhihu.com/question/47563637). Note that problems like TSP are NP-hard; in principle, there is no polynomial-time solution. But using an LSTM, we might potentially get a linear and effective solution. This is a huge impact on the field of traditional algorithms! Using neural networks to design optimization algorithms, and eventually using optimization algorithms to optimize neural networks to achieve self-optimization—now that's intelligence!

Well, I've strayed too far. Anyway, the result is the only standard.

Pre-trained Model

Finally, sharing a pre-trained model. Readers with Keras can download and test it:
~~seg.zip (This weight is no longer usable with the latest versions of TensorFlow + Keras; please train it yourself according to the code above~)~~

Originally published at: https://kexue.fm/archives/4245