[Chinese Word Segmentation Series] 3. Character Tagging and the HMM Model

By 苏剑林 | August 19, 2016

In this article, we pause our introduction of dictionary-based methods and turn instead to character tagging methods. As previously mentioned, character tagging approaches word segmentation by assigning a label to each character in a sentence. For instance, as noted earlier, using a 4-tag set (single, a single character as a word; begin, the start of a multi-character word; middle, the middle part of a word with three or more characters; end, the end of a multi-character word. All abbreviated by their first letters.), the phrase "为人民服务" (serve the people) could be tagged as "sbebe".

The 4-tag set is not the only way to label; similarly, there is a 6-tag set. Theoretically, more tags allow for finer precision and better results, but too many tags might lead to the problem of insufficient samples. Generally, the 4-tag and 6-tag sets are the most commonly used. It is worth mentioning that transforming a problem into sequence-to-sequence learning by tagging each character is not just a word segmentation method, but a way of thinking for solving a wide range of natural language problems, such as Named Entity Recognition (NER) and other tasks. Returning to segmentation, models that utilize character tagging include the Hidden Markov Model (HMM), Maximum Entropy Model (ME), and Conditional Random Field (CRF) model. Their accuracy generally increases in that order; it is said that the best performance in open evaluations is currently achieved by 4-tag CRFs. However, in this article, we will explain the least precise one: HMM. In my view, it is not just a specific model, but a general philosophy for solving a large class of problems—a discipline of simplifying complexity. All of this begins with probabilistic models.

HMM Model

A "model" refers to something that can process our input data and provide the optimal output. For the character tagging method of segmentation, the input is $n$ characters, and the output is $n$ labels. We use $\lambda=\lambda_1 \lambda_2 \dots \lambda_n$ to represent the input sentence, and $o=o_1 o_2 \dots o_n$ to represent the output. So, what is the optimal output? From a probabilistic perspective, we naturally want to maximize the following conditional probability:

$$\max P(o|\lambda) =\max P(o_1 o_2 \dots o_n|\lambda_1 \lambda_2 \dots \lambda_n)$$

In other words, there are many possibilities for $o$, and the optimal $o$ should be the one with the highest probability. Note that $P(o|\lambda)$ is a conditional probability involving $2n$ variables, and $n$ is variable. In this case, it is nearly impossible to model $P(o|\lambda)$ precisely. Nevertheless, we can make some simplifications. For example, if we assume that the output of each character depends only on the current character itself, then we have:

$$P(o_1 o_2 \dots o_n|\lambda_1 \lambda_2 \dots \lambda_n) = P(o_1|\lambda_1)P(o_2|\lambda_2)\dots P(o_n|\lambda_n)$$

Estimating $P(o_k|\lambda_k)$ is much easier. The problem is also greatly simplified because to maximize $P(o|\lambda)$, one only needs to maximize each $P(o_k|\lambda_k)$. This assumption is called the independence assumption.

The above simplification is one approach, but it completely ignores the context and can lead to illogical situations (for example, in our 4-tag system, 'b' can only be followed by 'm' or 'e', but using this maximization method, we might get an output like "bbb", which is invalid). Thus, we turn the problem around and propose a hidden model (Hidden Markov Model), much like the relationship between a function and its inverse in mathematics. By Bayes' theorem, we get:

$$P(o|\lambda)=\frac{P(o,\lambda)}{P(\lambda)}=\frac{P(\lambda|o)P(o)}{P(\lambda)}$$

Since $\lambda$ is the given input, $P(\lambda)$ is a constant and can be ignored. Therefore, maximizing $P(o|\lambda)$ is equivalent to maximizing:

$$P(\lambda|o)P(o)$$

Now, we can apply the independence assumption to $P(\lambda|o)$, obtaining:

$$P(\lambda|o)=P(\lambda_1|o_1)P(\lambda_2|o_2)\dots P(\lambda_n|o_n)$$

Meanwhile, for $P(o)$, we have:

$$P(o)=P(o_1)P(o_2|o_1)P(o_3|o_1,o_2)\dots P(o_n|o_1,o_2,\dots,o_{n-1})$$

At this point, we can make a Markov assumption: each output depends only on the previous output. Thus:

$$P(o)=P(o_1)P(o_2|o_1)P(o_3|o_2)\dots P(o_n|o_{n-1})\sim P(o_2|o_1)P(o_3|o_2)\dots P(o_n|o_{n-1})$$

Consequently:

$$P(\lambda|o)P(o)\sim P(\lambda_1|o_1) P(o_2|o_1) P(\lambda_2|o_2) P(o_3|o_2) \dots P(o_n|o_{n-1}) P(\lambda_n|o_n)$$

We call $P(\lambda_k|o_k)$ the emission probability and $P(o_k|o_{k-1})$ the transition probability. By setting certain $P(o_k|o_{k-1})=0$, we can exclude invalid combinations such as "bb" or "bs".

Python Implementation

The above is a basic introduction to HMM. If the reader has some foundation in probability theory, it should not be difficult to understand. As we can see, HMM makes a vast number of simplifications—so many that it cannot be perfectly accurate. Therefore, HMM models are generally used to solve "parts that cannot be resolved during the dictionary-lookup process" (much like what Jieba segmentation does). Of course, you could strengthen the Markov assumption—for example, by assuming each state depends on the previous two states—which would certainly result in a more accurate model, but the model's parameters would be much harder to estimate.

How do you train an HMM segmentation model? It mainly involves estimating the two sets of probabilities: $P(\lambda_k|o_k)$ and $P(o_k|o_{k-1})$. If you have a batch of tagged corpora, estimating these two probabilities is not hard. But what if you don't? Even just having a dictionary will do. We can convert a dictionary with frequencies into an HMM model. The Python implementation is as follows:

from collections import Counter
from math import log
hmm_model = {i:Counter() for i in 'sbme'}
with open('dict.txt') as f:
    for line in f:
        lines = line.decode('utf-8').split(' ')
        if len(lines[0]) == 1:
            hmm_model['s'][lines[0]] += int(lines[1])
        else:
            hmm_model['b'][lines[0][0]] += int(lines[1])
            hmm_model['e'][lines[0][-1]] += int(lines[1])
            for m in lines[0][1:-1]:
                hmm_model['m'][m] += int(lines[1])

log_total = {i:log(sum(hmm_model[i].values())) for i in 'sbme'}
trans = {'ss':0.3,
         'sb':0.7,
         'bm':0.3,
         'be':0.7, 
         'mm':0.3,
         'me':0.7,
         'es':0.3,
         'eb':0.7
        }
trans = {i:log(j) for i,j in trans.iteritems()}

def viterbi(nodes):
    paths = nodes[0]
    for l in range(1, len(nodes)):
        paths_ = paths
        paths = {}
        for i in nodes[l]:
            nows = {}
            for j in paths_:
                if j[-1]+i in trans:
                    nows[j+i]= paths_[j]+nodes[l][i]+trans[j[-1]+i]
            k = nows.values().index(max(nows.values()))
            paths[nows.keys()[k]] = nows.values()[k]
    return paths.keys()[paths.values().index(max(paths.values()))]

def hmm_cut(s):
    nodes = [{i:log(j[t]+1)-log_total[i] for i,j in hmm_model.iteritems()} for t in s]
    tags = viterbi(nodes)
    words = [s[0]]
    for i in range(1, len(s)):
        if tags[i] in ['b', 's']:
            words.append(s[i])
        else:
            words[-1] += s[i]
    return words

The first part of the code uses a dictionary to represent $P(\lambda_k|o_k)$. The calculation of $P(\lambda_k|o_k)$ is obtained via the dictionary; for example, all single-character words in the dictionary are counted under the 's' tag, the first character of multi-character words is counted under the 'b' tag, and so on. Logarithms are used during the calculation to prevent underflow. The second part, defining the transition probabilities, is estimated directly based on intuition. The third part uses the Viterbi algorithm (dynamic programming) to find the path with the maximum probability. For the probability estimation, a simple Plus-1 smoothing method is adopted, where unseen characters are counted as 1.

The entire code is quite simple and implemented in pure Python. Of course, the efficiency is not necessarily high and is for reference only. Here are some tests:

>>print ' '.join(hmm_cut(u'今天天气不错')) 
今天 天气 不错 
>>print ' '.join(hmm_cut(u'李想是一个好孩子')) 
李想 是 一个 好 孩子 
>>print ' '.join(hmm_cut(u'小明硕士毕业于中国科学院计算所')) 
小明 硕士 毕业 于 中 国科 学院 计算 所

As can be seen, HMM tends to bind two characters together, so the results are not perfect. However, as a supplement to the dictionary-lookup method for segments that cannot form words, it is quite excellent. For example, in "李想是一个好孩子", it automatically discovered the name "李想" (Li Xiang), which is difficult to solve using a dictionary approach alone.