[Chinese Word Segmentation Series] 5. Unsupervised Word Segmentation Based on Language Models

By 苏剑林 | September 12, 2016

So far, the first four articles have introduced several ideas for word segmentation, including dictionary-based methods using maximum probability and character-tagging methods based on HMM or LSTM. These are established research methods; what I have done is simply summarize them. Dictionary-based methods and character-tagging each have their own pros and cons. I have been wondering: can we create an unsupervised word segmentation model that only needs a large-scale corpus for training? That is to say, how to segment should be determined by the corpus itself, independent of the language. Simply put, given enough data, the corpus should tell us how to segment words.

This sounds perfect, but how is it achieved? "2. New Word Discovery Based on Segmentation" provided an initial thought, but it wasn't thorough enough. The new word discovery method there can indeed be seen as a type of unsupervised segmentation approach, as it uses simple "aggregation" (cohesion) to judge whether a cut should be made. However, from the perspective of a full segmentation system, it is far too crude. Therefore, I have been thinking about how to improve this accuracy. I obtained some meaningful results earlier, but hadn't formed a complete theory until recently. I have now completed that line of thought. Since I haven't found similar work elsewhere, this can be considered an original piece of work by me in the field of word segmentation.

Language Models

First, let's briefly discuss language models.

Many readers in data mining have heard of Word2Vec and know it's a tool that can generate word vectors, and many know to use word vectors as input features for models. However, I suspect many readers don't know why word vectors exist or why Word2Vec can generate them. The brilliance of Word2Vec itself (Google-produced, fast, effective, well-implemented in Python, etc.) has overshadowed similar products and their underlying principles. In fact, the original intention of word vectors was to better generate language models. The classic paper is likely "A Neural Probabilistic Language Model" by Bengio, one of the pioneers of deep learning. The point here is the language model, not the word vectors. For interested readers, you can refer to the following articles:

Deep Learning in NLP (I) Word Vectors and Language Models:
http://licstar.net/archives/328

"How We Understand Language" series from Flickering:
http://www.flickering.cn/?s=我们是这样理解语言的

A language model is a model that calculates the conditional probability: $$p(w_n|w_1,w_2,\dots,w_{n-1})$$ where $w_1,w_2,\dots,w_{n-1}$ are the first $n-1$ words (or characters) in a sentence, and $w_n$ is the $n$-th word (or character). Language models are applied in many areas, such as word segmentation, speech recognition, and machine translation. There are many methods to obtain a language model; for example, the simplest is the "statistics + smoothing" method. There are also maximum entropy language models, CRF language models, etc. Currently, in the framework of deep learning, "neural language models" are heavily researched. The general idea is: $p(w_n|w_1,w_2,\dots,w_{n-1})$ is a function of $w_1,w_2,\dots,w_n$. Since the specific form of this function is unknown, a neural network is used to fit it. To fit it better and reduce model parameters, words are "embedded" into a real number space, represented by short vectors, and trained alongside the language model. From this perspective, word vectors are merely a byproduct of language models.

The fact that word vectors generated by language models can represent semantics is very interesting, yet it stands to reason. What is semantics? For humans, semantics is a process of reasoning and understanding. Our language model, which predicts the next character from the previous $n-1$ characters, is also a reasoning process. Since it contains a reasoning component, it has the potential to capture semantics.

Unsupervised Word Segmentation

I have perhaps spoken too much about language models, but the segmentation method introduced in this article is based on a "character-based language model."

We start from the maximum probability method. If a string $s_1, s_2, \dots, s_l$ of length $l$ has an optimal segmentation result $w_1, w_2, \dots, w_m$, it should be the one among all possible partitions that maximizes the product of probabilities: $$p(w_1)p(w_2)\dots p(w_m)$$

If there is no dictionary, then the words $w_1, w_2, \dots, w_m$ do not exist. However, we can use the Bayesian formula to convert the probability of a word into the combined probability of its characters: $$p(w)=p(c_1)p(c_2|c_1)p(c_3|c_1 c_2)\dots p(c_k|c_1 c_2 \dots c_{k-1})$$ where $w$ is a $k$-character word, and $c_1,c_2,\dots,c_k$ are the $1,2,\dots,k$-th characters of $w$ respectively. We can see that $p(c_k|c_1 c_2 \dots c_{k-1})$ is exactly the character-based language model mentioned earlier.

Of course, for a very large $k$, $p(c_k|c_1 c_2 \dots c_{k-1})$ is not easy to estimate. Fortunately, based on our experience, the average length of words is not very large. Therefore, an n-gram language model is sufficient, where $n=4$ usually yields good results.

How does the segmentation actually work? Suppose we have a string $s_1, s_2, s_3 \dots, s_l$. If it is not segmented at all, its path probability would be: $$p(s_1)p(s_2)p(s_3)\dots p(s_l)$$ If $s_1, s_2$ should be combined into one word, its path probability is: $$p(s_1 s_2)p(s_3)\dots p(s_l)=p(s_1)p(s_2|s_1)p(s_3)\dots p(s_l)$$ If $s_2, s_3$ should be combined into one word, its path probability is: $$p(s_1)p(s_2 s_3)\dots p(s_l)=p(s_1)p(s_2)p(s_3|s_2)\dots p(s_l)$$ If $s_1, s_2, s_3$ should be combined into one word, its path probability is: $$p(s_1 s_2 s_3)\dots p(s_l)=p(s_1)p(s_2|s_1)p(s_3|s_1 s_2)\dots p(s_l)$$ Do you see the pattern? Every segmentation method essentially corresponds to the multiplication of $l$ conditional probabilities. We find the multiplication pattern that yields the maximum result. Similarly, if we know the optimal multiplication pattern, we can write out the corresponding segmentation result.

Looking at it more systematically, this actually converts word segmentation into a tagging problem. If the character language model goes up to 4-gram, it is equivalent to the following character tagging:

b: single-character word or the first character of a multi-character word
c: second character of a multi-character word
d: third character of a multi-character word
e: remaining part of a multi-character word

For a character $s_k$ in a sentence, we have: \begin{aligned}&p(b)=p(s_k)\\ &p(c)=p(s_k|s_{k-1})\\ &p(d)=p(s_k|s_{k-2} s_{k-1})\\ &p(e)=p(s_k|s_{k-3} s_{k-2} s_{k-1}) \end{aligned}

This transforms word segmentation into a character tagging problem, where the probability of each tag is given by the language model. Moreover, obviously, 'b' can only be followed by 'b' or 'c'. Similarly, the only non-zero transition probabilities are: $$p(b|b),\,p(c|b),\,p(b|c),\,p(d|c),\,p(b|d),\,p(e|d),\,p(b|e),\,p(e|e)$$ The values of these transition probabilities determine whether long or short words are partitioned. Finally, finding the optimal path is still completed by the Viterbi algorithm.

At this point, the problem becomes training the language model, which is unsupervised. We only need to focus on optimizing the language model, and both the theory and practice in this area are very mature, with many existing tools available. In simple terms, one could use a traditional "statistics + smoothing" model; if one wants to incorporate semantics, the latest neural language models can be used. In short, the segmentation effect depends on the quality of the language model.

Practice: Training

First, let's train the language model. The text data used here consists of 500,000 WeChat public account articles, about 2GB in size. The language model is trained using the traditional "statistics + smoothing" method, utilizing the kenlm tool.

Kenlm is a language model tool written in C++, characterized by its speed and small memory footprint; it also provides a Python interface. First, download and compile it:

wget -O - http://kheafield.com/code/kenlm.tar.gz |tar xz
cd kenlm
./bjam -j4
python setup.py install

Next, train the language model. Kenlm's input is very flexible; you don't need to pre-generate a text corpus, as it can be passed via pipes. For example, first write a p.py:

import pymongo
db = pymongo.MongoClient().weixin.text_articles

for text in db.find(no_cursor_timeout=True).limit(500000):
 print ' '.join(text['text']).encode('utf-8')

My articles are stored in MongoDB, so that is the format used above. If your data is elsewhere, please modify accordingly. It is simple: just separate the text you want to train (for a character-based model, separate every character with a space) and print them out one by one.

Then you can train the language model. Here, a 4-gram model is trained:

python p.py|./kenlm/bin/lmplz -o 4 > weixin.arpa
./kenlm/bin/build_binary weixin.arpa weixin.klm

The .arpa file is a common language model format, while .klm is a binary format defined by kenlm, which takes up less space. Finally, we can load it in Python:

import kenlm
model = kenlm.Model('weixin.klm')
model.score('微 信', bos=False, eos=False)
'''
The score function outputs log probability, i.e., log10(p('微 信')).
The string can be gbk or utf-8.
bos=False, eos=False means not to automatically add start-of-sentence and end-of-sentence markers.
'''

Practice: Segmentation

With the foundations above, we can now create a segmentation system.

import kenlm
model = kenlm.Model('weixin.klm')

from math import log10

# These transition probabilities are manually summarized. 
# Generally, they aim to reduce the likelihood of long words.
trans = {'bb':1, 'bc':0.15, 'cb':1, 'cd':0.01, 'db':1, 'de':0.01, 'eb':1, 'ee':0.001}
trans = {i:log10(j) for i,j in trans.iteritems()}

def viterbi(nodes):
 paths = nodes[0]
 for l in range(1, len(nodes)):
 paths_ = paths
 paths = {}
 for i in nodes[l]:
 nows = {}
 for j in paths_:
 if j[-1]+i in trans:
 nows[j+i]= paths_[j]+nodes[l][i]+trans[j[-1]+i]
 k = nows.values().index(max(nows.values()))
 paths[nows.keys()[k]] = nows.values()[k]
 return paths.keys()[paths.values().index(max(paths.values()))]

def cp(s):
 return (model.score(' '.join(s), bos=False, eos=False) - model.score(' '.join(s[:-1]), bos=False, eos=False)) or -100.0

def mycut(s):
 nodes = [{'b':cp(s[i]), 'c':cp(s[i-1:i+1]), 'd':cp(s[i-2:i+1]), 'e':cp(s[i-3:i+1])} for i in range(len(s))]
 tags = viterbi(nodes)
 words = [s[0]]
 for i in range(1, len(s)):
 if tags[i] == 'b':
 words.append(s[i])
 else:
 words[-1] += s[i]
 return words

Practice: Results

The language model file is nearly 3GB, so I won't upload it. Readers who need it can contact me. Below are some examples.

水是生命的源泉，是人类赖以生存且无可替代的营养物质。为使队员们更加了解水对生命的至关重要性，提高队员们对水更科学的认识与理解，倡导节水爱水的环保意识，青少年环境知识科普课堂走进大金小学，为五、六年级近 300 余名队员开展了一场《水与生命》为主题的科普知识讲座。此次活动共分为三场进行，宣讲人祝老师结合 PPT ，图文并茂、生动地从水的特性、水与生命、水与生活以及节水技巧四个方面与队员们进行了交流。祝老师告诉队员们水对人体的重要性，详细说明了水的营养组成，同时提醒队员们要学会健康科学的饮水方法，并分享了节水小窍门，希望队员们都能以自己为榜样，努力承担 “ 小小节水宣传员 ” Northern 职责，积极带动身边的人一起参与节约用水。 PH 试纸检测水的酸碱度，队员们都表现了浓厚的兴趣，纷纷取了试纸回家测试水质。讲座结束后，队员们都领到了 “ 小小节水宣传员 ” 培训课程的结业证书。从队员们兴奋的表情中能够感受到队员们节水爱水的决心。保护水环境，珍惜水资源，从点滴做起，从自己做起，只要每个人都做到了保护生态、爱护环境，那么碧水蓝天就会离我们越来越近！打赏小编的最好方式就是 —— 点赞 ↓↓ 长按二维码，关注我们吧！ ↓↓

As you can see, the results are quite good; the recognition of long words is effective. However, some cases might not align with our usual habits. For example, "队员们" (team members/the team) is treated as a single word, and "且无可替代" (and irreplaceable) was incorrectly segmented as "且无可替代" because "且无" (and no) is too frequent.

区志愿者协会在前几日得知芦林街道三官殿居有一居民家庭特别困难的情况， 12月 12 日下午，招募了 7 名志愿者来到芦林三官殿周全禄老人家，送去了一袋大米和一床棉被。此次助养慰问品是由广丰区志愿者协会公益基金提供， “ 暖冬行动 ” 作为志愿者协会帮困项目的其中重要一项，由参与暖冬行动的志愿者们负责执行发放到走访核实的困境家庭手中。志愿者现场和周全禄老人交谈，从他本人和周边群众了解到他的基本家庭状况，他本人今年 62 岁，娶了一个患有精神疾病的妻子，生了 2个儿子，小孩大的 14 岁，小的 12 岁，妻子在十年前也离家出走，至今未回，留下他和 2个儿子共同生活，由于儿子遗传了母亲的精神疾病，大儿子的种种不正常表现，不能在学校正常上学，只能整天跟着小儿子两个人无所事事，游手好闲，什么事也做不了。周老本身就是一个老实巴交的农民，今年不慎干农活时摔了一跤，医药费 2万多元，都是村里和亲戚邻居帮忙筹集的。他住的房子也是亲戚筹集盖的一层瓦房。凌乱的客厅，衣服基本上就是没有什么换洗，湿了就随意搭着晾干，然后接着穿我们在他家看到做的饭菜，这就是一家人赖以生存的厨房。这就是卧室，床铺被褥都是破旧不堪，我们带去的一床新棉被他的外甥女偶尔帮他整理下卫生，做些家务赠人玫瑰，手有余香；扶困助弱，千古美德；能力不分大小，善举不分先后，真情重在付出。众人拾柴火焰高，我们将把所有爱心力量汇集在一起，传递社会大家庭的温暖，传递社会正能量，放飞困境儿童的未来梦想！伸出您的双手，奉献您的爱心，让我们行动起来，共同关爱困境家庭，让所他们同在蓝天下健康快乐成长！如果您或您身边的人有 12 - 15 岁男孩子的衣物，棉被等暖冬物质可以捐赠，请伸出您充满爱心的双手，给这个特殊家庭一个暖暖的冬日！！！暖冬物质接收地址：广丰区志愿者协会暖冬物质接收联系人： 18 6 07 03 48 18 （段先生） 13 8 70 32 70 03 （陈女士）供稿：段建波图片：段建波编辑：周小飞

It can be seen that even for long idioms like "拾柴火焰高" (Many hands make light work), there is good recognition. Of course, there are many incorrect examples, such as "把所有" (take all), "让我们" (let us), and "请伸出您" (please reach out your...) becoming single "words."

根据业务发展需要，现将我公司 20 16 年招聘应届高校毕业生公告如下 : 一、招聘岗位 20 16 年我公司拟招聘应届高校毕业生 20 名。招聘岗位和学历、专业要求见下表。二、报名条件 1. 列入国家招生计划、具备派遣资格、处于毕业学年的全日制普通高等院校在校生，以及经教育部留学服务中心认证并具备派遣资格的归国留学生 ; 2. 遵守国家法律法规和学校规章制度，具有良好的思想品质和道德素质，无刑事犯罪和严重违反校纪校规记录 ; 3. 专业对口，符合工作岗位要求，热爱铁路集装箱事业 ; 4. 学习成绩优良，取得相应的大学本科及以上学历和学位证书 ; 应聘在京单位岗位毕业生需取得国家大学外语四级考试合格证书 ( 主修其他语种除外 ); 5. 身心健康，近期医院健康体检合格，能够适应应聘岗位工作要求。三、报名方法应聘者需登录 " 中国铁路人才招聘网 — 个人中心 " 栏目按照流程进行报名应聘 ( 首次登录须进行网上注册 )。报名截止日期为 20 16 年 1月 10 日。每人限报一个岗位。四、招聘流程 1. 资格确认。根据资格审查和初步筛选情况，于201 6年 2月 28 日前，择优以邮件、短信或电话方式通知毕业生参加招聘考试。 2. 招聘考试。参加招聘考试的毕业生应携带在中国铁路人才招聘网打印的毕业生应聘登记表，本人身份证、学生证、所在学校盖章的就业推荐表、成绩单、外语证书等材料的原件及复印件。招聘考试在 20 16 年 4月 15 日前完成，具体时间、地点另行通知。 3. 人员公示。拟录用人选将统一在中国铁路人才招聘网和公司官网进行公示。招聘过程中，对未进入下一环节的毕业生不再另行通知。五、其他事项 1. 公司不委托第三方招聘，也不在招聘过程中向应聘者收任何费用。 2. 应聘者的报名材料概不退回，在招聘过程中公司对应聘者的相关信息予以保密。毕业生应对招聘各环节所提供的材料的真实性负责，凡弄虚作假的，一经发现，取消聘用资格。 3. 单位地址：北京市西城区鸭子桥路 24 号中铁商务大厦邮政编码： 10 00 55 联系电话：0 10 - 51 89 27 23

To Summarize

Overall, this unsupervised word segmentation method essentially summarizes our character usage habits and extracts common character usage patterns. Therefore, it has very good recognition effects for many long words, especially fixed idioms. At the same time, some frequent character combinations, such as the aforementioned "让我们" (let us), are also treated as single words. We might think this is unreasonable, but thinking about it another way: since we say "让我们" so frequently, why not treat "让我们" as a single "word"?

In other words, word segmentation is essentially about pre-extracting fixed linguistic patterns. These patterns are not necessarily "words" as we traditionally define them; they could also be habitual expressions. Of course, there is a trade-off: if the granularity of segmentation is too fine, the number of words in the vocabulary won't be too large, but the length of a single sentence will increase; if the granularity is too coarse, the number of words in the vocabulary might explode, but the benefit is that sentence length decreases. The segmentation method provided in this article allows for the adjustment of granularity by modifying transition probabilities to adapt to different tasks.

Also, as mentioned before, the effectiveness of the segmentation depends on the quality of the language model. This means we only need to focus on optimizing the language model, which can be trained unsupervised—a clear advantage. For example, if we hope to achieve a segmentation model with semantic understanding capabilities, we can train the language model using methods like neural networks. If we prioritize speed, traditional statistical methods work well (using kenlm to obtain a language model from 500,000 texts took less than 10 minutes). In short, it provides the maximum degree of freedom.