Information Entropy Method and Implementation for New Word Discovery

By 苏剑林 | October 26, 2015

In previous posts on this blog, I have briefly touched upon the issues of Chinese text processing and mining. The most significant difference between Chinese data mining and similar tasks in English is that Chinese lacks spaces. To effectively complete linguistic tasks, word segmentation must be performed first. Currently, popular segmentation methods are dictionary-based. However, a major problem arises: where do these dictionaries come from? While common words can be manually collected into a dictionary, this approach fails to keep up with the constant emergence of new words, especially internet slang—which is often the key to linguistic tasks. Therefore, a core task in Chinese language processing is refining new word discovery algorithms.

New word discovery refers to automatically identifying potential word-like language fragments from massive corpora without adding any prior material. A few days ago, I visited Xiaoxia's company and tried joining one of their development projects, primarily focused on processing web articles. Consequently, I brushed up on the algorithmic knowledge of new word discovery, referring to the article on Matrix67.com, "Social Linguistics in the Internet Era: Text Data Mining based on SNS", particularly the information entropy ideas contained therein. Based on his logic, I wrote a simple Python script.

The program's algorithm is entirely derived from Matrix67.com's article; interested readers can head to his blog for a detailed read, which I believe will be very rewarding. Here, I will briefly discuss the implementation logic of the code. The specific program can be found at the end of the post. To process larger texts, I tried to avoid Python’s built-in loops as much as possible, using functions provided by third-party libraries like Numpy and Pandas. Since a large number of words are involved, indexing work is crucial; for instance, sorting word items beforehand significantly increases retrieval speed.

Below are the results of the new word discovery (partial) performed by this script on the Century Revised Edition of Jin Yong's novel Demi-Gods and Semi-Devils (2.5M txt file). It took about 20 seconds, and the results feel quite good—the names of the main characters were discovered automatically. Of course, since the code is short and lacks specialized processing, there is still much room for improvement.

Duan Yu, 3535
Shenme (What), 2537
Xiao Feng, 1897
Ziji (Self), 1730
Xu Zhu, 1671
Qiao Feng, 1243
A Zi, 1157
Wugong (Martial Arts), 1109
A Zhu, 1106
Guniang (Miss/Girl), 1047
Xiaodao (Said with a smile), 992
Zanmen (We), 832
Shifu (Master), 805
Ruhe (How), 771
Ruci (So/Thus), 682
Dali, 665
Gaibang (Beggars' Gang), 645
Turan (Suddenly), 640
Wang Yuyan, 920
Murong Fu, 900
Duan Zhengchun, 780
Mu Wanqing, 751
Jiumozhi, 600
You Tanzhi, 515
Ding Chunqiu, 463
Youshenme (Have what), 460
Bao Butong, 447
Shaolin Temple, 379
Emperor Baoding, 344
Mrs. Ma, 324
Duan Yanqing, 302
Elder Wu, 294
Buyoude (Can't help but), 275
Mrs. Wang, 265
Weishenme (Why), 258
Zhitingde (Only heard), 255
Shishenme (What is it), 237
Yun Zhonghe, 236
That girl, 234
Ba Tianshi, 230
Miss Wang, 227
Hutingde (Suddenly heard), 221
Zhong Wanchou, 218
Shaolin Sect, 216
Ye Erniang, 216
Zhu Danchen, 213
Feng Bo'e, 209
Khitan people, 208
Crocodile Deity of the South Sea, 485
Young Master Murong, 230
Yelu Hongji, 189
Six Meridian Divine Sword, 168
Stood up, 116
Leading Big Brother, 103
These few words, 100
Nodded, 96
Old Monster of Xingxiu, 92
Fairy Sister, 90
Startled, 87
Greatly startled, 86
Mr. Murong, 86
Youshenme (What else), 86

Complete Code (Version 3.x, can be used for 2.x with simple modifications, mainly the output functions):

import numpy as np
import pandas as pd
import re
from numpy import log,min

f = open('data.txt', 'r') # Open article
s = f.read() # Read as a single string

# Define punctuation to be removed
drop_dict = [u',', u'\n', u'。', u'、', u':', u'(', u')', u'[', u']', u'.', u',', u' ', u'\u3000', u'”', u'“', u'?', u'?', u'!', u'‘', u'’', u'…']
for i in drop_dict: # Remove punctuation
 s = s.replace(i, '')

# For convenience, define a dictionary for regex patterns
myre = {2:'(..)', 3:'(...)', 4:'(....)', 5:'(.....)', 6:'(......)', 7:'(.......)'}

min_count = 10 # Minimum occurrences for a word
min_support = 30 # Minimum support for a word, 1 represents random combination
min_s = 3 # Minimum information entropy, higher means more likely to be an independent word
max_sep = 4 # Maximum character length for candidate words
t=[] # To save results

t.append(pd.Series(list(s)).value_counts()) # Count bit by bit
tsum = t[0].sum() # Total character count
rt = [] # To save results

for m in range(2, max_sep+1):
 print(u'Generating %s-character words...'%m)
 t.append([])
 for i in range(m): # Generate all possible m-character words
  t[m-1] = t[m-1] + re.findall(myre[m], s[i:])

 t[m-1] = pd.Series(t[m-1]).value_counts() # Count word frequencies
 t[m-1] = t[m-1][t[m-1] > min_count] # Filter by minimum count
 tt = t[m-1][:]
 for k in range(m-1):
  qq = np.array(list(map(lambda ms: tsum*t[m-1][ms]/t[m-2-k][ms[:m-1-k]]/t[k][ms[m-1-k:]], tt.index))) > min_support # Filter by minimum support.
  tt = tt[qq]
 rt.append(tt.index)

def cal_S(sl): # Information entropy calculation function
 return -((sl/sl.sum()).apply(log)*sl/sl.sum()).sum()

for i in range(2, max_sep+1):
 print(u'Performing maximum entropy screening for %s-character words (%s)...'%(i, len(rt[i-2])))
 pp = [] # Save all left and right neighbors
 for j in range(i+2):
  pp = pp + re.findall('(.)%s(.)'%myre[i], s[j:])
 pp = pd.DataFrame(pp).set_index(1).sort_index() # Sort first, which is important for speeding up retrieval
 index = np.sort(np.intersect1d(rt[i-2], pp.index)) # Intersection
 # Left and right neighbor information entropy screening
 index = index[np.array(list(map(lambda s: cal_S(pd.Series(pp[0][s]).value_counts()), index))) > min_s]
 rt[i-2] = index[np.array(list(map(lambda s: cal_S(pd.Series(pp[2][s]).value_counts()), index))) > min_s]

# Pre-processing for output
for i in range(len(rt)):
 t[i+1] = t[i+1][rt[i]]
 t[i+1].sort_values(ascending = False)

# Save result and output
pd.DataFrame(pd.concat(t[1:])).to_csv('result.txt', header = False)