Lightweight Deep Learning Word Segmentation System: NNCWS v0.1

By 苏剑林 | November 29, 2016

Alright, I've played the clickbait artist once... In reality, the word segmentation system in this article is a three-layer neural network model, so it's actually "shallow learning"—writing "deep learning" just makes it more attractive. NNCWS stands for Neural Network based Chinese Word Segmentation System. It is written in Python and is currently completely open-source for readers to try out.

Small Talk

What are the special features of this program? Almost none! This article simply uses a neural network combined with character embeddings to implement an n-grams style (7-grams are used in the program) segmentation system. It doesn't use high-end models like "Chinese Word Segmentation Series 4: seq2seq Character Labeling Based on Bi-LSTM", nor does it have unsupervised training like "Chinese Word Segmentation Series 5: Unsupervised Segmentation Based on Language Models." This is a purely supervised simple model, trained on the 2014 People's Daily annotated corpus.

So, what is the significance of this program? Two words: Lightweight! Current deep learning programs are quite massive and require various dependency libraries, which might not be easy to install on certain platforms like Windows. Furthermore, the number of parameters can be so enormous that speed becomes an issue. Consequently, many deep learning projects are just played with in labs or used for publishing papers, remaining far from actual production applications. This program tries to be as streamlined as possible: character embedding dimensions and model scale are minimized, and finally, the program was replicated using NumPy. In other words, I trained the model using Keras, extracted the model parameters, and then used NumPy to call these parameters. As a result, the final program only requires NumPy to run. Therefore, the main feature of the program is its portability!

Of course, this type of neural network-based segmentation system already possesses preliminary semantic understanding capabilities. This is mainly reflected in its decent performance on ambiguous combinations and its good effect on entity recognition for names, places, etc. Thus, it is worth using in general scenarios. Since it was trained on the People's Daily corpus, the segmentation effect is better in the field of news.

Download and Use

GitHub Address: https://github.com/bojone/NNCWS

First, install the dependency NumPy, and then run:

# (No command provided in the snippet, typically involves git clone or pip installs)

Then you can use it directly:

# (Usage code here)

Output:

As a leader of a small country, Castro inevitably cannot be viewed in the same light as Mao Zedong. However, in Castro, the brilliance of that great era is, after all, reflected. Today, as we bid farewell to Castro, we are bidding farewell to that great era, but what we must keep is the immortal spiritual core of that era: —— The steadfast pursuit of ideals. —— The firm belief in national independence. —— The "tough bone" spirit that is unafraid of threats!

In January 2000, Robin Li founded Baidu. After more than ten years of development, Baidu has grown into the world's second-largest independent search engine and the largest Chinese search engine. The success of Baidu has also made China one of only four countries in the world, besides the United States, Russia, and South Korea, to possess core search engine technology. In 2005, Baidu successfully listed on the NASDAQ in the United States and became the first Chinese company to enter the NASDAQ-100 index. Baidu has become one of China's most valuable brands.

Training Process

If you are interested in the model structure and training process, please look at the nncws_train.py file. It contains the processing of the 2014 People's Daily corpus, the model structure, and the training. The file is very short and clear, so I won't say much more. Training also requires the installation of Keras.

If you don't want to look at the code, just put the script in the directory of the People's Daily corpus and run python nncws_train.py directly. The 2014 People's Daily corpus can be found and downloaded by searching online.

Final Words

In fact, this is just a toy, hence version 0.1, and I estimate it is still far from being truly practical. But it counts as a new attempt: lowering the threshold for the application of technologies like deep learning (it doesn't matter if the training threshold is high; the public cares about the application threshold), leveraging the effectiveness of new models while retaining portability.

An obvious disadvantage of such a model is that if a user finds the model does not meet their requirements, they cannot easily adjust it themselves; whereas with traditional dictionary-based methods, users only need to add to the dictionary, which is more flexible. Therefore, the next goal is: to integrate supervised training, unsupervised new word discovery, and vocabulary lists. I already have some ideas for this, so please look forward to it.