[Chinese Word Segmentation Series] 6. Chinese Word Segmentation Based on Full Convolutional Networks

By 苏剑林 | January 13, 2017

I have previously written about schemes using LSTM for word segmentation. Today, I'm presenting another one using CNNs—specifically, an FCN (Fully Convolutional Network). The primary goal of this model isn't actually to research Chinese word segmentation itself, but rather to practice using TensorFlow. I've been using Keras for two years now, and I'm quite familiar with it, but I've gradually discovered some of its limitations—such as the inconvenience of handling variable-length inputs and the difficulty of adding custom constraints. Thus, I decided to try native TensorFlow. After trying it, I found it's not actually that complex; after all, it's all Python, how complex can it be? This article serves as an exercise in using TensorFlow to handle variable-length sequences, using Chinese word segmentation as an example. Finally, I have incorporated Hard Decoding, which combines deep learning with dictionary-based segmentation.

CNN

Regarding FCNs: interpreted in the context of language tasks, a (one-dimensional) convolution is essentially an n-gram model. From this perspective, CNNs are actually far more natural than RNNs. RNNs seem to be meticulously designed specifically for sequence tasks, whereas CNNs are an extension of the traditional n-gram model. Furthermore, both CNNs and RNNs utilize weight sharing, which might seem like a compromise chosen just to reduce computation, but there is deep reasoning behind it. Weight sharing in CNNs is an inevitable result of translation invariance, not just a choice to reduce computational load. Imagine shifting an image slightly, or inserting a meaningless space at the beginning of a sentence (causing all subsequent characters to shift back by one position). This should yield a similar or even identical result, which requires the convolution to be weight-shared—meaning the weights cannot be tied to specific positions.

RNN-type models, especially LSTMs, have long been the kings of language tasks. However, recent GCNNs (Gated Convolutional Neural Networks) are said to have slightly surpassed LSTMs in language modeling. This demonstrates that even in language tasks, CNNs have significant potential. The advantage of LSTMs is their ability to capture long-distance dependencies, but in fact, there aren't many truly long-distance tasks in language. Even in language modeling, the probability of the next character depends mostly on the preceding few characters, rather than the entire preceding text. As long as a CNN has enough layers and large enough kernels, it can achieve this effect. But CNNs have another distinct advantage: they are much faster than RNNs. When using GPU acceleration, GPUs are best at performing convolutions (since they were originally designed for image processing), so the acceleration of CNNs is far more pronounced than that of RNNs...

These points make me prefer CNNs, much like the Facebook team (who developed GCNN). A Fully Convolutional Network uses convolutions from start to finish and can handle variable-length inputs. It is particularly suitable for tasks where the input length is variable but the input and output lengths are equal.

Corpus

The task in this article is to build a Chinese word segmentation system using an FCN. The approach follows the "sbme" character tagging method (if you're unsure, refer to previous articles). Since it's supervised training, a corpus is required. Two good corpora are: first, the 2014 People's Daily corpus; second, the corpus from the backoff2005 competition, which also comes with an evaluation system. I have experimented with both.

If using the 2014 People's Daily corpus, the preprocessing code is:

# Preprocessing code for People's Daily

If using the backoff2005 corpus, the preprocessing code is:

# Preprocessing code for backoff2005

Then, the corpus is sorted by string length. This is because although TensorFlow supports variable-length inputs, during training, the lengths within each batch must be equal. Therefore, a simple clustering (clustering by length) is needed. Next, a mapping table is created, which is quite standard:

# Mapping table creation

Create a generator to produce training samples for each batch. Note that the batch_size here is just an upper limit; since we require all sentences in a batch to have the same length, not every batch size will reach 1024.

# Batch generator code

Model

Now for building the model. It's actually quite simple: three layers of stacked convolutions. The input length is not specified (set to None), and padding='SAME' is used to ensure the output length matches the input length (for this reason, no pooling is used). ReLU is used for intermediate activation, and Softmax is used at the end, with cross-entropy as the loss function. When using TensorFlow, you have to write out each process yourself, but it's not actually that complicated.

import tensorflow as tf

# Parameters
embedding_size = 128
keep_prob = 0.5
num_tags = 4 # s, b, m, e

# Placeholders
x = tf.placeholder(tf.int32, [None, None]) # Batch size, sequence length
y = tf.placeholder(tf.int32, [None, None]) # Batch size, sequence length

# Embedding Layer
embeddings = tf.Variable(tf.random_uniform([len(chars)+1, embedding_size], -1.0, 1.0))
embedded = tf.nn.embedding_lookup(embeddings, x)
embedded_dropout = tf.nn.dropout(embedded, keep_prob)

# Convolutional Layers
# Filter size 3, 5, 3
W_conv1 = tf.Variable(tf.truncated_normal([3, embedding_size, 128], stddev=0.1))
b_conv1 = tf.Variable(tf.constant(0.1, shape=[128]))
conv1 = tf.nn.relu(tf.nn.conv1d(embedded_dropout, W_conv1, stride=1, padding='SAME') + b_conv1)

W_conv2 = tf.Variable(tf.truncated_normal([5, 128, 128], stddev=0.1))
b_conv2 = tf.Variable(tf.constant(0.1, shape=[128]))
conv2 = tf.nn.relu(tf.nn.conv1d(conv1, W_conv2, stride=1, padding='SAME') + b_conv2)

W_conv3 = tf.Variable(tf.truncated_normal([3, 128, num_tags], stddev=0.1))
b_conv3 = tf.Variable(tf.constant(0.1, shape=[num_tags]))
y_conv = tf.nn.conv1d(conv2, W_conv3, stride=1, padding='SAME') + b_conv3

# Loss and Optimizer
# Using sequence_loss or manual cross entropy
# Flattening to 2D for softmax_cross_entropy_with_logits
reshape_y_conv = tf.reshape(y_conv, [-1, num_tags])
reshape_y = tf.reshape(y, [-1])
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=reshape_y, logits=reshape_y_conv))
optimizer = tf.train.AdamOptimizer(1e-4).minimize(loss)

# Accuracy
correct_prediction = tf.equal(tf.cast(tf.argmax(reshape_y_conv, 1), tf.int32), reshape_y)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

That is the entire model. Then we train. I highly recommend using tqdm to assist with progress display (it shows real-time progress, speed, and accuracy); they are a perfect match.

# Example training output

Training process output (trained on a MacBook CPU; using a GTX 1060 only takes 3s per epoch):

Epoch 1, Accuracy: 0.717359: 347it [01:06, 5.21it/s]
Epoch 1 Mean Accuracy: 0.56555
Epoch 2, Accuracy: 0.759943: 347it [01:08, 8.62it/s]
Epoch 2 Mean Accuracy: 0.74762
Epoch 3, Accuracy: 0.598692: 347it [01:08, 5.08it/s]
Epoch 3 Mean Accuracy: 0.693505
Epoch 4, Accuracy: 0.634529: 347it [01:07, 5.14it/s]
Epoch 4 Mean Accuracy: 0.613064
Epoch 5, Accuracy: 0.659949: 347it [01:07, 5.16it/s]
Epoch 5 Mean Accuracy: 0.643388
Epoch 6, Accuracy: 0.709635: 347it [01:07, 5.14it/s]
Epoch 6 Mean Accuracy: 0.679544
Epoch 7, Accuracy: 0.742839: 271it [00:42, 2.45it/s]
...

Hard Decoding

After training, what remains is prediction, tagging, and segmentation—these are basic and don't require much explanation. Finally, it can reach 93% accuracy on the backoff2005 evaluation set (calculated using the score script provided by backoff2005). This isn't the absolute best, but it's sufficient, especially considering the adjustments below.

As is well known, word segmentation based on character tagging requires a tagged corpus for training. Once trained, it adapts to that specific corpus; it's difficult to extend to new domains. Or, if errors are discovered, they cannot be quickly corrected. In contrast, dictionary-based methods are easy to adjust by simply adding/removing words or adjusting word frequencies. We can consider how to combine deep learning with a dictionary. Here, I've simply added Hard Decoding (manual intervention decoding) during the final decoding stage.

Model prediction yields the probabilities for each tag. Next, the Viterbi algorithm is used to find the optimal path. However, before Viterbi, we can utilize a dictionary to adjust the probabilities of the various tags. The approach is: add an add_dict.txt file, where each line is a word consisting of the word itself and a multiplier. This multiplier is the factor by which the corresponding tag probabilities will be expanded. For example, if the dictionary specifies "科学空间,10", and the sentence "科学空间挺好" (Scientific Spaces is good) is being segmented: first, get the tag probabilities for these six characters from the model. Then, discovering that the word "科学空间" is in this sentence, multiply the probability of the first character being 's' by 10, the probabilities of the second and third characters being 'm' by 10, and the probability of the fourth character being 'e' by 10 (no normalization is needed since we only care about relative values). Similarly, if some segmentations were missed (where it should have been cut but wasn't), you can add them to the dictionary and set a multiplier less than 1.

Effect:

Before adding to dictionary: 扫描二维码，关注微信号。 (Scan the QR code, follow the micro-signal.)
(After adding "微信号,10" to dictionary): 扫描二维码，关注微信号。 (Scan the QR code, follow the official account [WeChat ID].)

Of course, this is just an empirical method. The partial code for this follows. Since this is for demonstration purposes, I used regular expressions to iterate and search; for efficiency, a multi-pattern matching tool like an AC automaton should be used.

# Example code for probability adjustment using a dictionary
def adjust_probs(sentence, probs, dictionary):
    # Implementation logic for Hard Decoding
    pass