By 苏剑林 | August 04, 2015
In the article "Text Sentiment Classification (I): Traditional Models," I briefly introduced the traditional approach to text sentiment classification. Traditional methods are easy to understand and relatively stable. However, they face two difficult-to-overcome limitations: 1. Accuracy issues—traditional methods are generally satisfactory for common applications, but there is a lack of effective ways to further improve precision; 2. Background knowledge issues—traditional methods require the prior extraction of an emotional dictionary. This step often requires manual intervention to ensure accuracy. In other words, the person doing this work must not only be a data mining expert but also a linguist. This dependency on background knowledge hinders progress in natural language processing (NLP).
Fortunately, deep learning has solved this problem (at least to a large extent). It allows us to build models for practical problems in specific fields with almost "zero background." This article continues the discussion on text sentiment classification by explaining deep learning models. Parts that were already discussed in detail in the previous article will not be repeated here.
In recent years, deep learning algorithms have been applied to the field of natural language processing, achieving results superior to traditional models. Scholars like Bengio built neural probabilistic language models based on deep learning ideas and further utilized various deep neural networks to train language models on large-scale English corpora. This led to better semantic representations and completed common NLP tasks such as syntactic analysis and sentiment classification, providing new ideas for NLP in the era of big data.
According to my tests, sentiment analysis models based on deep neural networks often achieve an accuracy of over 95%. The charm and power of deep learning algorithms are evident!
For further information on deep learning, please refer to the following literature:
[1] Yoshua Bengio, Réjean Ducharme Pascal Vincent, Christian Jauvin. A Neural Probabilistic Language Model, 2003
[2] A New Language Model: http://blog.sciencenet.cn/blog-795431-647334.html
[3] Organized Deep Learning Notes: http://blog.csdn.net/zouxy09/article/details/8775360
[4] Deep Learning: http://deeplearning.net
[5] Talk on Automatic Chinese Word Segmentation and Semantic Recognition: http://www.matrix67.com/blog/archives/4212
[6] Application of Deep Learning in Chinese Word Segmentation and Part-of-Speech Tagging: http://blog.csdn.net/itplus/article/details/13616045
In the article "Chitchat: Neural Networks and Deep Learning," I mentioned that the most important step in modeling is feature extraction, and NLP is no exception. In NLP, the core question is: how can a sentence be effectively represented in numerical form? If this step can be completed, sentence classification becomes straightforward. Obviously, an elementary idea is to assign a unique ID (1, 2, 3, 4...) to each word and then treat a sentence as a set of IDs. For example, if 1, 2, 3, 4 represent "I", "you", "love", and "hate" respectively, then "I love you" is [1, 3, 2] and "I hate you" is [1, 4, 2]. This approach seems effective but is actually very problematic. For instance, a stable model would assume that 3 and 4 are very close, and thus [1, 3, 2] and [1, 4, 2] should yield similar classification results. However, according to our numbering, the meanings of the words represented by 3 and 4 are completely opposite, so the classification results cannot be the same. Therefore, this encoding method cannot provide good results.
Readers might think: what if I cluster the IDs of words with similar meanings together (giving them similar IDs)? Well, indeed, if there is a way to place IDs of similar words close to each other, it would greatly improve the model's accuracy. But the problem arises: if each word is given a unique ID and similar words are given similar IDs, we are essentially assuming semantic singularity—that is, assuming semantics are only one-dimensional. However, this is not the case; semantics should be multi-dimensional.
For example, when we talk about "Home" (家园), some might think of the synonym "Family" (家庭), and from "Family," one might think of "Relatives" (亲人). These are all related meanings. On the other hand, from "Home," some might think of "Earth" (地球), and from "Earth," one might think of "Mars" (火星). In other words, both "Relatives" and "Mars" can be seen as secondary approximations of "Home," but there is no obvious connection between "Relatives" and "Mars" themselves. Furthermore, semantically speaking, "University" or "Comfortable" can also be considered secondary approximations of "Home." Clearly, with just a single unique ID, it is difficult to place these words in appropriate positions.
The divergence of words
From the above discussion, we know that the meanings of many words diverge in various directions rather than just one direction; therefore, a single ID is not ideal. So, how about multiple IDs? In other words, mapping a word to a multi-dimensional vector? Exactly, this is a very correct line of thought.
Why are multi-dimensional vectors feasible? First, multi-dimensional vectors solve the multi-directional divergence of words; even a 2D vector can rotate 360 degrees, let alone higher dimensions (typically hundreds of dimensions in practical applications). Second, there is a practical benefit: multi-dimensional vectors allow us to represent words using numbers that change within a small range. What does this mean? We know that in Chinese, the number of words is in the hundreds of thousands. If each word is given a unique ID, the IDs range from 1 to several hundred thousand. With such a large range of variation, model stability is difficult to guarantee. With high-dimensional vectors, say 20 dimensions, then only 0s and 1s are needed to represent $2^{20} = 1,048,576$ (1 million) words. Smaller variations help ensure model stability.
Having said all that, we haven't yet reached the core point. Now that we have the idea, the question is: how do we place these words into the correct high-dimensional vectors? And importantly, how do we do this without a linguistic background? (In other words, if I want to process English tasks, I shouldn't need to learn English first; I only need to collect a large number of English articles. How convenient!) We won't expand more on the theoretical principles here; instead, we introduce a famous open-source tool from Google based on this idea—Word2Vec.
Simply put, Word2Vec does exactly what we want—it represents words using high-dimensional vectors (Word Embeddings) and puts words with similar meanings in similar positions, using real-valued vectors (not limited to integers). We only need a large corpus of a certain language to train the model and obtain word vectors. The benefits of word vectors have been mentioned; essentially, they were created to solve the problems discussed earlier. Other benefits include: word vectors can easily be used for clustering, and Euclidean distance or cosine similarity can be used to find words with similar meanings. This essentially solves the "synonym" problem (unfortunately, there doesn't seem to be a good way yet to solve the "polysemy" problem).
Regarding the mathematical principles of Word2Vec, readers can refer to this series of articles. For the implementation, Google provides the official C source code, which readers can compile themselves. The Python Gensim library also provides Word2Vec as a sub-library (in fact, this version seems more powerful than the official one).
The next problem to solve is: we have segmented the text into words and converted the words into high-dimensional vectors. A sentence then corresponds to a set of word vectors, which is a matrix, similar to image processing where a digitized image corresponds to a pixel matrix. However, model inputs generally only accept one-dimensional features. What should we do? One simple idea is to flatten the matrix, i.e., concatenate word vectors one after another to form a longer vector. This idea is feasible, but it would result in input dimensions as high as several thousand or even tens of thousands. In fact, this is difficult to implement. (If tens of thousands of dimensions aren't a problem for today's computers, consider that for a $1000 \times 1000$ image, it would be as high as 1 million dimensions!)
In fact, for image processing, there is already a mature method called Convolutional Neural Networks (CNNs). It is a type of neural network specifically designed to handle matrix inputs, capable of encoding matrix-form inputs into lower-dimensional one-dimensional vectors while retaining most useful information. The CNN approach can be directly applied to NLP, especially in text sentiment classification, with good results. Related articles include "Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts". However, the principles of sentences are different from images. Applying the image approach directly to language, while somewhat successful, feels slightly out of place. Therefore, it is not the mainstream method in NLP.
In NLP, the methods usually used are Recursive Neural Networks or Recurrent Neural Networks (both called RNNs). Their role is the same as CNNs: encoding matrix inputs into lower-dimensional one-dimensional vectors while retaining most useful information. The difference from CNNs is that CNNs focus more on global fuzzy perception (like looking at a photo; we don't actually see every pixel clearly, but grasp the overall content), while RNNs focus on the reconstruction of neighboring positions. From this, it is clear that for language tasks, RNNs are more persuasive (language is always composed of adjacent characters forming words, adjacent words forming phrases, adjacent phrases forming sentences, etc.; thus, it is necessary to effectively integrate or reconstruct information from neighboring positions).
Speaking of model classification, they are truly endless. Within the RNNs subset, there are many variants, such as standard RNNs, as well as GRU, LSTM, etc. Readers can refer to the Keras official documentation: http://keras.io/models/. This is a deep learning library for Python that provides a large number of deep learning models. Its official documentation serves as both a help tutorial and a list of models—it basiclly implements the currently popular deep learning models.
After so much talk, it’s time to do some real work. Now we build a deep learning model for text sentiment classification based on LSTM (Long Short-Term Memory), with the structure shown below:
LSTM for Sentiment Classification
The model structure is simple, nothing complicated, and implementation is easy using Keras, which has already implemented the algorithms for us.
Now let's talk about two interesting steps.
The first step is the collection of labeled corpora. Note that our model is supervised (or at least semi-supervised), so we need to collect some already-classified sentences. As for the quantity, the more the better. For Chinese text sentiment classification, this step is not easy, as Chinese materials are often quite scarce. While building the model, I pieced together more than 20,000 labeled Chinese sentences (involving six domains) from various channels (searching and downloading online, purchasing from Datatang, etc.) to train the model. (Shared at the end of the text)
Training Corpora
The second step is the selection of the model threshold. In fact, the predicted result of the training is a continuous real number in the $[0, 1]$ interval. By default, the program sets 0.5 as the threshold; that is, results greater than 0.5 are judged as positive, and those less than 0.5 as negative. This default value is not always the best. As shown below, while studying the impact of different thresholds on the True Positive Rate and True Negative Rate, we found an abrupt change in the curve within the interval $(0.391, 0.394)$.
Threshold Selection
Although looking at the absolute values, it only dropped from 0.99 to 0.97—a small change—the rate of change is very large. Normally, changes are smooth. An abrupt change implies that something unusual must have occurred, and obviously, the reason for this exception is hard for us to discover. In other words, there is an unstable region here, and prediction results within this region are actually unreliable. Therefore, to be safe, we discard this interval. We only consider results greater than 0.394 as positive, and those less than 0.391 as negative. Results between 0.391 and 0.394 are marked as "undetermined." Experiments show that this practice helps improve the model's application accuracy.
This article is long and provides a rough introduction to the ideas and practical applications of deep learning in text sentiment classification. Many things were discussed broadly. I am not intending to write a deep learning tutorial; I only want to point out the key points, at least those I consider critical. There are many good tutorials on deep learning. It's best to read English papers. A relatively good Chinese source is the blog http://blog.csdn.net/itplus. I won't make a fool of myself in that regard.
Below are my corpora and code. Readers might wonder why I share these "private collections." It's simple: because I don't work in this industry. Data mining is just a hobby for me—a hobby combining math and Python. Therefore, I don't have to worry about anyone getting ahead of me in this area.
Corpus Download: sentiment.zip
Collected review data: sum.zip
Code for building LSTM for text sentiment classification:
import pandas as pd # Import Pandas
import numpy as np # Import Numpy
import jieba # Import Jieba segmentation
from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
from __future__ import absolute_import # Import 3.x features
from __future__ import print_function
neg=pd.read_excel('neg.xls',header=None,index=None)
pos=pd.read_excel('pos.xls',header=None,index=None) # Finished reading training corpora
pos['mark']=1
neg['mark']=0 # Labelling training corpora
pn=pd.concat([pos,neg],ignore_index=True) # Merge corpora
neglen=len(neg)
poslen=len(pos) # Count corpora size
cw = lambda x: list(jieba.cut(x)) # Define segmentation function
pn['words'] = pn[0].apply(cw)
comment = pd.read_excel('sum.xls') # Read review content
#comment = pd.read_csv('a.csv', encoding='utf-8')
comment = comment[comment['rateContent'].notnull()] # Only read non-empty reviews
comment['words'] = comment['rateContent'].apply(cw) # Review segmentation
d2v_train = pd.concat([pn['words'], comment['words']], ignore_index = True)
w = [] # Integrate all words together
for i in d2v_train:
w.extend(i)
dict = pd.DataFrame(pd.Series(w).value_counts()) # Count word frequencies
del w,d2v_train
dict['id']=list(range(1,len(dict)+1))
get_sent = lambda x: list(dict['id'][x])
pn['sent'] = pn['words'].apply(get_sent) # Speed is quite slow
maxlen = 50
print("Pad sequences (samples x time)")
pn['sent'] = list(sequence.pad_sequences(pn['sent'], maxlen=maxlen))
x = np.array(list(pn['sent']))[::2] # Training set
y = np.array(list(pn['mark']))[::2]
xt = np.array(list(pn['sent']))[1::2] # Test set
yt = np.array(list(pn['mark']))[1::2]
xa = np.array(list(pn['sent'])) # Full set
ya = np.array(list(pn['mark']))
print('Build model...')
model = Sequential()
model.add(Embedding(len(dict)+1, 256))
model.add(LSTM(128)) # try using a GRU instead, for fun
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x, y, batch_size=16, nb_epoch=10) # Training time takes several hours
classes = model.predict_classes(xt)
acc = np_utils.accuracy(classes, yt)
print('Test accuracy:', acc)