Recording a Trial of Semi-supervised Sentiment Analysis

By 苏剑林 | May 04, 2017

This article is a not-so-successful attempt at semi-supervised learning: on the IMDB dataset, a text sentiment classification model was trained using 1,000 randomly selected labeled samples, and achieved a test accuracy of 73.48% on the remaining 49,000 test samples.

Idea

The idea in this article originates from this OpenAI post: "OpenAI Research Finds Unsupervised Sentiment Neuron: Can Directly Control Sentiment of Generated Text".

That article introduced a method for training unsupervised (actually semi-supervised) sentiment classification models with excellent experimental results. However, the experiments in that article were massive and nearly impossible for an individual to replicate (training for one month on 4 Pascal GPUs). Nevertheless, the underlying idea is simple, so we can create a "budget-friendly version." The logic is as follows:

When we use deep learning for sentiment classification, a conventional approach is an Embedding layer + LSTM layer + Dense layer (Sigmoid activation). What we usually call word vectors are essentially a pre-trained Embedding layer (this layer has the most parameters and is most prone to overfitting). OpenAI's idea is: why not pre-train the LSTM layer as well? The pre-training method also uses a language model. Of course, to ensure the pre-training results don't lose sentiment information, the number of hidden nodes in the LSTM needs to be larger.

If even the LSTM layer is pre-trained, the remaining parameters in the Dense layer are few, so the model can be fully trained using a small number of labeled samples. This is the entire strategy for semi-supervised learning. As for the "sentiment neuron" mentioned in the OpenAI article, that is merely a figurative description.

Admittedly, from the perspective of sentiment analysis tasks, the 73.48% accuracy in this article is hardly impressive; a standard "dictionary + rules" solution can easily achieve over 80% accuracy. I am merely verifying the feasibility of this experimental scheme. I believe that if the scale could match OpenAI's, the results would be much better. Furthermore, what this article aims to describe is a modeling strategy, not limited to sentiment analysis; the same idea can be applied to any binary or even multi-classification problem.

Process

First, load the dataset and re-partition the training and test sets:

from keras.preprocessing import sequence
from keras.models import Model
from keras.layers import Input, Embedding, LSTM, Dense, Dropout
from keras.datasets import imdb
from keras import backend as K
import numpy as np

max_features = 10000 # Keep top max_features words
maxlen = 100 # Pad/truncate to 100 words
batch_size = 1000
nb_grams = 10 # Train a 10-gram language model
nb_train = 1000 # Number of training samples

# Load the built-in IMDB dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_lm_ = np.append(x_train, x_test)

# Construct data for training the language model
# Only existing data is used here; in a real environment, 
# additional data can be supplemented for more thorough training.
x_lm = []
y_lm = []
for x in x_lm_:
    for i in range(len(x)):
        x_lm.append([0]*(nb_grams - i + max(0,i-nb_grams))+x[max(0,i-nb_grams):i])
        y_lm.append([x[i]])

x_lm = np.array(x_lm)
y_lm = np.array(y_lm)
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
x = np.vstack([x_train, x_test])
y = np.hstack([y_train, y_test])

# Re-partition training and test sets
# Merge original train/test sets, randomly pick 1000 samples as 
# the new training set, and the rest as the test set.
idx = range(len(x))
np.random.shuffle(idx)
x_train = x[idx[:nb_train]]
y_train = y[idx[:nb_train]]
x_test = x[idx[nb_train:]]
y_test = y[idx[nb_train:]]

Then building the model:

embedded_size = 100 # Word vector dimension
hidden_size = 1000 # LSTM dimension, can be viewed as the encoded sentence vector dimension.

# Encoder part
inputs = Input(shape=(None,), dtype='int32')
embedded = Embedding(max_features, embedded_size)(inputs)
lstm = LSTM(hidden_size)(embedded)
encoder = Model(inputs=inputs, outputs=lstm)

# Train the encoder part entirely using the n-gram model
input_grams = Input(shape=(nb_grams,), dtype='int32')
encoded_grams = encoder(input_grams)
softmax = Dense(max_features, activation='softmax')(encoded_grams)
lm = Model(inputs=input_grams, outputs=softmax)
lm.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
# Using sparse cross-entropy avoids pre-converting categories to one-hot form.

# Sentiment analysis part
# Freeze the encoder and attach a simple Dense layer (equivalent to logistic regression)
# At this point, only hidden_size+1 = 1001 parameters are being trained
# Therefore, theoretically, a small number of labeled samples should suffice.
for layer in encoder.layers:
    layer.trainable=False

sentence = Input(shape=(maxlen,), dtype='int32')
encoded_sentence = encoder(sentence)
sigmoid = Dense(10, activation='relu')(encoded_sentence)
sigmoid = Dropout(0.5)(sigmoid)
sigmoid = Dense(1, activation='sigmoid')(sigmoid)
model = Model(inputs=sentence, outputs=sigmoid)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Next, training the language model; this part is quite time-consuming:

# Training the language model is time-consuming; usually 2 or 3 iterations are enough
lm.fit(x_lm, y_lm,
       batch_size=batch_size,
       epochs=3)

The training results for the language model were:

Epoch 1/3
11737946/11737946 [==============================] - 2400s - loss: 5.0376
Epoch 2/3
11737946/11737946 [==============================] - 2404s - loss: 4.5587
Epoch 3/3
11737946/11737946 [==============================] - 2404s - loss: 4.3968

Then, we began training the sentiment analysis model with 1,000 samples. Since the pre-training was already done, there weren't many parameters to train. Combined with Dropout, 1,000 samples did not lead to severe overfitting.

# Training the sentiment analysis model
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=200)

The training results were:

Epoch 198/200
1000/1000 [==============================] - 0s - loss: 0.2481 - acc: 0.9250
Epoch 199/200
1000/1000 [==============================] - 0s - loss: 0.2376 - acc: 0.9330
Epoch 200/200
1000/1000 [==============================] - 0s - loss: 0.2386 - acc: 0.9350

Now, let's evaluate the model:

# Evaluate the model's performance
model.evaluate(x_test, y_test, verbose=True, batch_size=batch_size)

The accuracy was 73.04%, which is mediocre. Let's try transfer learning by training with the training set combined with the prediction results of the test set:

# Retrain the model using the training set plus the test set's 
# predicted results (which may contain errors).
y_pred = model.predict(x_test, verbose=True, batch_size=batch_size)
y_pred = (y_pred.reshape(-1) > 0.5).astype(int)
xt = np.vstack([x_train, x_test])
yt = np.hstack([y_train, y_pred])

model.fit(xt, yt,
          batch_size=batch_size,
          epochs=10)

# Evaluate the model's performance again
model.evaluate(x_test, y_test, verbose=True, batch_size=batch_size)

The training results were:

Epoch 8/10
50000/50000 [==============================] - 27s - loss: 0.1455 - acc: 0.9561
Epoch 9/10
50000/50000 [==============================] - 27s - loss: 0.1390 - acc: 0.9590
Epoch 10/10
50000/50000 [==============================] - 27s - loss: 0.1349 - acc: 0.9600

This time we obtained 73.33% accuracy. It's not hard to see that this process can be iterated. Repeating it once more yielded 73.33% accuracy, the second time 73.47%... As expected, it converges to a stable value. I repeated it 5 more times, and it stabilized at 73.48%.

From the initial 73.04% to 73.48% after transfer learning, there is an improvement of about 0.44%. It doesn't look like much, but for students participating in competitions or writing papers, a 0.44% improvement is something worth writing about.

Review

As mentioned at the beginning of the article, this was a somewhat unsuccessful attempt—it is a "bootleg" version, after all—so please don't get too hung up on the low accuracy. Based on the experimental results of this article, this scheme is viable. Training a language model through vast amounts of sentiment-mixed corpora can indeed extract text features effectively, which is similar to the process of autoencoding in images.

This implementation was kept simple, without fine-tuning hyperparameters. The potential improvements roughly include: increasing the scale of the language model, adding more sentiment corpora (only sentiment reviews are needed, no labels required), and optimizing training details. I will leave that for now.