Text Sentiment Classification (Part 4): A Better Loss Function

By 苏剑林 | March 30, 2017

Text sentiment classification is essentially a binary classification problem. In fact, classification models often suffer from a common issue: the optimization objective is inconsistent with the evaluation metrics. Generally, for classification (including multi-class), we use cross-entropy as the loss function, which originates from Maximum Likelihood Estimation (refer to "Gradient Descent and EM Algorithm: From the Same Origin"). However, our final evaluation goal is not to see how small the cross-entropy is, but to look at the model's accuracy. Usually, a small cross-entropy leads to high accuracy, but this relationship is not absolute.

Aiming for Average Doesn't Necessarily Mean Being Top-Tier

A more common example is this: A math teacher is working hard to improve the students' average score, but the final assessment metric is the passing rate (passing at 60 points). If the average score is 100 points (implying everyone scored 100), then naturally the passing rate is 100%, which is ideal. But reality isn't always so perfect. As long as the average score hasn't reached 100, a higher average doesn't necessarily mean a higher passing rate. For example, if two students score 40 and 90 respectively, the average is 65, but the passing rate is only 50%. If both students score 60, the average is 60, but the passing rate is 100%. This means that while the average can serve as a target, it does not directly align with the final assessment goal.

So, to improve the final assessment metric, what should the teacher do? Obviously, they should first identify which students have already passed; they don't need to worry about them for now. Instead, they should focus on providing extra tutoring for the students who failed. In principle, this allows many failing students to reach 60 points. While some students who previously passed might slip below 60, this process can be iterated until everyone is above 60. Of course, the final average score might not be very high, but there's no choice—that's how the assessment works.

A Better Update Scheme

For binary classification models, we hope the model outputs 1 for positive samples and 0 for negative samples, but due to limitations in model fitting capacity, this is generally impossible. In practice, during prediction, we consider outputs greater than 0.5 as positive and less than 0.5 as negative. This implies we can update the model "selectively." For example, we could set a threshold of 0.6. If the model's output for a positive sample is already greater than 0.6, I won't update the model based on that sample. If the output for a negative sample is less than 0.4, I won't update the model for that either. Only samples falling within the 0.4 to 0.6 range trigger updates. This way, the model "concentrates its energy" on ambiguous samples, which leads to better classification results—consistent with the core idea of SVMs.

Furthermore, this approach theoretically helps prevent overfitting. It prevents the model from "obsessively" fitting easy samples just to lower the loss function. It’s like a teacher who only cares about top students, hoping they improve from 80 to 90 points, without finding ways to improve the grades of struggling students. That clearly isn't the mark of a good teacher.

Modified Cross-Entropy Loss

How can we achieve the goal described above? It's simple: adjust the loss function. This primarily draws inspiration from hinge loss and triplet loss. The standard cross-entropy loss function is:

\[L_{old} = -\sum_y y_{true} \log y_{pred}\]

Choose a threshold $m=0.6$ (in principle, any value greater than 0.5 works). Introduce the unit step function $\theta(x)$:

\[\theta(x) = \left\{\begin{aligned}&1, x > 0\\ &\frac{1}{2}, x = 0\\ &0, x < 0\end{aligned}\right.\]

Now, consider a new loss function:

\[L_{new} = -\sum_y \lambda(y_{true}, y_{pred}) y_{true}\log y_{pred}\]

where

\[\lambda(y_{true}, y_{pred}) = 1-\theta(y_{true}-m)\theta(y_{pred}-m)-\\theta(1-m-y_{true})\\theta(1-m-y_{pred})\]

$L_{new}$ adds a correction factor $\lambda(y_{true}, y_{pred})$ to the cross-entropy. What does this term represent? When a positive sample is processed, $y_{true}=1$, and clearly:

\[\lambda(1, y_{pred})=1-\theta(y_{pred}-m)\]

In this case, if $y_{pred} > m$, then $\lambda(1, y_{pred})=0$, and the cross-entropy automatically becomes 0 (reaching its minimum). Conversely, if $y_{pred} < m$, then $\lambda(1, y_{pred})=1$, and the cross-entropy is maintained. In other words, if a positive sample's output is already greater than $m$, it stops updating (as it has reached the minimum and the gradient can be considered 0); it only continues updating if it's less than $m$. A similar analysis applies to negative samples: if the output is already less than $1-m$, the update stops; it only continues if it's greater than $1-m$.

Thus, simply by replacing the original cross-entropy loss with the modified cross-entropy $L_{new}$, we can achieve our design goal.

Experimental Testing Based on IMDB

The theory sounds great, but does it work as well as imagined in practice? Let's experiment immediately.

To make the results more comparable, I chose a standard task in text sentiment classification: IMDB movie review classification. The tool used is the latest version of Keras (2.0). Most of the code can be found in the Keras examples, including LSTM and CNN versions.

First, the LSTM version:

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.datasets import imdb
from keras import backend as K

margin = 0.6
theta = lambda t: (K.sign(t)+1.)/2.

max_features = 20000
maxlen = 80
batch_size = 32

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

def loss(y_true, y_pred):
 return - (1 - theta(y_true - margin) * theta(y_pred - margin)
 - theta(1 - margin - y_true) * theta(1 - margin - y_pred)
 ) * (y_true * K.log(y_pred + 1e-8) + (1 - y_true) * K.log(1 - y_pred + 1e-8))

model.compile(loss=loss,
 optimizer='adam',
 metrics=['accuracy'])

model.fit(x_train, y_train,
 batch_size=batch_size,
 epochs=15,
 validation_data=(x_test, y_test))

The code is basically copied from the official examples. After execution, the model achieved a training accuracy of 99.01% and a test accuracy of 82.26%. If the loss is changed directly to binary_crossentropy (leaving everything else the same), the results are 99.56% training accuracy and 81.02% test accuracy. This shows that the new loss function indeed helps prevent overfitting and improves accuracy. While there might be some random error, the average results across multiple runs show that the new loss function brings about a 0.5% to 1% improvement in accuracy (naturally, you cannot expect a revolutionary leap just by slightly modifying the loss function).

Now, let's look at the CNN version:

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Embedding, Dense, Dropout, Activation
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb
from keras import backend as K

margin = 0.6
theta = lambda t: (K.sign(t)+1.)/2.

max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 10

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

model = Sequential()
model.add(Embedding(max_features,
 embedding_dims,
 input_length=maxlen))
model.add(Dropout(0.2))
model.add(Conv1D(filters,
 kernel_size,
 padding='valid',
 activation='relu',
 strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))

def loss(y_true, y_pred):
 return - (1 - theta(y_true - margin) * theta(y_pred - margin)
 - theta(1 - margin - y_true) * theta(1 - margin - y_pred)
 ) * (y_true * K.log(y_pred + 1e-8) + (1 - y_true) * K.log(1 - y_pred + 1e-8))

model.compile(loss=loss,
 optimizer='adam',
 metrics=['accuracy'])

model.fit(x_train, y_train,
 batch_size=batch_size,
 epochs=epochs,
 validation_data=(x_test, y_test))

After execution, the model achieved 98.66% training accuracy and 88.24% test accuracy. The results for pure binary_crossentropy were 98.90% training accuracy and 88.14% test accuracy, which are basically consistent within the range of fluctuation. However, during the training process, the test results using the new loss function remained stable around 88.2%, whereas with cross-entropy, they would jump to 89%, down to 87%, and back to 88%. This means that although the final accuracies were similar, the fluctuations were much larger with cross-entropy. We have reason to believe that models trained with the new loss function have better generalization capabilities.

In Short

This article primarily draws on the ideas of hinge loss and triplet loss to adjust the cross-entropy loss used in binary classification, making it more effective at fitting samples that are incorrectly predicted. Experiments also show that, in a certain sense, the new loss function can indeed bring a small improvement.

Furthermore, this logic can actually be applied to multi-class classification or even regression problems. I won't go into detail here, but I will share further analysis as I encounter specific cases.