Extending "Softmax + Cross Entropy" to Multi-label Classification

By 苏剑林 | April 25, 2020

(Note: The relevant content of this article has been organized into a paper "ZLPR: A Novel Loss for Multi-label Classification". If you need to cite this, you can cite the English paper directly. Thank you.)

Generally speaking, when dealing with conventional multi-class classification problems, we use a fully connected layer at the end of the model to output scores for each class, then use softmax activation and cross-entropy as the loss function. In this article, we attempt to extend the "Softmax + Cross Entropy" scheme to multi-label classification scenarios, hoping to obtain a loss function for multi-label classification tasks that does not require special adjustment of class weights or thresholds.

Category Imbalance: From Single-label to Multi-label

Generally, multi-class classification refers to single-label classification, i.e., selecting $1$ target category from $n$ candidate categories. Assuming the scores for each class are $s_1, s_2, \dots, s_n$ and the target class is $t \in \{1, 2, \dots, n\}$, the loss used is:

\begin{equation}-\log \frac{e^{s_t}}{\sum\limits_{i=1}^n e^{s_i}}= - s_t + \log \sum\limits_{i=1}^n e^{s_i}\label{eq:log-softmax}\end{equation}

The optimization direction of this loss is to make the score of the target class $s_t$ the maximum among $s_1, s_2, \dots, s_n$. Regarding softmax-related content, you can also refer to articles like "Seeking a Smooth Maximum Function" and "Chat on Function Smoothing: Differentiable Approximation of Non-differentiable Functions".

Now we move to the multi-label classification problem, i.e., selecting $k$ target categories from $n$ candidate categories. In this case, a naive approach is to use sigmoid activation, transforming it into $n$ binary classification problems, and using the sum of binary cross-entropy as the loss. Obviously, when $n \gg k$, this approach faces serious category imbalance problems, requiring balancing strategies such as manually adjusting the weights of positive and negative samples, Focal Loss, etc. After training is completed, it is also necessary to further determine the optimal threshold based on a validation set.

At this point, a natural confusion arises: Why does "choose $k$ from $n$" require so much more work than "choose $1$ from $n$"?

I believe this is quite unscientific. After all, intuitively, "choose $k$ from $n$" should just be a natural extension of "choose $1$ from $n$"; it shouldn't require significantly more effort. Even if "choose $k$ from $n$" is more complex, the difficulty should transition gradually. But if it turns into multiple binary classifications, "choose $1$ from $n$" actually becomes the most difficult because the category imbalance is at its most extreme. Formally, single-label classification is easier than multi-label classification because single-label has "Softmax + Cross Entropy" available, which does not suffer from category imbalance problems, whereas the "Sigmoid + Cross Entropy" in multi-label classification does.

Therefore, the ideal solution should be to extend "Softmax + Cross Entropy" to multi-label classification.

A Long Search Finally Yields Results

To consider this extension, I made several attempts and rejected many results, finally determining a relatively elegant scheme: constructing a combinatorial form of softmax as an extension of single-label softmax. In this part, we will first assume $k$ is a fixed constant, then discuss a scheme for automatically determining $k$ in general cases, eventually arriving at an effective extension form.

Combinatorial Softmax

First, we consider a scenario where $k$ is a fixed constant, which means that during prediction, we directly output the $k$ classes with the highest scores. What about training? As a natural extension of softmax, we can consider the following as the loss:

\begin{equation}-\log \frac{e^{s_{t_1}+s_{t_2}+\dots+s_{t_k}}}{\sum\limits_{1\leq i_1 < i_2 < \cdots < i_k\leq n}e^{s_{i_1}+s_{i_2}+\dots+s_{i_k}}}=\log Z_k - (s_{t_1}+s_{t_2}+\dots+s_{t_k})\end{equation}

Where $t_1, t_2, \dots, t_k$ are the $k$ target labels, and $Z_k = \sum\limits_{1\leq i_1 < i_2 < \cdots < i_k\leq n}e^{s_{i_1}+s_{i_2}+\dots+s_{i_k}}$ is the partition function. Clearly, the above formula is a softmax constructed using the total score of any $k$ categories $s_{i_1}+s_{i_2}+\dots+s_{i_k}$ as the basic unit, so it is a reasonable extension of single-label softmax. Or it can be understood as still being a single-label classification problem, but it is a "choose $1$ from $C_n^k$" problem.

The difficult part of this scheme is the calculation of $Z_k$, which is the sum of the exponentials of $C_n^k$ terms. However, we can use Newton's identities to help us calculate it recursively. Let $S_k = \sum_{i=1}^n e^{k s_i}$, then:

\begin{equation}\begin{aligned} Z_1 =&\, S_1\\ 2Z_2 =&\, Z_1 S_1 - S_2\\ 3Z_3 = &\, Z_2 S_1 - Z_1 S_2 + S_3\\ \vdots\\ k Z_k = &\, Z_{k-1} S_1 - Z_{k-2} S_2 + \dots + (-1)^{k-2} Z_1 S_{k-1} + (-1)^{k-1} S_k \end{aligned}\end{equation}

Therefore, to calculate $Z_k$, we only need to calculate $k$ recursive steps, which can be computed within a reasonable amount of time. In the prediction stage, the $k$ classes with the highest scores are output directly.

Automatically Determining the Threshold

What was discussed above is a multi-label classification problem with a fixed number of outputs, but the target label count in general multi-label classification is uncertain. To this end, we define a maximum target label count $K \geq k$ and add a class $0$ as a padding label. At this time, the loss becomes:

\begin{equation}\log \overline{Z}_K - (s_{t_1}+s_{t_2}+\dots+s_{t_k}+\underbrace{s_0+\dots+s_0}_{K-k \text{ terms}})\end{equation}

And

\begin{equation}\begin{aligned} \overline{Z}_K =&\, \sum\limits_{1\leq i_1 < i_2 < \cdots < i_K\leq n}e^{s_{i_1}+s_{i_2}+\dots+s_{i_K}} + \sum\limits_{0 = i_1 = \dots = i_j < i_{j+1} < \cdots < i_K\leq n}e^{s_{i_1}+s_{i_2}+\dots+s_{i_K}}\\ =&\, Z_K + e^{s_0} \overline{Z}_{K-1} \end{aligned}\end{equation}

It looks complicated, but it is actually very simple. It still uses the total score of $K$ categories as the basic unit, but it allows and only allows class $0$ to appear repeatedly. During prediction, we still output the $K$ categories with the highest scores, but allow repeating class $0$. The equivalent effect is: using $s_0$ as a threshold, only output categories with scores greater than $s_0$. The final formula shows that $\overline{Z}_K$ can also be calculated recursively, so implementation is not difficult.

Suddenly, Looking Back...

It seemed that the "long search" finally bore fruit: the theory was there, and the implementation was not difficult. The next step seemed to be conducting experiments to see the results. If the results were good, I could even consider publishing a paper! It looked like a bright future. However...

Fortunately or unfortunately, while verifying its effectiveness, I consulted some senior experts. Prompted by them, I revisited the Circle Loss paper, which I hadn't looked at closely before. I saw its unified loss form (formula (1) in the original paper) and realized that this unified form contained a much simpler extension scheme.

Thus, the unfortunate part is that such a simpler extension already existed, so no matter how much searching I did, it had lost much of its significance. The fortunate part is that I found this better scheme after all; otherwise, if I had rushed to publish the previous scheme as an article, only to find it wasn't as simple or effective as existing ones, that would have been quite embarrassing.

A Unified Loss Form

Let's look at the single-label classification cross-entropy $\eqref{eq:log-softmax}$ in another form:

\begin{equation}-\log \frac{e^{s_t}}{\sum\limits_{i=1}^n e^{s_i}}=-\log \frac{1}{\sum\limits_{i=1}^n e^{s_i-s_t}}=\log \sum\limits_{i=1}^n e^{s_i-s_t}=\log \left(1 + \sum\limits_{i=1,i\neq t}^n e^{s_i-s_t}\right)\end{equation}

Why is this loss effective? From the articles "Seeking a Smooth Maximum Function" and "Chat on Function Smoothing", we know that $\text{logsumexp}$ is actually a smooth approximation of $\max$. So we have:

\begin{equation}\log \left(1 + \sum\limits_{i=1,i\neq t}^n e^{s_i-s_t}\right)\approx \max\begin{pmatrix}0 \\ s_1 - s_t \\ \vdots \\ s_{t-1} - s_t \\ s_{t+1} - s_t \\ \vdots \\ s_n - s_t\end{pmatrix}\end{equation}

The feature of this loss is that all non-target category scores $\{s_1, \dots, s_{t-1}, s_{t+1}, \dots, s_n\}$ are compared pairwise with the target category score $\{s_t\}$. The maximum of their differences must be as small as possible below zero, thus achieving the effect where "the target category score is greater than the score of every non-target category."

So, if there is a multi-label classification scenario with multiple target categories, we also hope that "each target category score is not less than each non-target category score." Thus, the following form of loss becomes obvious:

\begin{equation}\log \left(1 + \sum\limits_{i\in\Omega_{neg},j\in\Omega_{pos}} e^{s_i-s_j}\right)=\log \left(1 + \sum\limits_{i\in\Omega_{neg}} e^{s_i}\sum\limits_{j\in\Omega_{pos}} e^{-s_j}\right)\label{eq:unified}\end{equation}

Where $\Omega_{pos}, \Omega_{neg}$ are the sets of positive and negative category indices for the sample, respectively. The form of this loss is easy to understand: if we want $s_i < s_j$, we add $e^{s_i - s_j}$ to the $\log$. If we add a scaling factor $\gamma$ and a margin $m$, we get the unified form from the Circle Loss paper:

\begin{equation}\log \left(1 + \sum\limits_{i\in\Omega_{neg},j\in\Omega_{pos}} e^{\gamma(s_i-s_j + m)}\right)=\log \left(1 + \sum\limits_{i\in\Omega_{neg}} e^{\gamma (s_i + m)}\sum\limits_{j\in\Omega_{pos}} e^{-\gamma s_j}\right)\end{equation}

As a side note, the above is formula (1) in the Circle Loss paper, but formula (1) is not actually called Circle Loss; formula (4) is. However, I believe formula (1) is the most interesting part of the entire paper.

Application to Multi-label Classification

$\gamma$ and $m$ are generally considerations in metric learning, so here we still only care about formula $\eqref{eq:unified}$. If $k$ is fixed in "choose $k$ from $n$" multi-label classification, formula $\eqref{eq:unified}$ can be used directly as the loss, and the $k$ categories with the largest scores can be output during prediction.

For multi-label classification where $k$ is not fixed, we need a threshold to determine which categories to output. To this end, we similarly introduce an additional class $0$. We hope target category scores are all greater than $s_0$ and non-target category scores are all less than $s_0$. As mentioned before, "if we hope $s_i < s_j$, we add $e^{s_i - s_j}$ to the $\log$," so formula $\eqref{eq:unified}$ now becomes:

\begin{equation}\begin{aligned} &\log \left(1 + \sum\limits_{i\in\Omega_{neg},j\in\Omega_{pos}} e^{s_i-s_j}+\sum\limits_{i\in\Omega_{neg}} e^{s_i-s_0}+\sum\limits_{j\in\Omega_{pos}} e^{s_0-s_j}\right)\\ =&\log \left(e^{s_0} + \sum\limits_{i\in\Omega_{neg}} e^{s_i}\right) + \log \left(e^{-s_0} + \sum\limits_{j\in\Omega_{pos}} e^{-s_j}\right)\\ \end{aligned}\end{equation}

If the threshold is specified as 0, it simplifies to:

\begin{equation}\log \left(1 + \sum\limits_{i\in\Omega_{neg}} e^{s_i}\right) + \log \left(1 + \sum\limits_{j\in\Omega_{pos}} e^{-s_j}\right)\label{eq:final}\end{equation}

This is the final loss form we obtain—a natural and concise extension of "softmax + cross entropy" for multi-label classification tasks. It does not suffer from category imbalance because it does not transform multi-label classification into multiple binary classification problems. Instead, it turns it into pairwise comparisons between target category scores and non-target category scores, and automatically balances the weight of each term thanks to the properties of $\text{logsumexp}$.

Here is a reference implementation in Keras:

def multilabel_categorical_crossentropy(y_true, y_pred):
    """Multi-label cross-entropy
    Note: y_true and y_pred have the same shape. y_true contains 0 or 1,
    where 1 indicates the class is a target class and 0 otherwise.
    Warning: Ensure that y_pred values are real numbers. In general,
    y_pred should not have an activation function, especially not 
    sigmoid or softmax! In the prediction stage, output classes 
    where y_pred > 0.
    """
    y_pred = (1 - 2 * y_true) * y_pred
    y_pred_neg = y_pred - y_true * 1e12
    y_pred_pos = y_pred - (1 - y_true) * 1e12
    zeros = K.zeros_like(y_pred[..., :1])
    y_pred_neg = K.concatenate([y_pred_neg, zeros], axis=-1)
    y_pred_pos = K.concatenate([y_pred_pos, zeros], axis=-1)
    neg_loss = K.logsumexp(y_pred_neg, axis=-1)
    pos_loss = K.logsumexp(y_pred_pos, axis=-1)
    return neg_loss + pos_loss

Conclusion

The final conclusion is formula $\eqref{eq:final}$, which is the unified loss for multi-label classification problems sought in this article. Everyone is welcome to test it and report the results. I have experimented with several multi-label classification tasks, and they all matched the performance of binary classification schemes under fine-tuned weights.

It should be noted that in addition to standard multi-label classification problems, some common task formats can also be considered as multi-label classification, such as sequence labeling based on 0/1 tagging. A typical example is the author's "Semi-Pointer Semi-Tagging" design. Therefore, from this perspective, there are many tasks that can be regarded as multi-label classification to test formula $\eqref{eq:final}$. I indeed tried $\eqref{eq:final}$ in previous triplet extraction examples like task_relation_extraction.py and achieved results consistent with the previous ones.

Of course, finally, it must be stated that although formula $\eqref{eq:final}$ can theoretically solve many problems automatically as a multi-label classification loss function, there is ultimately no absolutely perfect solution that guarantees improvement. Therefore, when you use it to replace your original multi-label classification scheme, there is no guarantee that there will be a boost, especially if you have already handled the category imbalance problem through fine-tuning weights. The gain from formula $\eqref{eq:final}$ may be very limited. After all, the original intention of formula $\eqref{eq:final}$ is to allow us to achieve most of the effect without excessive parameter tuning.

Original link: https://kexue.fm/archives/7359