ZLPR: A Novel Loss for Multi-label Classification

By 苏剑林 | May 07, 2022

(Note: The relevant content of this article has been compiled into the paper "ZLPR: A Novel Loss for Multi-label Classification". If you need to cite it, you can cite the English paper directly, thank you.)

In "Promoting 'Softmax + Cross-Entropy' to Multi-Label Classification", we proposed a loss function for multi-label classification:

\begin{equation}\log \left(1 + \sum\limits_{i\in\Omega_{neg}} e^{s_i}\right) + \log \left(1 + \sum\limits_{j\in\Omega_{pos}} e^{-s_j}\right)\label{eq:original}\end{equation}

This loss function possesses the advantages of "Softmax + Cross-Entropy" in single-label classification and works effectively even when the positive and negative classes are imbalanced. However, as can be seen from its form, it is only applicable to "hard labels," which means techniques like label smoothing and mixup cannot be used. This article attempts to solve this problem by proposing a soft label version of the aforementioned loss function.

A Clever Connection

The classic approach to multi-label classification is to transform it into multiple binary classification problems, where each category is activated by the sigmoid function $\sigma(x)=1/(1+e^{-x})$, and then each uses the binary cross-entropy loss. When the positive and negative categories are extremely imbalanced, this approach usually performs poorly, whereas loss $\eqref{eq:original}$ is typically a superior choice.

In the comments section of a previous article, reader @wu.yan revealed a clever connection between multiple "sigmoid + binary cross-entropy" instances and formula $\eqref{eq:original}$: multiple "sigmoid + binary cross-entropy" losses can be appropriately rewritten as:

\begin{equation}\begin{aligned} &\,-\sum_{j\in\Omega_{pos}}\log\sigma(s_j)-\sum_{i\in\Omega_{neg}}\log(1-\sigma(s_i))\\ =&\,\log\prod_{j\in\Omega_{pos}}(1+e^{-s_j})+\log\prod_{i\in\Omega_{neg}}(1+e^{s_i})\\ =&\,\log\left(1+\sum_{j\in\Omega_{pos}}e^{-s_j}+\cdots\right)+\log\left(1+\sum_{i\in\Omega_{neg}}e^{s_i}+\cdots\right) \end{aligned}\label{eq:link}\end{equation}

Comparing this with formula $\eqref{eq:original}$, we can discover that formula $\eqref{eq:original}$ is exactly the aforementioned multiple "sigmoid + binary cross-entropy" loss with the higher-order terms represented by $\cdots$ removed! When positive and negative categories are imbalanced, these higher-order terms occupy too much weight, exacerbating the imbalance problem and resulting in poor performance. Conversely, after removing these higher-order terms, the function of the loss remains unchanged (desiring positive class scores to be greater than 0 and negative class scores to be less than 0), and because the summation within the brackets is linearly related to the number of classes, the loss gap between the positive and negative classes is not too large.

Formal Conjecture

This clever connection tells us that to find the soft label version of formula $\eqref{eq:original}$, we can try starting from the soft label version of multiple "sigmoid + binary cross-entropy" and then try to remove the higher-order terms. So-called soft labels mean that labels are no longer strictly 0 or 1, but can be any real number between 0 and 1, representing the probability of belonging to that class. For binary cross-entropy, its soft label version is simple:

\begin{equation}-t\log\sigma(s)-(1-t)\log(1-\sigma(s))\end{equation}

Here $t$ is the soft label, and $s$ is the corresponding score. Following the process in $\eqref{eq:link}$, we can get:

\begin{equation}\begin{aligned} &\,-\sum_i t_i\log\sigma(s_i)-\sum_i (1-t_i)\log(1-\sigma(s_i))\\ =&\,\log\prod_i(1+e^{-s_i})^{t_i}+\log\prod_i (1+e^{s_i})^{1-t_i}\\ =&\,\log\prod_i(1+t_i e^{-s_i} + \cdots)+\log\prod_i (1+(1-t_i)e^{s_i}+\cdots)\\ =&\,\log\left(1+\sum_i t_i e^{-s_i}+\cdots\right)+\log\left(1+\sum_i(1-t_i)e^{s_i}+\cdots\right) \end{aligned}\end{equation}

If we remove the higher-order terms, we obtain:

\begin{equation}\log\left(1+\sum_i t_i e^{-s_i}\right)+\log\left(1+\sum_i(1-t_i)e^{s_i}\right)\label{eq:soft}\end{equation}

This is the candidate form for the soft label version of formula $\eqref{eq:original}$. It can be seen that when $t_i\in\{0,1\}$, it degenerates exactly into formula $\eqref{eq:original}$.

Proof of Result

For now, formula $\eqref{eq:soft}$ is at most a "candidate" form. To "verify" it, we need to prove that when $t_i$ is a floating-point number between 0 and 1, formula $\eqref{eq:soft}$ can learn meaningful results. By "meaningful," we mean that theoretically, the information of $t_i$ can be reconstructed through $s_i$ ($s_i$ is the model's prediction result, $t_i$ is the given label, so reconstructing $t_i$ from $s_i$ is the goal of machine learning).

To this end, we denote formula $\eqref{eq:soft}$ as $l$ and calculate the partial derivative with respect to $s_i$:

\begin{equation}\frac{\partial l}{\partial s_i} = \frac{-t_i e^{-s_i}}{1+\sum\limits_i t_i e^{-s_i}}+\frac{(1-t_i)e^{s_i}}{1+\sum\limits_i(1-t_i)e^{s_i}}\end{equation}

We know that the minimum of $l$ occurs when all $\frac{\partial l}{\partial s_i}$ are equal to 0. It is not easy to solve the system of equations $\frac{\partial l}{\partial s_i}=0$ directly, but the author noticed a magical "coincidence": when $t_i e^{-s_i}=(1-t_i)e^{s_i}$, each $\frac{\partial l}{\partial s_i}$ automatically equals 0! Therefore, $t_i e^{-s_i}=(1-t_i)e^{s_i}$ should be the optimal solution for $l$, which yields:

\begin{equation}t_i = \frac{1}{1+e^{-2s_i}}=\sigma(2s_i)\end{equation}

This is a very beautiful result, which tells us several things:

Formula $\eqref{eq:soft}$ is indeed a reasonable soft-label generalization of formula $\eqref{eq:original}$. it can fully reconstruct $t_i$ through $s_i$, and its form is also exactly related to the sigmoid.
If we want to output the result as a probability value between 0 and 1, the correct approach should be $\sigma(2s_i)$ instead of the intuitive $\sigma(s_i)$.
Since the final probability formula also has a sigmoid form, thinking in reverse, it can also be understood that we are still learning multiple sigmoid-activated binary classification problems, just with the loss function replaced by formula $\eqref{eq:soft}$.

Implementation Techniques

The implementation of formula $\eqref{eq:soft}$ can refer to the multilabel_categorical_crossentropy code in bert4keras, where there is a small detail worth discussing.

First, formula $\eqref{eq:soft}$ can be equivalently rewritten as:

\begin{equation}\log\left(1+\sum_i e^{-s_i + \log t_i}\right)+\log\left(1+\sum_i e^{s_i + \log (1-t_i)}\right)\label{eq:soft-log}\end{equation}

So it seems that we only need to add $\log t_i$ to $-s_i$, add $\log(1-t_i)$ to $s_i$, pad with zeros, and then perform a standard logsumexp. However, in reality, it is possible for $t_i$ to take the value $0$ or $1$, and the corresponding $\log t_i$ or $\log(1-t_i)$ would be negative infinity. Since frameworks cannot directly handle negative infinity, we usually need to clip $t_i$ before the $\log$. That is, after choosing an $\epsilon > 0$, we define:

\begin{equation}\text{clip}(t)=\left\{\begin{aligned}&\epsilon, &t < \epsilon \\ &t, & \epsilon\leq t\leq 1-\epsilon\\ &1-\epsilon, &t > 1-\epsilon\end{aligned}\right.\end{equation}

But this clipping brings a problem. Since $\epsilon$ is not truly infinitesimal (e.g., if $\epsilon=10^{-7}$, then $\log\epsilon$ is approximately $-16$); in scenarios like GlobalPointer, we mask out illegitimate $s_i$ beforehand by setting the corresponding $s_i$ to a negative number with a very large absolute value, such as $-10^7$. Looking at formula $\eqref{eq:soft-log}$, the summands of the first term are $e^{-s_i + \log t_i}$, so $-10^7$ becomes $10^7$. If $t_i$ were not clipped, theoretical $\log t_i$ would be $\log 0 = -\infty$, which could turn $-s_i + \log t_i$ back to negative infinity. But as we just saw, the clipped $\log t_i$ is at most around $-16$, which is far smaller than the $10^7$ from $-s_i$. Thus, $-s_i + \log t_i$ remains a large positive number.

To solve this problem, we not only clip $t_i$, but we also need to find the $t_i$ that were originally smaller than $\epsilon$ and manually set the corresponding $-s_i$ to a negative number with a very large absolute value. Similarly, find $t_i$ greater than $1-\epsilon$ and set the corresponding $s_i$ to a negative number with a very large absolute value. Doing this treats values less than $\epsilon$ exactly as 0 and values greater than $1-\epsilon$ exactly as 1.

Summary

This article primarily generalizes the multi-label "Softmax + Cross-Entropy" previously proposed by the author to soft label scenarios. With the corresponding soft label version, we can combine it with techniques such as label smoothing and mixup, providing another optimization direction for models like GlobalPointer.