Revisiting Class Imbalance: Comparison and Connection between Weight Adjustment and Custom Loss Functions

By 苏剑林 | August 31, 2020

The class imbalance problem, also known as the long-tail distribution problem, has been discussed several times on this blog in posts such as "From Hard Truncation and Softening of Loss to Focal Loss", "Generalizing 'Softmax + Cross Entropy' to Multi-label Classification", and "Mitigating Class Imbalance through Mutual Information Ideology". For alleviating class imbalance, the most basic method is adjusting sample weights, while more "high-end" approaches involve various modifications to the loss function (such as Focal Loss, Dice Loss, Logits Adjustment, etc.). This article aims to systematically understand the connections between them.

Long-tail distribution: A few categories have a very large number of samples, while most categories have a very small number of samples.

From Smooth Accuracy to Cross-Entropy

The analysis here is primarily based on binary classification with sigmoid, but most conclusions can be generalized to multi-class classification with softmax. Let $x$ be the input, $y \in \{0,1\}$ be the target, and $p_{\theta}(x) \in [0, 1]$ be the model. Ideally, we should optimize the metric we use for evaluation. For classification problems, the most straightforward metric is accuracy, but accuracy does not provide useful gradients and thus cannot be used directly for training.

To address this, we use a smoothed metric. From the previous article "Random Talk on Function Smoothing: Differentiable Approximations of Non-differentiable Functions", the smoothed approximation of accuracy is:

\begin{equation}\text{ACC}_{\text{smooth}}=\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y p_{\theta}(x) + (1 - y)(1 - p_{\theta}(x))\big]\end{equation}

where $\mathcal{D}$ is the training dataset. In principle, we should use $-\text{ACC}_{\text{smooth}}$ as the minimization objective. However, in practice, directly optimizing this objective does not yield good results; it is better to optimize the cross-entropy:

\begin{equation}\text{cross\_entropy}=\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[-y \log p_{\theta}(x) - (1 - y)\log(1 - p_{\theta}(x))\big]\end{equation}

This is intriguing: if $\text{ACC}_{\text{smooth}}$ is closer to our evaluation metric, why does using cross-entropy yield better results for that metric?

This can be explained through gradients. $p_{\theta}(x)$ is typically activated via sigmoid, i.e., $p_{\theta}(x)=\sigma(z_{\theta}(x))$, where $\sigma(t)=\frac{1}{1+e^{-t}}$. Its derivative is $\sigma'(t)=\sigma(t)(1 - \sigma(t))$, and $z_{\theta}(x)$ is what we usually call the "logits."

If we assume $y$ is 1, then the corresponding $-\text{ACC}_{\text{smooth}}$ is $-p_{\theta}(x)=-\sigma(z_{\theta}(x))$, and its gradient is:

\begin{equation}-\nabla_{\theta} p_{\theta}(x) = - p_{\theta}(x) (1 - p_{\theta}(x))\nabla_{\theta}z_{\theta}(x)\end{equation}

Since $y$ is 1, the training goal is $p_{\theta}(x) \to 1$. Therefore, we expect that when $p_{\theta}(x)$ is close to 0 (a large error), it should produce a large gradient, and when $p_{\theta}(x)$ is close to 1 (a small error), it should produce a small gradient. However, the gradient $-\nabla_{\theta} p_{\theta}(x)$ above does not behave this way; its adjustment factor $p_{\theta}(x) (1 - p_{\theta}(x))$ reaches its maximum at 0.5, and is minimal at both 0 and 1. This means that if the error is extremely large, the gradient is actually small, leading to low optimization efficiency and poor overall performance. In contrast, for cross-entropy, we have:

\begin{equation}-\nabla_{\theta} \log p_{\theta}(x) = - (1 - p_{\theta}(x))\nabla_{\theta}z_{\theta}(x)\end{equation}

This perfectly removes the $p_{\theta}(x)$ factor that was detrimental to the gradient. Consequently, optimization is more efficient, resulting in better final performance. The same conclusion holds for $y=0$.

From Smooth F1 to Weighted Cross-Entropy

From this process, we can sense that various modifications to the loss function essentially adjust the gradients. By obtaining more reasonable gradients, we can achieve more effective optimization and yield better models. Furthermore, reflecting on the transformation process above: the gradient of the approximate objective was originally $-\nabla_{\theta}p_{\theta}(x)$, but $-\nabla_{\theta}\log p_{\theta}(x)$ worked better. If we skip the detailed analysis and treat $p \to \log p$ as an "axiom," what interesting results might follow?

For example, when negative samples far outnumber positive samples, our evaluation metric is usually no longer accuracy (since outputting all 0s would result in very high accuracy). Instead, we care about the F1 score for the positive class. Directly optimizing F1 is difficult, so we need a smoothed version. The article "Random Talk on Function Smoothing" also provides this result:

\begin{equation}\text{F1}_{\text{smooth}}=\frac{2 \mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y p_{\theta}(x)\big]}{\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y + p_{\theta}(x)\big]}\end{equation}

So our original minimization goal is $-\text{F1}_{\text{smooth}}$. Following the $p \to \log p$ "axiom," we first calculate the gradient of $-\text{F1}_{\text{smooth}}$:

\begin{equation}\begin{aligned}&-\nabla_{\theta}\frac{2 \mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y p_{\theta}(x)\big]}{\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y + p_{\theta}(x)\big]}\\ =&-2\frac{\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y \nabla_{\theta}p_{\theta}(x)\big]}{\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y + p_{\theta}(x)\big]} + 2\frac{\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y p_{\theta}(x)\big]\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\nabla_{\theta}p_{\theta}(x)\big]}{\left(\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y + p_{\theta}(x)\big]\right)^2}\\ =&-\frac{2\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\big(y-\text{F1}_{\text{smooth}}/2\big)\nabla_{\theta}p_{\theta}(x)\big]}{\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y + p_{\theta}(x)\big]} \end{aligned}\end{equation}

where $\frac{2}{\mathbb{E}_{(x,y)\sim\mathcal{D}}[y + p_{\theta}(x)]}$ is a global scaling factor. We are primarily interested in the per-sample gradient, which is:

\begin{equation}-\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\big(y-\text{F1}_{\text{smooth}}/2\big)\nabla_{\theta}p_{\theta}(x)\big]\end{equation}

Applying the $p \to \log p$ "axiom" (and $-p \to \log(1-p)$ for negative samples), we obtain the final gradient as:

\begin{equation}-\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y\cdot\big(1-\text{F1}_{\text{smooth}}/2\big)\cdot\nabla_{\theta}\log p_{\theta}(x) + (1 - y)\cdot\text{F1}_{\text{smooth}}/2\cdot\nabla_{\theta}\log (1-p_{\theta}(x))\big]\end{equation}

This is equivalent to the gradient of the following optimization objective (where $\text{F1}_{\text{smooth}}$ is treated as a constant with no gradient):

\begin{equation}-\mathbb{E}_{(x,y)\sim\mathcal{D}}\big[y\cdot\big(1-\text{F1}_{\text{smooth}}/2\big)\cdot\log p_{\theta}(x) + (1 - y)\cdot\text{F1}_{\text{smooth}}/2\cdot\log (1-p_{\theta}(x))\big]\end{equation}

This is simply cross-entropy where positive samples are weighted by $1-\text{F1}_{\text{smooth}}/2$ and negative samples are weighted by $\text{F1}_{\text{smooth}}/2$.

From Margin Expansion to Logits Adjustment

Ultimately, regardless of the evaluation metric, we hope to predict every sample correctly. The issue is that categories with fewer samples do not generalize as well because they are not learned sufficiently.

Let's think about this from a geometric perspective. Ideally, in the embedding space, each class occupies its own "territory," and these territories should not overlap. Categories with fewer samples generalize poorly because their territories are small and often "suppressed" by categories with more samples. Their "survival" is at risk, let alone accounting for new samples not seen in the training set.

How do we solve this? It's quite intuitive: if samples from minority classes are "heavyweights" (capable of taking on ten opponents at once), they can hold their ground in "territorial disputes" despite their small numbers. Let's consider an $n$-class classification problem. For a sample with encoding vector $f_{\theta}(x)$ and class vector $u_y$, the similarity is typically measured by the inner product $\langle f_{\theta}(x), u_y\rangle$. Suppose each sample can occupy a territory with radius $r_y$; this means that for any $z$ satisfying $\| z - f_{\theta}(x)\| \leq r_y$, $z$ should be treated as an encoding vector for that sample, implying $z$'s similarity to $u_y$ should be greater than its similarity to other classes.

Now consider:

\begin{equation}\langle z, u_y\rangle = \langle f_{\theta}(x), u_y\rangle + \langle z - f_{\theta}(x), u_y\rangle\end{equation}

Since $\| z - f_{\theta}(x)\| \leq r_y$, it follows that:

\begin{equation}\langle f_{\theta}(x), u_y\rangle - r_y\| u_y\|\leq\langle z, u_y\rangle \leq \langle f_{\theta}(x), u_y\rangle + r_y\| u_y\|\end{equation}

To ensure that "the similarity of any $z$ to $u_y$ is always greater than its similarity to other classes," we only need the "minimum similarity to $u_y$ to be greater than the maximum similarity to other classes." Thus, our optimization objective becomes:

\begin{equation}-\log\frac{e^{\langle f_{\theta}(x), u_y\rangle - r_y\| u_y\|}}{e^{\langle f_{\theta}(x), u_y\rangle - r_y\| u_y\|}+\sum\limits_{i\neq y} e^{\langle f_{\theta}(x), u_i\rangle + r_y\| u_i\|}}\end{equation}

This essentially matches softmax variants with margins, such as AM-Softmax or Circle Loss. The specific form is less important than setting a larger margin for classes with fewer samples (making samples from minority classes more "capable"). How should we design the margin for each class? The previous article "Mitigating Class Imbalance through Mutual Information Ideology" provided a solution: $m_y=-\tau\log p(y)$, where $p(y)$ is the prior distribution. This gives:

\begin{equation}-\log\frac{e^{\langle f_{\theta}(x), u_y\rangle + \tau \log p(y)}}{\sum\limits_{i} e^{\langle f_{\theta}(x), u_i\rangle + \tau \log p(i)}}\end{equation}

Thus, we connect to the Logit Adjustment Loss, or rather provide a geometric intuition for it. Essentially, Logit Adjustment is also adjusting weights, but while standard weight adjustment happens after the $\log$ in the loss function, Logit Adjustment happens before the $\log$.

Conclusion

This article has shared some thoughts on the class imbalance phenomenon and its countermeasures, primarily aiming to reveal the logic behind modifying loss functions through relatively intuitive guidance. From this, we can see that these solutions essentially boil down to adjusting sample weights or class weights. The line of reasoning in this article is somewhat loose, reflecting my own brainstorming process. Please forgive and point out any errors.