By 苏剑林 | March 14, 2023
As is well known, the standard evaluation metric for classification problems is accuracy, while the standard loss function is cross-entropy. Cross-entropy has the advantage of fast convergence, but it is not a smooth approximation of accuracy, which leads to an inconsistency between training and prediction. On the other hand, when the predicted probability of a training sample is very low, cross-entropy yields a very large loss (tending towards $-\log 0^{+}=\infty$). This means that cross-entropy pays excessive attention to samples with low predicted probabilities—even if those samples might be "noisy data." Consequently, models trained with cross-entropy often exhibit an overconfidence phenomenon, where the model assigns high predicted probabilities to every sample. This brings two side effects: first, a decrease in performance due to overfitting on noisy data; and second, the inability to use predicted probabilities as reliable indicators of uncertainty.
Regarding improvements to cross-entropy, the academic community has continuously produced research, yet this field remains in a state of "various methods competing with one another" without a definitive standard answer. In this article, we will learn about another concise candidate solution provided by the paper "Tailoring Language Generation Models under Total Variation Distance".
Brief Introduction to the Results
As the name suggests, the modification in the original paper is aimed at text generation tasks, based on the theory of Total Variation distance (refer to "Designing GANs: Another GAN Production Workshop"). However, after a series of relaxations and simplifications in the original paper, the final result no longer has an obvious connection to Total Variation distance, and theoretically, it is not limited to text generation tasks. Therefore, this article treats it as a loss function for general classification tasks.
For a data pair $(x, y)$, the loss function given by cross-entropy is:
\begin{equation}-\log p_{\theta}(y|x)\end{equation}
The modification in the original paper is very simple, changing it to:
\begin{equation}-\frac{\log \big[\gamma + (1 - \gamma)p_{\theta}(y|x)\big]}{1-\gamma}\label{eq:gamma-ce}\end{equation}
where $\gamma \in [0, 1]$. When $\gamma=0$, it is ordinary cross-entropy; when $\gamma=1$, calculated by the limit, the result is $-p_{\theta}(y|x)$.
In the experiments of the original paper, the choice of $\gamma$ varied significantly across different tasks. For example, $\gamma=10^{-7}$ was used for language modeling tasks, $\gamma=0.1$ for machine translation, and $\gamma=0.8$ for text summarization. A referenceable pattern is: if training from scratch, choose a $\gamma$ close to 0; if fine-tuning, consider a relatively larger $\gamma$. Additionally, a more intuitive approach is to treat $\gamma$ as a dynamic parameter, starting from $\gamma=0$ and slowly transitioning to $\gamma=1$ as training progresses, though this adds another schedule to tune.
In terms of effectiveness, since there is an adjustable $\gamma$ parameter and the original cross-entropy is included as a special case, as long as one is willing to tune it, there is generally a good chance of achieving better results than standard cross-entropy.
Personal Derivation
How should we understand Equation $\eqref{eq:gamma-ce}$? In the "Accuracy" section of "Random Talk on Function Smoothing: Differentiable Approximations of Non-differentiable Functions", we derived that a smooth approximation of accuracy is:
\begin{equation}\mathbb{E}_{(x,y)\sim \mathcal{D}}[p_{\theta}(y|x)]\end{equation}
So, if our evaluation metric is accuracy, intuitively, we should use $-p_{\theta}(y|x)$ as the loss function, because the changes in the loss function would then be more synchronized with accuracy. However, in reality, cross-entropy often performs better. But the starting point of cross-entropy is merely "better training," so it sometimes "over-trains," leading to overfitting. A natural thought is whether we can "interpolate" these two results to balance their advantages.
To this end, we consider the gradients of both (where "Accuracy" refers to its negative smooth approximation $-p_{\theta}(y|x)$):
\begin{equation}\begin{aligned}
\text{Accuracy:}&\,\quad-\nabla_{\theta} p_{\theta}(y|x) \\
\text{Cross-Entropy:}&\,\quad-\frac{1}{p_{\theta}(y|x)}\nabla_{\theta} p_{\theta}(y|x)
\end{aligned}\end{equation}
The difference between the two is just the factor $\frac{1}{p_{\theta}(y|x)}$. How can we transition $\frac{1}{p_{\theta}(y|x)}$ to 1? The original paper's solution is:
\begin{equation}\frac{1}{\gamma + (1 - \gamma)p_{\theta}(y|x)}\end{equation}
Of course, this construction is not unique. The method chosen by the original paper preserves the gradient characteristics of cross-entropy as much as possible, thereby maintaining its fast convergence property. Based on this construction, we want the gradient of the new loss function to be:
\begin{equation}-\frac{\nabla_{\theta}p_{\theta}(y|x)}{\gamma + (1 - \gamma)p_{\theta}(y|x)} = \nabla_{\theta}\left(-\frac{\log \big[\gamma + (1 - \gamma)p_{\theta}(y|x)\big]}{1-\gamma}\right)\label{eq:gamma-ce-g}\end{equation}
This identifies the loss function in Equation $\eqref{eq:gamma-ce}$. In this process, we first designed the new gradient and then found the corresponding loss function by integrating.
Additional Comments
Why design the loss function from a gradient perspective? There are roughly two reasons.
First, many loss functions simplify significantly after taking gradients. Designing in gradient space often provides more inspiration and flexibility. For instance, in this example, designing the transition function $\frac{1}{\gamma + (1 - \gamma)p_{\theta}(y|x)}$ between $\frac{1}{p_{\theta}(y|x)}$ and $1$ in gradient space is not too complex, but designing a transition function between $p_{\theta}(y|x)$ and $\log p_{\theta}(y|x)$ directly in the loss function space like $\frac{\log [\gamma + (1 - \gamma)p_{\theta}(y|x)]}{1-\gamma}$ would be much more complicated.
Second, current optimizers are all gradient-based, so in many cases, we only need to design the gradient; we don't even necessarily need to find the antiderivative. The original result in the paper actually only provided the gradient:
\begin{equation}-\max\left(b, \frac{p_{\theta}(y|x)}{\gamma + (1 - \gamma)p_{\theta}(y|x)}\right)\nabla_{\theta}\log p_{\theta}(y|x)\end{equation}
When $b=0$, it is equivalent to Equation $\eqref{eq:gamma-ce}$. That is to say, the original paper also added a threshold when designing the gradient, at which point it becomes difficult to write a simple antiderivative. However, implementing the above is not difficult; one just needs to consider the loss function:
\begin{equation}-\max\left(b, \frac{p_{\theta}(y|x)}{\gamma + (1 - \gamma)p_{\theta}(y|x)}\right)_{\text{stop\_grad}}\log p_{\theta}(y|x)\end{equation}
The $\text{stop\_grad}$ here simply cuts off the gradient for that part of the result, corresponding to the tf.stop_gradient operator in TensorFlow.
Summary
This article primarily introduced a concise solution to mitigate cross-entropy overconfidence.