By 苏剑林 | June 01, 2022
Recently, a paper on Arxiv titled "EXACT: How to Train Your Accuracy" caught my interest. As the name suggests, it introduces how to directly use accuracy as the training objective to train a model. Since I have previously done some analysis on this topic, such as in "Random Talk on Function Smoothing: Differentiable Approximations of Non-differentiable Functions" and "Revisiting the Class Imbalance Problem: Comparison and Connection Between Adjusting Weights and Modifying Loss", I quickly finished reading the paper with my prior research experience and wrote this summary, along with some recent new thoughts on this subject.
An Unrealistic Example
The paper points out at the beginning that the classification loss functions we usually use, such as Cross-Entropy or Hinge Loss in SVM, do not fit the final evaluation metric—accuracy—very well. To illustrate this, the paper gives a very simple example: suppose the data consists of only three points $\{(-0.25,-1),(0,-1),(0.25,1)\}$, where $-1$ and $1$ represent the negative and positive classes respectively. The model to be fitted is $f(x)=x-b$, where $b$ is a parameter, and we hope to predict the category via $\text{sign}(f(x))$. If we use "sigmoid + Cross-Entropy," the loss function is $-\log \frac{1}{1+e^{-l \cdot f(x)}}$, where $(x,l)$ represents a pair of labeled data; if Hinge Loss is used, it is $\max(0, 1 - l\cdot f(x))$.
Since this is a one-dimensional model, we can directly perform a grid search for its optimal solution. It can be found that using "sigmoid + Cross-Entropy," the minimum of the loss function is reached at $b=0.7$, while for Hinge Loss, $b\in[0.75,1]$. However, to achieve complete classification accuracy through $\text{sign}(f(x))$, $b$ must be in $(0, 0.25)$. Therefore, this illustrates the inconsistency between Cross-Entropy or Hinge Loss and the final evaluation metric, accuracy.
This seems like a concise and elegant example, but I believe it is unrealistic. The biggest problem is the model's setting of the temperature parameter; typically, the model appears as $f(x)=k(x-b)$ rather than $f(x)=x-b$. Deliberately removing the temperature parameter to construct an unrealistic counterexample is unconvincing. In fact, after adding a tunable temperature parameter, both loss functions can learn the correct answer. Even more unfairly, when the authors later propose their solution, EXACT, it comes with its own temperature parameter, which is a key component. In other words, in this example, EXACT is better than the other two losses purely because EXACT has a temperature parameter.
Old Wine in New Bottles
Now let's look at the proposed scheme—EXACT (EXpected ACcuracy opTimization). In hindsight, EXACT is somewhat mysterious because the authors directly, and without much explanation, redefine a conditional probability distribution $p(y|x)$ from the perspective of reparameterization:
\begin{equation}p(y|x) = P\left(y = \mathop{\text{argmax}}_i \frac{\mu(x)}{\sigma(x)}+\varepsilon\right)\end{equation}
where $\mu(x)$ is a vector network, $\sigma(x)$ is a scalar network, and $\varepsilon$ has the same dimension as $\mu(x)$, with each component independently and identically distributed (i.i.d.) sampled from $\sim \mathcal{N}(0,1)$. We have already discussed the practice of defining probability distributions from a reparameterization perspective in the previous article "Constructing Discrete Probability Distributions from a Reparameterization Perspective", so I won't repeat it here.
Immediately after, with this new $p(y|x)$, the authors directly use
\begin{equation}-\mathbb{E}_{(x,y)\sim\mathcal{D}}[p(y|x)]\label{eq:soft-acc}\end{equation}
as the loss function. The entire theoretical framework basically ends here.
From this, we can summarize the mysterious aspects of EXACT. From "Constructing Discrete Probability Distributions from a Reparameterization Perspective", we know that from a reparameterization view, the noise distribution corresponding to Softmax is the Gumbel distribution, while EXACT changed it to a Normal distribution. So, what is the benefit? Why is it better? These are not explained at all.
Furthermore, the negative of Equation $\eqref{eq:soft-acc}$ is a smooth approximation of accuracy, which is already "widely known." However, there is also a widely known conclusion that in the Softmax case, directly optimizing Equation $\eqref{eq:soft-acc}$ is usually inferior to optimizing Cross-Entropy. Now, it is just "old wine" (the same smooth approximation of accuracy) in a "new bottle" (a new method for constructing the probability distribution). Can it really provide an improvement?
Experiments Hard to Reproduce
The original paper presents very impressive experimental results, showing that EXACT is almost always SOTA:

However, I tried to implement EXACT based on my own understanding and tested it on NLP tasks. The results showed that EXACT completely failed to reach the level of "Softmax + Cross-Entropy." Additionally, the original paper mentioned that optimizing $-\log\mathbb{E}_{(x,y)\sim\mathcal{D}}[p(y|x)]$ would be better than $\eqref{eq:soft-acc}$, but my results showed that this variant couldn't even match $\eqref{eq:soft-acc}$. Overall, my test conclusions are quite different from the original paper.
Since the original paper has not yet released open-source code, I cannot further judge the reliability of the paper's experiments. However, from my theoretical understanding and preliminary experimental results, it is very unlikely that directly optimizing Equation $\eqref{eq:soft-acc}$ can achieve the effect of optimizing Cross-Entropy. Simply modifying the method of constructing the probability distribution should be difficult to form a substantial improvement. If readers have new experimental results, they are welcome to share and exchange them.
A New Perspective
In terms of numerical values, Equation $\eqref{eq:soft-acc}$ is indeed closer to accuracy than the Cross-Entropy $\mathbb{E}_{(x,y)\sim\mathcal{D}}[-\log p(y|x)]$. But why does optimizing Cross-Entropy often yield better accuracy? I was also puzzled by this for a long time. In "Revisiting the Class Imbalance Problem: Comparison and Connection Between Adjusting Weights and Modifying Loss", I resorted to treating it as an "axiom," which was a compromise.
Until one day, I suddenly realized a relationship: as training progresses, most $p(y|x)$ gradually approach 1, so we can use the approximation $\log x \approx x - 1$ to obtain:
\begin{equation}\mathbb{E}_{(x,y)\sim\mathcal{D}}[-\log p(y|x)]\approx \mathbb{E}_{(x,y)\sim\mathcal{D}}[1 - p(y|x)] = 1 - \mathbb{E}_{(x,y)\sim\mathcal{D}}[p(y|x)]\end{equation}
Thus, we can explain why optimizing Cross-Entropy also yields good accuracy, because from the above equation, we can see that in the middle and late stages of optimization, Cross-Entropy is basically equivalent to Equation $\eqref{eq:soft-acc}$, which means it is also optimizing a smooth approximation of accuracy!
What then is the benefit of Cross-Entropy compared to Equation $\eqref{eq:soft-acc}$? The difference lies in the gap between $-\log p(y|x)$ and $1 - p(y|x)$ when $p(y|x) \ll 1$. When $p(y|x) \ll 1$, i.e., the probability of the target class is very small, it means the classification is likely very inaccurate. In this case, $-\log p(y|x)$ gives a result that tends towards infinity, while $1 - p(y|x)$ can give at most $1$. With such a comparison, we find that the $-\log p(y|x)$ of Cross-Entropy gives a greater penalty to misclassified samples, so it will be more inclined to correct misclassified samples, while the final classification result remains close to direct optimization of the smooth approximation of accuracy.
From this, we can gain a new perspective on an excellent loss function:
First, find a smooth approximation of the evaluation metric, preferably expressed as an expected form for each sample. Then, stretch the error in the incorrect direction to infinity (to ensure the model focuses more on incorrect samples), but at the same time, ensure a first-order approximation with the original form in the correct direction.
Final Summary
This article mainly discusses how to optimize accuracy. It first provides a brief introduction and review of the recent paper "EXACT: How to Train Your Accuracy" and then offers an personal analysis of "why optimizing Cross-Entropy can yield better accuracy results."