By 苏剑林 | October 13, 2023
As is well-known, the standard loss for classification tasks is Cross Entropy (equivalent to Maximum Likelihood Estimation, or MLE). It is characterized by its simplicity and efficiency, but in certain scenarios, it reveals issues such as deviation from evaluation metrics and overconfidence. Correspondingly, there have been many improvement efforts. We have previously introduced some, such as "Revisiting the Class Imbalance Problem: Comparison and Connection between Weight Adjustment and Modified Loss", "How to Train Your Accuracy?", and "A Simple Solution to Mitigate Overconfidence in Cross Entropy". Since the training of Large Language Models (LLMs) can also be understood as a token-by-token classification task with cross-entropy as the default loss, these improvement works remain valuable in today's LLM-dominated era.
In this article, we introduce a work titled "EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling". It proposes a new improved loss function, EMO, based on the principle of Optimal Transport, claiming to significantly improve the fine-tuning effects of LLMs. Let’s explore the details.
Suppose $p_i$ is the probability of the $i$-th category predicted by the model, $i=1,2,\dots,n$, and $t$ is the target category. The cross-entropy loss is:
\begin{equation}\mathcal{L} = - \log p_t\end{equation}If the label $t$ is represented as a distribution $\tau$ in one-hot form (i.e., $\tau_t=1, \tau_i=0$ for $i \neq t$ and $i \in [1,n]$), it can be rewritten as:
\begin{equation}\mathcal{L} = - \sum_i \tau_i \log p_i\end{equation}This form also applies to non-one-hot labels $\tau$ (i.e., soft labels), and it is equivalent to optimizing the KL divergence between $\tau$ and $p$:
\begin{equation}KL(\tau\Vert p) = \sum_i \tau_i\log \frac{\tau_i}{p_i} = \color{skyblue}{\sum_i \tau_i\log \tau_i} - \sum_i \tau_i\log p_i\end{equation}When $\tau$ is given, the first term on the far right is a constant, so the objective is equivalent to cross-entropy.
This result indicates that when we perform MLE, or use cross-entropy as a loss, we are essentially minimizing the KL divergence between the target distribution and the predicted distribution. Since the general generalization of KL divergence is f-divergence (see "Introduction to f-GAN: The Production Workshop of GAN Models"), it is natural to think that switching to other f-divergences might have an improvement effect. Indeed, many works have followed this path. For example, the method introduced in "A Simple Solution to Mitigate Overconfidence in Cross Entropy" takes "Total Variation distance" (also a type of f-divergence) as its starting point.
However, every f-divergence has more or less some issues. When it comes to the ideal metric between probability distributions, the "Earth Mover's Distance (EMD)" based on Optimal Transport ideas stands out. Readers unfamiliar with this can refer to the author's previous post "From Wasserstein Distance and Duality Theory to WGAN".
Simply put, the Earth Mover's Distance is defined as the optimal transport cost between two distributions:
\begin{equation}\mathcal{C}[p,\tau]=\inf_{\gamma\in \Pi[p,\tau]} \sum_{i,j} \gamma_{i,j} c_{i,j} \end{equation}Substituting $\gamma\in \Pi[p,\tau]$ means that $\gamma$ is any joint distribution with $p$ and $\tau$ as marginal distributions. $c_{i,j}$ is a pre-specified cost function representing the "cost of transporting from $i$ to $j$." $\inf$ is the infimum, meaning that the lowest transportation cost is used as the measure of difference between $p$ and $\tau$. Just as switching from vanilla GANs based on f-divergence to Wasserstein GANs based on optimal transport yields better convergence properties, we hope that replacing the classification loss with the W-distance between two distributions will also lead to better results.
When $\tau$ is a one-hot distribution, the target distribution is a single point $t$. In this case, "optimality" is irrelevant because there is only one transport scheme: moving everything from $p$ to the same point $t$. Thus:
\begin{equation}\mathcal{C}[p,\tau]= \sum_i p_i c_{i,t} \label{eq:emo}\end{equation}If $\tau$ is a general soft label distribution, calculating $\mathcal{C}[p,\tau]$ is a linear programming problem, which is complex to solve. Since the distribution defined by $p_i \tau_j$ belongs to $\Pi[p,\tau]$, we have:
\begin{equation}\mathcal{C}[p,\tau]=\inf_{\gamma\in \Pi[p,\tau]} \sum_{i,j} \gamma_{i,j} c_{i,j} \leq \sum_{i,j} p_i \tau_j c_{i,j} \end{equation}This is an upper bound that is easy to compute and can also serve as an optimization target. Equation \eqref{eq:emo} corresponds to $\tau_j = \delta_{j,t}$, where $\delta$ is the Kronecker delta function.
Now back to the scenario the original paper focuses on—fine-tuning of LLMs, including continued pre-training and fine-tuning for downstream tasks. As mentioned at the beginning of this article, LLM training can be understood as a token-by-token classification task (where the categories are all tokens). Each label is one-hot, making it suitable for Equation \eqref{eq:emo}.
Equation \eqref{eq:emo} still requires defining the cost function $c_{i,t}$. If we simply assume that as long as $i \neq t$, the cost is 1 (i.e., $c_{i,t} = 1 - \delta_{i,t}$), then:
\begin{equation}\mathcal{C}[p,\tau]= \sum_i p_i c_{i,t} = \sum_i (p_i - p_i \delta_{i, t}) = 1 - p_t\end{equation}This is essentially a smooth approximation of maximizing accuracy (refer to "Discussions on Function Smoothing: Differentiable Approximations of Non-differentiable Functions"). However, intuitively, penalizing all $i \neq t$ equally seems too simplistic. Ideally, different costs should be designed for different $i$ based on similarity—the greater the similarity, the lower the transport cost. Thus, we can design the transport cost as:
\begin{equation}c_{i,t} = 1 - \cos(\boldsymbol{e}_i,\boldsymbol{e}_t) = 1 - \left\langle\frac{\boldsymbol{e}_i}{\|\boldsymbol{e}_i\|}, \frac{\boldsymbol{e}_t}{\|\boldsymbol{e}_t\|}\right\rangle\end{equation}Here $\boldsymbol{e}_i, \boldsymbol{e}_t$ are pre-obtained Token Embeddings. The original paper uses the LM Head of the pre-trained model as the Token Embeddings. According to the definition of optimal transport, the cost function must be fixed beforehand; therefore, the Token Embeddings used to calculate similarity must remain constant during training.
With the cost function, we can calculate:
\begin{equation}\mathcal{C}[p,\tau]= \sum_i p_i c_{i,t} = \sum_i \left(p_i - p_i \left\langle\frac{\boldsymbol{e}_i}{\|\boldsymbol{e}_i\|}, \frac{\boldsymbol{e}_t}{\|\boldsymbol{e}_t\|}\right\rangle\right) = 1 - \left\langle \sum_i p_i \frac{\boldsymbol{e}_i}{\|\boldsymbol{e}_i\|}, \frac{\boldsymbol{e}_t}{\|\boldsymbol{e}_t\|}\right\rangle\end{equation}This is the final training loss for EMO (Earth Mover Distance Optimization). Since the embedding_size is typically much smaller than the vocab_size, first calculating $\sum_i p_i \frac{\boldsymbol{e}_i}{\|\boldsymbol{e}_i\|}$ can significantly reduce the computational load.
Since the author's research on LLM is currently at the pre-training stage and has not yet touched upon fine-tuning, there are no personal experimental results yet. Let's look at the original paper's experiments for now. It must be said that the experimental results in the original paper are quite striking.
First are the continued pre-training experiments on small models. Compared to Cross-Entropy (MLE), the improvements are as high as 10 points in some cases, and it achieved SOTA across the board:

It is worth mentioning that the evaluation metric used here is MAUVE (the larger, the better), proposed in "MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers". It is one of the automatic evaluation metrics most highly correlated with human evaluation. Additionally, the TaiLr comparison method was briefly introduced in "A Simple Solution to Mitigate Overconfidence in Cross Entropy".
Some readers might wonder if EMO is better simply because the evaluation metric was chosen favorably. Not so. Surprisingly, models trained with EMO even show better PPL (perplexity), despite PPL being more directly related to MLE:

Then comes the performance of fine-tuning LLAMA-7B/13B for few-shot tasks on downstream datasets, which is also excellent:

Finally, the paper compared effects across different model scales and data sizes, showing that EMO performs well across various configurations:

Overall, the "report card" of the original paper is very impressive and worth trying. The only concern might be that the experimental data sizes in the original paper are not actually that large; it remains unclear whether the gap between EMO and MLE will narrow as the data volume increases significantly.
In the author's view, the reason EMO achieves better results is that it calculates similarity via embeddings to assign more reasonable losses to "synonyms," thereby making the model's learning more logical. While form-wise LLM is a classification task, it is not a simple right-or-wrong problem. Just because the next predicted token is not identical to the target token doesn't mean the sentence is unreasonable. Therefore, introducing semantic similarity to design the loss is helpful for LLM training. One could further speculate that for larger vocab_size and larger token granularity, EMO's effect might be even more pronounced because a larger vocab_size potentially means more "synonyms."
Of course, introducing semantic similarity also means EMO is not suitable for training from scratch because it requires a pre-trained LM Head as the Token Embedding. One possible solution is to consider other ways, such as using classic Word2Vec to pre-train the Token Embeddings, but this carries a risk: whether Token Embeddings trained by classic methods would lower the ceiling of LLM capability (due to inconsistency).
Furthermore, even if the Token Embeddings are not an issue, training from scratch with pure EMO might suffer from slow convergence. This is based on the perspective on loss functions proposed at the end of the author's post "How to Train Your Accuracy?":
First, find a smooth approximation of the evaluation metric, preferably expressed as an expectation for each sample. Then, drive the error in the wrong direction towards infinity (to ensure the model focuses more on incorrect samples), while ensuring it remains a first-order approximation of the original form in the correct direction.
That is to say, to ensure convergence speed (for training from scratch), it is best to push errors in the wrong direction toward infinity, whereas EMO clearly does not satisfy this. Therefore, when using EMO for training from scratch, it would likely require a weighted combination of EMO and MLE to balance convergence speed with final performance.
This article introduced a new "replacement" for the cross-entropy loss—EMO, based on Optimal Transport. Unlike the minor improvements seen in the past, EMO achieved noticeably significant gains in LLM fine-tuning experiments.