By 苏剑林 | November 09, 2022
In "CoSENT (1): A More Effective Sentence Vector Scheme than Sentence-BERT", the author proposed a supervised sentence vector scheme named "CoSENT." Since it directly trains the cosine similarity, which is more relevant to the evaluation target, it usually achieves better performance and faster convergence than Sentence-BERT. In "CoSENT (2): How Big is the Gap Between Feature-based Matching and Interactive Matching?", we compared the differences between it and interactive similarity models, showing that its performance on certain tasks can even approach that of interactive similarity models.
However, at that time, the author was focused on finding a Sentence-BERT replacement that was closer to the evaluation target, so the results were oriented towards supervised sentence vectors, i.e., feature-based similarity models. Recently, it occurred to me that CoSENT can actually also serve as a loss function for interactive similarity models. So, how does it compare to the standard choice of cross-entropy? This article supplements that part of the experiments.
When CoSENT was first proposed, it was used as a loss function for supervised sentence vectors:
\[ \log\left(1+\sum_{sim(i,j)>sim(k,l)}e^{\lambda(\cos(u_k,u_l)-\cos(u_i,u_j))}\right) \]where $i,j,k,l$ are four training samples (e.g., four sentences), $u_i,u_j,u_k,u_l$ are the sentence vectors to be learned (e.g., their [CLS] vectors after BERT), $\cos(\cdot,\cdot)$ represents the cosine similarity between two vectors, and $sim(\cdot,\cdot)$ represents their similarity labels. Thus, the definition of this loss function is clear: if you believe the similarity of $(i,j)$ should be greater than the similarity of $(k,l)$, then an $e^{\lambda(\cos(u_k,u_l)-\cos(u_i,u_j))}$ term is added into the log sum.
From this form, it's evident that CoSENT was originally intended for supervised training of cosine similarity in feature-based models; the name "CoSENT" even stands for "Cosine Sentence." However, moving past the cosine similarity aspect, CoSENT is essentially a loss function that relies only on the relative order of labels. It has no necessary connection to cosine similarity. We can generalize it as:
\[ \log\left(1+\sum_{sim(i,j)>sim(k,l)}e^{\lambda(f(k,l)-f(i,j))}\right) \]where $f(\cdot,\cdot)$ is any function that outputs a scalar (generally no activation function is needed), representing the similarity model to be learned. This includes "interactive similarity" models where two inputs are concatenated into a single text input for BERT!
The conventional way to train interactive similarity is to construct a two-node output at the end, followed by softmax, using cross-entropy (abbreviated as CE in the table below) as the loss function. This is equivalent to adding a sigmoid activation onto $f(\cdot,\cdot)$ and using binary cross-entropy. However, this approach is only suitable for binary labels. If the labels are continuous scores (e.g., STS-B is from 1 to 5), it is not very suitable, and the task is usually transformed into a regression problem. CoSENT does not have this limitation because it only requires the order information of labels, which is consistent with the commonly used evaluation metric, the Spearman correlation coefficient.
For the comparative experiments between the two, refer to the following code:
https://github.com/bojone/CoSENT/blob/main/accuracy/interact_cosent.py
The experimental results are as follows:
| Model | ATEC | BQ | LCQMC | PAWSX | avg |
|---|---|---|---|---|---|
| BERT + CE | 48.01 | 71.96 | 78.53 | 68.59 | 66.77 |
| BERT + CoSENT | 48.09 | 72.25 | 78.70 | 69.34 | 67.10 |
| RoBERTa + CE | 49.70 | 73.20 | 79.13 | 70.52 | 68.14 |
| RoBERTa + CoSENT | 49.82 | 73.09 | 78.78 | 70.54 | 68.06 |
| Model | ATEC | BQ | LCQMC | PAWSX | avg |
|---|---|---|---|---|---|
| BERT + CE | 85.38 | 83.57 | 88.10 | 81.45 | 84.63 |
| BERT + CoSENT | 85.55 | 83.73 | 87.92 | 81.85 | 84.76 |
| RoBERTa + CE | 85.97 | 84.67 | 88.14 | 82.85 | 85.41 |
| RoBERTa + CoSENT | 86.06 | 84.23 | 88.14 | 83.03 | 85.37 |
As we can see, there are no surprises; the effects of CE and CoSENT are basically identical. If we have to dig for some subtle differences, it can be observed that in BERT, the performance of CoSENT is relatively better, while in RoBERTa there is essentially no difference. Additionally, on the PAWSX task, the improvement from CoSENT is more noticeable, whereas on other tasks it is mostly the same. Thus, we can "weakly" draw a conclusion:
When the model is weaker (BERT is weaker than RoBERTa) or the task is harder (PAWSX is relatively harder than the other three tasks), CoSENT might achieve better results than CE.
Note the word "might"; I cannot guarantee it. To be pragmatically honest, I don't believe the two constitute any significant difference. However, one can guess that because the forms of the two loss functions are distinctly different, even if the final metrics are similar, there should be some differences within the models. In that case, perhaps model ensemble could be considered?
This article mainly reflects on and experiments with the feasibility of CoSENT in interactive similarity models. The final conclusion is that it is "feasible, but provides no significant improvement in performance."
Original link: https://kexue.fm/archives/9341
Reference Format: