By 苏剑林 | April 26, 2021
At the beginning of this year, inspired by BERT-flow, I conceived the "BERT-whitening" method, which temporarily became the new SOTA for semantic similarity (see "You Might Not Need BERT-flow: A Simple Linear Transformation Competes with BERT-flow"; the paper is "Whitening Sentence Representations for Better Semantics and Faster Retrieval"). However, the "good times did not last long." Shortly after BERT-whitening was submitted to Arxiv, at least two new papers appeared on Arxiv with results significantly superior to BERT-whitening.
The first is "Generating Datasets with Pretrained Language Models", which uses templates to unsupervisedly construct data pairs from GPT2_XL to train similarity models. While I find it clarifying and effective, the reproduction cost and variance are somewhat large. The other is the protagonist of this article, "SimCSE: Simple Contrastive Learning of Sentence Embeddings". The proposed SimCSE significantly outperforms BERT-flow and BERT-whitening on English data, and the method is remarkably simple.
So, is SimCSE equally effective in Chinese? Can it substantially improve Chinese semantic similarity results? This article aims to provide some supplementary experiments.
Open Source Address: https://github.com/bojone/SimCSE
SimCSE
First, let's briefly introduce SimCSE. In fact, SimCSE can be seen as a simplified version of SimBERT (for SimBERT, please read "Having Your Cake and Eating It Too: SimBERT Model Integrating Retrieval and Generation"). Its simplifications are as follows:
1. SimCSE removes the generative part of SimBERT, retaining only the retrieval model;
2. Since SimCSE lacks labeled data, it treats each sentence itself as a similar sentence for input.
In short, it essentially uses (self, self) as a positive example and (self, others) as negative examples to train a contrastive learning model. Of course, it is not quite that simple. If exactly identical samples were used as positive pairs, the generalization ability would be greatly compromised. Generally, we use data augmentation techniques to introduce differences between the two positive samples. However, data augmentation in NLP is a difficult problem in itself. SimCSE proposes an extremely simple solution: use Dropout directly as data augmentation!
Specifically, $N$ sentences pass through an Encoder with Dropout to obtain vectors $\boldsymbol{h}^{(0)}_1, \boldsymbol{h}^{(0)}_2, \dots, \boldsymbol{h}^{(0)}_N$. Then, this batch of sentences passes through the Encoder again (this time with a different random Dropout) to obtain vectors $\boldsymbol{h}^{(1)}_1, \boldsymbol{h}^{(1)}_2, \dots, \boldsymbol{h}^{(1)}_N$. We can treat $(\boldsymbol{h}^{(0)}_i, \boldsymbol{h}^{(1)}_i)$ as a pair of (slightly different) positive examples. The training objective is then:
\begin{equation}-\sum_{i=1}^N\sum_{\alpha=0,1}\log \frac{e^{\cos(\boldsymbol{h}^{(\alpha)}_i, \boldsymbol{h}^{(1-\alpha)}_i)/\tau}}{\sum\limits_{j=1,j\neq i}^N e^{\cos(\boldsymbol{h}^{(\alpha)}_i, \boldsymbol{h}^{(\alpha)}_j)/tau} + \sum\limits_j^N e^{\cos(\boldsymbol{h}^{(\alpha)}_i, \boldsymbol{h}^{(1-\alpha)}_j)/\tau}}\end{equation}
English Effect
The original paper contains quite extensive (English) experiments; readers can read the original text carefully. However, it should be noted that the evaluation metrics in the main text tables of the original paper are inconsistent with those of BERT-flow and BERT-whitening. The tables with consistent metrics are in the appendix:
SimCSE compared with BERT-flow and BERT-whitening
Regardless of the comparison, SimCSE is clearly superior to BERT-flow and BERT-whitening. So, is this advantage of SimCSE universal? Does it have the same advantage in Chinese? Let's proceed with our experiments.
Experimental Configuration
Our Chinese experiments are basically aligned with "Which Unsupervised Semantic Similarity is Strongest? We Conducted a Comprehensive Evaluation", including the five tasks previously tested, four types of Pooling, and all base, small, and tiny versions of the models. The large version was not run because it triggered OOM errors under the same configuration.
Open Source Address: https://github.com/bojone/SimCSE
After parameter tuning, I found that the optimal parameters for Chinese tasks are not entirely consistent with those in the original paper. The specific differences are as follows:
1. The original paper used batch_size=512; here, it is batch_size=64 (I really can't afford such an expensive batch_size);
2. The learning rate in the original paper was 5e-5; here, it is 1e-5;
3. The optimal dropout ratio in the original paper was 0.1; here, it is 0.3;
4. The unsupervised SimCSE in the original paper was trained on additional data; here, 10,000 task data items were randomly selected for training;
5. The original unsupervised training included an MLM task; here, only SimCSE training is performed.
Clarifying the last point further: the unsupervised SimCSE in the original paper was trained on 1 million sentences selected from Wikipedia. For the Chinese experiments, for convenience and fairness in comparison, I directly used task data for training (only using the sentences, no labels, still unsupervised). However, except for PAWSX, for the other four tasks, it is not necessary to use all the data for training. Testing revealed that randomly selecting 10,000 training samples and training for one epoch is sufficient to achieve optimal results (performance degrades with both more or fewer samples).
Chinese Effect
All experimental results for SimCSE in Chinese are as follows:
$$\small{\begin{array}{l|ccccc}
\hline
& \text{ATEC} & \text{BQ} & \text{LCQMC} & \text{PAWSX} & \text{STS-B} \\
\hline
\text{BERT}\text{-P1} & 16.59 / 20.61 / \color{green}{33.14} & 29.35 / 25.76 / \color{green}{50.67} & 41.71 / 48.92 / \color{green}{69.99} & 15.15 / 17.03 / \color{red}{12.95} & 34.65 / 61.19 / \color{green}{69.04} \\
\text{BERT}\text{-P2} & 9.46 / 22.16 / \color{green}{25.18} & 16.97 / 18.97 / \color{green}{41.19} & 28.42 / 49.61 / \color{green}{56.45} & 13.93 / 16.08 / \color{red}{12.46} & 21.66 / 60.75 / \color{red}{57.63} \\
\text{BERT}\text{-P3} & 20.79 / 18.27 / \color{green}{32.89} & 33.08 / 22.58 / \color{green}{49.58} & 59.22 / 60.12 / \color{green}{71.83} & 16.68 / 18.37 / \color{red}{14.47} & 57.48 / 63.97 / \color{green}{70.08} \\
\text{BERT}\text{-P4} & 24.51 / 27.00 / \color{green}{31.96} & 38.81 / 32.29 / \color{green}{48.40} & 64.75 / 64.75 / \color{green}{71.49} & 15.12 / 17.80 / \color{red}{16.01} & 61.66 / 69.45 / \color{green}{70.03} \\
\hline
\text{RoBERTa}\text{-P1} & 24.61 / 29.59 / \color{green}{32.23} & 40.54 / 28.95 / \color{green}{50.61} & 70.55 / 70.82 / \color{green}{74.22} & 16.23 / 17.99 / \color{red}{12.25} & 66.91 / 69.19 / \color{green}{71.13} \\
\text{RoBERTa}\text{-P2} & 20.61 / 28.91 / \color{red}{20.07} & 31.14 / 27.48 / \color{green}{39.92} & 65.43 / 70.62 / \color{red}{62.65} & 15.71 / 17.30 / \color{red}{12.00} & 59.50 / 70.77 / \color{red}{61.49} \\
\text{RoBERTa}\text{-P3} & 26.94 / 29.94 / \color{green}{32.66} & 40.71 / 30.95 / \color{green}{51.03} & 66.80 / 68.00 / \color{green}{73.15} & 16.08 / 19.01 / \color{red}{16.47} & 61.67 / 66.19 / \color{green}{70.14} \\
\text{RoBERTa}\text{-P4} & 27.94 / 28.33 / \color{green}{32.40} & 43.09 / 33.49 / \color{green}{49.78} & 68.43 / 67.86 / \color{green}{72.74} & 15.02 / 17.91 / \color{red}{16.39} & 64.09 / 69.74 / \color{green}{70.11} \\
\hline
\text{NEZHA}\text{-P1} & 17.39 / 18.83 / \color{green}{32.14} & 29.63 / 21.94 / \color{green}{46.08} & 40.60 / 50.52 / \color{green}{60.38} & 14.90 / 18.15 / \color{red}{16.60} & 35.84 / 60.84 / \color{green}{68.50} \\
\text{NEZHA}\text{-P2} & 10.96 / 23.08 / \color{red}{15.70} & 17.38 / 28.81 / \color{green}{32.20} & 22.66 / 49.12 / \color{red}{21.07} & 13.45 / 18.05 / \color{red}{12.68} & 21.16 / 60.11 / \color{red}{43.35} \\
\text{NEZHA}\text{-P3} & 23.70 / 21.93 / \color{green}{31.47} & 35.44 / 22.44 / \color{green}{46.69} & 60.94 / 62.10 / \color{green}{69.65} & 18.35 / 21.72 / \color{red}{18.17} & 60.35 / 68.57 / \color{green}{70.68} \\
\text{NEZHA}\text{-P4} & 27.72 / 25.31 / \color{green}{30.26} & 44.18 / 31.47 / \color{green}{46.57} & 65.16 / 66.68 / \color{green}{67.21} & 13.98 / 16.66 / \color{red}{14.41} & 61.94 / 69.55 / \color{red}{68.18} \\
\hline
\text{WoBERT}\text{-P1} & 23.88 / 22.45 / \color{green}{32.66} & 43.08 / 32.52 / \color{green}{49.13} & 68.56 / 67.89 / \color{green}{72.99} & 18.15 / 19.92 / \color{red}{12.36} & 64.12 / 66.53 / \color{green}{70.00} \\
\text{WoBERT}\text{-P2} & \text{-} & \text{-} & \text{-} & \text{-} & \text{-} \\
\text{WoBERT}\text{-P3} & 24.62 / 22.74 / \color{green}{34.03} & 40.64 / 28.12 / \color{green}{49.77} & 64.89 / 65.22 / \color{green}{72.44} & 16.83 / 20.56 / \color{red}{14.55} & 59.43 / 66.57 / \color{green}{70.96} \\
\text{WoBERT}\text{-P4} & 25.97 / 27.24 / \color{green}{33.67} & 42.37 / 32.34 / \color{green}{49.09} & 66.53 / 65.62 / \color{green}{71.74} & 15.54 / 18.85 / \color{red}{14.00} & 61.37 / 68.11 / \color{green}{70.00} \\
\hline
\text{RoFormer}\text{-P1} & 24.29 / 26.04 / \color{green}{32.33} & 41.91 / 28.13 / \color{green}{49.13} & 64.87 / 60.92 / \color{green}{71.61} & 20.15 / 23.08 / \color{red}{15.25} & 59.91 / 66.96 / \color{green}{69.45} \\
\text{RoFormer}\text{-P2} & \text{-} & \text{-} & \text{-} & \text{-} & \text{-} \\
\text{RoFormer}\text{-P3} & 24.09 / 28.51 / \color{green}{34.23} & 39.09 / 34.92 / \color{green}{50.01} & 63.55 / 63.85 / \color{green}{72.01} & 16.53 / 18.43 / \color{red}{15.25} & 58.98 / 55.30 / \color{green}{71.44} \\
\text{RoFormer}\text{-P4} & 25.92 / 27.38 / \color{green}{34.10} & 41.75 / 32.36 / \color{green}{49.58} & 66.18 / 65.45 / \color{green}{71.84} & 15.30 / 18.36 / \color{red}{15.17} & 61.40 / 68.02 / \color{green}{71.40} \\
\hline
\text{SimBERT}\text{-P1} & 38.50 / 23.64 / \color{green}{36.98} & 48.54 / 31.78 / \color{green}{51.47} & 76.23 / 75.05 / \color{red}{74.87} & 15.10 / 18.49 / \color{red}{12.66} & 74.14 / 73.37 / \color{green}{75.12} \\
\text{SimBERT}\text{-P2} & 38.93 / 27.06 / \color{green}{37.00} & 49.93 / 35.38 / \color{green}{50.33} & 75.56 / 73.45 / \color{red}{72.61} & 14.52 / 18.51 / \color{green}{19.72} & 73.18 / 73.43 / \color{green}{75.13} \\
\text{SimBERT}\text{-P3} & 36.50 / 31.32 / \color{green}{37.81} & 45.78 / 29.17 / \color{green}{51.24} & 74.42 / 73.79 / \color{green}{73.85} & 15.33 / 18.39 / \color{red}{12.48} & 67.31 / 70.70 / \color{green}{73.18} \\
\text{SimBERT}\text{-P4} & 33.53 / 29.04 / \color{green}{36.93} & 45.28 / 34.70 / \color{green}{50.09} & 73.20 / 71.22 / \color{green}{73.42} & 14.16 / 17.32 / \color{red}{16.59} & 66.98 / 70.55 / \color{green}{72.64} \\
\hline
\text{SimBERT}_{\text{small}}\text{-P1} & 30.68 / 27.56 / \color{green}{31.16} & 43.41 / 30.89 / \color{green}{44.80} & 74.73 / 73.21 / \color{green}{74.32} & 15.89 / 17.96 / \color{red}{14.69} & 70.54 / 71.39 / \color{red}{69.85} \\
\text{SimBERT}_{\text{small}}\text{-P2} & 31.00 / 29.14 / \color{green}{30.76} & 43.76 / 36.86 / \color{green}{45.50} & 74.21 / 73.14 / \color{green}{74.55} & 16.17 / 18.12 / \color{red}{15.18} & 70.10 / 71.40 / \color{red}{69.18} \\
\text{SimBERT}_{\text{small}}\text{-P3} & 30.03 / 21.24 / \color{green}{30.07} & 43.72 / 31.69 / \color{green}{44.27} & 72.12 / 70.27 / \color{green}{71.21} & 16.93 / 21.68 / \color{red}{12.10} & 66.55 / 66.11 / \color{red}{64.95} \\
\text{SimBERT}_{\text{small}}\text{-P4} & 29.52 / 28.41 / \color{green}{28.56} & 43.52 / 36.56 / \color{green}{43.38} & 70.33 / 68.75 / \color{red}{68.35} & 15.39 / 21.57 / \color{red}{14.47} & 64.73 / 68.12 / \color{red}{63.23} \\
\hline
\text{SimBERT}_{\text{tiny}}\text{-P1} & 30.51 / 24.67 / \color{green}{30.04} & 44.25 / 31.75 / \color{green}{43.89} & 74.27 / 72.25 / \color{green}{73.47} & 16.01 / 18.07 / \color{red}{12.51} & 70.11 / 66.39 / \color{green}{70.11} \\
\text{SimBERT}_{\text{tiny}}\text{-P2} & 30.01 / 27.66 / \color{green}{29.37} & 44.47 / 37.33 / \color{green}{44.04} & 73.98 / 72.31 / \color{green}{72.93} & 16.55 / 18.15 / \color{red}{13.73} & 70.35 / 70.88 / \color{red}{69.63} \\
\text{SimBERT}_{\text{tiny}}\text{-P3} & 28.47 / 19.68 / \color{green}{28.08} & 42.04 / 29.49 / \color{green}{41.21} & 69.16 / 66.99 / \color{green}{69.85} & 16.18 / 20.11 / \color{red}{12.21} & 64.41 / 66.72 / \color{red}{64.62} \\
\text{SimBERT}_{\text{tiny}}\text{-P4} & 27.77 / 27.67 / \color{red}{26.25} & 41.76 / 37.02 / \color{green}{41.62} & 67.55 / 65.66 / \color{green}{67.34} & 15.06 / 20.49 / \color{red}{13.87} & 62.92 / 66.77 / \color{red}{60.80} \\
\hline
\end{array}}$$
Each unit of data is in the form of "a/b/c", where 'a' is the original result without any processing, 'b' is the result of BERT-whitening (without dimensionality reduction), and 'c' is the result of SimCSE. If c > b, then c is shown in green, otherwise in red. In other words, more green indicates that SimCSE is better than BERT-whitening. For other experimental details, please refer to the original code and "Which Unsupervised Semantic Similarity is Strongest? We Conducted a Comprehensive Evaluation".
Note that due to Dropout and the random sampling of 10,000 samples during training, the results have a certain degree of randomness. Rerunning the code will definitely result in fluctuations in the indicators; please be aware of this.
Some Conclusions
From the experimental results, it can be seen that except for the "outlier" PAWSX, SimCSE has an overwhelming advantage compared to BERT-whitening, sometimes performing more than 10 percentage points better. On BQ, SimCSE even outperforms the supervised-trained SimBERT. Furthermore, models already supervised-trained like SimBERT can gain further improvements, all of which demonstrate its power. (As for why PAWSX is "odd," a simple analysis was already provided in the article "Which Unsupervised Semantic Similarity is Strongest? We Conducted a Comprehensive Evaluation".)
At the same time, we can see that under SimCSE, the first-last-avg Pooling method, which performed better in BERT-flow and BERT-whitening, no longer has any advantage. Instead, taking the [CLS] vector directly is better. Surprisingly, however, the performance of the Pooler (taking [CLS] and adding a Dense layer) is relatively poor, which is truly confusing.
Since BERT-whitening is only a linear transformation, I also experimented with whether SimCSE alone could replicate the effect of this linear transformation. Specifically, I fixed the weight of the Encoder, then attached a Dense layer without activation functions, and then trained only the final Dense layer with SimCSE as the target. The results showed that SimCSE in this case was not as good as BERT-whitening. This implies that SimCSE must fine-tune the Encoder to be effective. It also suggests that BERT-whitening may contain elements that SimCSE lacks; perhaps combining the two in some way would yield better results (under formulation...).
Related Work
A simple survey revealed that several papers using the "self versus self as positive samples" idea have appeared recently. Besides SimCSE, there are "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks" and "Semantic Re-tuning with Contrastive Tension", which are extremely similar. In fact, I had similar ideas myself but didn't think they would actually work (so I didn't conduct experiments), and I didn't realize the key point was Dropout. It seems one must experiment more!
Summary
This article shares my Chinese experiments on SimCSE. The results show that on many tasks, SimCSE is indeed quite excellent and significantly outperforms BERT-whitening.