By 苏剑林 | April 11, 2021
In January, I wrote "You Might Not Need BERT-flow: A Simple Linear Transformation Comparable to BERT-flow", pointing out that the state-of-the-art (SOTA) unsupervised semantic similarity model, BERT-flow, can actually be matched by a simple linear transformation (whitening operation, or BERT-whitening). Subsequently, we further refined our experimental results and wrote a paper titled "Whitening Sentence Representations for Better Semantics and Faster Retrieval". This blog post will summarize the content of that paper and provide supplementary evaluations on five Chinese semantic similarity tasks, involving over 600 experimental results.
Github Link: https://github.com/bojone/BERT-whitening
The logic behind BERT-whitening is simple: after obtaining the sentence vectors $\{x_i\}_{i=1}^N$ for each sentence, a whitening operation (essentially PCA) is applied to these vectors so that the mean of each dimension is 0 and the covariance matrix is the identity matrix. Finally, the top $k$ principal components are retained. The process is illustrated below:

Theoretically, BERT-whitening can be viewed as the simplest implementation of BERT-flow. As noted in previous blogs, this simple implementation is sufficient to match standard BERT-flow models and sometimes even outperform them. Additionally, while transforming, BERT-whitening ranks features by importance, allowing us to reduce the dimensionality of the sentence vectors to improve retrieval speed. Experimental results show that in most tasks, dimensionality reduction not only avoids performance degradation but often leads to performance improvements.
First, we present the testing results of BERT-whitening on English tasks across three main charts, providing a strict comparison with BERT-flow.
The first table shows the results using sentence vectors extracted directly from pre-trained BERT in a completely unsupervised manner. The BERT-flow paper confirmed that without any post-processing, the best pooling method for BERT-based sentence vectors is the average of all tokens in the first and last layers, denoted as first-last-avg (the BERT-flow paper mistakenly identified this as the average of the last two layers and labeled it last2avg, but it was actually the first and last layers). Therefore, all subsequent results use first-last-avg as the baseline for adding either flow or whitening.

The second table shows results for sentence vectors extracted from the Sentence-BERT model (SBERT), which was fine-tuned on NLI datasets. In this case, first-last-avg remains the best base, so it is used as the starting point for flow and whitening. The NLI dataset is a natural language inference dataset, which is similar but not identical to semantic similarity. It serves as supervised pre-training for semantic similarity tasks, but since semantic similarity data is not used directly, it is still considered unsupervised relative to the target task.

In these two tables, bold values indicate the best results. A green arrow $\color{green}{\uparrow}$ indicates that BERT-whitening outperformed BERT-flow under the same conditions, while a red arrow $\color{red}{\downarrow}$ indicates the opposite. More green arrows mean BERT-whitening is generally more effective. The numbers 256 and 384 following "whitening" refer to the retained dimensions after reduction. From these tables, it is clear that BERT-whitening generally outperforms BERT-flow, achieving SOTA on most tasks. In many cases, dimensionality reduction further improves performance.
To further examine the impact of dimensionality reduction, we plotted the relationship between retained dimensions and performance:

The indices above show the performance curves of different models across various tasks after whitening as the number of retained dimensions changes. Markers indicate the dimensions where peak performance was reached. It's visible that for every task, the optimal performance is not reached at the full dimension, implying that dimensionality reduction can bring performance gains. Notably, many tasks maintain or even increase performance when reduced to 1/8 of the original size, demonstrating BERT-whitening's engineering value as lower dimensions significantly accelerate retrieval speed.
Adhering to the philosophy that "a model not tested on Chinese has no soul," I organized several Chinese semantic similarity datasets to evaluate various Chinese pre-trained models, pooling methods, and whitening configurations. The results are summarized here for comparison.
This evaluation involves 11 models, 5 datasets, and 4 pooling methods. For each combination, we test three post-processing methods: "no whitening," "whitening," and "whitening with dimensionality reduction." This results in approximately $$11 \times 5 \times 4 \times 3 = 660$$ experimental results. (The actual number is slightly lower as some pooling methods aren't applicable to certain models). Due to the significantly higher computational cost of BERT-flow compared to BERT-whitening, we did not reproduce BERT-flow results for Chinese, but based on the English results, the two are usually close, with BERT-whitening often having the edge.
The evaluation metric used is the Spearman correlation coefficient, a ranking metric similar to AUC that relies only on the order of predicted scores and is threshold-independent. It is particularly suitable for STS-B, where labels are continuous values (1–5) rather than 0/1. For those accustomed to accuracy, a rough heuristic is "accuracy ≈ 0.5 + spearman / 2".
The 11 models are as follows:
BERT: Google's official Chinese BERT base. Link;
RoBERTa: HFL's roberta_wwm_ext base. Link;
NEZHA: Huawei's BERT base with relative position encoding (wwm). Link;
WoBERT: Word-based BERT (Plus version used here). Link;
RoFormer: BERT with Rotary Position Embedding (RoPE). Link;
BERTlarge: Tencent UER's BERT large. Link;
RoBERTalarge: HFL's roberta_wwm_ext large. Link;
NEZHA-large: Huawei's NEZHA large (wwm). Link;
SimBERT: BERT base trained on similar sentence pairs. Link;
SimBERTsmall: Smaller version of SimBERT. Link;
SimBERTtiny: Tiny version of SimBERT. Link.
The 5 tasks are:
ATEC: ATEC Semantic Similarity Competition dataset (Financial/Customer Service). Link;
BQ: HIT BQ Corpus (Banking/Finance questions). Link;
LCQMC: HIT LCQMC (General domain question matching). Link;
PAWSX: Google's PAWS-X dataset (Paraphrase identification with high lexical overlap, very difficult for unsupervised methods). Link;
STS-B: Similarity calculation between two sentences (translated from English with manual correction). Link.
The 4 pooling methods are:
P1: Use the [CLS] vector from the encoder's last layer;
P2: Use the vector corresponding to the Pooler (used for the NSP task), which includes a linear transformation after [CLS];
P3: Average all vectors from the encoder's last layer;
P4: Average all vectors from both the first and last layers of the encoder.
All results are summarized in the following three tables. Each element is in the form $a / b / c$, representing "no whitening" score $a$, "whitening" score $b$, and "whitening with dimensionality reduction" score $c$. If $b \geq a$, $b$ is green; otherwise, it is red. If $c \geq a$, $c$ is green; otherwise, it is red. "Appropriate reduction" means reducing to 256 dimensions for base models, 384 for large models, and 128 for tiny/small models.
The first table compares 6 base models (WoBERT and RoFormer lack NSP weights, so P2 is omitted):
\[\small{\begin{array}{l|ccccc} \hline & \text{ATEC} & \text{BQ} & \text{LCQMC} & \text{PAWSX} & \text{STS-B} \\ \hline \text{BERT-P1} & 16.59 / \color{green}{20.61} / \color{green}{25.58} & 29.35 / \color{red}{25.76} / \color{green}{34.66} & 41.71 / \color{green}{48.92} / \color{green}{49.18} & 15.15 / \color{green}{17.03} / \color{green}{15.98} & 34.65 / \color{green}{61.19} / \color{green}{60.07} \\ \text{BERT-P2} & 9.46 / \color{green}{22.16} / \color{green}{25.13} & 16.97 / \color{green}{18.97} / \color{green}{33.99} & 28.42 / \color{green}{49.61} / \color{green}{49.59} & 13.93 / \color{green}{16.08} / \color{green}{16.19} & 21.66 / \color{green}{60.75} / \color{green}{60.13} \\ \text{BERT-P3} & 20.79 / \color{red}{18.27} / \color{green}{28.98} & 33.08 / \color{red}{22.58} / \color{green}{38.62} & 59.22 / \color{green}{60.12} / \color{green}{62.00} & 16.68 / \color{green}{18.37} / \color{green}{17.38} & 57.48 / \color{green}{63.97} / \color{green}{68.27} \\ \text{BERT-P4} & 24.51 / \color{green}{27.00} / \color{green}{27.91} & 38.81 / \color{red}{32.29} / \color{red}{37.67} & 64.75 / \color{green}{64.75} / \color{green}{65.65} & 15.12 / \color{green}{17.80} / \color{green}{15.34} & 61.66 / \color{green}{69.45} / \color{green}{69.37} \\ \hline \text{RoBERTa-P1} & 24.61 / \color{green}{29.59} / \color{green}{29.49} & 40.54 / \color{red}{28.95} / \color{red}{38.35} & 70.55 / \color{green}{70.82} / \color{red}{68.84} & 16.23 / \color{green}{17.99} / \color{green}{16.87} & 66.91 / \color{green}{69.19} / \color{green}{71.16} \\ \text{RoBERTa-P2} & 20.61 / \color{green}{28.91} / \color{green}{29.49} & 31.14 / \color{red}{27.48} / \color{green}{38.46} & 65.43 / \color{green}{70.62} / \color{green}{68.76} & 15.71 / \color{green}{17.30} / \color{green}{17.01} & 59.50 / \color{green}{70.77} / \color{green}{71.16} \\ \text{RoBERTa-P3} & 26.94 / \color{green}{29.94} / \color{green}{30.57} & 40.71 / \color{red}{30.95} / \color{red}{39.89} & 66.80 / \color{green}{68.00} / \color{green}{67.30} & 16.08 / \color{green}{19.01} / \color{green}{16.79} & 61.67 / \color{green}{66.19} / \color{green}{69.36} \\ \text{RoBERTa-P4} & 27.94 / \color{green}{28.33} / \color{green}{29.06} & 43.09 / \color{red}{33.49} / \color{red}{38.83} & 68.43 / \color{red}{67.86} / \color{red}{68.36} & 15.02 / \color{green}{17.91} / \color{green}{15.26} & 64.09 / \color{green}{69.74} / \color{green}{70.09} \\ \hline \text{NEZHA-P1} & 17.39 / \color{green}{18.83} / \color{green}{24.97} & 29.63 / \color{red}{21.94} / \color{green}{33.65} & 40.60 / \color{green}{50.52} / \color{green}{46.57} & 14.90 / \color{green}{18.15} / \color{green}{16.69} & 35.84 / \color{green}{60.84} / \color{green}{58.98} \\ \text{NEZHA-P2} & 10.96 / \color{green}{23.08} / \color{green}{24.21} & 17.38 / \color{green}{28.81} / \color{green}{32.21} & 22.66 / \color{green}{49.12} / \color{green}{47.03} & 13.45 / \color{green}{18.05} / \color{green}{17.15} & 21.16 / \color{green}{60.11} / \color{green}{58.68} \\ \text{NEZHA-P3} & 23.70 / \color{red}{21.93} / \color{green}{28.65} & 35.44 / \color{red}{22.44} / \color{green}{37.95} & 60.94 / \color{green}{62.10} / \color{green}{62.50} & 18.35 / \color{green}{21.72} / \color{green}{18.78} & 60.35 / \color{green}{68.57} / \color{green}{68.97} \\ \text{NEZHA-P4} & 27.72 / \color{red}{25.31} / \color{red}{26.18} & 44.18 / \color{red}{31.47} / \color{red}{36.02} & 65.16 / \color{green}{66.68} / \color{green}{66.54} & 13.98 / \color{green}{16.66} / \color{green}{14.02} & 61.94 / \color{green}{69.55} / \color{green}{69.14} \\ \hline \text{WoBERT-P1} & 23.88 / \color{red}{22.45} / \color{green}{27.88} & 43.08 / \color{red}{32.52} / \color{red}{37.54} & 68.56 / \color{red}{67.89} / \color{red}{65.80} & 18.15 / \color{green}{19.92} / \color{green}{18.73} & 64.12 / \color{green}{66.53} / \color{green}{69.03} \\ \text{WoBERT-P2} & \text{-} & \text{-} & \text{-} & \text{-} & \text{-} \\ \text{WoBERT-P3} & 24.62 / \color{red}{22.74} / \color{green}{29.01} & 40.64 / \color{red}{28.12} / \color{red}{38.82} & 64.89 / \color{green}{65.22} / \color{green}{65.14} & 16.83 / \color{green}{20.56} / \color{green}{17.87} & 59.43 / \color{green}{66.57} / \color{green}{67.76} \\ \text{WoBERT-P4} & 25.97 / \color{green}{27.24} / \color{green}{28.38} & 42.37 / \color{red}{32.34} / \color{red}{38.06} & 66.53 / \color{red}{65.62} / \color{red}{66.36} & 15.54 / \color{green}{18.85} / \color{green}{15.98} & 61.37 / \color{green}{68.11} / \color{green}{68.42} \\ \hline \text{RoFormer-P1} & 24.29 / \color{green}{26.04} / \color{green}{28.20} & 41.91 / \color{red}{28.13} / \color{red}{38.21} & 64.87 / \color{red}{60.92} / \color{red}{60.83} & 20.15 / \color{green}{23.08} / \color{green}{21.30} & 59.91 / \color{green}{66.96} / \color{green}{66.86} \\ \text{RoFormer-P2} & \text{-} & \text{-} & \text{-} & \text{-} & \text{-} \\ \text{RoFormer-P3} & 24.09 / \color{green}{28.51} / \color{green}{29.37} & 39.09 / \color{red}{34.92} / \color{red}{39.05} & 63.55 / \color{green}{63.85} / \color{green}{63.58} & 16.53 / \color{green}{18.43} / \color{green}{17.52} & 58.98 / \color{red}{55.30} / \color{green}{67.32} \\ \text{RoFormer-P4} & 25.92 / \color{green}{27.38} / \color{green}{28.37} & 41.75 / \color{red}{32.36} / \color{red}{38.05} & 66.18 / \color{red}{65.45} / \color{red}{65.63} & 15.30 / \color{green}{18.36} / \color{green}{15.69} & 61.40 / \color{green}{68.02} / \color{green}{68.27} \\ \hline \text{SimBERT-P1} & 38.50 / \color{red}{23.64} / \color{red}{30.79} & 48.54 / \color{red}{31.78} / \color{red}{40.01} & 76.23 / \color{red}{75.05} / \color{red}{74.50} & 15.10 / \color{green}{18.49} / \color{green}{15.64} & 74.14 / \color{red}{73.37} / \color{green}{75.29} \\ \text{SimBERT-P2} & 38.93 / \color{red}{27.06} / \color{red}{30.79} & 49.93 / \color{red}{35.38} / \color{red}{40.14} & 75.56 / \color{red}{73.45} / \color{red}{74.39} & 14.52 / \color{green}{18.51} / \color{green}{15.74} & 73.18 / \color{green}{73.43} / \color{green}{75.12} \\ \text{SimBERT-P3} & 36.50 / \color{red}{31.32} / \color{red}{31.24} & 45.78 / \color{red}{29.17} / \color{red}{40.98} & 74.42 / \color{red}{73.79} / \color{red}{73.43} & 15.33 / \color{green}{18.39} / \color{green}{15.87} & 67.31 / \color{green}{70.70} / \color{green}{72.00} \\ \text{SimBERT-P4} & 33.53 / \color{red}{29.04} / \color{red}{28.78} & 45.28 / \color{red}{34.70} / \color{red}{39.00} & 73.20 / \color{red}{71.22} / \color{red}{72.09} & 14.16 / \color{green}{17.32} / \color{green}{14.39} & 66.98 / \color{green}{70.55} / \color{green}{71.43} \\ \hline \end{array}}\]The second table compares 3 large models:
\[\small{\begin{array}{l|ccccc} \hline & \text{ATEC} & \text{BQ} & \text{LCQMC} & \text{PAWSX} & \text{STS-B} \\ \hline \text{BERT}_{\text{large}}\text{-P1} & 13.15 / \color{green}{22.42} / \color{green}{24.32} & 19.81 / \color{red}{17.61} / \color{green}{31.09} & 23.45 / \color{green}{44.31} / \color{green}{41.32} & 16.88 / \color{green}{19.37} / \color{green}{19.87} & 25.93 / \color{green}{52.70} / \color{green}{56.74} \\ \text{BERT}_{\text{large}}\text{-P2} & 8.16 / \color{green}{16.57} / \color{green}{24.34} & 9.43 / \color{green}{18.23} / \color{green}{30.91} & 16.66 / \color{green}{39.50} / \color{green}{41.40} & 14.72 / \color{green}{20.00} / \color{green}{19.92} & 15.82 / \color{green}{56.79} / \color{green}{56.73} \\ \text{BERT}_{\text{large}}\text{-P3} & 24.31 / \color{red}{18.25} / \color{green}{30.24} & 35.87 / \color{red}{32.56} / \color{green}{37.51} & 59.29 / \color{green}{65.06} / \color{green}{63.78} & 16.94 / \color{green}{20.01} / \color{green}{18.62} & 60.22 / \color{green}{68.07} / \color{green}{68.87} \\ \text{BERT}_{\text{large}}\text{-P4} & 25.62 / \color{green}{27.64} / \color{green}{28.15} & 38.45 / \color{red}{31.30} / \color{red}{36.47} & 65.43 / \color{green}{66.54} / \color{green}{67.02} & 15.33 / \color{green}{19.06} / \color{green}{15.95} & 62.02 / \color{green}{69.74} / \color{green}{69.99} \\ \hline \text{RoBERTa}_{\text{large}}\text{-P1} & 19.32 / \color{red}{15.90} / \color{green}{29.32} & 34.21 / \color{red}{23.16} / \color{green}{37.11} & 64.89 / \color{green}{67.05} / \color{green}{66.49} & 17.78 / \color{green}{20.66} / \color{green}{19.73} & 60.16 / \color{green}{69.46} / \color{green}{70.44} \\ \text{RoBERTa}_{\text{large}}\text{-P2} & 19.32 / \color{green}{22.16} / \color{green}{29.23} & 34.33 / \color{red}{33.22} / \color{green}{37.10} & 65.00 / \color{green}{67.12} / \color{green}{66.50} & 17.77 / \color{green}{18.90} / \color{green}{19.79} & 60.09 / \color{green}{61.35} / \color{green}{70.32} \\ \text{RoBERTa}_{\text{large}}\text{-P3} & 24.83 / \color{red}{21.05} / \color{green}{30.85} & 39.23 / \color{red}{26.85} / \color{red}{38.39} & 66.86 / \color{green}{68.62} / \color{green}{67.25} & 17.67 / \color{green}{20.06} / \color{green}{19.09} & 62.98 / \color{red}{55.75} / \color{green}{69.72} \\ \text{RoBERTa}_{\text{large}}\text{-P4} & 25.69 / \color{green}{28.19} / \color{green}{28.39} & 40.18 / \color{red}{32.06} / \color{red}{36.91} & 68.58 / \color{green}{68.74} / \color{green}{68.71} & 16.01 / \color{green}{19.87} / \color{green}{16.50} & 63.75 / \color{green}{70.08} / \color{green}{70.39} \\ \hline \text{NEZHA}_{\text{large}}\text{-P1} & 18.91 / \color{green}{24.98} / \color{green}{25.68} & 30.39 / \color{red}{29.30} / \color{green}{33.29} & 41.68 / \color{green}{52.42} / \color{green}{49.80} & 18.89 / \color{green}{23.31} / \color{green}{21.74} & 39.04 / \color{green}{60.36} / \color{green}{61.13} \\ \text{NEZHA}_{\text{large}}\text{-P2} & 7.92 / \color{green}{21.60} / \color{green}{25.33} & 12.03 / \color{green}{24.63} / \color{green}{33.22} & 12.33 / \color{green}{52.40} / \color{green}{49.68} & 16.26 / \color{green}{23.11} / \color{green}{21.95} & 16.59 / \color{green}{57.70} / \color{green}{60.82} \\ \text{NEZHA}_{\text{large}}\text{-P3} & 22.74 / \color{green}{25.63} / \color{green}{27.48} & 36.48 / \color{red}{22.33} / \color{red}{35.47} & 59.65 / \color{green}{59.90} / \color{green}{59.94} & 18.09 / \color{green}{23.12} / \color{green}{19.71} & 59.66 / \color{green}{67.80} / \color{green}{68.55} \\ \text{NEZHA}_{\text{large}}\text{-P4} & 27.45 / \color{red}{24.83} / \color{red}{24.90} & 44.33 / \color{red}{29.73} / \color{red}{34.05} & 66.19 / \color{green}{66.89} / \color{green}{67.88} & 13.74 / \color{green}{16.66} / \color{green}{13.95} & 62.91 / \color{green}{69.87} / \color{green}{69.71} \\ \hline \end{array}}\]The third table compares different sizes of SimBERT:
\[\small{\begin{array}{l|ccccc} \hline & \text{ATEC} & \text{BQ} & \text{LCQMC} & \text{PAWSX} & \text{STS-B} \\ \hline \text{SimBERT}\text{-P1} & 38.50 / \color{red}{23.64} / \color{red}{30.79} & 48.54 / \color{red}{31.78} / \color{red}{40.01} & 76.23 / \color{red}{75.05} / \color{red}{74.50} & 15.10 / \color{green}{18.49} / \color{green}{15.64} & 74.14 / \color{red}{73.37} / \color{green}{75.29} \\ \text{SimBERT}\text{-P2} & 38.93 / \color{red}{27.06} / \color{red}{30.79} & 49.93 / \color{red}{35.38} / \color{red}{40.14} & 75.56 / \color{red}{73.45} / \color{red}{74.39} & 14.52 / \color{green}{18.51} / \color{green}{15.74} & 73.18 / \color{green}{73.43} / \color{green}{75.12} \\ \text{SimBERT}\text{-P3} & 36.50 / \color{red}{31.32} / \color{red}{31.24} & 45.78 / \color{red}{29.17} / \color{red}{40.98} & 74.42 / \color{red}{73.79} / \color{red}{73.43} & 15.33 / \color{green}{18.39} / \color{green}{15.87} & 67.31 / \color{green}{70.70} / \color{green}{72.00} \\ \text{SimBERT}\text{-P4} & 33.53 / \color{red}{29.04} / \color{red}{28.78} & 45.28 / \color{red}{34.70} / \color{red}{39.00} & 73.20 / \color{red}{71.22} / \color{red}{72.09} & 14.16 / \color{green}{17.32} / \color{green}{14.39} & 66.98 / \color{green}{70.55} / \color{green}{71.43} \\ \hline \text{SimBERT}_{\text{small}}\text{-P1} & 30.68 / \color{red}{27.56} / \color{red}{29.07} & 43.41 / \color{red}{30.89} / \color{red}{39.78} & 74.73 / \color{red}{73.21} / \color{red}{73.50} & 15.89 / \color{green}{17.96} / \color{green}{16.75} & 70.54 / \color{green}{71.39} / \color{green}{72.14} \\ \text{SimBERT}_{\text{small}}\text{-P2} & 31.00 / \color{red}{29.14} / \color{red}{29.11} & 43.76 / \color{red}{36.86} / \color{red}{39.84} & 74.21 / \color{red}{73.14} / \color{red}{73.67} & 16.17 / \color{green}{18.12} / \color{green}{16.81} & 70.10 / \color{green}{71.40} / \color{green}{72.28} \\ \text{SimBERT}_{\text{small}}\text{-P3} & 30.03 / \color{red}{21.24} / \color{red}{29.30} & 43.72 / \color{red}{31.69} / \color{red}{40.81} & 72.12 / \color{red}{70.27} / \color{red}{70.52} & 16.93 / \color{green}{21.68} / \color{green}{18.75} & 66.55 / \color{red}{66.11} / \color{green}{69.19} \\ \text{SimBERT}_{\text{small}}\text{-P4} & 29.52 / \color{red}{28.41} / \color{red}{28.57} & 43.52 / \color{red}{36.56} / \color{red}{40.49} & 70.33 / \color{red}{68.75} / \color{red}{69.01} & 15.39 / \color{green}{21.57} / \color{green}{16.34} & 64.73 / \color{green}{68.12} / \color{green}{68.24} \\ \hline \text{SimBERT}_{\text{tiny}}\text{-P1} & 30.51 / \color{red}{24.67} / \color{red}{27.98} & 44.25 / \color{red}{31.75} / \color{red}{39.42} & 74.27 / \color{red}{72.25} / \color{red}{73.24} & 16.01 / \color{green}{18.07} / \color{green}{17.07} & 70.11 / \color{red}{66.39} / \color{green}{71.92} \\ \text{SimBERT}_{\text{tiny}}\text{-P2} & 30.01 / \color{red}{27.66} / \color{red}{27.92} & 44.47 / \color{red}{37.33} / \color{red}{39.39} & 73.98 / \color{red}{72.31} / \color{red}{73.31} & 16.55 / \color{green}{18.15} / \color{green}{17.14} & 70.35 / \color{green}{70.88} / \color{green}{72.04} \\ \text{SimBERT}_{\text{tiny}}\text{-P3} & 28.47 / \color{red}{19.68} / \color{green}{28.60} & 42.04 / \color{red}{29.49} / \color{red}{40.59} & 69.16 / \color{red}{66.99} / \color{red}{67.74} & 16.18 / \color{green}{20.11} / \color{green}{17.87} & 64.41 / \color{green}{66.72} / \color{green}{67.57} \\ \text{SimBERT}_{\text{tiny}}\text{-P4} & 27.77 / \color{red}{27.67} / \color{green}{28.02} & 41.76 / \color{red}{37.02} / \color{red}{40.19} & 67.55 / \color{red}{65.66} / \color{red}{66.60} & 15.06 / \color{green}{20.49} / \color{green}{16.26} & 62.92 / \color{green}{66.77} / \color{green}{67.01} \\ \hline \end{array}}\]Similar to the English tables, green indicates that the whitening operation improved sentence vector quality, while red indicates a decrease. More green suggests the method is generally effective. From these charts, we can draw several conclusions:
1. Results on Chinese tasks are much more complex and irregular than English. For instance, in English, P4 is almost always better than other pooling methods, and large models are generally better than base models, but these patterns are not obvious in Chinese tasks.
2. Except for SimBERT, there is generally more green than red, indicating that whitening usually has a positive effect on sentence vector quality. Specifically, the third value $c$ (with reduction) is green significantly more often than the second value $b$, showing that dimensionality reduction further improves performance. Whitening is thus a truly "faster and better" algorithm.
3. In the BQ task, whitening almost always led to a decrease. This is similar to the SICK-R task in English, suggesting there is "no free lunch": on some tasks, the "isotropy" assumption may fail, and neither BERT-whitening nor BERT-flow will help.
4. SimBERT is the SOTA for all tasks except PAWS-X. Of course, SimBERT was trained using supervised semantic similarity data (though ideally, the training data shouldn't overlap with evaluation tasks), so it's not a perfectly fair comparison. However, since SimBERT is open-source, it serves as a strong baseline.
5. Applying whitening to SimBERT either degrades performance or provides negligible improvement. This suggests that for sentence vectors trained with supervised methods, further whitening is unnecessary and unlikely to yield gains.
6. PAWS-X is indeed very difficult, and unsupervised semantic similarity still has a long way to go...
This post presented a comprehensive evaluation of unsupervised semantic similarity methods on both English and Chinese tasks. For English, we reiterated results from our BERT-whitening paper, including a direct comparison with BERT-flow. For Chinese, we collected 5 tasks and 11 pre-trained models, testing over 600 combinations of pooling and post-processing methods to provide a clear reference for comparison.
In short: BERT-whitening matches the current unsupervised SOTA, while SimBERT stands as a high-performance open-source baseline for Chinese semantic similarity.