By 苏剑林 | May 18, 2022
In "You Might Not Need BERT-flow: A Linear Transformation Comparable to BERT-flow", I proposed BERT-whitening, verifying that a simple linear transformation could rival the SOTA method at the time, BERT-flow. Additionally, BERT-whitening can reduce the dimensionality of sentence vectors, resulting in lower memory usage and faster retrieval speeds. However, in "Which Unsupervised Semantic Similarity Method is Stronger? A Comprehensive Evaluation", we also found that the whitening operation does not always bring improvements. Some models are inherently well-suited to the task (such as the supervised SimBERT), and in these cases, additional whitening often degrades performance.
To address this deficiency, this article proposes introducing two hyperparameters $\beta$ and $\gamma$ into BERT-whitening. By adjusting these two hyperparameters, we can almost always obtain results that achieve "dimensionality reduction without performance loss." In other words, even for tasks where adding whitening originally caused a drop in performance, there is now a chance to maintain similar or even better results while benefiting from dimensionality reduction.
The current BERT-whitening workflow is as follows:
\begin{equation} \begin{aligned} \tilde{\boldsymbol{x}}_i =&\, (\boldsymbol{x}_i - \boldsymbol{\mu})\boldsymbol{U}\boldsymbol{\Lambda}^{-1/2} \\ \boldsymbol{\mu} =&\, \frac{1}{N}\sum\limits_{i=1}^N \boldsymbol{x}_i \\ \boldsymbol{\Sigma} =&\, \frac{1}{N}\sum\limits_{i=1}^N (\boldsymbol{x}_i - \boldsymbol{\mu})^{\top}(\boldsymbol{x}_i - \boldsymbol{\mu}) = \boldsymbol{U}\boldsymbol{\Lambda}\boldsymbol{U}^{\top} \,\,(\text{SVD decomposition}) \end{aligned} \end{equation}Where $\boldsymbol{x}_i$ is the given sentence vector (vectors are row vectors by default unless otherwise stated), $\tilde{\boldsymbol{x}}_i$ is the transformed vector. In the SVD decomposition result, $\boldsymbol{U}$ is an orthogonal matrix, and $\boldsymbol{\Lambda}$ is a diagonal matrix with non-negative diagonal elements sorted from largest to smallest. As we can see, the current process is completely fixed, meaning there are no adjustable hyperparameters.
To increase the tuning space, we can introduce two hyperparameters $\beta, \gamma$ (scalars), changing the process to:
\begin{equation} \begin{aligned} \tilde{\boldsymbol{x}}_i =&\, (\boldsymbol{x}_i - {\color{red}\beta}\boldsymbol{\mu})\boldsymbol{U}\boldsymbol{\Lambda}^{-{\color{red}\gamma}/2} \\ \boldsymbol{\mu} =&\, \frac{1}{N}\sum\limits_{i=1}^N \boldsymbol{x}_i \\ \boldsymbol{\Sigma} =&\, \frac{1}{N}\sum\limits_{i=1}^N (\boldsymbol{x}_i - {\color{red}\beta}\boldsymbol{\mu})^{\top}(\boldsymbol{x}_i - {\color{red}\beta}\boldsymbol{\mu}) = \boldsymbol{U}\boldsymbol{\Lambda}\boldsymbol{U}^{\top} \,\,(\text{SVD decomposition}) \end{aligned} \end{equation}As we can see, when $\beta=\gamma=1$, it is the original BERT-whitening. When $\beta=\gamma=0$, the net transformation is:
\begin{equation}\tilde{\boldsymbol{x}}_i = \boldsymbol{x}_i \boldsymbol{U}\end{equation}Since $\boldsymbol{U}$ is an orthogonal matrix, it does not change the inner product result, i.e., $\tilde{\boldsymbol{x}}_i\tilde{\boldsymbol{x}}_i^{\top} = \boldsymbol{x}_i \boldsymbol{U} (\boldsymbol{x}_i \boldsymbol{U})^{\top} = \boldsymbol{x}_i\boldsymbol{x}_i^{\top}$. Therefore, when we use cosine similarity as the metric, it does not change the original results. In other words, after introducing these hyperparameters, it provides the possibility of results that are "no worse than before the transformation." By fine-tuning these parameters, it is possible to achieve better results than before. This is the design philosophy behind these two hyperparameters.
Furthermore, with this modification, the original ability for dimensionality reduction is still preserved. We can view the transformation in two parts:
\begin{equation}\tilde{\boldsymbol{x}}_i = \color{red}{\underbrace{(\boldsymbol{x}_i - \beta\boldsymbol{\mu})\boldsymbol{U}}_{\text{part 1}}}\color{skyblue}{\underbrace{\boldsymbol{\Lambda}^{-\gamma/2}}_{\text{part 2}}}\end{equation}The first part is primarily the orthogonal transformation $\boldsymbol{U}$. $\boldsymbol{U}$ is the result of the SVD decomposition of the $\boldsymbol{\Sigma}$ matrix. it transforms the vector $\boldsymbol{x}_i - \beta\boldsymbol{\mu}$ into a new vector where components are as independent as possible. The average fluctuation of each component of the new vector from 0 is measured by the diagonal elements of $\boldsymbol{\Lambda}^{1/2}$. If a corresponding fluctuation is very close to 0, we can treat it as practically 0. Discarding such a component will not significantly affect the calculation of cosine similarity; this is the principle of dimensionality reduction. Since the SVD result already sorts $\boldsymbol{\Lambda}$ from largest to smallest, we can implement dimensionality reduction to $k$ dimensions by simply keeping the first $k$ dimensions: $\tilde{\boldsymbol{x}}_i[:k]$.
As for the second part $\boldsymbol{\Lambda}^{-\gamma/2}$, it can be understood as the current task's dependency on isotropy. If $\gamma=1$, it means every component is given equal weight, which serves as an unsupervised prior. However, this might not be optimal for all tasks, so we can adjust $\gamma$ to better adapt to the current task.
The article "Which Unsupervised Semantic Similarity Method is Stronger? A Comprehensive Evaluation" has already shown that on the ATEC, BQ, and LCQMC tasks, SimBERT combined with the default whitening operation (i.e., $\beta=\gamma=1$) leads to a performance drop. However, if we take $\beta=\gamma=0$, the results are different (two combinations are demonstrated below; other combinations yield similar results):
BERT-P4 Performance Table
| ATEC | BQ | LCQMC | PAWSX | STS-B | |
|---|---|---|---|---|---|
| $\beta=\gamma=1$ | 24.51 / 27.00 / 27.91 | 38.81 / 32.29 / 37.67 | 64.75 / 64.75 / 65.65 | 15.12 / 17.80 / 15.34 | 61.66 / 69.45 / 69.37 |
| $\beta=\gamma=0$ | 24.51 / 24.51 / 24.59 | 38.81 / 38.81 / 38.99 | 64.75 / 64.75 / 63.45 | 15.12 / 15.12 / 14.59 | 61.66 / 61.66 / 62.30 |
SimBERT-P1 Performance Table
| ATEC | BQ | LCQMC | PAWSX | STS-B | |
|---|---|---|---|---|---|
| $\beta=\gamma=1$ | 38.50 / 23.64 / 30.79 | 48.54 / 31.78 / 40.01 | 76.23 / 75.05 / 74.50 | 15.10 / 18.49 / 15.64 | 74.14 / 73.37 / 75.29 |
| $\beta=\gamma=0$ | 38.50 / 38.50 / 38.81 | 48.54 / 48.54 / 48.66 | 76.23 / 76.23 / 76.22 | 15.10 / 15.10 / 14.88 | 74.14 / 74.14 / 74.46 |
As in the previous article, each element in the table is in the form $a / b / c$, representing the score for that task under that model "without whitening" ($a$), "with whitening" ($b$), and "with whitening reduced to 256 dimensions" ($c$). If $b > a$, then $b$ is displayed in green, otherwise red; if $c > a$, then $c$ is displayed in green, otherwise red. As mentioned, if dimensionality reduction is not applied, the net transformation for $\beta=\gamma=0$ is just $\boldsymbol{U}$, which does not change the cosine similarity; thus $a$ and $b$ are equal when $\beta=\gamma=0$.
In this table, we primarily look at the third result $c$ in $a/b/c$, which is the result of reducing the vector from 768 to 256 dimensions. It can be observed that when $\beta=\gamma=0$, whether it is the unsupervised BERT or the supervised SimBERT, this result is generally very close to the original vector result (i.e., $a$), and some results even show improvements. This implies that the combination $\beta=\gamma=0, k=256$ can almost be considered a "free lunch"—it essentially achieves dimensionality reduction without loss of performance.
I also tried fine-tuning $\beta, \gamma$. On some tasks, it indeed yielded better results than the two combinations mentioned above. However, fine-tuning requires labeled data, which might be controversial in an unsupervised context, so I will not demonstrate that here. If the original sentence vector model was already obtained through supervised training and BERT-whitening is used solely for dimensionality reduction, then it is perfectly appropriate to fine-tune $\beta, \gamma,$ and $k$ using a validation set.
This article introduces two hyperparameters to give BERT-whitening a degree of tuning space, making it possible to achieve results that are "no worse than before the transformation" while retaining the ability for dimensionality reduction. In other words, even for pre-trained sentence vector models, we can use the new BERT-whitening to reduce their dimensions while keeping the performance essentially unchanged, and sometimes even better!