By 苏剑林 | January 11, 2021
BERT-flow originates from the paper "On the Sentence Embeddings from Pre-trained Language Models", which was accepted to EMNLP 2020. It primarily uses a flow model to calibrate the distribution of sentenece embeddings produced by BERT, thereby making the calculated cosine similarity more reasonable. Due to my habit of regularly browsing Arxiv, I saw the paper when it was first uploaded, but I didn't find it particularly interesting at the time. Unexpectedly, it gained significant popularity recently, with numerous interpretations appearing in public accounts and Zhihu within a short period. I believe many readers have probably seen it in their feeds.
From the experimental results, BERT-flow indeed achieves a new SOTA. However, regarding this result, my first impression was: something is not quite right! Of course, I'm not saying there's an issue with the results, but based on my understanding, it is unlikely that the flow model itself is playing the critical role. Following this intuition, I performed some analysis, and as expected, I found that while the logic behind BERT-flow is sound, a simple linear transformation can achieve similar effects. The flow model is not essential.
Assumptions of Cosine Similarity
Generally, for semantic similarity comparison or retrieval, we calculate a sentence vector for each sentence and then calculate the cosine of the angle between them for comparison or ranking. Have we ever considered this question: What assumptions does cosine similarity make about the input vectors? Or, under what conditions do vectors perform better when compared using cosine similarity?
We know that the geometric meaning of the inner product of two vectors $\boldsymbol{x}, \boldsymbol{y}$ is "the product of their respective magnitudes and the cosine of the angle between them." Therefore, cosine similarity is the inner product of two vectors divided by their respective magnitudes. The corresponding coordinate calculation formula is:
\begin{equation}\cos(\boldsymbol{x},\boldsymbol{y}) = \frac{\sum\limits_{i=1}^d x_i y_i}{\sqrt{\sum\limits_{i=1}^d x_i^2} \sqrt{\sum\limits_{i=1}^d y_i^2}}\label{eq:cos}\end{equation}
However, do not forget one thing: the above equality holds only under an "orthonormal basis." In other words, the "cosine of the angle" of vectors itself has a distinct geometric meaning, but the right side of the above equation is merely a coordinate operation. Coordinates depend on the chosen coordinate basis. Different bases lead to different coordinate formulas for the inner product, and thus different coordinate formulas for the cosine value.
Therefore, assuming that BERT sentence embeddings already contain sufficient semantics (e.g., the original sentence can be reconstructed), if they perform poorly when using formula \eqref{eq:cos} to calculate cosine values for similarity, the reason might be that the coordinate system belonging to the sentence vectors at that time is not an orthonormal basis. So, how do we know specifically which basis was used? In principle, there is no way to know, but we can guess. The basis for guessing is that when choosing a basis for a set of vectors, we try to use each basis vector as evenly as possible. From a statistical perspective, this manifests as each component being used independently and uniformly. If this set of bases is orthonormal, the corresponding vector set should exhibit "isotropy."
Of course, this is not a rigorous derivation, just a heuristic guide. It tells us that if a set of vectors satisfies isotropy, we can consider it to originate from an orthonormal basis, in which case using equation \eqref{eq:cos} for similarity is appropriate. Conversely, if it does not satisfy isotropy, we can find a way to make it more isotropic and then use equation \eqref{eq:cos} for similarity. BERT-flow happened to think of the "flow model" as the solution.
Ramblings on Flow Models
In my view, the flow model is truly a type of model that makes one feel "it's complicated." Ramblings about it could fill several pages, so I will try to be brief here. In mid-2018, OpenAI released the Glow model, which looked very effective. This attracted me to learn more about flow models, and I even implemented Glow once. Related work is recorded in "Flow as Water: NICE - Basic Concepts and Implementation of Flow Models" and "Flow as Water: RealNVP and Glow - Inheritance and Sublimation of Flow Models". If you are not familiar with flow models, feel free to check those posts. Simply put, a flow model is a vector transformation model that can transform the distribution of input data into a standard normal distribution. Obviously, a standard normal distribution is isotropic, which is why BERT-flow chose the flow model.
So, is there any problem with flow models? Actually, I already complained about this in the article "Flow as Water: Invertible ResNet - The Ultimate Brutal Beauty". Let me repeat it here:
(Flow models) utilize clever designs to ensure that the inverse transformation of each layer is simple and the Jacobian matrix is triangular, making the Jacobian determinant easy to calculate. Such models are theoretically elegant and beautiful, but there is a serious problem: to ensure simple inverse transformations and easy Jacobian determinant calculations, the non-linear transformation capability of each layer is very weak. In fact, in models like Glow, only half of the variables are transformed in each layer. Therefore, to ensure sufficient fitting capability, the model must be stacked very deep (e.g., for 256x256 face generation, the Glow model stacked about 600 convolutional layers with 200 million parameters), resulting in massive computational costs.
Coming back to the topic, readers can now understand why I said at the beginning that BERT-flow felt "off." The above complaint tells us that flow models are actually quite weak. Then what is the size of the flow model used in BERT-flow? It is a Glow model with level=2 and depth=3. You might not have a concept of these two parameters, but essentially it is so small that the entire model adds almost no computational cost. Thus, my "off" intuition was:
The flow model itself is very weak, and the flow model used in BERT-flow is even weaker. Therefore, it is unlikely that the flow model plays a vital role in BERT-flow. Conversely, perhaps we can find a simpler and more direct method to achieve the effect of BERT-flow.
Standardizing the Covariance Matrix
After exploration, I did find such a method. As the title of this post suggests, it is just a linear transformation.
The idea is simple. We know that the mean of a standard normal distribution is 0 and its covariance matrix is the identity matrix. So, why don't we try transforming the mean of sentence vectors to 0 and the covariance matrix to the identity matrix? Suppose the set of (row) vectors is $\{\boldsymbol{x}_i\}_{i=1}^N$. We perform the transformation:
\begin{equation}\tilde{\boldsymbol{x}}_i = (\boldsymbol{x}_i - \boldsymbol{\mu})\boldsymbol{W}
\end{equation}
such that the mean of $\{\tilde{\boldsymbol{x}}_i\}_{i=1}^N$ is 0 and the covariance matrix is the identity matrix. Readers familiar with traditional data mining may know that this is essentially equivalent to the "Whitening" operation in traditional data mining. Thus, I call this method BERT-whitening.
Setting the mean to 0 is simple: let $\boldsymbol{\mu}=\frac{1}{N}\sum\limits_{i=1}^N \boldsymbol{x}_i$. The difficulty lies in solving for the matrix $\boldsymbol{W}$. We denote the covariance matrix of the original data as:
\begin{equation}\boldsymbol{\Sigma}=\frac{1}{N}\sum\limits_{i=1}^N (\boldsymbol{x}_i - \boldsymbol{\mu})^{\top}(\boldsymbol{x}_i - \boldsymbol{\mu})=\left(\frac{1}{N}\sum\limits_{i=1}^N \boldsymbol{x}_i^{\top}\boldsymbol{x}_i\right) - \boldsymbol{\mu}^{\top}\boldsymbol{\mu}\end{equation}
Then it is not hard to obtain the covariance matrix of the transformed data as $\tilde{\boldsymbol{\Sigma}}=\boldsymbol{W}^{\top}\boldsymbol{\Sigma}\boldsymbol{W}$. Thus, we essentially need to solve the equation:
\begin{equation}\boldsymbol{W}^{\top}\boldsymbol{\Sigma}\boldsymbol{W}=\boldsymbol{I}\quad\Rightarrow \quad \boldsymbol{\Sigma} = \left(\boldsymbol{W}^{\top}\right)^{-1}\boldsymbol{W}^{-1} = \left(\boldsymbol{W}^{-1}\right)^{\top}\boldsymbol{W}^{-1}\end{equation}
We know that the covariance matrix $\boldsymbol{\Sigma}$ is a positive semi-definite symmetric matrix, and when there is enough data, it is usually positive definite, having an SVD decomposition of the form:
\begin{equation}\boldsymbol{\Sigma} = \boldsymbol{U}\boldsymbol{\Lambda}\boldsymbol{U}^{\top}\end{equation}
where $\boldsymbol{U}$ is an orthogonal matrix and $\boldsymbol{\Lambda}$ is a diagonal matrix with positive diagonal elements. Therefore, letting $\boldsymbol{W}^{-1}=\sqrt{\boldsymbol{\Lambda}}\boldsymbol{U}^{\top}$ completes the solution:
\begin{equation}\boldsymbol{W} = \boldsymbol{U}\sqrt{\boldsymbol{\Lambda}^{-1}}\end{equation}
Reference Numpy code:
def compute_kernel_bias(vecs):
"""vecs: [num_samples, embedding_size]
"""
mu = vecs.mean(axis=0, keepdims=True)
cov = np.cov(vecs.T)
u, s, vh = np.linalg.svd(cov)
W = np.dot(u, np.diag(1 / np.sqrt(s)))
return W, -mu
Some might ask what to do with large corpora. First, the algorithm only needs to know the mean vector $\boldsymbol{\mu}\in\mathbb{R}^{d}$ and the covariance matrix $\boldsymbol{\Sigma}\in\mathbb{R}^{d\times d}$ of all sentence vectors ($d$ is the embedding dimension). $\boldsymbol{\mu}$ is the mean of all sentence vectors $\boldsymbol{x}_i$, which can be computed recursively:
\begin{equation}\boldsymbol{\mu}_{n+1} = \frac{n}{n+1}\boldsymbol{\mu}_{n} + \frac{1}{n+1}\boldsymbol{x}_{n+1}\end{equation}
Similarly, the covariance matrix $\boldsymbol{\Sigma}$ is just the mean of all $\boldsymbol{x}_i^{\top}\boldsymbol{x}_i$ minus $\boldsymbol{\mu}^{\top}\boldsymbol{\mu}$, which naturally can also be calculated recursively:
\begin{equation}\boldsymbol{\Sigma}_{n+1} = \frac{n}{n+1}\left(\boldsymbol{\Sigma}_{n}+\boldsymbol{\mu}_{n}^{\top}\boldsymbol{\mu}_{n}\right) + \frac{1}{n+1}\boldsymbol{x}_{n+1}^{\top}\boldsymbol{x}_{n+1}-\boldsymbol{\mu}_{n+1}^{\top}\boldsymbol{\mu}_{n+1}\end{equation}
Since it is recursive, identifying $\boldsymbol{\mu}, \boldsymbol{\Sigma}$ within limited memory is possible. Therefore, BERT-whitening is not an issue for large corpora.
Compared to BERT-flow
Now, we can test the effect of BERT-whitening. To compare it with BERT-flow, I used bert4keras to conduct tests on the STS-B task. The reference script is at:
Github Link: https://github.com/bojone/BERT-whitening
Comparison of effects is as follows:
\[
\begin{array}{l|c}
\hline
& \text{STS-B} \\
\hline
\text{BERT}_{\text{base}}\text{-last2avg (Paper result)} & 59.04 \\
\text{BERT}_{\text{base}}\text{-flow (target, Paper result)} & 70.72 \\
\text{BERT}_{\text{base}}\text{-last2avg (My reproduction)} & 59.04 \\
\text{BERT}_{\text{base}}\text{-whitening (target, My implementation)} & 71.20 \\
\hline
\text{BERT}_{\text{large}}\text{-last2avg (Paper result)} & 59.56 \\
\text{BERT}_{\text{large}}\text{-flow (target, Paper result)} & 72.26 \\
\text{BERT}_{\text{large}}\text{-last2avg (My reproduction)} & 59.59 \\
\text{BERT}_{\text{large}}\text{-whitening (target, My implementation)} & 71.98 \\
\hline
\end{array}
\]
As can be seen, simple BERT-whitening indeed achieves results comparable to BERT-flow. Besides STS-B, my colleagues have conducted similar comparisons within Chinese business data, and the results all indicate that the improvement brought by BERT-flow is similar to BERT-whitening. This suggests that the introduction of the flow model might not be that necessary, as flow layers are not common layers, require specialized implementation, and involve a certain amount of workload to train. In contrast, BERT-whitening is easy to implement—just a linear transformation—and can be easily applied to any sentence vector model. (Of course, if one wants to argue, one could say that whitening is a flow model realized via linear transformation...)
Note: As a side note, the "last2avg" mentioned in the BERT-flow paper originally meant the average vector of the last two layers' output, but its code actually calculates the average vector of the "first layer + last layer" output. Related discussion can be found in this ISSUE.
Dimensionality Reduction Can Be Even Better
Now we know that the transformation matrix for BERT-whitening, $\boldsymbol{W} = \boldsymbol{U}\sqrt{\boldsymbol{\Lambda}^{-1}}$, can transform the covariance matrix of data into an identity matrix. What if we don't consider $\sqrt{\boldsymbol{\Lambda}^{-1}}$ and simply use $\boldsymbol{U}$ for the transformation? It's not hard to see that if we only use $\boldsymbol{U}$, the data covariance matrix becomes $\boldsymbol{\Lambda}$, which is diagonal.
As mentioned before, $\boldsymbol{U}$ is an orthogonal matrix. It merely rotates the entire data without changing the relative positions between samples; in other words, it is a completely "faithful" transformation. The diagonal elements of $\boldsymbol{\Lambda}$ measure the variance of the data in that specific dimension. If a value is very small, it means the variation in that dimension is negligible, nearly constant. This implies the original sentence vectors might actually reside in a lower-dimensional space. We can remove these dimensions, achieving dimensionality reduction while making the cosine similarity results more reasonable.
In fact, the diagonal matrix $\boldsymbol{\Lambda}$ from SVD is already sorted from largest to smallest. So we only need to keep the first few dimensions to achieve this dimensionality reduction effect. Readers familiar with linear algebra should realize that this operation is exactly PCA! The code only needs one line of modification:
W = np.dot(u[:, :k], np.diag(1 / np.sqrt(s[:k])))
The results are as follows:
\[
\begin{array}{l|c}
\hline
& \text{STS-B} \\
\hline
\text{BERT}_{\text{base}}\text{-last2avg (Paper result)} & 59.04 \\
\text{BERT}_{\text{base}}\text{-flow (target, Paper result)} & 70.72 \\
\text{BERT}_{\text{base}}\text{-last2avg (My reproduction)} & 59.04 \\
\text{BERT}_{\text{base}}\text{-whitening (target, My implementation)} & 71.20 \\
\text{BERT}_{\text{base}}\text{-whitening-256 (target, My implementation)} & 71.42 \\
\hline
\text{BERT}_{\text{large}}\text{-last2avg (Paper result)} & 59.56 \\
\text{BERT}_{\text{large}}\text{-flow (target, Paper result)} & 72.26 \\
\text{BERT}_{\text{large}}\text{-last2avg (My reproduction)} & 59.59 \\
\text{BERT}_{\text{large}}\text{-whitening (target, My implementation)} & 71.98 \\
\text{BERT}_{\text{large}}\text{-whitening-384 (target, My implementation)} & 72.66 \\
\hline
\end{array}
\]
From the table above, it can be seen that if we keep only the first 256 dimensions of the 768-dimensional base version, the performance actually improves. Due to the dimensionality reduction, vector retrieval speed will definitely be greatly accelerated. Similarly, keeping only the first 384 dimensions of the 1024-dimensional large version improves performance while reducing dimensionality. This result indicates that sentence vectors trained unsupervised are "general-purpose." For applications within specific domains, there are many redundant features. Eliminating these redundant features often achieves the effect of increasing speed and efficiency.
By contrast, flow models are reversible and do not reduce dimensionality. While this is an advantage in some scenarios, it is a disadvantage in many others because it cannot eliminate redundant dimensions, limiting performance. For example, research on GANs shows that a 1024x1024 face image can be randomly generated from a 256-dimensional Gaussian vector, indicating that these face images actually constitute a fairly low-dimensional manifold. However, if a flow model is used, because reversibility must be guaranteed, one would be forced to use a 1024x1024x3 dimensional Gaussian vector for random generation, which greatly increases computational costs and hampers performance.
(Note: For subsequent experimental results, please see "Which Unsupervised Semantic Similarity is Strongest? We Conducted a Comprehensive Evaluation".)
So the Final Conclusion is
The current results show that through several experiments, simple linear transformation (BERT-whitening) operations can basically match the performance of the BERT-flow model. This indicates that introducing a flow model into a sentence vector model might not be critical; its correction of the distribution might only be superficial, whereas directly correcting the covariance matrix of sentence vectors through a linear transformation can achieve similar effects. Meanwhile, BERT-whitening supports dimensionality reduction, which can achieve both speed and efficiency gains.