By 苏剑林 | November 02, 2022
In the field of text similarity, there are two common approaches: "interaction-based" and "representation-based." Many readers are likely familiar with these. I previously wrote an article, "CoSENT (Part 2): How Large Is the Gap Between Representation-based and Interaction-based Matching?", comparing their performance. In general, interaction-based similarity models usually deliver better results, but using them directly for large-scale retrieval is impractical. On the other hand, representation-based similarity models offer faster retrieval speeds but slightly inferior performance.
Therefore, how to improve the retrieval speed of interaction-based similarity while maintaining its performance is a subject that the academic community has been studying for a long time. Recently, the paper "Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization" proposed a new solution: CUR decomposition.
In a retrieval scenario, we generally have a massive candidate set $\mathcal{K}$. Without loss of generality, we can assume $\mathcal{K}$ is constant. The task of retrieval is to find the several results $k \in \mathcal{K}$ most relevant to an arbitrary query $q \in \mathcal{Q}$. Interaction-based similarity models directly train a relevance scoring function $s(q,k)$. In theory, we could calculate $s(q,k)$ for all $k \in \mathcal{K}$ and then sort them in descending order. However, this means the computational complexity for each retrieval is $\mathcal{O}(|\mathcal{K}|)$, and since intermediate calculation results cannot be cached, the cost is unacceptable.
Acceptable computational costs are achieved with similarity measures in the form of matrix factorization. For a single sample, this is similarity based on inner products and its variants. A classic implementation involves encoding $q$ and $k$ via an encoder $\boldsymbol{E}$ into two vectors $\boldsymbol{E}(q)$ and $\boldsymbol{E}(k)$, and then calculating the inner product $\boldsymbol{E}(q)^{\top}\boldsymbol{E}(k)$; this is the representation-based similarity approach. This scheme has several characteristics: 1. All $\boldsymbol{E}(k)$ can be pre-calculated and cached; 2. The inner product of $\boldsymbol{E}(q)$ with all $\boldsymbol{E}(k)$ can be converted into matrix multiplication, allowing for parallelized fast computation; 3. Tools like Faiss can be used to further improve retrieval speed using approximate algorithms.
Thus, the idea for accelerating interaction-based similarity retrieval is to transform it into a matrix factorization form. A classic implementation is using a representation-based similarity model to distill the effects of an interaction-based similarity model. The ingenuity of this Google paper lies in the fact that it does not introduce any new models, but rather utilizes CUR decomposition directly on the original interaction-based similarity model to achieve acceleration. This scheme is named ANNCUR.
CUR decomposition is a type of matrix factorization. When it comes to matrix factorization, many readers' first thought might be SVD. In fact, why everyone is so sensitive to SVD is not because it is particularly intuitive or easy to understand, but because it is frequently introduced. In terms of being intuitive and easy to understand, CUR decomposition is clearly superior.
Actually, we can understand SVD and CUR decomposition from a unified perspective: for a scoring function $s(q,k)$, we hope to construct the following approximation: \begin{equation}s(q,k) \approx \sum_{u\in\mathcal{U},v\in\mathcal{V}} f(q, u) g(u, v) h(v, k)\label{eq:decom}\end{equation} Generally, there are constraints $\|\mathcal{U}\|,\|\mathcal{V}\|\ll \|\mathcal{Q}\|,\|\mathcal{K}\|$, making it a compressed decomposition. We can think of $\mathcal{U}$ as a "representative set" (or "clustering centers") of $\mathcal{K}$, and similarly $\mathcal{V}$ as a "representative set" of $\mathcal{Q}$. Then the above decomposition becomes very vivid:
The score $s(q,k)$ between $q$ and $k$ is approximated by first calculating the score $f(q, u)$ between $q$ and the "representatives" $u\in \mathcal{U}$ of $\mathcal{K}$, then calculating the score $h(v, k)$ between $k$ and the "representatives" $v\in \mathcal{V}$ of $\mathcal{Q}$, and finally performing a weighted summation through the weights $g(u, v)$.
In other words, the direct interaction between $q$ and $k$ is transformed into their respective interactions with "representatives," followed by a weighting of the results. The advantage is obvious: once $f, g, h$ are determined, all $g(u,v)$ and $h(v,k)$ can be pre-calculated and cached as a matrix. Then, for each retrieval, we only need to calculate $\|\mathcal{U}\|$ instances of $f(q,u)$ and then perform one matrix multiplication (inner product-based retrieval). Thus, the retrieval computation is converted from $\mathcal{O}(\|\mathcal{K}\|)$ to $\mathcal{O}(\|\mathcal{U}\|)$ (with tools like Faiss, inner product-based retrieval can be optimized to approximately $\mathcal{O}(1)$, and thus can be ignored).
Assuming the request set $\mathcal{Q}$ is also finite, then all $s(q,k)$ constitute a $\|\mathcal{Q}\|\times \|\mathcal{K}\|$ matrix $\boldsymbol{S}$. Correspondingly, $f(q, u), g(u, v), h(v, k)$ correspond to a $\|\mathcal{Q}\|\times \|\mathcal{U}\|$ matrix $\boldsymbol{F}$, a $\|\mathcal{U}\|\times \|\mathcal{V}\|$ matrix $\boldsymbol{G}$, and a $\|\mathcal{V}\|\times \|\mathcal{K}\|$ matrix $\boldsymbol{H}$. Equation \eqref{eq:decom} then becomes a matrix factorization: \begin{equation}\begin{array}{ccccc} \boldsymbol{S} & \approx & \boldsymbol{F} & \boldsymbol{G} & \boldsymbol{H} \\ \in\mathbb{R}^{\|\mathcal{Q}\|\times \|\mathcal{K}\|} & & \in\mathbb{R}^{\|\mathcal{Q}\|\times \|\mathcal{U}\|} & \in\mathbb{R}^{\|\mathcal{U}\|\times \|\mathcal{V}\|} & \in\mathbb{R}^{\|\mathcal{V}\|\times \|\mathcal{K}\|} \end{array}\label{eq:m-decom}\end{equation}
If $\boldsymbol{G}$ is restricted to being a diagonal matrix and no special restrictions are placed on $\boldsymbol{F}$ and $\boldsymbol{H}$, the corresponding decomposition is SVD. SVD is equivalent to virtually creating several "representatives" so that the final fitting effect is relatively good. However, with "representatives" constructed by the algorithm itself, it is difficult for us to understand their specific meanings—meaning the interpretability is somewhat lacking.
CUR decomposition is more direct. it posits that "representatives" should be members of the original population. That is, the "representatives" of $\mathcal{Q}$ and $\mathcal{K}$ should be subsets chosen from the sets themselves, i.e., $\mathcal{U}\subset \mathcal{K}, \mathcal{V}\subset\mathcal{Q}$. In this way, $(q,u)$ and $(v,k)$ are among the original $(q,k)$, and therefore we can continue to use the original scoring function $s$, namely: \begin{equation}s(q,k) \approx \sum_{u\in\mathcal{U},v\in\mathcal{V}} s(q, u) g(u, v) s(v, k)\end{equation} Consequently, only the function $g(u,v)$ remains to be determined. From the perspective of matrix decomposition, at this point $\boldsymbol{F}$ in Equation \eqref{eq:m-decom} is a submatrix composed of several columns of $\boldsymbol{S}$, and $\boldsymbol{H}$ is a submatrix composed of several rows of $\boldsymbol{S}$. The only thing left to calculate is matrix $\boldsymbol{G}$. The calculation of $\boldsymbol{G}$ is also straightforward. Let's first consider a very special case where $\mathcal{U}=\mathcal{K}, \mathcal{V}=\mathcal{Q}$ and $\|\mathcal{Q}\|=\|\mathcal{K}\|$. At this point, the CUR decomposition is $\boldsymbol{S}\approx \boldsymbol{S}\boldsymbol{G}\boldsymbol{S}$, where $\boldsymbol{S}$ and $\boldsymbol{G}$ are square matrices. Since we have taken the entirety of $\mathcal{Q}$ and $\mathcal{K}$ as representatives, we naturally hope for $=$ instead of $\approx$. If $=$ holds, we can directly solve for $\boldsymbol{G} = \boldsymbol{S}^{-1}$.
However, this implies that $\boldsymbol{S}$ must be invertible, which is not always the case. In such instances, the matrix inverse operation must be generalized; we call this the "pseudo-inverse," denoted as $\boldsymbol{G}=\boldsymbol{S}^{\dagger}$. Specifically, the pseudo-inverse is also defined for non-square matrices, so when $\|\mathcal{Q}\|\neq\|\mathcal{K}\|$, we can similarly solve for $\boldsymbol{G}=\boldsymbol{S}^{\dagger}$. Finally, when $\mathcal{U}\neq\mathcal{K}$ or $\mathcal{V}\neq\mathcal{Q}$, the result is similar, except that the matrix for which the pseudo-inverse is calculated is replaced by the intersection matrix of $\boldsymbol{F}$ and $\boldsymbol{H}$ (i.e., the $\mathcal{U}\times \mathcal{V}$ matrix formed by the intersection elements of several rows and columns of $\boldsymbol{S}$): \begin{equation} \boldsymbol{S} \approx \boldsymbol{F} (\boldsymbol{F}\cap \boldsymbol{H})^{\dagger}\boldsymbol{H}\end{equation}
The entire process is illustrated below:
CUR Decomposition Diagram
In fact, this is not the first time this blog has covered CUR decomposition. Last year's article "Nyströmformer: A Linearized Attention Scheme Based on Matrix Decomposition" introduced Nyströmformer, which was also designed based on the idea of CUR decomposition; the original paper devoted quite a bit of space to introducing CUR decomposition. ANNCUR utilizes CUR decomposition for retrieval acceleration, showing that CUR applications are widespread.
The principle of acceleration was briefly mentioned just now; let's summarize it. First, we select several representative $q\in \mathcal{V}\subset \mathcal{Q}$ and $k\in \mathcal{U}\subset \mathcal{K}$, calculate their pairwise scores to form the matrix $\boldsymbol{F}\cap\boldsymbol{H}$, and obtain matrix $\boldsymbol{G}$ after calculating the pseudo-inverse. Then, we calculate the scoring matrix $\boldsymbol{G}$ for $q\in \mathcal{V}$ and $k\in \mathcal{K}$ in advance and store $\boldsymbol{G}\boldsymbol{H}$. Finally, for every $q$ that needs to be queried, we calculate scores with every $k\in \mathcal{U}$ to get a $\|\mathcal{U}\|$-dimensional vector, then multiply this vector by the cached matrix $\boldsymbol{G}\boldsymbol{H}$ to get the scoring vector for $q$ with every $k\in \mathcal{K}$.
The general principle of ANNCUR is as described. Specific details can be found by reading the original paper, such as why replacing the interaction-based similarity model with the "[EMB]-CE" version described in the paper yields better results. Some readers might ask, "How do we select representative $q$ and $k$?" In fact, in most cases, they are chosen randomly. This leaves room for improvement—for example, one could perform clustering and select the points closest to the cluster centers; these depend on one's own creativity. Additionally, it should be pointed out that CUR decomposition itself is an approximation and will inevitably have errors. Therefore, this acceleration scheme is primarily designed for retrieval scenarios. A characteristic of retrieval scenarios is that they care more about top-$k$ recall than top-1 precision. We can use CUR decomposition to accelerate the recall of several results and then use the exact $s(q,k)$ for a re-ranking to improve accuracy.
Some experimental results diagrams from ANNCUR
This article reviewed the CUR decomposition in matrix factorization and introduced its application in accelerating the retrieval of interaction-based similarity models.