Performer: Linearizing Attention Complexity with Random Projections

By 苏剑林 | December 01, 2020

The $\mathcal{O}(n^2)$ complexity of the Attention mechanism has been a long-standing issue. There are two main approaches to changing this complexity: one is the sparsification route, such as the Sparse Attention we introduced previously and Big Bird released by Google a few months ago; the other is the linearization route, parts of which we summarized in "Exploring Linear Attention: Does Attention Need a Softmax?". This article introduces a new piece of improvement work, Performer, from Google's paper "Rethinking Attention with Performers." Its goal is quite ambitious: to linearize the complexity of Attention through random projections without losing accuracy.

To put it bluntly, in an ideal scenario, we wouldn't need to retrain the model, and the output results wouldn't change significantly, but the complexity would drop to $\mathcal{O}(n)$! It sounds like a "pie in the sky" improvement. Is it really that beautiful?

Attention

We know that the general definition of Attention is: \begin{equation}Attention(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})_i = \frac{\sum\limits_{j=1}^n \text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)\boldsymbol{v}_j}{\sum\limits_{j=1}^n \text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)}\label{eq:gen-att}\end{equation}

For standard Scaled-Dot Attention, $\text{sim}(\boldsymbol{q}, \boldsymbol{k})=e^{\boldsymbol{q}\cdot \boldsymbol{k}}$ (sometimes there is a scaling factor in the exponent, which we won't explicitly write out here). Writing the operation for the entire sequence in matrix form gives: \begin{equation}Attention(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}) = softmax\left(\boldsymbol{Q}\boldsymbol{K}^{\top}\right)\boldsymbol{V}\end{equation} We are mainly interested in the Self Attention scenario, so we generally have $\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}\in\mathbb{R}^{n\times d}$. In the above formula, the $\boldsymbol{Q}\boldsymbol{K}^{\top}$ step is equivalent to performing dot products for $n^2$ pairs of vectors, resulting in $n^2$ real numbers. Therefore, both time and space complexity are $\mathcal{O}(n^2)$.

For linear Attention, $\text{sim}(\boldsymbol{q}, \boldsymbol{k})=\phi(\boldsymbol{q})\cdot \varphi(\boldsymbol{k})$, where $\phi,\varphi$ are activation functions with non-negative ranges. In this way, the core computation of Attention (the numerator of Equation $\eqref{eq:gen-att}$) becomes: \begin{equation}\left(\phi(\boldsymbol{Q})\varphi(\boldsymbol{K})^{\top}\right)\boldsymbol{V}=\phi(\boldsymbol{Q})\left(\varphi(\boldsymbol{K})^{\top}\boldsymbol{V}\right)\end{equation} The complexity of the left side of the equation is still $\mathcal{O}(n^2)$. However, since matrix multiplication satisfies the associative law, we can calculate the multiplication of the last two matrices first. This way, the complexity can be reduced to $\mathcal{O}(n)$. For a detailed introduction, readers are encouraged to look at the previous article "Exploring Linear Attention: Does Attention Need a Softmax?"; we won't expand too much here.

Performer

Now we can enter the introduction of Performer. As stated at the beginning, the starting point of Performer is still standard Attention, so it still uses $\text{sim}(\boldsymbol{q}, \boldsymbol{k})=e^{\boldsymbol{q}\cdot \boldsymbol{k}}$. Then, it hopes to linearize the complexity, which means finding new $\tilde{\boldsymbol{q}}, \tilde{\boldsymbol{k}}$ such that: \begin{equation}\text{sim}(\boldsymbol{q}, \boldsymbol{k}) \approx \tilde{\boldsymbol{q}}\cdot\tilde{\boldsymbol{k}}\end{equation} Finding a reasonable mapping scheme from $\boldsymbol{q},\boldsymbol{k}$ to $\tilde{\boldsymbol{q}},\tilde{\boldsymbol{k}}$ is the greatest difficulty of this approach.

Beautiful Random Mapping

Performer's greatest contribution lies in finding a very beautiful mapping scheme: \begin{equation}\begin{aligned} e^{\boldsymbol{q}\cdot \boldsymbol{k}}&=\mathbb{E}_{\boldsymbol{\omega}\sim \mathcal{N}(\boldsymbol{\omega};0,\boldsymbol{1}_d)}\left[e^{\boldsymbol{\omega}\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \times e^{\boldsymbol{\omega}\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}\right]\\[6pt] &\approx\underbrace{\frac{1}{\sqrt{m}}\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \end{pmatrix}}_{\tilde{\boldsymbol{q}}} \cdot \underbrace{\frac{1}{\sqrt{m}}\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \end{pmatrix}}_{\tilde{\boldsymbol{k}}} \end{aligned}\label{eq:core}\end{equation} Let's analyze what the above equation says. The first equal sign means the two sides are identical. This implies that as long as we sample an infinite number of $\boldsymbol{\omega}$ from the standard normal distribution $\mathcal{N}(\boldsymbol{\omega};0,\boldsymbol{1}_d)$ and then calculate the average of $e^{\boldsymbol{\omega}\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \times e^{\boldsymbol{\omega}\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}$, the result will equal $e^{\boldsymbol{q}\cdot \boldsymbol{k}}$. Written in integral form, it is: \begin{equation}\begin{aligned} &\frac{1}{(2\pi)^{d/2}}\int e^{-\Vert\boldsymbol{\omega}\Vert^2 / 2}\times e^{\boldsymbol{\omega}\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \times e^{\boldsymbol{\omega}\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}d\boldsymbol{\omega} \\ =&\frac{1}{(2\pi)^{d/2}}\int e^{-\Vert\boldsymbol{\omega}-\boldsymbol{q}-\boldsymbol{k}\Vert^2 / 2 + \boldsymbol{q}\cdot \boldsymbol{k}}d\boldsymbol{\omega}\\ =&\ e^{\boldsymbol{q}\cdot \boldsymbol{k}} \end{aligned}\end{equation} Of course, in practice, we can only sample a finite number $m$, thus obtaining the second approximation sign. It happens to be represented as the dot product of two $m$-dimensional vectors, which is exactly the $e^{\boldsymbol{q}\cdot \boldsymbol{k}}\approx \tilde{\boldsymbol{q}}\cdot\tilde{\boldsymbol{k}}$ we need! So, with the help of this approximation, we can transform the exponent of the dot product of two $d$-dimensional vectors into the dot product of two $m$-dimensional vectors. Theoretically, we can transform standard Attention with a head_size of $d$ into linear Attention with a head_size of $m$. This is the core idea of the paper.

Discussion on the Derivation Process

Some readers might be interested in the origin of Equation $\eqref{eq:core}$. Let's discuss it further, though readers who are not interested can skip this section. Although Equation $\eqref{eq:core}$ can be verified by directly calculating the integral, can a similar linear approximation be found for any arbitrarily defined $\text{sim}(\boldsymbol{q}, \boldsymbol{k})$? We will show below that a similar linearization scheme can be found through a generalized process, though the results may be far less elegant or effective than Equation $\eqref{eq:core}$.

Specifically, for any $\text{sim}(\boldsymbol{q}, \boldsymbol{k})$, we can rewrite it as: \begin{equation}\text{sim}(\boldsymbol{q}, \boldsymbol{k}) = \frac{\beta(\boldsymbol{q})\gamma(\boldsymbol{k})\text{sim}(\boldsymbol{q}, \boldsymbol{k})}{\beta(\boldsymbol{q})\gamma(\boldsymbol{k})}\end{equation} Then we can perform a Fourier transform on $\beta(\boldsymbol{q})\gamma(\boldsymbol{k})\text{sim}(\boldsymbol{q}, \boldsymbol{k})$: \begin{equation}\mathcal{F}(\boldsymbol{\omega}_q, \boldsymbol{\omega}_k)=\frac{1}{(2\pi)^{d/2}}\int \beta(\boldsymbol{q})\gamma(\boldsymbol{k})\text{sim}(\boldsymbol{q}, \boldsymbol{k})e^{-i\boldsymbol{\omega}_q\cdot \boldsymbol{q}-i\boldsymbol{\omega}_k\cdot \boldsymbol{k}}d\boldsymbol{q}d\boldsymbol{k}\end{equation} The reason for multiplying by $\beta(\boldsymbol{q})\gamma(\boldsymbol{k})$ first is that the Fourier transform of $\text{sim}(\boldsymbol{q}, \boldsymbol{k})$ directly might not look good or might not even exist. Multiplying by a function might make it possible. For example, we could let $\beta(\boldsymbol{x})=\gamma(\boldsymbol{x})=e^{-\lambda\Vert x\Vert^2}$. As long as $\lambda$ is large enough, many $\text{sim}(\boldsymbol{q}, \boldsymbol{k})$ functions can complete the Fourier transform.

Next, we perform the inverse transform and substitute it back into the original equation to get: \begin{equation}\text{sim}(\boldsymbol{q}, \boldsymbol{k})=\frac{1}{(2\pi)^{d/2}}\int \mathcal{F}(\boldsymbol{\omega}_q, \boldsymbol{\omega}_k)\frac{e^{i\boldsymbol{\omega}_q\cdot \boldsymbol{q}}}{\beta(\boldsymbol{q})} \frac{e^{i\boldsymbol{\omega}_k\cdot \boldsymbol{k}}}{\gamma(\boldsymbol{k})}d\boldsymbol{\omega}_q d\boldsymbol{\omega}_k\end{equation} If we can calculate $\mathcal{F}(\boldsymbol{\omega}_q, \boldsymbol{\omega}_k)$ and perform normalization, it can become a distribution from which we can sample. From this, we can sample random vectors $\boldsymbol{\omega}_q, \boldsymbol{\omega}_k$ and transform them into a dot product of vector sequences composed of $\frac{e^{i\boldsymbol{\omega}_q\cdot \boldsymbol{q}}}{\beta(\boldsymbol{q})}$ and $\frac{e^{i\boldsymbol{\omega}_k\cdot \boldsymbol{k}}}{\gamma(\boldsymbol{k})}$. Of course, the calculation here may involve imaginary numbers, while we usually only process real numbers. However, this is not a big problem; we can expand using Euler's formula $e^{i \theta}=\cos\theta + i\sin\theta$ and keep only the real part during the entire calculation process—the form will not change much. Theoretically, the whole process can be carried out without difficulty. However, compared to Equation $\eqref{eq:core}$, the problems are: 1. It now requires sampling two sets of random variables $\boldsymbol{\omega}_q, \boldsymbol{\omega}_k$, which increases the variance of the estimation; 2. After finally keeping the real part, the result will be a combination of $\sin$ and $\cos$, which cannot guarantee non-negativity and requires extra clipping.

The specialty of Equation $\eqref{eq:core}$ is that $e^{\boldsymbol{q}\cdot \boldsymbol{k}}$ can be rewritten as: \begin{equation}e^{\boldsymbol{q}\cdot \boldsymbol{k}} = e^{\Vert \boldsymbol{q}\Vert^2 / 2 + \Vert \boldsymbol{k}\Vert^2 / 2 - \Vert\boldsymbol{q}-\boldsymbol{k}\Vert^2 / 2}\end{equation} So it only needs to be transformed into a problem for a single variable $\boldsymbol{q}-\boldsymbol{k}$. The Fourier transform of $e^{-\Vert\boldsymbol{q}-\boldsymbol{k}\Vert^2 / 2}$ is exactly $e^{-\Vert\boldsymbol{\omega}\Vert^2 / 2}$, so doing the inverse transform we have: \begin{equation}e^{\boldsymbol{q}\cdot \boldsymbol{k}}=\frac{e^{\Vert \boldsymbol{q}\Vert^2 / 2 + \Vert \boldsymbol{k}\Vert^2 / 2}}{(2\pi)^{d/2}}\int e^{-\Vert\boldsymbol{\omega}\Vert^2 / 2 + i \boldsymbol{\omega}\cdot (\boldsymbol{q} - \boldsymbol{k})} d\boldsymbol{\omega}\end{equation} At this point, if we directly take the real part to expand, we also get a combination of $\sin$ and $\cos$, which is the $\text{trig}$ projection scheme mentioned in the original paper. However, there is a more ingenious property that changes everything! Notice that the above equation is an identity, so we can replace $\boldsymbol{q}\to -i\boldsymbol{q},\boldsymbol{k}\to i\boldsymbol{k}$ on both sides. The result is: \begin{equation}e^{\boldsymbol{q}\cdot \boldsymbol{k}}=\frac{e^{-\Vert \boldsymbol{q}\Vert^2 / 2 - \Vert \boldsymbol{k}\Vert^2 / 2}}{(2\pi)^{d/2}}\int e^{-\Vert\boldsymbol{\omega}\Vert^2 / 2 + \boldsymbol{\omega}\cdot (\boldsymbol{q} + \boldsymbol{k})} d\boldsymbol{\omega}\end{equation} This is Equation $\eqref{eq:core}$. The substitution $\boldsymbol{q}\to -i\boldsymbol{q},\boldsymbol{k}\to i\boldsymbol{k}$ keeps the left side unchanged while making the right side completely free of imaginary numbers and maintaining non-negativity. It really feels like a collection of many coincidences—a "unique" solution!

Orthogonalization to Reduce Variance

In addition to proposing Equation $\eqref{eq:core}$ to linearize standard Attention, the original paper made further enhancements. In Equation $\eqref{eq:core}$, $\boldsymbol{\omega}_1,\boldsymbol{\omega}_2,\cdots,\boldsymbol{\omega}_m$ are independently and identically sampled from $\mathcal{N}(\boldsymbol{\omega};0,\boldsymbol{1}_d)$. The original paper pointed out that if each $\boldsymbol{\omega}_i$ is orthogonalized, the estimation variance can be effectively reduced, improving the average accuracy of a single estimation.

Note that orthogonalization here refers to keeping the norm of $\boldsymbol{\omega}_i$ unchanged and only performing Schmidt orthogonalization on its direction. This operation was first proposed in another Google paper "The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings." Performer spent a full 6 pages in its appendix proving this. Reproducing a 6-page proof here is obviously inappropriate; how should we understand this strategy?

Actually, the most fundamental reason this strategy works is the isotropy of the sampling distribution $\mathcal{N}(\boldsymbol{\omega};0,\boldsymbol{1}_d)$. Its probability density function $(2\pi)^{-d/2}e^{-\Vert\boldsymbol{\omega}\Vert^2 / 2}$ only depends on the norm $\Vert\boldsymbol{\omega}\Vert$, so it is uniform in direction. If we want to reduce the variance of the estimation, we should reduce the randomness of sampling to make the sampling results more uniform. Orthogonalizing the vectors is a way to achieve directional uniformity. In other words, orthogonalizing the $\boldsymbol{\omega}_i$ promotes the uniformity of sampling results, thereby reducing the estimation variance. Furthermore, orthogonalization is generally only effective when $m\leq d$. If $m > d$, the original paper's approach is to group every $d$ vectors and orthogonalize them separately.

We can think of the orthogonalization operation as merely making the sampling direction more uniform. If we want to be more thorough, we could also make the sampling norm uniform. Specifically, transforming the standard normal distribution into n-sphere coordinates gives the probability element as: \begin{equation}\frac{1}{(2\pi)^{d/2}} r^{d-1} e^{-r^2 /2} dr dS\end{equation} where $dS = \sin^{d-2}\varphi_{1} \sin^{d-3}\varphi_{2} \cdots \sin\varphi_{d-2}\,d\varphi_{1}\,d\varphi_{2}\cdots d\varphi_{d-1}$ represents the integral element on the $d$-dimensional sphere. The equation shows that the standard normal distribution is uniform in direction, and the probability density function of the norm is proportional to $r^{d-1} e^{-r^2 /2}$. We define its cumulative probability function: \begin{equation}P_d(r\leq R) = \frac{\int_0^R r^{d-1} e^{-r^2 /2} dr}{\int_0^{\infty} r^{d-1} e^{-r^2 /2} dr}\end{equation} If we want to sample $m$ samples, we can let $P_d(r\leq R_i) = \frac{i}{m+1}, \,i=1,2,\cdots,m$, and solve for $m$ values of $R_i$ to use as the norms.

Performance and Results

The theoretical introduction ends here. It has already been quite extensive. Generally, as long as one has an understanding of linear Attention, seeing Equation $\eqref{eq:core}$ is enough to grasp the key points of Performer. The rest is supplementary content that doesn't change the main arc. Next, we look at the evaluation of Performer.

Review of the Original Paper

Let's first look at the evaluation in the original paper. The evaluation is rich but a bit messy, potentially leaving NLP readers somewhat confused.

The first is the speed evaluation. Not surprisingly, as the sequence length increases, Performer has a clear advantage over standard Transformers:

Performer vs Standard Transformer Speed Comparison

Speed comparison between Performer and standard Transformer (Solid line: Performer, Dashed line: standard Transformer)

Next is a comparison of approximation accuracy, demonstrating the effectiveness of orthogonal sampling and the precision of the proposed Equation $\eqref{eq:core}$ compared to the older $\sin, \cos$ based forms:

Approximation Accuracy

Left: Effectiveness of orthogonalized sampling vectors; Right: Comparison of Performer's approximation precision against the older trigonometric-based approximation.

Can we achieve the initial expectation—not having to retrain pre-trained models? Unfortunately, no. The original paper performed two experiments showing that directly loading Transformer weights into a Performer does not reproduce original results, though it recovery quickly after fine-tuning. The paper did not provide an in-depth analysis of why reproduction failed initially.

Finetuning Experiment

Performer loading pre-trained Transformer weight experiment.

Finally, the paper conducted experiments on protein sequences and images, proving Performer is effective for long sequences; notably, it is more effective than Reformer. The experiments are extensive but slightly disjointed.

Reviews from Other Papers

Perhaps their colleagues couldn't stand the lack of clarity, as Google later released two more papers, "Efficient Transformers: A Survey" and "Long Range Arena: A Benchmark for Efficient Transformers." These systematically evaluated and compared existing methods for improving Transformer efficiency, including Performer. In contrast, the results given by these two papers were much more intuitive, using simple charts to clearly represent the positioning of each model.

Taxonomy of Efficient Transformers

Classification of various efficient Transformers.

Comparison Table

Comparison of various improved Transformers. The "decode" column refers to whether future information can be masked for language modeling.

LRA Results

"Effect-Speed-Memory" chart of various Transformer models. The vertical axis is performance, the horizontal axis is speed, and the size of the circle represents the required memory. Theoretically, models closer to the upper right are better, and models with smaller circles are better.

For more detailed evaluation information, you can check those two papers directly.

Questions and Reflections

Performer seems quite good, but does that mean we can use it to replace Transformers everywhere? Not necessarily. Despite its many merits, Performer still has some unsolvable problems.

First, to better approximate standard Transformers, Performer's $m$ must be relatively large, at least $m > d$, and generally several times $d$. This means that Performer's head_size is significantly larger than a standard Transformer's. Although theoretically, as long as $m$ is fixed, the complexity of Performer with respect to sequence length is linear, the increase in $m$ means the computational cost for short sequences increases significantly. In other words, when using short sequences, Performer's performance might drop. According to my estimation, only when the sequence length exceeds 5000 does Performer show a significant advantage.

Second, currently Performer (and other linear Attention models) is incompatible with relative position encoding, because relative position encoding is applied directly to the Attention matrix. Since Performer doesn't have an Attention matrix, it cannot be added. Furthermore, special Seq2Seq approaches like UniLM cannot be implemented, although ordinary unidirectional language models can. In short, the $\mathcal{O}(n^2)$ Attention matrix actually brings great flexibility. By abandoning it, linear Attention abandons this flexibility as well.

Finally, and what I consider the biggest problem, is that the idea of Performer is to linearize standard Attention. Why not just train a linear Attention model directly instead of trying to align with standard Attention? From the last graph in the previous section, Performer does not show a massive advantage over Linear Transformer (in fact, I think there might be an issue with the last chart's comparison: Performer may be better than Linear Transformer in accuracy, but how could it be faster? Performer essentially transforms into a Linear Transformer; with the extra transformation step, how could it be faster?). Therefore, the value of Performer is somewhat questionable, as linear Attention is much easier to implement and works for both long and short sequences, while Performer is much more troublesome to implement and only suitable for long sequences.

Conclusion

This article introduced Google's new model, Performer. It is a work that linearizes the complexity of standard Attention through random projections. There is much to learn from it. Finally, we summarized the evaluation results of various improved Transformers and shared my reflections on Performer.