Transformer Upgrade Journey: 5. Standard Attention as Infinite-Dimensional Linear Attention

By 苏剑林 | August 06, 2021

In "Performer: Linearizing Attention Complexity with Random Projections", we learned about the Performer model proposed by Google. It introduces a random projection scheme that can transform standard Attention into Linear Attention while maintaining a certain level of approximation. Theoretically, as long as the projection dimension is large enough, it can sufficiently approximate standard Attention. In other words, standard Attention can be viewed as an infinite-dimensional Linear Attention.

This article will introduce two other ideas I conceived for converting standard Attention into infinite-dimensional Linear Attention. Unlike the random projection of Performer, these two schemes are deterministic and allow for a more intuitive perception of the degree of approximation.

Brief Introduction

Regarding standard Attention and Linear Attention, I won't go into too much detail here. Readers who are not yet familiar can refer to my previous articles "Exploring Linear Attention: Does Attention Must Have a Softmax?" and "Transformer Upgrade Journey: 3. From Performer to Linear Attention". Briefly, the calculation method for standard Attention is:

\begin{equation}a_{i,j}=\frac{e^{\boldsymbol{q}_i\cdot \boldsymbol{k}_j}}{\sum\limits_j e^{\boldsymbol{q}_i\cdot \boldsymbol{k}_j}}\end{equation}

While the calculation method for Linear Attention is:

\begin{equation}a_{i,j}=\frac{\phi(\boldsymbol{q}_i)\cdot \varphi(\boldsymbol{k}_j)}{\sum\limits_j \phi(\boldsymbol{q}_i)\cdot \varphi(\boldsymbol{k}_j)}\end{equation}

Therefore, to (approximately) transform standard Attention into Linear Attention, one generally needs to find transformations $\phi, \varphi$ such that there is an approximation:

\begin{equation}\phi(\boldsymbol{q})\cdot \varphi(\boldsymbol{k})\approx e^{\boldsymbol{q}\cdot \boldsymbol{k}}\end{equation}

In this context, $e^{\boldsymbol{q}\cdot \boldsymbol{k}}$ is the "kernel function" in kernel methods.

Random Projection

Performer found the first practical random projection transformation scheme. Essentially, it is based on the following integral:

\begin{equation}\begin{aligned} e^{\boldsymbol{q}\cdot \boldsymbol{k}} =&\, \frac{1}{(2\pi)^{d/2}}\int e^{-\Vert\boldsymbol{\omega}-\boldsymbol{q}-\boldsymbol{k}\Vert^2 / 2 + \boldsymbol{q}\cdot \boldsymbol{k}}d\boldsymbol{\omega}\\ =&\, \frac{1}{(2\pi)^{d/2}}\int e^{-\Vert\boldsymbol{\omega}\Vert^2 / 2}\times e^{\boldsymbol{\omega}\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \times e^{\boldsymbol{\omega}\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}d\boldsymbol{\omega} \\ \end{aligned}\end{equation}

leading to

\begin{equation}\begin{aligned} e^{\boldsymbol{q}\cdot \boldsymbol{k}}&=\mathbb{E}_{\boldsymbol{\omega}\sim \mathcal{N}(\boldsymbol{\omega};0,\boldsymbol{1}_d)}\left[e^{\boldsymbol{\omega}\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \times e^{\boldsymbol{\omega}\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}\right]\\[6pt] &\approx\underbrace{\frac{1}{\sqrt{m}}\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \end{pmatrix}}_{\phi(\boldsymbol{q})} \cdot \underbrace{\frac{1}{\sqrt{m}}\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \end{pmatrix}}_{\varphi(\boldsymbol{k})} \end{aligned}\end{equation}

where $\boldsymbol{\omega}_1,\boldsymbol{\omega}_2,\cdots,\boldsymbol{\omega}_m\sim \mathcal{N}(\boldsymbol{\omega};0,\boldsymbol{1}_d)$. In this way, through the idea of random projection, we approximately transform the exponential of the inner product of two $d$-dimensional vectors into the inner product of two $m$-dimensional vectors, and when $m\to\infty$, the two are theoretically equal.

The above random projection scheme is quite ingenious and not easily conceived. Below, I will introduce two schemes I devised, which are relatively easier to understand, especially for readers familiar with kernel functions; they might grasp it at a glance.

Taylor Expansion

My first idea is based on the Taylor expansion:

\begin{equation}e^{\boldsymbol{q}\cdot \boldsymbol{k}} = \sum_{m=0}^{\infty} \frac{(\boldsymbol{q}\cdot \boldsymbol{k})^m}{m!}\end{equation}

Truncating to the first $n+1$ terms, we get an $n$-th degree polynomial regarding $\boldsymbol{q}\cdot \boldsymbol{k}$:

\begin{equation}e^{\boldsymbol{q}\cdot \boldsymbol{k}} \approx 1 + \boldsymbol{q}\cdot \boldsymbol{k} + \frac{1}{2}(\boldsymbol{q}\cdot \boldsymbol{k})^2 + \cdots + \frac{1}{n!}(\boldsymbol{q}\cdot \boldsymbol{k})^n\end{equation}

This is actually a "polynomial kernel function." Notice that we have:

\begin{equation}\begin{aligned} (\boldsymbol{q}\cdot \boldsymbol{k})^m =&\, \left(\sum_i q_i k_i\right)^m = \left(\sum_{i_1} q_{i_1} k_{i_1}\right)\cdots\left(\sum_{i_m} q_{i_m} k_{i_m}\right) \\ =&\, \sum_{i_1,\cdots,i_m} (q_{i_1}\cdots q_{i_m}) (k_{i_1}\cdots k_{i_m}) \end{aligned}\end{equation}

If we view $q_{i_1}\cdots q_{i_m}$ and $k_{i_1}\cdots k_{i_m}$ respectively as a large $d^m$-dimensional vector, then $(\boldsymbol{q}\cdot \boldsymbol{k})^m$ is the inner product of these two large vectors. In fact, the operation that yields a "large vector" from several vectors is called the "outer product" of vectors, also known as the "tensor product," generally denoted as $\otimes$. At this point:

\begin{equation} \frac{1}{m!}(\boldsymbol{q}\cdot \boldsymbol{k})^m = \frac{1}{m!}\underbrace{(\boldsymbol{q}\otimes\cdots\otimes\boldsymbol{q})}_{m \text{ times } \boldsymbol{q}}\cdot\underbrace{(\boldsymbol{k}\otimes\cdots\otimes\boldsymbol{k})}_{m \text{ times } \boldsymbol{k}} = \left(\frac{\otimes^m\boldsymbol{q}}{\sqrt{m!}}\right)\cdot\left(\frac{\otimes^m\boldsymbol{k}}{\sqrt{m!}}\right) \end{equation}

Here $\otimes^m\boldsymbol{q}, \otimes^m\boldsymbol{k}$ are shorthand for the continuous outer product of $m$ copies of $\boldsymbol{q}, \boldsymbol{k}$ (the $m$-th power of the outer product). Using this result, we have:

\begin{equation} e^{\boldsymbol{q}\cdot \boldsymbol{k}}\approx \sum_{m=0}^n \left(\frac{\otimes^m\boldsymbol{q}}{\sqrt{m!}}\right)\cdot\left(\frac{\otimes^m\boldsymbol{k}}{\sqrt{m!}}\right) =\underbrace{\begin{pmatrix} 1 \\ \boldsymbol{q}\\ \frac{\otimes^2\boldsymbol{q}}{\sqrt{2}} \\ \vdots\\ \frac{\otimes^n\boldsymbol{q}}{\sqrt{n!}}\end{pmatrix}}_{\phi(\boldsymbol{q})} \cdot \underbrace{\begin{pmatrix} 1 \\ \boldsymbol{k}\\ \frac{\otimes^2\boldsymbol{k}}{\sqrt{2}} \\ \vdots\\ \frac{\otimes^n\boldsymbol{k}}{\sqrt{n!}}\end{pmatrix}}_{\varphi(\boldsymbol{k})} \end{equation}

This completes the transformation from standard Attention to Linear Attention.

Exponential Definition

Compared to Performer's random projection, the Taylor expansion approach mentioned above should be easier to understand. However, there is an even simpler and more direct approach, which is to use the definition of the natural exponential:

\begin{equation}e^x = \lim_{n\to\infty} \left(1+\frac{x}{n}\right)^n\end{equation}

Therefore, by choosing an appropriate $n$, we have:

\begin{equation}e^{\boldsymbol{q}\cdot \boldsymbol{k}} \approx \left(1+\frac{{\boldsymbol{q}\cdot \boldsymbol{k}}}{n}\right)^n = \left(\begin{pmatrix} 1 \\ \frac{\boldsymbol{q}}{\sqrt{n}}\end{pmatrix} \cdot \begin{pmatrix}1 \\ \frac{\boldsymbol{k}}{\sqrt{n}}\end{pmatrix}\right)^n \end{equation}

Combining this with the transformation results of the polynomial kernel function from the previous section, we have:

\begin{equation}e^{\boldsymbol{q}\cdot \boldsymbol{k}} \approx \underbrace{\left(\otimes^n\begin{pmatrix} 1 \\ \frac{\boldsymbol{q}}{\sqrt{n}}\end{pmatrix}\right)}_{\phi(\boldsymbol{q})} \cdot \underbrace{\left(\otimes^n\begin{pmatrix}1 \\ \frac{\boldsymbol{k}}{\sqrt{n}}\end{pmatrix}\right)}_{\varphi(\boldsymbol{k})}\end{equation}

This might be the simplest and most direct scheme for transforming standard Attention into Linear Attention.

Results Analysis

In terms of practical value, the latter two deterministic schemes are far inferior to Performer's random projection scheme. This is because the output dimension of random projection can be flexibly controlled, whereas the output dimension of the two deterministic schemes is on the order of $d^n$, which is usually much larger than the sequence length itself. Thus, using them to implement Linear Attention would be less efficient than standard Attention.

However, from a theoretical standpoint, the latter two schemes provide a simpler and more convenient way to equate standard Attention with infinite-dimensional Linear Attention. This equivalence often helps us better understand the Attention mechanism, the most direct being the understanding of the rank of Attention.

Readers who have studied Linear Attention should know that if Linear Attention is used for bidirectional tasks (such as MLM), the performance drop is very significant. This is because in Linear Attention, $\phi(\boldsymbol{Q}), \varphi(\boldsymbol{K}) \in \mathbb{R}^{n \times d}$ (where $d$ is the head_size of each head), and generally $n \gg d$, so the rank of the $n \times n$ Attention matrix obtained from $\phi(\boldsymbol{Q})\varphi(\boldsymbol{K})^{\top}$ is at most $d$. This is the low-rank problem of Linear Attention, which limits its expressive power.

Conversely, the three transformations introduced earlier tell us that standard Attention can be viewed as infinite-dimensional Linear Attention. Therefore, the rank of standard Attention is theoretically not limited by $d$, which is why standard Attention with the same number of parameters often performs better than Linear Attention. In "Transformer Upgrade Journey: 3. From Performer to Linear Attention", we also mentioned that if standard Attention is to be switched to Linear Attention, $d$ must be scaled up accordingly to maintain a certain degree of approximation in performance.

Article Summary

This article introduced three interpretations of regarding standard Attention as infinite-dimensional Linear Attention. These different perspectives allow us to link standard Attention with Linear Attention, enabling a more comprehensive understanding of the Attention mechanism from multiple angles.