By 苏剑林 | December 28, 2022
In last year's article "Transformer Upgrade Road: 2. Rotary Positional Embeddings (RoPE) that Incorporate Various Strengths," I proposed Rotary Positional Embedding (RoPE). At that time, the starting point was simply feeling that implementing relative positions through absolute positions was a "very fun thing to do," and I did not expect that its actual performance would be quite good and widely accepted. This was truly a pleasant surprise. Later, in "Transformer Upgrade Road: 4. Rotary Positional Embeddings for 2D Positions," I discussed the 2D form of RoPE and studied the general solution of RoPE expressed using matrix exponentials.
Since we have a general solution, a natural question arises: the RoPE we commonly use is just a block-diagonal matrix with 2D rotation matrices as the basic units. If we replaced it with the general solution, would the performance be theoretically better? This article aims to answer this question.
Exponential General Solution
In "Transformer Upgrade Road: 4. Rotary Positional Embeddings for 2D Positions," we abstractly defined RoPE as any square matrix satisfying the following equation:
\begin{equation}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n=\boldsymbol{\mathcal{R}}_{n-m}\label{eq:re}\end{equation}
Then, we explored solutions in the form of the following matrix exponential:
\begin{equation}\boldsymbol{\mathcal{R}}_n=\exp n\boldsymbol{B}\end{equation}
The matrix exponential here is not an element-wise operation like a Softmax activation function, but follows the definition of a "Matrix Exponential" according to a Taylor series. According to the "Baker–Campbell–Hausdorff formula," we have:
\begin{equation}\begin{aligned}
\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n=&\,(\exp m\boldsymbol{B})^{\top}(\exp n\boldsymbol{B}) = (\exp m\boldsymbol{B}^{\top})(\exp n\boldsymbol{B}) \\
=&\,\exp\left(m\boldsymbol{B}^{\top} + n\boldsymbol{B} + \frac{1}{2}mn\left[\boldsymbol{B}^{\top}, \boldsymbol{B}\right]+\cdots\right)
\end{aligned}\end{equation}
Here $[\boldsymbol{A}, \boldsymbol{B}]=\boldsymbol{A}\boldsymbol{B}-\boldsymbol{B}\boldsymbol{A}$, and the $\cdots$ omits terms of third order or higher in $m$ and $n$. According to Equation \eqref{eq:re}, the exponent part of the above expression should equal $(n-m)\boldsymbol{B}$, which leads to:
\begin{equation}\boldsymbol{B}^{\top} = - \boldsymbol{B}\end{equation}
That is, $\boldsymbol{B}$ must be a skew-symmetric matrix.
Orthogonal General Solution
Furthermore, we have $(\exp \boldsymbol{B})^{\top}(\exp \boldsymbol{B})=\exp(\boldsymbol{B}-\boldsymbol{B}) = \boldsymbol{I}$ and $\exp n\boldsymbol{B} = (\exp\boldsymbol{B})^n$. The former indicates that $\exp\boldsymbol{B}$ is an orthogonal matrix, and the latter suggests whether this can be generalized to any orthogonal matrix. It is not difficult to verify that the answer is yes. we have the conclusion:
For any orthogonal matrix $\boldsymbol{O}$, $\boldsymbol{\mathcal{R}}_n=\boldsymbol{O}^n$ is a solution satisfying Equation \eqref{eq:re}.
It is worth pointing out that in the real field, not all orthogonal matrices can be written in the form $\exp\boldsymbol{B}$, so $\boldsymbol{O}^n$ is actually more broad than the matrix exponential form. From "Appreciating the Identity det(exp(A)) = exp(Tr(A))" we know $\det(\exp(\boldsymbol{A})) = \exp(\text{Tr}(\boldsymbol{A})) > 0$, so the determinant of an orthogonal matrix that can be written in matrix exponential form must be greater than 0 (i.e., equal to 1). In fact, the converse also holds: any orthogonal matrix with a determinant equal to 1 can be written in the form $\exp\boldsymbol{B}$, where $\boldsymbol{B}$ is a skew-symmetric matrix (refer to "Why can any orthogonal matrix be written as O=e^A").
For an orthogonal matrix with $\det(\boldsymbol{O}) = -1$, we have $\boldsymbol{O}=\boldsymbol{O}_+ \boldsymbol{I}_-$, where $\boldsymbol{I}_-$ is a diagonal matrix with one -1 on the diagonal and the rest 1s, and $\boldsymbol{O}_+$ is an orthogonal matrix with $\det(\boldsymbol{O}_+) = 1$, which can be written in the form $\exp\boldsymbol{B}$. In this case, $\boldsymbol{O}^n = (\boldsymbol{O}_+ \boldsymbol{I}_-)^n = \boldsymbol{I}_-^n\exp n\boldsymbol{B}$. This means that even for $\boldsymbol{O}^n$ with $\det(\boldsymbol{O}) = -1$, it is just a simple transformation of $\exp n\boldsymbol{B}$. Therefore, we will primarily study solutions in the form of $\exp n\boldsymbol{B}$.
Completeness Analysis
As is well known, the RoPE positional encoding we usually use is a block-diagonal matrix of the following form:
\begin{equation}\scriptsize{\left(\begin{array}{cc:cc:cc:cc}
\cos n\theta_0 & -\sin n\theta_0 & 0 & 0 & \cdots & \cdots & 0 & 0 \\
\sin n\theta_0 & \cos n\theta_0 & 0 & 0 & \cdots & \cdots & 0 & 0 \\
\hdashline
0 & 0 & \cos n\theta_1 & -\sin n\theta_1 & \cdots & \cdots & 0 & 0 \\
0 & 0 & \sin n\theta_1 & \cos n\theta_1 & \cdots & \cdots & 0 & 0 \\
\hdashline
\vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\
\vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\
\hdashline
0 & 0 & 0 & 0 & \cdots & \cdots & \cos n\theta_{d/2-1} & -\sin n\theta_{d/2-1} \\
0 & 0 & 0 & 0 & \cdots & \cdots & \sin n\theta_{d/2-1} & \cos n\theta_{d/2-1} \\
\end{array}\right)}\end{equation}
It can be simplified as:
\begin{equation}\begin{pmatrix}
\boldsymbol{R}_{n\theta_0} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\
\boldsymbol{0} & \boldsymbol{R}_{n\theta_1} & \cdots & \boldsymbol{0} \\
\vdots & \vdots & \ddots & \vdots \\
\boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{R}_{n\theta_{d/2-1}} \\
\end{pmatrix} = \exp n\begin{pmatrix}
\boldsymbol{J}_{\theta_0} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\
\boldsymbol{0} & \boldsymbol{J}_{\theta_1} & \cdots & \boldsymbol{0} \\
\vdots & \vdots & \ddots & \vdots \\
\boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{J}_{\theta_{d/2-1}} \\
\end{pmatrix}\end{equation}
Where
\begin{equation}\boldsymbol{R}_{\theta} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix},\quad \boldsymbol{J}_{\theta} = \begin{pmatrix} 0 & -\theta \\ \theta & 0\end{pmatrix}\end{equation}
This choice is arguably the simplest possible one, fundamentally chosen to reduce computational complexity. So, the so-called completeness question is to answer: Compared to the full-parameter $\exp n\boldsymbol{B}$, does the special case of the block-diagonal matrix above represent a loss in capability? In other words, if computational cost is not considered, would replacing $\boldsymbol{B}$ with a general skew-symmetric matrix potentially improve performance?
Answering this question is not difficult. In fact, for any even-order skew-symmetric matrix, it can be diagonalized into a block-diagonal matrix:
\begin{equation}\boldsymbol{\Lambda} = \begin{pmatrix}
\boldsymbol{J}_{\theta_0} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\
\boldsymbol{0} & \boldsymbol{J}_{\theta_1} & \cdots & \boldsymbol{0} \\
\vdots & \vdots & \ddots & \vdots \\
\boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{J}_{\theta_{d/2-1}} \\
\end{pmatrix}\end{equation}
This conclusion can be found in the reference for Skew-symmetric matrix. That is to say, there exists an invertible matrix $\boldsymbol{P}$ such that $\boldsymbol{B}=\boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$, so:
\begin{equation}\exp n\boldsymbol{B} = \exp \left(n\boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}\right) = \boldsymbol{P}(\exp n\boldsymbol{\Lambda})\boldsymbol{P}^{-1}\end{equation}
This means that any $\exp n\boldsymbol{B}$ and the block-diagonal $\exp n\boldsymbol{\Lambda}$ differ only by a similarity transformation. When we apply RoPE in Self-Attention, it looks like this:
\begin{equation}\boldsymbol{q}^{\top}\big(\exp (n-m)\boldsymbol{B}\big)\boldsymbol{k} = \big(\boldsymbol{P}^{\top}\boldsymbol{q}\big)^{\top}\big(\exp (n-m)\boldsymbol{\Lambda}\big)\big(\boldsymbol{P}^{-1}\boldsymbol{k}\big)\end{equation}
Since $\boldsymbol{q},\boldsymbol{k}$ are generally obtained from the input $\boldsymbol{x}$ through some learnable linear transformation, $\boldsymbol{P}^{\top}$ and $\boldsymbol{P}^{-1}$ can theoretically be absorbed into the training parameters of the linear transformations. Therefore, directly setting it to $\boldsymbol{q}^{\top}\big(\exp (n-m)\boldsymbol{\Lambda}\big)\boldsymbol{k}$ theoretically results in no loss of generality.
So, for Self-Attention, the answer is negative. However, if it is Linear Attention, the answer will be slightly different, because the $\boldsymbol{q},\boldsymbol{k}$ in Linear Attention have an added activation function:
\begin{equation}\phi(\boldsymbol{q})^{\top}\big(\exp (n-m)\boldsymbol{B}\big)\varphi(\boldsymbol{k}) = \big(\boldsymbol{P}^{\top}\phi(\boldsymbol{q})\big)^{\top}\big(\exp (n-m)\boldsymbol{\Lambda}\big)\big(\boldsymbol{P}^{-1}\varphi(\boldsymbol{k})\big)\end{equation}
This results in $\boldsymbol{P}^{\top}$ and $\boldsymbol{P}^{-1}$ not necessarily being absorbable into the training parameters of the linear transformation. Therefore, adding two parameter matrices to Linear Attention might potentially bring improvements.
Summary
This article briefly analyzes the completeness issue of RoPE, showing that for Self-Attention, the current block-diagonal form of RoPE does not lose generality.