The Road to Transformer Upgrades: 19. The Second Type of Rotary Positional Encoding

By 苏剑林 | April 18, 2025

Readers who have been following the "Road to Transformer Upgrades" series up to this point are likely already familiar with Rotary Positional Encoding (RoPE). Simply put, RoPE is a rotary transformation applied to the Query ($\boldsymbol{Q}$) and Key ($\boldsymbol{K}$) of the Attention mechanism. Formally, it belongs to the category of absolute positional encodings, but when combined with the dot-product property of Attention, it automatically achieves a relative position effect.

So, can RoPE be added to the Value ($\boldsymbol{V}$)? At first glance, it seems not, because rotating $\boldsymbol{V}$ doesn't result in relative positional encoding. However, things are not so absolute. This article discusses applying RoPE to $\boldsymbol{V}$, which we can call the "Second Type of Rotary Positional Encoding."

Basic Review

We decompose Dot-Product Attention as follows: \begin{equation}\boldsymbol{o}_i = \sum_j a_{i,j}\boldsymbol{v}_j,\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_j e^{s_{i,j}}},\qquad s_{i,j} = \boldsymbol{q}_i^{\top}\boldsymbol{k}_j\end{equation} For simplicity, the scaling factor for $s_{i,j}$ is omitted here. RoPE is applied to $\boldsymbol{q}_i$ and $\boldsymbol{k}_j$: \begin{equation}\boldsymbol{q}_i \to \boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i,\qquad \boldsymbol{k}_j \to \boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j\end{equation} This causes the Attention Logits, $s_{i,j}$, to become: \begin{equation}s_{i,j} = (\boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i)^{\top} (\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j) = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j=\boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{k}_j\end{equation} That is, $s_{i,j}$ only depends on the relative position $j-i$, thereby achieving a relative position effect through an absolute position form. This transformation process utilizes the property of rotation matrices: $\boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j=\boldsymbol{\mathcal{R}}_{j-i}$.

In addition to rotation matrices, we proved in "The Road to Transformer Upgrades: 4. Rotary Positional Encoding for 2D Positions" that the general solution is $\boldsymbol{\mathcal{R}}_i = \boldsymbol{O}^i$, where $\boldsymbol{O}$ is any orthogonal matrix and the superscript represents matrix exponentiation. However, in "The Road to Transformer Upgrades: 6. Completeness Analysis of Rotary Positional Encoding," we also proved that general orthogonal matrix solutions are essentially isomorphic to rotary matrix solutions.

New Usage

What happens if we apply RoPE to $\boldsymbol{v}_j$, i.e., $\boldsymbol{v}_j\to\boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j$? Clearly, the result of the Attention is \begin{equation}\boldsymbol{o}_i = \sum_j a_{i,j} \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j\label{eq:v-rope-abs}\end{equation} This causes the Attention to explicitly depend on the absolute position $j$. If we only want a generic positional encoding, this might not be a problem; but if we want a relative positional encoding, it fails to meet our objective.

However, there is a simple trick to solve this flaw! We can apply another inverse RoPE to $\boldsymbol{o}_i$: \begin{equation}\boldsymbol{o}_i = \boldsymbol{\mathcal{R}}_i^{\top}\left(\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j\right)=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{v}_j\label{eq:vo-rope}\end{equation} In this way, it once again becomes a relative positional encoding! Formally, it still consists of two absolute positional encodings, echoing the logic of existing RoPE. Therefore, we call it the "Second Type of Rotary Positional Encoding," or more intuitively, "VO-RoPE," because it adds RoPE to both the Value and the Output. Correspondingly, standard RoPE can be called "QK-RoPE."

Simple Experiment

A quick set of experiments was conducted on a LLAMA-like model of approximately 1B parameters. Several configurations were compared:

1. NoPE: No positional encoding added at all;
2. QK-RoPE: Standard rotary positional encoding;
3. VO-RoPE: The second type of rotary positional encoding proposed in this article;
4. Q/K/V/O-RoPE: Adding rotary positional encoding individually to only one of Q, K, V, or O;
5. QKV-RoPE: Adding rotary positional encoding to Q, K, and V;
6. QKVO-RoPE: Adding rotary positional encoding to Q, K, V, and O.

Note that points 4 and 5 are considered absolute positional encodings. The general conclusion is: $$\text{QK-RoPE} \approx \text{QKVO-RoPE} > \text{K-RoPE} \approx \text{VO-RoPE} > \text{QKV-RoPE} > \text{NoPE} > \text{Q/V/O-RoPE}$$

The specific differences in loss function are: \begin{array}{c|c} \hline & \text{Loss} \\ \hline \text{QK-RoPE} & 2.712 \\ \text{QKVO-RoPE} & 2.719 \\ \text{K-RoPE} & 2.769 \\ \text{VO-RoPE} & 2.770 \\ \text{QKV-RoPE} & 2.783 \\ \text{NoPE} & 2.795 \\ \text{O-RoPE} & 2.841 \\ \text{Q-RoPE} & 2.851 \\ \text{V-RoPE} & 2.856 \\ \hline \end{array}

Some Reflections

As seen from the results above, VO-RoPE is superior to NoPE but inferior to QK-RoPE, and stacking VO-RoPE with QK-RoPE yields no gain. In this light, does VO-RoPE seem unnecessary to propose?

In the author's view, completing the usage of RoPE, answering the question "Can RoPE be added to Value?", and clarifying experimentally that it "doesn't provide significant gains" is itself very valuable. Furthermore, in the long run, it might not always lack benefits; it just might not show its effect under our current mainstream language model settings. When the author originally proposed RoPE, the motivation was simply that it was "interesting," with no expectation that it would become a competitive positional encoding (what happened later was fortunate).

Currently, VO-RoPE has a potential application scenario related to MLA, introduced in "The Ultimate Tug-of-War Between Cache and Performance: From MHA, MQA, GQA to MLA." We know that during inference, MLA is roughly equivalent to an MQA where K and V are shared: \begin{equation}\boldsymbol{o}_i = \sum_{j=1}^i a_{i,j}\boldsymbol{c}_j,\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum_{j=1}^i e^{s_{i,j}}},\qquad s_{i,j} = \exp(\boldsymbol{q}_i^{\top}\boldsymbol{c}_j)\end{equation} This feature allows it to have a KV Cache of only a single $\boldsymbol{c}$. However, this important feature is incompatible with QK-RoPE because once RoPE is added to $\boldsymbol{c}_j$ inside the Attention matrix, two results occur:

1. If $\boldsymbol{c}_j$ on the Value side does not have RoPE added, then K and V are no longer fully shared, leading to either doubled KV Cache (caching both before and after RoPE) or real-time RoPE injection for K (introducing latency);
2. If $\boldsymbol{c}_j$ on the Value side does have RoPE added, the sharing of K and V is achieved, but it is no longer a relative positional encoding.

To solve this, MLA adopts a "mostly NoPE + a small part RoPE" concatenation approach. However, from the second type of rotary positional encoding discussed in this article, we know that one only needs to add O-RoPE to the Output once: \begin{equation}\boldsymbol{o}_i = \boldsymbol{\mathcal{R}}_i^{\top}\sum_{j=1}^i a_{i,j}(\boldsymbol{\mathcal{R}}_j\boldsymbol{c}_j),\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum_{j=1}^i e^{s_{i,j}}},\qquad s_{i,j} = (\boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i)^{\top} (\boldsymbol{\mathcal{R}}_j\boldsymbol{c}_j)\end{equation} However, this line of reasoning is not yet fully realized and cannot be directly used in MLA's training form; it is written here for reference.

Related Work

In fact, VO-RoPE also ingeniously provides an intermediate form between Attention and complex linear RNNs (such as LRU and RetNet). Starting from equation $\eqref{eq:vo-rope}$, consider a Causal scenario and take a special case where $a_{i,j}=\gamma^{i-j}$ with $0 < \gamma < 1$, then: \begin{equation}\boldsymbol{o}_i = \sum_{j=1}^i \gamma^{i-j} \boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{v}_j\end{equation} We know that the rotation matrix $\boldsymbol{\mathcal{R}}_{j-i}$ written in complex form is simply a diagonal matrix of $e^{\mathbb{I}\theta (j - i)}$, where $\mathbb{I}$ is the imaginary unit (i.e., $\mathbb{I}^2=-1$; to distinguish it from the index $i$, it is written as $\mathbb{I}$). Thus, the above equation is equivalent to: \begin{equation}\boldsymbol{o}_i = \sum_{j=1}^i \gamma^{i-j} e^{\mathbb{I}\theta (j - i)} \boldsymbol{v}_j = \sum_{j=1}^i (\gamma e^{-\mathbb{I}\theta})^{i-j} \boldsymbol{v}_j\end{equation} This is actually the simplest linear RNN with complex decay. According to the derivation in "Google's New Work Attempts to 'Revive' RNNs: Can RNNs Shine Again?", such RNNs are theoretically more complete than pure real-decay RNNs.

Therefore, supplementing RoPE with the VO-RoPE form is equivalent to a general generalization from real linear RNNs to complex linear RNNs. Theoretically, this makes its capabilities more complete, although such completeness might not necessarily help in language modeling tasks—just as the introduction of complex numbers in LRU did not show an advantage over the pure real-numbered RWKV. However, theoretical completeness may imply special value in certain scenarios—who knows?

Epilogue: After sharing this article on Twitter, some readers provided feedback that they had previously attempted VO-RoPE, including:

1. @gharik stated that he had previously tried QKVO-RoPE and obtained some positive results, naming it "RoPER" at the time. More details can be found here and here;
2. @vinam_arora pointed out that he had tried VO-RoPE in a "brain decoding task" and the results were also positive. The paper is "A Unified, Scalable Framework for Neural Population Decoding."

Summary

This article centered around the question "Can RoPE be added to V?" and discussed a second usage for RoPE.