Transformer Path to Upgrade: 10. RoPE is a $eta$-base Encoding

By 苏剑林 | July 6, 2023

For readers concerned with how to extend the Context length of LLMs (Large Language Models), last week was undoubtedly an exciting week, with the open-source community continuously producing heartening results. First, the user @kaiokendev experimented with a "Positional Linear Interpolation" scheme in his project SuperHOT, showing that with very little fine-tuning on long texts, existing LLMs can handle a Long Context. Almost simultaneously, Meta proposed the same idea and published it with rich experimental results in the paper "Extending Context Window of Large Language Models via Positional Interpolation". The surprises did not stop there—subsequently, the user @bloc97 proposed NTK-aware Scaled RoPE, achieving the effect of extending the Context length without any fine-tuning!

All these developments, especially NTK-aware Scaled RoPE, have forced me to re-think the meaning of RoPE. After analysis, I discovered that the construction of RoPE can be viewed as a $\beta$-base encoding. From this perspective, these recent advances in the open-source community can be understood as different ways of expanding base encoding.

Base Representation

Suppose we have an integer $n$ less than 1000 (excluding 1000) that we want to input into a model as a condition. Which method would be best?

The simplest idea is to input it directly as a one-dimensional floating-point vector. However, a range from 0 to 999 involves a span of nearly a thousand; for gradient-based optimizers, it is not easy to optimize. What about scaling it to between 0 and 1? That’s not great either, because the gap between adjacent numbers changes from 1 to 0.001, making it difficult for the model and the optimizer to distinguish adjacent digits. Generally speaking, gradient-based optimizers are somewhat "fickle"; they can only handle inputs that are neither too large nor too small—problems arise when values are extreme.

Therefore, to avoid this problem, we need to devise a new input method. When we don't know how to let a machine handle it, we might as well think about how humans handle it. For an integer, like 759, this is a three-digit number in base 10, where each digit is 0–9. Since we ourselves use base 10 to represent numbers, why not input the base-10 representation directly into the model? That is, we input the integer $n$ as a three-dimensional vector $[a, b, c]$, where $a, b, c$ are the hundreds, tens, and units digits of $n$, respectively. In this way, we have reduced the span of the numbers without shrinking the gap between adjacent numbers, at the cost of increasing the input dimension—fortunately, neural networks are good at handling high-dimensional data.

If we want to further reduce the span of the numbers, we can further reduce the base of the notation, such as using base 8, base 6, or even base 2, at the cost of further increasing the input dimension.

Direct Extrapolation

Suppose we trained a model using a three-dimensional base-10 representation, and the model performance is quite good. Then a new requirement suddenly arises: increase the upper limit of $n$ to 2000. How should we handle this?

If we still use a base-10 representation for the vector input, then the input would now be a four-dimensional vector. However, the original model was designed and trained for three-dimensional vectors, so the model cannot handle the new dimension. Some readers might ask: "Why not pre-allocate enough dimensions in advance?" Yes, we could pre-allocate a few extra dimensions, set them to 0 during the training phase, and change them to other numbers during the inference phase. This is Extrapolation.

However, the dimensions pre-allocated during the training phase are always 0. If they are changed to other numbers during inference, the effect may not be good because the model does not necessarily have the ability to adapt to situations it has never been trained on. In other words, because the training data for certain dimensions is insufficient, direct extrapolation usually leads to a serious degradation in model performance.

Linear Interpolation

Thus, someone thought of changing extrapolation to Interpolation. Simply put, this compresses the range of 2000 into the range of 1000. For example, by dividing by 2, 1749 becomes 874.5, which is then input into the original model as a three-dimensional vector $[8, 7, 4.5]$. From an absolute value perspective, the new $[7, 4, 9]$ actually corresponds to 1498, which is twice the original value—the mapping method is inconsistent. From a relative value perspective, the original gap between adjacent numbers was 1, and now it is 0.5; the last dimension becomes more "crowded." Therefore, after making interpolation modifications, fine-tuning is usually required so that the model can re-adapt to the crowded mapping relationship.

Of course, some readers might say that the extrapolation scheme can also be fine-tuned. Yes, but the number of steps required for fine-tuning in the interpolation scheme is much smaller. This is because in many scenarios (such as position encoding), relative magnitude (or ordinal information) is more important. In other words, the model only needs to know that 874.5 is larger than 874; it doesn't need to know exactly how large the number it actually represents is. Since the original model has already learned that 875 is larger than 874, and the model itself has a certain generalization ability, it is not too difficult to learn that 874.5 is larger than 874.

However, the interpolation scheme is not perfect. When the processing range increases further, the adjacent difference becomes even smaller, and this shrinkage is concentrated in the units place, while the remaining hundreds and tens places still maintain an adjacent difference of 1. In other words, the interpolation method makes the distribution across different dimensions inconsistent, making each dimension unequal, which increases the difficulty for the model to learn further.

Base Conversion

Is there a scheme that does not require adding new dimensions and can maintain the adjacent gap? Yes, one we might be very familiar with: base conversion! A three-digit base-10 encoding can represent 0–999. What if it's base 16? It can represent a maximum of $16^3 - 1 = 4095 > 1999$. Therefore, we only need to convert to base 16, so 1749 becomes $[6, 13, 5]$, and a three-dimensional vector can cover the target range. The cost is that the numbers in each dimension change from 0–9 to 0–15.

If you think about it carefully, this is truly a brilliant idea. As just mentioned, the scenarios we are concerned with mainly utilize ordinal information. The originally trained model has already learned that $875 > 874$, and in base 16, $875 > 874$ still holds—the comparison rules are exactly the same (the model has no idea what base you are using for the input). The only concern is whether the model can still compare normally once each dimension exceeds 9 (i.e., 10–15), but in fact, models generally have a certain generalization ability, so extrapolating each dimension slightly is fine. Therefore, this idea of base conversion might even be effective without fine-tuning the original model! Furthermore, to narrow the range of extrapolation even more, we can use base $\lceil \sqrt[3]{2000} \rceil = 13$ instead of base 16.

Next, we will see that this idea of base conversion actually corresponds to the NTK-aware scaled RoPE mentioned at the beginning of the article!

Position Encoding

To establish the connection between them, we first need to establish the following result:

The Rotary Positional Encoding (RoPE) of position $n$ is, in essence, a $\beta$-base encoding of the number $n$!

This may seem surprising because the two appear quite different. But in fact, the operations of both share the same key properties. To understand this, let's first recall a base-10 number $n$. If we want to find the $m$-th digit of its $\beta$-base representation (counting from right to left), the method is: \begin{equation}\left\lfloor\frac{n}{\beta^{m-1}}\right\rfloor\bmod\beta\label{eq:mod}\end{equation} That is, first divide by the $(m-1)$-th power of $\beta$, and then find the remainder (modulo). Now let's recall RoPE; its construction basis is Sinusoidal Position Encoding, which can be rewritten as: \begin{equation}\left[\cos\left(\frac{n}{\beta^0}\right),\sin\left(\frac{n}{\beta^0}\right),\cos\left(\frac{n}{\beta^1}\right),\sin\left(\frac{n}{\beta^1}\right),\cdots,\cos\left(\frac{n}{\beta^{d/2-1}}\right),\sin\left(\frac{n}{\beta^{d/2-1}}\right)\right]\label{eq:sinu}\end{equation} where $\beta=10000^{2/d}$. Now, comparing with equation $\eqref{eq:mod}$, doesn't equation $\eqref{eq:sinu}$ also have the exact same $\frac{n}{\beta^{m-1}}$ term? As for the modulo operation, its most important characteristic is periodicity. Aren't $\cos$ and $\sin$ in equation $\eqref{eq:sinu}$ also periodic functions? Therefore, aside from the insignificant difference of the floor function, RoPE (or Sinusoidal position encoding) is actually a $\beta$-base encoding of the number $n$!

After establishing this connection, the expansion schemes for the integer $n$ discussed in the previous sections can correspond to the various developments mentioned at the beginning of the article. Among them, the direct extrapolation scheme is to change nothing; the interpolation scheme is to replace $n$ with $n/k$, where $k$ is the factor by which the range is expanded. This is the Positional Interpolation experimented with in Meta's paper, where experimental results also proved that extrapolation indeed requires more fine-tuning steps than interpolation.

As for base conversion, if we want to expand the representation range by $k$ times, then the original base $\beta$ must be expanded to at least base $\beta (k^{2/d})$ (although equation $\eqref{eq:sinu}$ is a $d$-dimensional vector, $\cos$ and $\sin$ appear in pairs, so it is equivalent to a $d/2$-digit $\beta$-base representation; hence we take the $(d/2)$-th root instead of the $d$-th root), or equivalently, the original base $10000$ is replaced with $10000k$. This is basically NTK-aware Scaled RoPE. As discussed earlier, because position encoding relies more on ordinal information, and base conversion basically does not change the rules for comparing order, NTK-aware Scaled RoPE can achieve good results on a longer Context even without fine-tuning.

Tracing to the Source

Some readers might be curious: what does this have to do with NTK? NTK stands for "Neural Tangent Kernel," which we briefly covered in "Optimization Algorithms from a Dynamical Perspective (7): SGD ≈ SVM?". The relationship between the above results and NTK stems more from the proposer's academic background. The proposer is familiar with results such as "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains," which uses NTK-related results to prove that neural networks cannot directly learn high-frequency signals. The solution is to transform them into Fourier features—whose form is similar to the Sinusoidal position encoding in equation $\eqref{eq:mod}$.

Therefore, based on the intuition of NTK-related results, the proposer derived NTK-aware Scaled RoPE. I consulted the proposer about his derivation; it is actually quite simple—it combines extrapolation and interpolation: high-frequency extrapolation and low-frequency interpolation. Specifically, the lowest frequency term in equation $\eqref{eq:sinu}$ is $\frac{n}{\beta^{d/2-1}}$. Introducing a parameter $\lambda$, it becomes $\frac{n}{(\beta\lambda)^{d/2-1}}$, and we set it to be consistent with interpolation, i.e.: \begin{equation}\frac{n}{(\beta\lambda)^{d/2-1}} = \frac{n/k}{\beta^{d/2-1}}\end{equation} Then we solve for $\lambda=k^{2/(d-2)}$. As for the highest frequency term $\frac{n}{\beta}$, after introducing $\lambda$ it becomes $\frac{n}{\beta\lambda}$. Since $d$ is usually very large, $\lambda$ is very close to 1, so it is still close to $\frac{n}{\beta}$, which is equivalent to extrapolation.

Thus, this scheme simply and ingeniously combines extrapolation and interpolation. Additionally, since $d$ is relatively large (64 for BERT, 128 for LLaMA), $k^{2/(d-2)}$ is not very different from $k^{2/d}$, so it is basically consistent with the $k^{2/d}$ solution I proposed based on the base-encoding idea. Furthermore, from the proposer's idea, any scheme that can achieve "high-frequency extrapolation, low-frequency interpolation" is acceptable, not just the scheme that introduces $\lambda$ mentioned above; readers can try this out for themselves.

Personal Testing

As a scheme that claims to increase the Context length of LLMs without fine-tuning, when I first saw NTK-aware Scaled RoPE, I was also shocked and couldn't wait to test it. After all, based on the experience from "Transformer Path to Upgrade: 9. A New Idea for Global Length Extrapolation," many mainstream schemes failed on the "GAU + Post Norm" combination I prefer. So how does this method fare?

When $k=8$, the comparison results are as follows (for the difference between "repeated" and "non-repeated," please refer to here):

Test Length 512 (Train) 4096 (Repeated) 4096 (Non-repeated)
Baseline 49.41% 24.17% 23.16%
Baseline-$\log n$ 49.40% 24.60% 24.02%
PI-RoPE 49.41% 15.04% 13.54%
PI-RoPE-$\log n$ 49.40% 14.99% 16.51%
NTK-RoPE 49.41% 51.28% 39.27%
NTK-RoPE-$\log n$ 49.40% 61.71% 43.75%

All results reported above are without long-text fine-tuning. Baseline refers to extrapolation; PI (Positional Interpolation) means modifying the Baseline to use interpolation; NTK-RoPE refers to modifying the Baseline to use NTK-aware Scaled RoPE. The options with $\log n$ refer to including the scale from "From Entropy Invariance to the Scale Operation of Attention" during pre-training. I included this variant because I felt that although NTK-RoPE solves the length generalization problem of RoPE, it does not solve the problem of attention dispersion.

The experimental results in the table fully meet expectations:

1. Direct extrapolation doesn't work very well;
2. Interpolation performed poorly without fine-tuning;
3. NTK-RoPE achieved non-trivial (though slightly degraded) extrapolation results without fine-tuning;
4. Adding $\log n$ to concentrate attention indeed helps.

Therefore, NTK-RoPE has successfully become the second scheme I have tested that can extend the Context length of LLMs without fine-tuning (the first is naturally NBCE). Once again, a salute to the proposer's exceptional insight! Even better, NTK-RoPE performs significantly better on "repeated" extrapolation than on "non-repeated" extrapolation, indicating that the global dependence is preserved after this modification, rather than simply localizing attention.

Final Remarks

This article interprets RoPE from the perspective of $\beta$-base encoding and uses this to introduce some current developments in the open-source community regarding Long Context, including a modification scheme that increases Context length without fine-tuning.

In just one week, the progress on Long Context in the open-source community has been overwhelming and gratifying, so much so that user @ironborn123 commented:

Last week looked like the revenge of the interpolators :) ~~Open~~ ClosedAI better watch out.