Transformer Upgrade Path: 11. Taking the β-base Position Encoding to the End

By 苏剑林 | July 31, 2023

In the article "Transformer Upgrade Path: 10. RoPE is a β-base Encoding", we provided a $\beta$-base interpretation of RoPE and derived NTK-aware Scaled RoPE, which can extend Context length without fine-tuning, based on the idea of base conversion. It must be said that understanding position encoding through the analogy of $\beta$-base representation is a very beautiful and inspiring perspective. Every time I think deeply about it, I seem to gain new insights.

This article will revisit the $\beta$-base interpretation of RoPE and attempt to generalize the existing NTK-aware Scaled RoPE, with the aim of finding an optimal strategy for extending the Context length of LLMs without fine-tuning.

Base Analogy

We know that the parameterization of RoPE follows the form of Sinusoidal position encoding. Whether by coincidence or design, the Sinusoidal position encoding of an integer $n$ has many similarities with its $\beta$-base encoding.

Specifically, the $m$-th digit (counting from right to left) of the $\beta$-base representation of an integer $n$ is: \begin{equation} \lfloor n / \beta^{m-1} \rfloor \bmod \beta \label{eq:1} \end{equation} And its Sinusoidal position encoding is: \begin{equation} p_n = [\cos \theta_1, \sin \theta_1, \cos \theta_2, \sin \theta_2, \dots, \cos \theta_{d/2}, \sin \theta_{d/2}], \quad \theta_m = n / \beta^{m-1}, \beta = 10000^{2/d} \label{eq:2} \end{equation} As we can see, both share the same $n / \beta^{m-1}$, and $\bmod$ is a periodic function just like $\cos$ and $\sin$. Therefore, the only difference between the two is the largely insignificant floor function $\lfloor \cdot \rfloor$. Thus, analogizing RoPE/Sinusoidal position encoding to a $\beta$-base representation is a very intuitive and reasonable result.

Corrected NTK

Following the logic in "Transformer Upgrade Path: 10. RoPE is a β-base Encoding", direct extrapolation concentrates the extrapolation pressure on the "high-order bits (large $m$)", while position interpolation makes the representation of "low-order bits (small $m$)" denser, which is detrimental to distinguishing relative distances. NTK-aware Scaled RoPE is essentially a base conversion that spreads the extrapolation pressure across every bit while keeping the adjacent intervals constant. These characteristics are very friendly and crucial for LLMs, which clearly tend to rely on relative positions, allowing it to achieve certain effects even without fine-tuning.

Looking closely at Equation \eqref{eq:2}, $\cos$ and $\sin$ actually form a single unit, so it effectively has $d/2$ bits. This means it corresponds to a $d/2$-digit $\beta$-base encoding of $n$. If we want to extend the Context length by $k$ times by converting the $\beta$-base to a $\beta\lambda$-base, then at least we should have: \begin{equation} \lambda^{d/2} = k \Rightarrow \lambda = k^{2/d} \end{equation} Then the new RoPE becomes: \begin{equation} p_n = [\cos \theta_1, \sin \theta_1, \cos \theta_2, \sin \theta_2, \dots, \cos \theta_{d/2}, \sin \theta_{d/2}], \quad \theta_m = n / (\beta\lambda)^{m-1}, \beta = 10000^{2/d}, \lambda = k^{2/d} \label{eq:4} \end{equation} This is the NTK-RoPE we proposed in the previous article.

However, after deeper reflection, I realized this is not quite reasonable. Returning to Equation \eqref{eq:1}, if we were to calculate the $m$-th digit of a $\beta\lambda$-base, it should be: \begin{equation} \lfloor n / (\beta\lambda)^{m-1} \rfloor \bmod (\beta\lambda) \label{eq:5} \end{equation} In other words, besides changing $n / \beta^{m-1}$ to $n / (\beta\lambda)^{m-1}$, the period of the $\bmod$ operation also needs to be expanded by $\lambda$ times. This is equivalent to dividing by an additional $\lambda$ before applying $\cos$ and $\sin$: \begin{equation} p_n = [\cos \theta_1, \sin \theta_1, \cos \theta_2, \sin \theta_2, \dots, \cos \theta_{d/2}, \sin \theta_{d/2}], \quad \theta_m = \frac{n}{\lambda (\beta\lambda)^{m-1}}, \beta = 10000^{2/d}, \lambda = k^{2/d} \label{eq:6} \end{equation} In subsequent experiments, we refer to Equation \eqref{eq:4} proposed in the previous article as "NTK-RoPE-old" and Equation \eqref{eq:6} as "NTK-RoPE-fixed".

Mixed Base

Now, let's let our imagination run a bit wilder—if we can use a $\beta$-base to represent position, why not use a more generalized "mixed base"? A mixed base means that the base used for each digit is not necessarily the same. This is not unfamiliar to us; for example, 60 seconds make 1 minute, 60 minutes make 1 hour, but 24 hours make 1 day, and 7 days make 1 week. Here, 60, 60, 24, and 7 are different bases. In other words, seconds, minutes, hours, days, and weeks are examples of using a mixed base.

Assuming that from right to left, the 1st digit uses base $\beta_1$, the 2nd digit uses base $\beta_2$, the 3rd digit uses base $\beta_3$, and so on, then the calculation of the $m$-th digit of $n$ results in: \begin{equation} \lfloor \frac{n}{\beta_1 \beta_2 \dots \beta_{m-1}} \rfloor \bmod \beta_m \label{eq:7} \end{equation} Why consider a mixed base? Because one day I discovered an interesting fact: RoPE is essentially a relative position encoding, and relative position is a special case of a Toeplitz matrix, which looks like this (since this article focuses on language models, only the lower triangular part is shown): \begin{equation} \begin{pmatrix} 0 & & & & & & \\ 1 & 0 & & & & & \\ 2 & 1 & 0 & & & & \\ 3 & 2 & 1 & 0 & & & \\ 4 & 3 & 2 & 1 & 0 & & \\ 5 & 4 & 3 & 2 & 1 & 0 & \\ 6 & 5 & 4 & 3 & 2 & 1 & 0 \end{pmatrix} \end{equation} From the above matrix, we can observe that the distribution of relative positions is unbalanced! 0 appears most frequently, followed by 1, then 2, and so on. That is, as $n$ increases, the number of occurrences decreases. This implies that as a $\beta$-base encoding, the "high-order bits" of RoPE are likely to be insufficiently trained; in other words, the generalization ability of the high-order bits is likely inferior to that of the low-order bits. As mentioned, NTK-RoPE spreads the extrapolation pressure evenly across all bits. If my suspicion is correct, then "even spreading" is not optimal. Instead, low-order bits should share more of the pressure, and high-order bits less, which leads to a mixed base.

Allocation Optimization

Specifically, we extend the context to $k$ times by converting the $\beta$-base into a mixed $\beta_1, \beta_2, \dots, \beta_{d/2}$ base, where $\beta_m = \beta\lambda_m$. At this point, Equation \eqref{eq:7} becomes: \begin{equation} \lfloor \frac{n}{\beta^{m-1}(\lambda_1 \lambda_2 \dots \lambda_{m-1})} \rfloor \bmod (\beta\lambda_m) \end{equation} Equation \eqref{eq:6} correspondingly becomes: \begin{equation} p_n = [\cos \theta_1, \sin \theta_1, \cos \theta_2, \sin \theta_2, \dots, \cos \theta_{d/2}, \sin \theta_{d/2}], \quad \theta_m = \frac{n}{\beta^{m-1}(\lambda_1 \lambda_2 \dots \lambda_m)}, \beta = 10000^{2/d} \label{eq:9} \end{equation} According to the principles of "extending $k$ times" and "low-order bits sharing more pressure," the constraints are: \begin{equation} \lambda_1 \lambda_2 \dots \lambda_{d/2} = k, \quad \lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_{d/2} \ge 1 \label{eq:10} \end{equation} We discuss a solution of the following form (interested readers can try other forms, as there is a lot of freedom here): \begin{equation} \lambda_1 \lambda_2 \dots \lambda_m = \exp(a m^b) \label{eq:12} \end{equation} When $a > 0$ and $b \le 1$, it satisfies the condition $\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_{d/2} \ge 1$. When $b=1$, it is actually the previously mentioned "NTK-RoPE-fixed". When $b=0$, it is Positional Interpolation (PI). The constraint $\lambda_1 \lambda_2 \dots \lambda_{d/2} = k$ gives: \begin{equation} a(d/2)^b = \log k \label{eq:13} \end{equation} Thus, there is only one degree of freedom to adjust. Through a simple binary search, I found that in my experiments, $b = 0.625$ generally yields better expansion results (different models might have different optimal solutions, please tune accordingly). This version is called "NTK-RoPE-mixed".

Experimental Results

Based on the experiments in "Transformer Upgrade Path: 10. RoPE is a β-base Encoding", I added experiments for "NTK-RoPE-fixed" and "NTK-RoPE-mixed". The comparison is as follows:

Test Length	512 (Train)	4096 (Repeated)	4096 (Non-repeated)
Baseline	49.41%	24.17%	23.16%
Baseline-logn	49.40%	24.60%	24.02%
PI-RoPE	49.41%	15.04%	13.54%
PI-RoPE-logn	49.40%	14.99%	16.51%
NTK-RoPE-old	49.41%	51.28%	39.27%
NTK-RoPE-logn-old	49.40%	61.71%	43.75%
NTK-RoPE-fixed	49.41%	51.86%	39.61%
NTK-RoPE-logn-fixed	49.40%	62.85%	44.14%
NTK-RoPE-mixed	49.41%	53.09%	40.12%
NTK-RoPE-logn-mixed	49.40%	68.91%	45.41%

As can be seen, compared to the constant-base "NTK-RoPE-old" and "NTK-RoPE-fixed", the "NTK-RoPE-mixed" derived from the mixed base brings a significant improvement. Moreover, it requires no fine-tuning, making it a "free lunch." Additionally, it is evident that the logn version has better extrapolation performance, but the logn trick needs to be added during the pre-training phase. Some readers have previously asked if models like LLAMA, which did not include the logn trick during pre-training, can still enjoy the "dividends" of logn. After testing, I found that the effect can be improved by adding the following scale factor:

\begin{equation} \max(1, \log_{\text{maxlen}} n) \label{eq:14} \end{equation}

Here, $maxlen$ is the maximum length during pre-training. In this article's experiments, it is 512; in LLAMA, it is 2048; in LLAMA2, it is 4096. This can be implemented by multiplying each $q_n$ by the corresponding factor. This way, the part within $maxlen$ is unaffected, while the part beyond is scaled by logn. This serves as a simple transition. The results are as follows (using $\dagger$ to distinguish from the original logn):

Test Length	512 (Train)	4096 (Repeated)	4096 (Non-repeated)
NTK-RoPE-fixed	49.41%	51.86%	39.61%
NTK-RoPE-logn†-fixed	49.41%	55.94%	41.11%
NTK-RoPE-mixed	49.41%	53.09%	40.12%
NTK-RoPE-logn†-mixed	49.41%	59.11%	42.38%

It can be seen that this $\text{logn}^{\dagger}$ can also be considered a free lunch. In short, if you plan to perform pre-training from scratch, you might as well add the logn trick in advance. If the training is already complete, you can use Equation \eqref{eq:14} instead and add NTK-RoPE-mixed to achieve better Context extension effects.

Summary

In this article, we revisited the $\beta$-base perspective of RoPE and attempted to generalize NTK-aware Scaled RoPE. Inspired by the mixed base concept, we obtained a superior strategy for extending Context length without fine-tuning. Finally, experimental results demonstrated its effectiveness.