"Behind Closed Doors" Brief Discussion on Multimodal Ideas (III): Position Encoding

By 苏剑林 | September 6, 2024

In previous articles, we expressed the view that the main difference between multimodal LLMs and pure text LLMs is that the former has not yet formed a universally recognized standard methodology. This methodology includes not only the generation and training strategies discussed earlier but also the design of basic architectures, such as the "multimodal position encoding" we will discuss in this article.

Regarding this topic, we already discussed it once in "Path to Transformer Upgrade: 17. Simple Reflections on Multimodal Position Encoding" and proposed a scheme (RoPE-Tie). However, at that time, the author's thinking on this issue was only in its early stages, and there were issues such as imprecise considerations of details and insufficient understanding. Therefore, looking back from today's perspective, the scheme proposed then is still clearly distant from a perfect answer.

Therefore, in this article, we will once again comb through this issue from the top down and provide what we consider to be a more ideal result.

Multimodal Position

It might surprise many readers that multimodal models have not even reached a consensus on position encoding, but this is indeed the case. For text LLMs, the current mainstream position encoding is RoPE (RoPE won't be introduced in detail here, assuming the reader is already familiar with it), or more accurately, RoPE-1D, because the original design only applies to 1D sequences. Later, we derived RoPE-2D, which can be used for 2D sequences like images, and following the logic of RoPE-2D, we can parallelly extend it to RoPE-3D for 3D sequences like video.

However, what we just described is for single-modality input. When multiple modalities are mixed as input, difficulties arise: text is a 1D sequence, so its position is just a scalar $n$; images are 2D ("width" and "height"), so representing their positions requires a two-dimensional vector $(x, y)$; video adds a new time dimension (or "frames") to images, so its position is a three-dimensional vector $(x, y, z)$. When we hope to use the same model to process data from three modalities, we must find a way to blend these three different forms of position information.

As everyone knows, RoPE is implemented as an absolute position encoding, but when used in conjunction with inner-product-based Attention, positions are automatically subtracted after the inner product, thereby achieving the effect of relative position encoding. But while vectors of the same size can be subtracted, how do you subtract vectors of different sizes? This is the difficulty of multimodal position encoding.

Many works choose to "evade" this difficulty by directly flattening all modalities and then using RoPE-1D. This is a viable solution but ultimately feels less than elegant. Furthermore, forced flattening may lower the performance ceiling of the model, as works like "VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks" have shown that the introduction of RoPE-2D helps improve model performance, especially for variable-resolution inputs.

Backward Compatibility

Therefore, we hope to design a multimodal position encoding that can be used for mixed modalities and can degrade to the corresponding RoPE-1D/2D/3D in a single modality to fully unlock the capabilities of each modality.

We just said that the main difficulty of multimodal position encoding is that different-sized position vectors cannot be subtracted. To retain complete position information while allowing subtraction, we have no choice but to unify by increasing dimensions to the highest dimension. Let's take the mixed image-text modality as an example. Since images are 2D, we increase the position encoding of text to two dimensions and then unify them using RoPE-2D. Can we increase dimensions in any way? Not exactly. We want it to have backward compatibility, meaning that when the input is pure text, it is completely equivalent to RoPE-1D.

To this end, let's compare RoPE-1D and RoPE-2D:

$$ \scriptsize{\begin{array}{c}\begin{array}{c}\text{RoPE-1D}\\ (\boldsymbol{\mathcal{R}}_n)\end{array}= \begin{pmatrix} \cos \bbox[yellow]{n}\theta_0 & -\sin \bbox[yellow]{n}\theta_0 & 0 & 0 & 0 & 0 & \cdots & 0 & 0 \\ \sin \bbox[yellow]{n}\theta_0 & \cos \bbox[yellow]{n}\theta_0 & 0 & 0 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos \bbox[yellow]{n}\theta_1 & -\sin \bbox[yellow]{n}\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \sin \bbox[yellow]{n}\theta_1 & \cos \bbox[yellow]{n}\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos \bbox[yellow]{n}\theta_{d/2-2} & -\sin \bbox[yellow]{n}\theta_{d/2-2} & 0 & 0 \\ 0 & 0 & 0 & 0 & \cdots & \sin \bbox[yellow]{n}\theta_{d/2-2} & \cos \bbox[yellow]{n}\theta_{d/2-2} & 0 & 0 \\ 0 & 0 & 0 & 0 & \cdots & 0 & 0 & \cos \bbox[yellow]{n}\theta_{d/2-1} & -\sin \bbox[yellow]{n}\theta_{d/2-1} \\ 0 & 0 & 0 & 0 & \cdots & 0 & 0 & \sin \bbox[yellow]{n}\theta_{d/2-1} & \cos \bbox[yellow]{n}\theta_{d/2-1} \end{pmatrix} \\[16pt] \begin{array}{c}\text{RoPE-2D}\\ (\boldsymbol{\mathcal{R}}_{x,y})\end{array}= \begin{pmatrix} \cos \bbox[yellow]{x}\theta_0 & -\sin \bbox[yellow]{x}\theta_0 & 0 & 0 & 0 & 0 & \cdots & 0 & 0 \\ \sin \bbox[yellow]{x}\theta_0 & \cos \bbox[yellow]{x}\theta_0 & 0 & 0 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos \bbox[yellow]{y}\theta_1 & -\sin \bbox[yellow]{y}\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \sin \bbox[yellow]{y}\theta_1 & \cos \bbox[yellow]{y}\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos \bbox[yellow]{x}\theta_{d/2-2} & -\sin \bbox[yellow]{x}\theta_{d/2-2} & 0 & 0 \\ 0 & 0 & 0 & 0 & \cdots & \sin \bbox[yellow]{x}\theta_{d/2-2} & \cos \bbox[yellow]{x}\theta_{d/2-2} & 0 & 0 \\ 0 & 0 & 0 & 0 & \cdots & 0 & 0 & \cos \bbox[yellow]{y}\theta_{d/2-1} & -\sin \bbox[yellow]{y}\theta_{d/2-1} \\ 0 & 0 & 0 & 0 & \cdots & 0 & 0 & \sin \bbox[yellow]{y}\theta_{d/2-1} & \cos \bbox[yellow]{y}\theta_{d/2-1} \end{pmatrix}\end{array}} $$

Did you notice anything in common? If looking solely at this form, you can find that actually $\boldsymbol{\mathcal{R}}_n = \boldsymbol{\mathcal{R}}_{n,n}$. That is, RoPE-1D at position $n$ is actually equivalent to RoPE-2D at position $(n, n)$. Therefore, to unify mixed image-text with RoPE-2D and have it degrade to RoPE-1D for pure text, one must set the position coordinates for the text portion in the form of $(n, n)$.

Of course, in reality, they are slightly different. We know that for RoPE-1D, $\theta_i = b^{-2i/d}$, which means $\theta_{2j}$ and $\theta_{2j+1}$ are different. But for RoPE-2D, to ensure the symmetry of $x$ and $y$, the usual choice is to ensure $\theta_{2j}=\theta_{2j+1}$. This creates a contradiction. Regarding this, we have two choices: first, abandon the symmetry of $x, y$ in RoPE-2D and still use $\theta_i = b^{-2i/d}$; second, use $\theta_{2j}=\theta_{2j+1}=b^{-4j/d}$, in which case the position encoding of the pure text portion will be slightly different from existing RoPE-1D. For $\theta_i = b^{-2i/d}$, the difference between $\theta_i$ and $\theta_{i+1}$ is small, so both schemes are actually similar. Which one to choose depends on personal aesthetic; the author leans toward the first option.

Equivalence and Symmetry

Through the above analysis, we have determined the scheme for unifying mixed image-text modalities using RoPE-2D, and backward compatibility has determined that the two-dimensional position for a text token at position $n$ should be $(n, n)$, thereby completing the design of the position encoding for the text part. Next, what we need to conceive is the position encoding for the image part.

If the input is just a single image with $w \times h$ patches, its position coordinates are naturally the coordinates of the patches themselves, i.e.,

\begin{equation}\left[\begin{matrix} (1,1) & (1,2) & \cdots & (1, w) \\ (2,1) & (2,2) & \cdots & (2, w) \\ \vdots & \vdots & \ddots & \vdots \\ (h,1) & (h,2) & \cdots & (h, w) \\ \end{matrix}\right]\label{eq:rope2d}\end{equation}

What we show here are absolute positions, but the actual effect is relative positioning. A characteristic of relative positioning is that it is independent of a position offset, so we can add $(\beta_1, \beta_2)$ to every coordinate without changing the effect. Secondly, we can multiply every coordinate by $(\gamma_1, \gamma_2)$, allowing us to adjust the interval between adjacent positions as needed. Combining these two points, we get the generalized two-dimensional positions for an image as:

\begin{equation}\left[\begin{matrix} (\beta_1 + \gamma_1, \beta_2 + \gamma_2) & (\beta_1 + \gamma_1, \beta_2 + 2\gamma_2) & \cdots & (\beta_1 + \gamma_1, \beta_2 + w\gamma_2) \\[8pt] (\beta_1 + 2\gamma_1, \beta_2 + \gamma_2) & (\beta_1 + 2\gamma_1, \beta_2 + 2\gamma_2) & \cdots & (\beta_1 + 2\gamma_1, \beta_2 + w\gamma_2) \\[8pt] \vdots & \vdots & \ddots & \vdots \\[8pt] (\beta_1 + h\gamma_1, \beta_2 + \gamma_2) & (\beta_1 + h\gamma_1, \beta_2 + 2\gamma_2) & \cdots & (\beta_1 + h\gamma_1, \beta_2 + w\gamma_2) \end{matrix}\right]\end{equation}

Now consider when an image is sandwiched between two segments of text; how should $\beta_1, \beta_2, \gamma_1, \gamma_2$ be chosen?

First, we assume a certain equivalence between text tokens and patches: after reasonable patchification, the status of each patch is equivalent to a token (An Image is Worth xxx Tokens). This means that for the two segments of text, it is as if they are sandwiching a sentence of $wh$ tokens. So, if the position of the last token of the left text segment is $(L, L)$, then the position of the first token of the right text segment is $(L + wh + 1, L + wh + 1)$.

Next, we need to introduce symmetry. Specifically, if the position of the first patch of the image is $(\beta_1 + \gamma_1, \beta_2 + \gamma_2)$ and the position of the last patch is $(\beta_1 + h\gamma_1, \beta_2 + w\gamma_2)$, we believe that the position difference between [the first patch of the image] and [the last token of the left text segment] should equal the position difference between [the first token of the right text segment] and [the last patch of the image], i.e.,

\begin{equation}\begin{pmatrix}\beta_1 + \gamma_1 \\ \beta_2 + \gamma_2\end{pmatrix} - \begin{pmatrix}L \\ L\end{pmatrix} = \begin{pmatrix}L+wh+1 \\ L+wh+1\end{pmatrix} - \begin{pmatrix}\beta_1 + h\gamma_1 \\ \beta_2 + w\gamma_2\end{pmatrix}\label{eq:beta-gamma}\end{equation}

There are four unknowns $\beta_1, \beta_2, \gamma_1, \gamma_2$ here but only two equations, so there are infinitely many solutions. We can simply take $\gamma_1 = \gamma_2 = 1$ and solve for:

\begin{equation}\beta_1 = L + \frac{1}{2}(wh - h),\quad \beta_2 = L + \frac{1}{2}(wh - w)\end{equation}

We can temporarily call this scheme RoPE-Tie-v2 or RoPE-TV (RoPE for Text and Vision).

Analysis of Pros and Cons

According to this result, when an image of $w \times h$ follows a sentence, we just need to calculate $(\beta_1, \beta_2)$ as described above and add them to the conventional two-dimensional RoPE $\eqref{eq:rope2d}$ to obtain the position coordinates for the image part, as shown in the figure below:

Schematic of the new RoPE-TV (RoPE-Tie-v2)
Schematic of the new RoPE-TV (RoPE-Tie-v2)

As a comparison, the old RoPE-Tie proposed in "Path to Transformer Upgrade: 17. Simple Reflections on Multimodal Position Encoding" has position coordinates as shown in the following figure:

Schematic of the old RoPE-Tie
Schematic of the old RoPE-Tie

In fact, the starting points for RoPE-Tie also included compatibility and symmetry, but it did not strictly follow equivalence. Furthermore, RoPE-Tie defaulted to $\beta_1 = \beta_2 = L$ and did not restrict that $w \times h$ patches are equivalent to $wh$ tokens, eventually leading to a set of integer solutions (if integer solutions are not required, equivalence can also be satisfied):

\begin{equation}\gamma_1 = w+1,\quad \gamma_2 = h+1\end{equation}

From today's perspective, the default settings of RoPE-Tie are actually not very ideal. Therefore, this article re-chooses $\gamma_1 = \gamma_2 = 1$, ensures equivalence, and then derives $\beta_1, \beta_2$.

What are the benefits of the new scheme? First, in RoPE-Tie, the relative position within the image is related to its size, whereas in the new scheme, the patch intervals are fixed at $(0,1)$ and $(1,0)$. This makes the scale of the patches more consistent. For example, for a $128 \times 128$ image and the upper half of that image (i.e., a $128 \times 64$ sub-image), because the two heights are different, the horizontal position intervals in RoPE-Tie are not the same. This means that two patches with the same position and same meaning have inconsistent distances (scales) after adding RoPE-Tie, which seems unreasonable. The new scheme does not have this problem.

Secondly, in RoPE-Tie, the interval between the image and the surrounding text is the same as the interval between patches within the image $(\gamma_1, \gamma_2)$. In the new scheme, a relatively large gap of $\frac{1}{2}(wh - h, wh - w)$ appears between text and image, while the inner parts of the text and image maintain fixed uniform intervals. Intuitively, this relatively large positional jump between different modalities can better achieve "modality isolation," allowing a single model to better process single-modality content while retaining multimodal interaction. This bears a striking resemblance to our usual practice of adding [IMG] and [/IMG] special tokens to mark an image.

The 3D Dilemma

In the RoPE-Tie article, "text-video" mixed modality position encoding was not discussed. In this section, we will supplement that discussion.

Intuitively, we can process video input in two ways. The first way is to simply treat the video as multiple images (adding [VIDEO] and [/VIDEO] markers if necessary). This way, we don't need to propose new position encoding for video; we can just follow the results of "text-image" mixed position encoding. However, this loses the alignment relationship between different frames of the same video. It might not be perfect. For example, "the 1st patch of the 1st frame" should have a similar proximity relationship with "the 1st patch of the 2nd frame" as it does with "the 2nd patch of the 1st frame," but flattening and treating them as multiple images fails to reflect this.

The second way is to extend the "text-image" results in parallel to "text-video." For a video of $w \times h \times t$ (frames of $w \times h$, totaling $t$ frames), its position coordinates are three-dimensional $(x, y, z)$. Based on the same compatibility, equivalence, and symmetry, we can generalize equation $\eqref{eq:beta-gamma}$ to:

\begin{equation}\begin{pmatrix}\beta_1 + \gamma_1 \\ \beta_2 + \gamma_2 \\ \beta_3 + \gamma_3\end{pmatrix} - \begin{pmatrix}L \\ L \\ L\end{pmatrix} = \begin{pmatrix}L+wht+1 \\ L+wht+1 \\ L+wht+1\end{pmatrix} - \begin{pmatrix}\beta_1 + h\gamma_1 \\ \beta_2 + w\gamma_2 \\ \beta_3 + t\gamma_3\end{pmatrix}\end{equation}

If we still set $\gamma_1 = \gamma_2 = \gamma_3 = 1$, we get:

\begin{equation}\beta_1 = L + \frac{1}{2}(wht - h),\quad \beta_2 = L + \frac{1}{2}(wht - w),\quad \beta_3 = L + \frac{1}{2}(wht - t)\end{equation}

Doing this completely preserves the dimensionality of the video position, which looks more elegant, but the author believes it still has some shortcomings. This shortcoming stems from the author's different understanding of the time dimension of video: the 3D of a video is actually "2 spatial dimensions + 1 time dimension," which is different from "3 spatial dimensions" in the real 3D world. In the author's view, the time dimension and the two spatial dimensions of a video are not equipollent. The time dimension is more like the left-to-right writing direction of text. Therefore, a perfect multimodal LLM in the author's imagination should be able to continue writing video just like a text LLM continues text, theoretically capable of generating video infinitely in an autoregressive manner until an [EOS] marker appears.

We just mentioned two "text-video" mixed encoding schemes. The first, treating it directly as multiple images, can generate video infinitely autoregressively. However, the second, seemingly more perfect scheme, cannot, because its $\beta_1, \beta_2, \beta_3$ depend on $t$. This means we need to know the number of video frames in advance. In other words, it's not that the second scheme cannot be used to generate video autoregressively, but that it requires the number of frames to be determined in advance. In the author's view, this does not fit the ideal characteristics of the time dimension (time should be able to advance forward without constraints).

Some readers might wonder: why doesn't an image mind $w, h$ appearing in $\beta_1, \beta_2$? That is to say, why doesn't image generation mind knowing the image size in advance? This is because an image has two directions. Even if we generate an image autoregressively, we must at least know the size of one direction to tell the model when to "wrap" to generate a complete 2D image. Since the two spatial dimensions of an image are equipollent, it's better to know all of them than to just know one. Thus, we can accept pre-determining the image size.

Furthermore, we can use the "AR+Diffusion" approach introduced in "Behind Closed Doors" Shallow Talk on Multimodal Ideas (Part 1): Lossless Input to make a "text-image" model. In this case, the image generation part is Diffusion, which must know the target image size in advance anyway.

Related Work

A while ago, Alibaba open-sourced a multimodal model named "Qwen2-VL." The introduction mentioned that it proposed a multimodal rotary position encoding (M-ROPE), which piqued the author's interest. After reading the source code (link), it was found that M-RoPE actually follows the compatibility idea of RoPE-Tie but does not preserve symmetry and equivalence.

Source code comments of M-RoPE
Source code comments of M-RoPE

Using the notation of this article, M-RoPE actually chooses $\beta_1 = \beta_2 = \beta_3 = L$ and $\gamma_1 = \gamma_2 = \gamma_3 = 1$ (for the "text-video" mixed modality), and then the position of the first token of the text after the video is simply taken as the maximum video position coordinate plus 1. This way, if generating video autoregressively, one indeed doesn't need to determine the frame count in advance, but symmetry and equivalence are sacrificed.

How important are symmetry and equivalence? The author does not know the answer; it requires sufficient experimentation to verify. But if just brainstorming, the author guesses it might affect performance in extreme cases. For example, in M-RoPE, if there is a video with a very small frame size but a very long duration, its spatial position coordinates are continuous relative to the text on the left but jump abruptly relative to the text on the right. Intuitively, this could make text and visual interaction less friendly.

Another example: for a video where $w=h=t=n$, it intuitively is equivalent to $n^3$ tokens. But according to M-RoPE's rules, if two text segments sandwich such a video, it is only equivalent to sandwiching a text segment of $n$ tokens. In other words, $n^3$ tokens are placed within a relative distance of size $n$. Could this lead to excessive information density and increase the model's difficulty in understanding?

Of course, for a Decoder-only LLM where even NoPE (No Position Encoding) might work, these problems might just be overthinking on the author's part.

Summary

This article shared the author's subsequent thoughts on multimodal position encoding, proposed three principles for constructing multimodal position encoding: compatibility, equivalence, and symmetry, improved the previously proposed RoPE-Tie, and finally discussed the design and difficulties of position encoding for "text-video" mixed modalities, as well as the connection between Qwen2-VL's M-RoPE and RoPE-Tie.