Reflections on "Building Behind Closed Doors": Shallow Thoughts on Multimodal Approaches (1): Lossless Input

By 苏剑林 | February 21, 2024

In this article, I want to share some of my "behind closed doors" thoughts—or rather, some conjectures—regarding multimodal model architectures. The recent releases of Google's Gemini 1.5 and OpenAI's Sora have once again ignited many people's passion for multimodality. The sparse technical reports have also led to intense speculation about the architectures behind them. However, this article is not being published just to join the hype; in fact, some of these ideas have been brewing for a long time and have only recently been organized into a coherent form. It just so happens that their release coincided with these thoughts.

A disclaimer beforehand: the term "building behind closed doors" is not false modesty. My experience with large models is "unremarkable," and my practice in multimodality is almost a "total blank." This article is indeed just a "subjective conjecture" based on previous experiences in text and image generation.

Problem Background

First, let's simplify the problem. The "multimodality" discussed in this article primarily refers to dual-modality mixing of text and images—that is, both input and output can be text and images. Many readers might initially feel that multimodal models are just about burning money on GPUs, using Transformer for everything, and moving towards "great power producing miracles" (scaling laws).

Actually, it's not that simple. Looking at text generation, there has essentially been only one mainstream path from the beginning: the Language Model, which models the conditional probability $p(x_t|x_1,\cdots,x_{t-1})$. Whether it was the early n-gram models, or later Seq2Seq and GPT, all are approximations of this conditional probability. In other words, people have always been clear about "which direction to go" to achieve text generation; the only differences were the models used, such as LSTM, CNN, Attention, or the recently revived Linear RNNs. Thus, text generation can indeed "All in Transformer" to achieve miracles because the direction is standard and clear.

However, for image generation, there is no such "standard direction." Among the image generation models discussed on this site, we have VAE, GAN, Flow, Diffusion, and niche ones like EBM, PixelRNN/PixelCNN, etc. The distinctions between these methods are not because they use RNN, CNN, or Attention leading to different effects, but because their underlying modeling theories are fundamentally different. The root cause of this diversity in image generation methods is the difficulty of probability modeling for continuous variables.

For a sentence $(x_1,x_2,\cdots,x_l)$ of length $l$, each $x_t$ comes from a finite vocabulary. Therefore, $p(x_t|x_1,\cdots,x_{t-1})$ is essentially a classification task. Given the combination of "the universal approximation capability of neural networks + Softmax," any classification task can theoretically be modeled precisely. This is the theoretical guarantee behind text generation. However, we usually treat images as continuous vectors, where $x_t$ is a real number. Even if we perform the same conditional decomposition, how do we model $p(x_t|x_1,\cdots,x_{t-1})$? Note that at this point, $p(x_t|x_1,\cdots,x_{t-1})$ is a probability density, and a necessary condition for a probability density is that it is non-negative and integrates to 1:

\begin{equation}\int p(x_t|x_1,\cdots,x_{t-1}) dx_t = 1\end{equation}

Aside from the normal distribution, how many functions can we write whose integral is always 1? And the functions we can write, like the normal distribution, are not sufficient to fit arbitrarily complex distributions. To put it bluntly: neural networks are universal function approximators, but they are not universal probability density approximators. This is the inherent difficulty of generative modeling for continuous variables. Various image generation schemes are essentially different ways to "show their special prowess" to bypass direct modeling of the probability density (except for Flow). Discrete variables do not have this difficulty because the constraint for discrete probability is that the sum is 1, which can be easily achieved through Softmax.

The Path of Discreteness

Some readers might wonder: can we discretize images and then apply the text generation framework? Yes, this is one of the mainstream ideas (perhaps the only one).

In fact, images are naturally discrete. An RGB image of size $n\times n$ is essentially $3n^2$ integers ranging from 0 to 255. That is, it's equivalent to a sentence with a length of $3n^2$ and a vocab_size of 256. Broadly speaking, computers are inherently discrete; everything they represent—text, images, audio, video—is discrete. So, directly using their original discrete representations in a text generation framework is theoretically sound. Early works like PixelRNN and PixelCNN performed autoregressive generation directly in the pixel space. One of the main experiments of OpenAI's Sparse Transformer, which we introduced in "Born for Thrift: From Standard Attention to Sparse Attention," was also pixel-level image autoregression.

However, the biggest problem with operating directly in pixel space is the sequence length—it's too long, and generation is too slow. In most application scenarios, image resolution needs to be at least 256 to be practical (unless it's just for generating small emoticons). Even if $n=256$, there are $3n^2 \approx 200,000$ elements. To generate one 256-sized image, we would need 200,000 autoregressive decoding steps! Although Long Context technology has made great strides recently, this cost is still very extravagant, and the generation time is hard to accept.

To this end, an obvious idea is "compress first, generate later"—that is, use another model to compress the sequence length, generate in the compressed space, and then restore the image through a model. Compression naturally relies on an AE (AutoEncoder). But since we want to apply the modeling method of text generation, we must ensure discreteness after compression. This requires VQ-VAE and later VQ-GAN, where VQ can also be replaced by the recent FSQ. Similar to a text Tokenizer, VQ-VAE/GAN acts as an "Image Tokenizer." It maintains the discreteness of the encoding result but significantly reduces the sequence length (e.g., if resolution is reduced to $1/4$, then $3n^2 \to (n/4)^2$, a 48-fold reduction), and can restore the original image through a corresponding Decoder (DeTokenize). Many multimodal works based on the "Image Tokenizer" idea exist, such as the recent LWM and AnyGPT.

Whether in the original pixel space or the compressed encoding space, they share a common trait: they are 2D features. Unlike text, which has only one left-right dimension, images have both left-right and up-down dimensions. During autoregressive generation, a generation order must be manually designed—such as left-to-right then top-to-bottom, top-to-bottom then left-to-right, counter-clockwise from the center, or sorted by distance from the top-left corner. Different generation directions can significantly impact results, introducing extra hyperparameters and making it feel less "end-to-end" and elegant. To address this, we can use Cross Attention to combine 2D features and output a single-dimensional encoding result. Related work can be found in "Planting a SEED of Vision in Large Language Model".

Compression Loss

It seems that through "Image Tokenizers," multimodal generation has been "solved"? Not quite; the problems are just beginning.

The biggest problem with Image Tokenizers like VQ-VAE and VQ-GAN is that to significantly improve generation speed and shorten sequence length, the encoding resolution is highly compressed (the mainstream is $256\times 256 \to 32\times 32$ or even $256\times 256 \to 16\times 16$). This leads to severe loss of image information. To perceive this intuitively, we can refer to the reconstruction effect in the SEED paper:

Reconstruction effect of SEED

As can be seen, although the reconstructed images are clear and maintain the overall semantics of the input image, the local details are completely different. This means it is impossible to complete arbitrary mixed text-image tasks (such as OCR) based on this Image Tokenizer.

Furthermore, we can perform a simple "information audit" to see how serious the information loss is. First, referring to the experimental results in "Generating Long Sequences with Sparse Transformers", the average entropy of ImageNet-64 is 3.44 bits/byte. Models at that time were not big enough; theoretically, increasing the model could lower this further. Let's assume it is 3 bits/byte. Thus, a $64 \times 64$ ImageNet image has an average total entropy of $64\times 64\times 3\times 3$ bits. Next, we know that for a vocabulary with $V$ tokens, the average entropy per token is $\log_2 V$ bits. If we want to compress the encoding length to $L$ and achieve lossless compression, we must have:

\begin{equation}L\times \log_2 V \geq 64\times 64\times 3\times 3\end{equation}

If $L=1024=32\times 32$, then at least $V\geq 2^{36}\approx 7\times 10^{10}$. If $L=256=16\times 16$, then it requires at least $V\geq 2^{144}\approx 2\times 10^{43}$! Clearly, the codebook sizes of current Image Tokenizers have not reached such astronomical magnitudes. Consequently, the result is inevitably severe information loss!

A natural question is: why must it be lossless? Indeed, humans cannot achieve perfect losslessness either. In fact, human understanding of images might involve more severe information loss than an Image Tokenizer. However, the requirement for a model is to align with human cognition. In other words, lossy compression is fine, but it should be "lossless to humans," much like how discarding infrared and ultraviolet light is lossless to the human eye. However, "lossless to humans" is a very broad concept without calculable metrics. VQ-VAE uses L2 distance to reconstruct images; due to information loss, blurring is inevitable. VQ-GAN adds a GAN loss to improve clarity, but it can only maintain global semantics and cannot perfectly align with human standards. Moreover, no one knows when humans will propose new image tasks that rely more on detail. Therefore, from the perspective of general intelligence, lossless compression is the inevitable final choice.

Thus, in a truly universal multimodal model, the image part is bound to be much more difficult than the text part because the information content of images is far greater than that of text. But actually, images created by humans (like painting) are not much more complex than text (like writing). The truly complex images are those captured directly from nature. So ultimately, text is a product of humanity, while images are a product of nature. Humans are not as "smart" as nature, so text is not as difficult as images. A truly universal artificial intelligence must aim to surpass humans in every aspect.

Diffusion Models

Back to the point. Regarding current image generation technology, if we insist on lossless compression, we either go back to autoregressing in pixel space—which, as analyzed, is unacceptably slow—or we return to continuous space. This treats images as continuous vectors. Under the constraint of lossless compression, the only two choices are Flow models and Diffusion models.

Flow is designed to be reversible, and Diffusion models can also derive reversible ODE equations. Both map a standard Gaussian distribution to the target distribution, meaning they have sufficient entropy sources.

Discrete and continuous generation differ: the entropy source of discrete autoregressive generation is the product of seqlen and $\log(\text{vocab\_size})$. Since vocab_size's contribution grows logarithmically, it mainly relies on seqlen. But seqlen equals cost, so discrete entropy sources are expensive. The entropy source for transformation-based continuous generation is Gaussian noise, which can be infinite in principle, is cheap, and can be parallelized.

However, to ensure reversibility in every layer, Flow models significantly modify the architecture, which likely limits their performance upper bound (no direct evidence, but Flow models haven't produced stunning results yet). Therefore, the only remaining choice is Diffusion models.

Note that Diffusion is just the choice for the image generation scheme. For image understanding, from a lossless perspective, any encoding method carries the risk of distortion, so the most reliable input is definitely the original image. Therefore, the safest way should be to input the original image directly in the form of Patches, similar to the approach used by Fuyu-8b:

Model diagram of Fuyu-8b

However, Fuyu-8b only handles multimodal input; the output is still single-modality text. How can we add image generation capability to it? Considering that the Diffusion model in the training phase is a denoising task, a possible approach would be:

Personal imagination of a multimodal generation method

During training, the input is text and a noisy image. The training goal for text is to predict the next token, while for the image, it's to predict the original image (or noise). For prediction, the text part is still generated token-by-token recursively until `[IMG]` is predicted. Then, several noise vectors are input in parallel, and an image is sampled according to the Diffusion model. Note that the image generation part is parallel, so in principle, it's better if it's not Decoder-Only, because Decoder-Only necessitates a manually specified order, and different orders might significantly affect performance. With current Diffusion acceleration techniques, generation can be completed in roughly 10 steps, making the generation speed acceptable.

(Update 2024.08.26: Meta's new Transfusion uses a scheme basically identical to the one above, just adding a Latent Encoder step for the image, with excellent results; furthermore, the slightly later Show-o is also very similar, with the difference being the discretization of Diffusion.)

Patch Input

A key aspect of the above approach is the use of a Patch-based Diffusion model, so the most fundamental thing is to verify if such a Diffusion model design is feasible (since it was previously said that Diffusion models rely heavily on existing U-Net architectures). To this end, I conducted some experiments and surveyed some literature. Below is a summary of my preliminary conclusions.

According to my research, the earliest attempts to combine "Patch input + Transformer" for Diffusion models were likely "All are Worth Words: A ViT Backbone for Diffusion Models" and "Scalable Diffusion Models with Transformers". These two papers were released roughly at the same time and used similar methods. The former (U-ViT) emphasized the role of the "Long skip connection" found in U-Net, while the latter (DiT) emphasized the necessity of incorporating the diffusion time $t$ and condition label $y$ into the model via adaLN. However, only U-ViT tried the approach of using original image patches directly as input, and even then, only for $64\times 64$ resolution. For $256\times 256$ and $512\times 512$ resolutions, both U-ViT and DiT performed diffusion in the feature space reduced by an LDM autoencoder. This is indeed the current mainstream, but as stated earlier, this level of compression comes with severe information loss and can't be called a truly universal feature.

Directly using raw image patches instead of pre-trained encoder features also has the advantage of avoiding isolation between features. For example, when we need to input two images $I_1$ and $I_2$ simultaneously, the usual approach based on encoder features is to input $encoder(I_1)$ and $encoder(I_2)$. The problem is that the $encoder$ itself performs a layer of semantic interaction within the image. By inputting $encoder(I_1)$ and $encoder(I_2)$, this layer of interaction between $I_1$ and $I_2$ is missing. This is the problem of feature isolation between images; more details can be found in "Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion". Thus, it's better to input the original forms of both text and images and let the multimodal model determine all interactions itself, thus eliminating this barrier.

Of course, there is a reason why directly inputting raw image patches hasn't become mainstream. I experimented with it myself. The task was CelebA-HQ diffusion generation at $64\times 64$ and $128\times 128$ resolutions, reshaped respectively into $16\times 16\times 48$ and $16\times 16\times 192$ and projected into the model. The model was a standard Transformer with Pre-Norm, no Long skip connections, using GAU instead of MHA for the backbone. Position encoding was 2D-RoPE, and the Time Embedding was added directly to the Patch input. Code reference: GitHub - bojone/patch-diffusion.

My experimental results show that both $64\times 64$ and $128\times 128$ resolutions can converge normally and eventually generate effects comparable to a standard U-Net (visually, without calculating FID). However, they require many more training steps to converge. For instance, on a single A800, a standard U-Net yields passable results in about 1-2 days at $128\times 128$ resolution, whereas the Transformer-based architecture required more than 10 days to even look decent. The reason is likely the lack of the inductive bias of CNNs; the model needs more training steps to learn the image priors. However, for large multimodal models, this is probably not an issue, as the number of training steps required for LLMs is already sufficiently high.

Article Summary

This article introduced my conception of multimodal model design—using original image patches directly as image input, while the text part follows the conventional next-token prediction, and the image part reconstructs the original image from added noise. Theoretically, this combination achieves multimodal generation in the most high-fidelity way. Preliminarily, it seems that a Transformer taking raw image patches as input can indeed train a successful image diffusion model. Thus, such a design mixing diffusion and text has the possibility of success. Of course, these are just some rough ideas about the multimodal path, most of which have not been verified through extensive practice. Please read with discretion~