By 苏剑林 | October 31, 2023
Just as "XXX is all you need" has become a cliché, many papers are titled "An Embarrassingly Simple XXX." In my view, most of these papers are more gimmick than substance. However, I recently read a paper that truly makes one exclaim, "That is embarrassingly simple!" The paper is titled "Finite Scalar Quantization: VQ-VAE Made Simple." As the name suggests, this work aims to simplify VQ-VAE using FSQ (Finite Scalar Quantization). With the growing popularity of generative models and multimodal LLMs, VQ-VAE and its successors have risen in prominence as "Tokenizers for images." However, the training of VQ-VAE itself presents several challenges. The FSQ paper claims that through a simpler "rounding" operation, one can achieve the same goals with better results, faster convergence, and more stable training. Is FSQ really that magical? Let's dive in and learn about it.
First, let's understand "VQ." VQ stands for "Vector Quantize," which refers to the technique of mapping infinite, continuous encoding vectors to a finite, discrete set of integer numbers. If we apply VQ to the bottleneck layer of an autoencoder, we can compress the input size while making the encoding result a discrete sequence of integers.
Assuming the reconstruction loss of the autoencoder is satisfactory, this integer sequence becomes an equivalent representation of the original image. All operations on the original image can then be transformed into operations on the integer sequence. For example, if we want to train an image generation model, we only need to train a model for generating integer sequences. Since this is equivalent to text generation, we can use it to train a GPT—the model and the workflow are identical to text. Once training is complete, we can sample integer sequences from the GPT model and feed them into the decoder to obtain images, thus completing the construction of the image generation model. Simply put, "VQ + Autoencoder" transforms any input into an integer sequence consistent with text, unifying the input forms of different data modalities as well as their processing and generation models.
An autoencoder with such a VQ capability is known as a "VQ-VAE."
As early as four years ago, in the article "A Brief Introduction to VQ-VAE: Quantized Autoencoders," we introduced VQ-VAE. Despite being named a "VAE (Variational AutoEncoder)," it actually has little to do with VAEs; as mentioned in the previous section, it is merely an AE (AutoEncoder) with VQ functionality.
Since it is an AE, it has an encoder and a decoder. A standard AE looks like this:
\begin{equation}z = encoder(x),\quad \hat{x}=decoder(z),\quad \mathcal{L}=\Vert x - \hat{x}\Vert^2 \end{equation}VQ-VAE is slightly more complex:
\begin{equation}\begin{aligned} z =&\, encoder(x)\\[5pt] z_q =&\, z + \text{sg}[e_k - z],\quad k = \mathop{\text{argmin}}_{i\in\{1,2,\cdots,K\}} \Vert z - e_i\Vert\\ \hat{x} =&\, decoder(z_q)\\[5pt] \mathcal{L} =&\, \Vert x - \hat{x}\Vert^2 + \beta\Vert e_k - \text{sg}[z]\Vert^2 + \gamma\Vert z - \text{sg}[e_k]\Vert^2 \end{aligned}\label{eq:vqvae}\end{equation}Let's explain this step by step. First, the input $x$ is passed into the encoder to output an encoding vector $z$. However, instead of passing $z$ directly to the decoder, we maintain a codebook of encoding vectors $\{e_1, e_2, \cdots, e_K\}$ and select the $e_k$ that is closest to $z$ to be sent into the decoder for reconstruction of $x$. Since the codebook is finite, we can understand the actual encoding result as an integer (specifically, the index $k$ of the $e_k$ closest to $z$). This is the meaning of "VQ" in VQ-VAE.
In practice, to ensure reconstruction clarity, the encoder's output may consist of multiple vectors. Each vector undergoes the same quantization step to become an integer. Consequently, an image originally in a continuous real-valued space is encoded by the VQ-VAE into a sequence of integers. This is similar to the role of a text tokenizer, hence the term "Image Tokenizer."
However, because the $\mathop{\text{argmin}}$ operation appears in the forward computation flow, the gradient cannot be backpropagated to the encoder, meaning we cannot optimize the encoder. A common approach at this point is Gumbel Softmax, but its results are often sub-optimal. Therefore, the authors cleverly designed a better gradient for VQ-VAE using a "Straight-Through" estimator. One could say this is the most brilliant part of VQ-VAE; it tells us that while "Attention is all you need" is popular, "Gradient" is what we truly need!
Specifically, VQ-VAE utilizes the stop_gradient (represented as $\text{sg}$ in the formulas) function, which is built into most deep learning frameworks, to define custom gradients. Any input passing through $\text{sg}$ maintains the same output value, but its gradient is forced to zero. Thus, for $z_q$ in Equation $\eqref{eq:vqvae}$, we have:
In this way, the quantized $e_k$ is fed into the decoder, but when the optimizer calculates the gradient, it uses $z$. Since $z$ comes from the encoder, the encoder can now be optimized. This operation is called the "Straight-Through Estimator (STE)," a common technique for designing gradients for non-differentiable modules in neural networks.
As gradient-based optimizers remain the mainstream, designing the gradient directly is often more fundamental—and usually more difficult and impressive—than designing a loss function.
However, the story isn't over. There are two issues: 1. While the encoder now has a gradient, the codebook $e_1, e_2, \cdots, e_K$ has lost its gradient; 2. Although $\text{sg}$ lets us define gradients arbitrarily, not just any gradient will successfully optimize the model. From Equation $\eqref{eq:sg}$, we can see that for it to be strictly mathematically valid, the only solution is $e_k=z$. This tells us that if STE is reasonable, $e_k$ and $z$ must be at least close to each other. Therefore, to ensure the validity of the gradient and to optimize the codebook, we can add an auxiliary loss:
\begin{equation}\Vert e_k - z\Vert^2\label{eq:ez}\end{equation}This forces $e_k$ to be close to $z$ and allows $e_k$ to have a gradient—killing two birds with one stone! But on reflection, there is still a slight flaw: theoretically, the reconstruction loss from the encoder and decoder should be enough to optimize $z$, so the auxiliary term should primarily be used to optimize $e_k$ and should not significantly affect $z$. To address this, we use the $\text{sg}$ trick again. It is easy to prove that the gradient of Equation $\eqref{eq:ez}$ is equivalent to:
\begin{equation}\Vert e_k - \text{sg}[z]\Vert^2 + \Vert z - \text{sg}[e_k]\Vert^2\end{equation}The first term stops the gradient for $z$, leaving only the gradient for $e_k$. The second term does the opposite. Currently, the two terms are summed with a 1:1 weight, meaning they influence each other equally. As we just noted, this auxiliary loss should primarily optimize $e_k$, not $z$. Thus, we introduce $\beta > \gamma > 0$ and change the auxiliary loss to:
\begin{equation}\beta\Vert e_k - \text{sg}[z]\Vert^2 + \gamma\Vert z - \text{sg}[e_k]\Vert^2\label{eq:ez2}\end{equation}Adding this to the reconstruction loss gives us the total loss for VQ-VAE.
Additionally, there is another scheme for optimizing $e_k$: first, set $\beta$ in Equation $\eqref{eq:ez2}$ to zero, so $e_k$ again has no gradient. Then, we observe that the VQ operation in VQ-VAE is quite similar to K-Means clustering, where $e_1, e_2, \cdots, e_K$ correspond to $K$ cluster centers. Based on our understanding of K-Means, the cluster center is the average of all vectors in that class. Thus, an alternative optimization for $e_k$ is the moving average of $z$:
\begin{equation}e_k^{(t)} = \alpha e_k^{(t-1)} + (1-\alpha) z \end{equation}This is equivalent to specifying the use of SGD to optimize the $\Vert e_k - \text{sg}[z]\Vert^2$ loss term (while other terms can use Adam, etc.). This approach was used in VQ-VAE-2.
Some readers might wonder: isn't the theme of this article FSQ? Haven't we spent a bit too much time on VQ-VAE? In fact, because FSQ truly deserves the description "embarrassingly simple," introducing it takes only a few lines compared to VQ-VAE. Writing VQ-VAE in detail allows everyone to better appreciate the simplicity of FSQ.
Strictly speaking, FSQ is simply a replacement for "VQ" in VQ-VAE. Its discretization logic is very simple: it's "rounding." First, assume we have a scalar $t \in \mathbb{R}$. We define:
\begin{equation}\text{FSQ}(t)\triangleq \text{Round}[(L-1)\sigma(t)] \end{equation}Here $L \in \mathbb{N}$ is a hyperparameter, and $\sigma(x)=1/(1+e^{-x})$ is the sigmoid function (the original paper used $\tanh$, but I believe sigmoid is more appropriate). $\text{Round}$ refers to rounding to the nearest integer. It is easy to see that $\text{FSQ}(t) \in \{0, 1, \cdots, L-1\}$, meaning the FSQ operation restricts the output to $L$ integers, achieving discretization. Of course, in most cases, one scalar is not enough. For $z \in \mathbb{R}^d$, FSQ can be applied to each dimension:
\begin{equation}\text{FSQ}(z) = \text{Round}[(L-1)\sigma(z)]\in\{0,1,\cdots,L-1\}^d \end{equation}This means a $d$-dimensional vector $z$ is discretized into one of $L^d$ possible integer vectors. However, note that the $\text{Round}$ operation also has no gradient (or rather, the gradient is zero). After the setup with VQ-VAE, some readers may have guessed what comes next: using the STE trick again:
\begin{equation}\text{FSQ}(z) = (L-1)\sigma(z) + \text{sg}\big[\text{Round}[(L-1)\sigma(z)] - (L-1)\sigma(z)\big] \end{equation}That is, backpropagation uses $(L-1)\sigma(z)$ (the value before rounding) to calculate gradients. Since the values before and after $\text{Round}$ are numerically close, FSQ does not require an auxiliary loss to force approximation, nor is there an additional codebook to update. The conciseness of FSQ is evident!
[Comparison between VQ and FSQ (from the original paper)]
If VQ is understood as clustering encoding vectors into $K$ distinct categories, then FSQ can be seen as inducing $d$ attributes from the encoding vectors, with each attribute divided into $L$ levels, directly expressing $L^d$ different integers. Generally speaking, the number of levels for each attribute could also be different ($L_1, L_2, \cdots, L_d$), resulting in $L_1 L_2 \cdots L_d$ possible combinations.
According to the original paper's recommendation, $L \geq 5$ (the previous LFQ effectively used $L=2$). To match the codebook size $K$ of VQ-VAE, FSQ must have $d = \log_L K$. This means FSQ imposes a constraint on the dimension $d$ of the encoding vector (usually a single-digit number), which is typically much smaller than the encoding dimension of VQ-VAE (usually a three-digit number). A direct consequence of this is that when the total number of codes $K$ is small (and thus $d$ is very small), FSQ often performs worse than VQ:
[Performance difference between VQ and FSQ under different codebook sizes]
From the chart, we can see that when the codebook size is around 1000, the performance of FSQ and VQ is similar. When the codebook size significantly exceeds 1000, FSQ has the advantage; conversely, when the codebook size is significantly smaller than 1000, VQ performs better. This is consistent with my own experimental results. My reference code can be found at: https://github.com/bojone/fsq-pytorch.
The other experiments in the paper are standard demonstrations of FSQ's superiority over VQ across various tasks; readers can refer to the original paper for those details.
Formally, assuming $K=L^d$, VQ is like a "classifier with $L^d$ classes," while FSQ is like "$d$ scoring mechanisms with $L$ levels." Whether in terms of parameter count, geometric intuition, or representational capacity, FSQ is actually inferior to VQ. So why does FSQ have the chance to achieve better results than VQ? I believe there are two reasons.
The first reason is that the encoder and decoder are very powerful. Although FSQ itself is weaker, the encoder and decoder are strong enough that, based on the universal approximation hypothesis of neural networks, the weakness of FSQ relative to VQ can be fully compensated for within the encoder and decoder. Under the setting $K=L^d$, the level of discretization is the same for both, meaning the "information bottleneck" between the encoder and decoder is the same. Therefore, the inherent issues of FSQ become negligible.
The second reason is that VQ's "teammate" (the gradient) is too weak. A classic problem with VQ is codebook collapse: as the codebook size increases, the codebook is not fully utilized. Instead, due to unhealthy competition, the codebook vectors cluster together. A classic manifestation is that a codebook of 5000 entries performs worse than one with 500 entries. Ultimately, this is due to unreasonable gradients. Although VQ cleverly designs its gradients, for "hard assignment" operations like $\mathop{\text{argmin}}$, gradient-based optimization suffers from a "winner-take-all" problem, which is the root cause of collapse. FSQ's $\text{Round}$ operation does not involve assignment; it directly takes an approximate value. Broadly speaking, K-Means, which is similar to VQ, often suffers from cluster center collapse as well, showing that $\mathop{\text{argmin}}$ is a notoriously difficult problem to optimize. Therefore, rather than saying FSQ is strong, it is more that VQ's "teammate" is too weak.
From the two points above, it is clear that for FSQ to surpass VQ, not only must the codebook be large enough, but the encoder and decoder must also be complex enough. This is not always satisfied. For example, in some scenarios where we want every layer of the model's output to be quantized, the amortized complexity of the encoder and decoder may not be sufficient, and the limitations of FSQ itself then become a bottleneck. Furthermore, while the dimension of the vector after VQ does not change and can be arbitrarily high, the vector before FSQ must be projected into $d = \log_L K$ dimensions. This is a severe dimensionality reduction. When we need high-dimensional approximate vectors from before projection, it is difficult to recover them simply from the low-dimensional vectors after FSQ.
Therefore, while FSQ might effectively replace VQ as an "Image Tokenizer," it does not mean FSQ can replace VQ in every scenario.
This article introduced an extraordinarily simple replacement for the "VQ" in VQ-VAE—FSQ (Finite Scalar Quantization). It discretizes continuous vectors directly through rounding and requires no additional auxiliary loss. Experimental results show that when the codebook is large enough, FSQ has an advantage over VQ.