Small Flow Series: TARFLOW: Flow Models Returning at Full Power?

By 苏剑林 | January 17, 2025

I wonder if any readers still remember this series? This series, titled "Small Flow" (细水长flow), primarily introduces work related to flow models. It began back in 2018 when OpenAI released a new flow model, Glow, which was quite stunning at a time when GANs were the mainstream. But as impressive as it was, for a long time, Glow and its subsequent improvements couldn't match the generation quality of GANs, let alone the currently dominant diffusion models.

However, the tide may be turning. Last month's paper "Normalizing Flows are Capable Generative Models" introduced a new flow model called TARFLOW. It approaches current SOTA results across almost all generative tasks, representing a "full power" return for flow models.

Write at the Front

The flow models discussed here specifically refer to Normalizing Flows, which are characterized by reversible architectures, training via maximum likelihood, and the ability to achieve one-step generation. The Flow Matching branch of diffusion models is not included in this category.

Since the debut of Glow, progress in flow models has been somewhat "unremarkable." Simply put, it was difficult to generate CelebA faces without obvious artifacts, let alone more complex datasets like ImageNet. Consequently, the "Small Flow" series stopped in 2019 with "Small Flow: Reversible ResNet: The Brutal Beauty of the Extreme." However, the emergence of TARFLOW proves that flow models "still have fight in them." This time, its generation style looks like this:

TARFLOW generation results

TARFLOW Generation Results

By contrast, Glow's generation style looked like this:

Glow generation results

Glow Generation Results

Glow only demonstrated relatively simple face generation, yet the flaws were already apparent, not to mention more complex natural image generation. From this, it's clear that TARFLOW's progress is more than just a small step. Quantitatively, its performance approaches the best of various models, even surpassing the SOTA GAN representative, BigGAN:

TARFLOW quantitative comparison

Quantitative comparison of TARFLOW with other models

Keep in mind that flow models are inherently one-step generation models and do not require adversarial training like GANs; they are trained with a single loss function from start to finish. In some ways, their training is even simpler than diffusion models. Therefore, the fact that TARFLOW has raised the performance of flow models means it combines the advantages of GANs and diffusion models while possessing its own unique strengths (reversibility, log-likelihood estimation, etc.).

Model Review

Returning to the main topic, let's examine what "magic pill" TARFLOW used to revitalize flow models. Before that, we'll briefly review the theoretical foundations of flow models. For a more detailed historical trace, you can refer to "Small Flow: NICE: Basic Concepts and Implementation of Flow Models" and "Small Flow: RealNVP and Glow: Inheritance and Sublimation of Flow Models."

Ultimately, both flow models and GANs aim to obtain a deterministic function $\boldsymbol{x}=\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z})$ that maps random noise $\boldsymbol{z}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ to an image $\boldsymbol{x}$ from the target distribution. In probabilistic language, this means modeling the target distribution in the following form:

\begin{equation}q_{\boldsymbol{\theta}}(\boldsymbol{x}) = \int \delta(\boldsymbol{x} - \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}))q(\boldsymbol{z})d\boldsymbol{z}\label{eq:q-int}\end{equation}

where $q(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ and $\delta()$ is the Dirac delta function. The ideal goal of training a probabilistic model is maximum likelihood, using $-\log q_{\boldsymbol{\theta}}(\boldsymbol{x})$ as the loss function. However, the current $q_{\boldsymbol{\theta}}(\boldsymbol{x})$ contains an integral, which is only formal and cannot be used directly for training.

This is where flow models and GANs "part ways": GANs essentially use another model (the discriminator) to approximate $-\log q_{\boldsymbol{\theta}}(\boldsymbol{x})$, leading to alternating training. Flow models, however, design an appropriate $\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z})$ such that the integral $\eqref{eq:q-int}$ can be calculated directly. What conditions are required to calculate the integral $\eqref{eq:q-int}$? Let $\boldsymbol{y} = \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z})$, with its inverse function being $\boldsymbol{z} = \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{y})$. Then:

\begin{equation}d\boldsymbol{z} = \left\|\det \frac{\partial \boldsymbol{z}}{\partial \boldsymbol{y}}\right\|d\boldsymbol{y} = \left\|\det \frac{\partial \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{y})}{\partial \boldsymbol{y}}\right\|d\boldsymbol{y}\end{equation}

and

\begin{equation}\begin{aligned} q_{\boldsymbol{\theta}}(\boldsymbol{x}) =&\, \int \delta(\boldsymbol{x} - \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}))q(\boldsymbol{z})d\boldsymbol{z} \\ =&\, \int \delta(\boldsymbol{x} - \boldsymbol{y})q(\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{y}))\left\|\det \frac{\partial \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{y})}{\partial \boldsymbol{y}}\right\|d\boldsymbol{y} \\ =&\, q(\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))\left\|\det \frac{\partial \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})}{\partial \boldsymbol{x}}\right\| \end{aligned}\end{equation}

Therefore:

\begin{equation}-\log q_{\boldsymbol{\theta}}(\boldsymbol{x}) = - \log q(\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})) - \log \left\|\det \frac{\partial \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})}{\partial \boldsymbol{x}}\right\|\end{equation}

This indicates that calculating the integral $\eqref{eq:q-int}$ requires two conditions: first, knowing the inverse function $\boldsymbol{z} = \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})$ of $\boldsymbol{x} = \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z})$; second, needing to calculate the determinant of the Jacobian matrix $\frac{\partial \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})}{\partial \boldsymbol{x}}$.

Affine Coupling

To this end, flow models introduced a crucial design—the "Affine Coupling Layer":

\begin{equation}\begin{aligned}&\boldsymbol{h}_1 = \boldsymbol{x}_1\\ &\boldsymbol{h}_2 = \exp(\boldsymbol{\gamma}(\boldsymbol{x}_1))\otimes\boldsymbol{x}_2 + \boldsymbol{\beta}(\boldsymbol{x}_1)\end{aligned}\label{eq:couple}\end{equation}

where $\boldsymbol{x} = [\boldsymbol{x}_1, \boldsymbol{x}_2]$, and $\boldsymbol{\gamma}(\boldsymbol{x}_1), \boldsymbol{\beta}(\boldsymbol{x}_1)$ are models with $\boldsymbol{x}_1$ as input and an output shape identical to $\boldsymbol{x}_2$, and $\otimes$ is the Hadamard product. The equation says that $\boldsymbol{x}$ is split into two parts (the split is arbitrary and doesn't have to be equal); one part is output unchanged, while the other is transformed according to a specific rule. Note that the affine coupling layer is reversible, and its inverse is:

\begin{equation}\begin{aligned}&\boldsymbol{x}_1 = \boldsymbol{h}_1\\ &\boldsymbol{x}_2 = \exp(-\boldsymbol{\gamma}(\boldsymbol{h}_1))\otimes(\boldsymbol{h}_2 - \boldsymbol{\beta}(\boldsymbol{h}_1))\end{aligned}\end{equation}

This satisfies the first condition of reversibility. On the other hand, the Jacobian matrix of the affine coupling layer is lower triangular:

\begin{equation}\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}} = \begin{pmatrix}\frac{\partial \boldsymbol{h}_1}{\partial \boldsymbol{x}_1} & \frac{\partial \boldsymbol{h}_1}{\partial \boldsymbol{x}_2} \\ \frac{\partial \boldsymbol{h}_2}{\partial \boldsymbol{x}_1} & \frac{\partial \boldsymbol{h}_2}{\partial \boldsymbol{x}_2}\end{pmatrix}=\begin{pmatrix}\boldsymbol{I} & \boldsymbol{O} \\ \frac{\partial (\exp(\boldsymbol{\gamma}(\boldsymbol{x}_1))\otimes\boldsymbol{x}_2 + \boldsymbol{\beta}(\boldsymbol{x}_1))}{\partial \boldsymbol{x}_1} & \text{diag}(\exp(\boldsymbol{\gamma}(\boldsymbol{x}_1)))\end{pmatrix}\end{equation}

The determinant of a triangular matrix is the product of its diagonal elements, so:

\begin{equation}\log\left\|\det\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}}\right\| = \sum_i \boldsymbol{\gamma}_i(\boldsymbol{x}_1)\end{equation}

That is, the log absolute determinant of the Jacobian is equal to the sum of the components of $\boldsymbol{\gamma}(\boldsymbol{x}_1)$. This satisfies the second condition of Jacobian determinant computability.

The affine coupling layer was first proposed in RealNVP. The NVP stands for "Non-Volume Preserving," a name contrasting the special case where $\boldsymbol{\gamma}(\boldsymbol{x}_1)$ is identically equal to zero. That special case is called the "Additive Coupling Layer," proposed in NICE, which is "Volume Preserving" because its Jacobian determinant is always 1 (the geometric meaning of the determinant is volume).

Note that if one simply stacks multiple affine coupling layers, $\boldsymbol{x}_1$ remains unchanged throughout, which is not what we want. We want to map the entire $\boldsymbol{x}$ to a standard normal distribution. To solve this, before applying each affine coupling layer, we must "shuffle" the input components in some way so that no component remains consistently unchanged. "Shuffling" operations correspond to permutation matrix transformations, for which the absolute determinant is always 1.

Core Improvements

Up to this point, we have covered only the basic content of flow models. Now we formally enter TARFLOW's contributions.

First, TARFLOW noticed that the affine coupling layer $\eqref{eq:couple}$ can be generalized to multi-block partitions, splitting $\boldsymbol{x}$ into $n$ parts $[\boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_n]$, and then following similar operational rules:

\begin{equation}\begin{aligned}&\boldsymbol{h}_1 = \boldsymbol{x}_1\\ &\boldsymbol{h}_k = \exp(\boldsymbol{\gamma}_k(\boldsymbol{x}_{< k}))\otimes\boldsymbol{x}_k + \boldsymbol{\beta}_k(\boldsymbol{x}_{< k})\end{aligned}\label{eq:couple-2}\end{equation}

where $k > 1$ and $\boldsymbol{x}_{< k}=[\boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_{k-1}]$. Its inverse operation is:

\begin{equation}\begin{aligned}&\boldsymbol{x}_1 = \boldsymbol{h}_1\\ &\boldsymbol{x}_k = \exp(-\boldsymbol{\gamma}_k(\boldsymbol{x}_{< k}))\otimes(\boldsymbol{h}_k - \boldsymbol{\beta}_k(\boldsymbol{x}_{< k}))\end{aligned}\label{eq:couple-2-inv}\end{equation}

Similarly, the log absolute determinant of the Jacobian for this generalized affine coupling layer is the sum of all components of $\boldsymbol{\gamma}_2(\boldsymbol{x}_{< 2}),\cdots,\boldsymbol{\gamma}_n(\boldsymbol{x}_{< n})$. Thus, both conditions required by flow models are met. An early form of this generalization was proposed in IAF in 2016, even earlier than Glow.

But why did subsequent work rarely go deeper in this direction? This is largely due to historical reasons. In the early days, the main architecture for CV models was the CNN. The premise of using a CNN is that features satisfy local correlation. This led to partitions of $\boldsymbol{x}$ being primarily considered in the channel dimension. Because each layer requires a shuffling operation, once spatial dimensions (height, width) are chosen for partitioning, the features lose local correlation after random shuffling, making it impossible to use CNNs. However, partitioning multiple parts in the channel dimension makes it difficult for multiple channel features to interact efficiently.

However, in the Transformer era, the situation is completely different. The input to a Transformer is essentially an unordered set of vectors; in other words, it does not rely on local correlation. Therefore, with the Transformer as the primary architecture, we can choose to partition in the spatial dimension—this is Patchify. Furthermore, the form $\boldsymbol{h}_{k} = \cdots(\boldsymbol{x}_{< k})$ in Equation $\eqref{eq:couple-2}$ means this is a Causal model, which is also perfectly suited for efficient implementation with Transformers.

Beyond the formal fit, what is the intrinsic benefit of partitioning in the spatial dimension? This brings us back to the goal of flow models: $\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z})$ transforms noise into an image, and the inverse model $\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})$ transforms an image into noise. Noise is characterized by randomness—essentially being messy—while the prominent feature of images is local correlation. Thus, one of the keys to transforming images into noise is to break this local correlation. Directly applying Patchify in the spatial dimension, combined with the shuffling operation inherent to coupling layers, is undoubtedly the most efficient choice.

So, Equation $\eqref{eq:couple-2}$ and the Transformer are a "perfect match." This is the meaning of the "TAR" in TARFLOW (Transformer AutoRegressive Flow), and it is its core improvement.

Adding and Removing Noise

A commonly used training trick for flow models is adding noise—that is, adding a small amount of noise to images before feeding them into the model for training. Although we treat images as continuous vectors, they are actually stored in discrete formats. Adding noise (dequantization) can further smooth this discontinuity, making images closer to continuous vectors. The addition of noise also prevents the model from relying too heavily on specific details in the training data, thereby reducing the risk of overfitting.

Adding noise is a basic operation for flow models and was not first proposed by TARFLOW. What TARFLOW proposes is denoising. Theoretically, if a flow model is trained on images with added noise, its generated results will also contain noise. Previously, since flow model generation wasn't great anyway, this bit of noise didn't matter much. But as TARFLOW pushed flow model capabilities upward, denoising became "imperative," otherwise the noise would become a major factor affecting quality.

How do we denoise? Training another denoising model? There's no need. We already proved in "From Denoising Autoencoders to Generative Models" that if $q_{\boldsymbol{\theta}}(\boldsymbol{x})$ is the probability density function after training with noise $\mathcal{N}(\boldsymbol{0},\sigma^2 \boldsymbol{I})$, then:

\begin{equation}\boldsymbol{r}(\boldsymbol{x}) = \boldsymbol{x} + \sigma^2 \nabla_{\boldsymbol{x}} \log q_{\boldsymbol{\theta}}(\boldsymbol{x})\end{equation}

is the theoretically optimal solution for a denoising model. Thus, once we have $q_{\boldsymbol{\theta}}(\boldsymbol{x})$, we don't need to train a separate denoising model; we can calculate it directly based on the equation above. This is another advantage of flow models. Because of this denoising step, the input noise for TARFLOW was changed to a Gaussian distribution, and the noise variance was appropriately increased, which is also one reason for its better performance.

In summary, the complete sampling process for TARFLOW is:

\begin{equation}\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) ,\quad \boldsymbol{y} =\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}),\quad\boldsymbol{x} = \boldsymbol{y} + \sigma^2 \nabla_{\boldsymbol{y}} \log q_{\boldsymbol{\theta}}(\boldsymbol{y}) \end{equation}

Extended Thoughts

At this point, the key changes of TARFLOW compared to previous flow models have been introduced. For other model details, readers should read the original paper themselves; if there's anything you don't understand, you can also refer to the official open-source code.

GitHub: https://github.com/apple/ml-tarflow

Next, I'll mainly discuss some of my thoughts on TARFLOW.

First, it must be pointed out that although TARFLOW reaches SOTA in quality, its sampling speed is actually not as fast as we might hope. The appendix of the original paper mentions that sampling 32 ImageNet64 images on an A100 takes about 2 minutes. Why is it so slow? If we look closely at the inverse of the coupling layer $\eqref{eq:couple-2-inv}$, we find that it is actually a nonlinear RNN! Nonlinear RNNs can only be computed serially, which is the root cause of the slowness.

In other words, TARFLOW is essentially a "train fast, sample slow" model. Of course, if we were willing, we could also make it "train slow, sample fast"; in short, one side of the forward and inverse processes is inevitably slow. This is a weakness of multi-block affine coupling layers and the primary direction for improvement if TARFLOW is to be further generalized.

Secondly, the "AR" in TARFLOW might remind one of current mainstream autoregressive LLMs. Can they be integrated for multimodal generation? Honestly, it's difficult. Because the AR in TARFLOW is purely a requirement of the affine coupling layers, and there's a shuffle before the coupling layer, it's not truly a Causal model; rather, it is a thoroughly Bi-Directional model. Therefore, it's not well-suited for forced integration with textual AR.

Overall, if TARFLOW can further increase its sampling speed, it will be a very competitive pure vision generation model. This is because, beyond simple training and excellent results, the reversibility of flow models has another advantage: as mentioned in "The Reversible Residual Network: Backpropagation Without Storing Activations," backpropagation can be done without storing any activation values, and the cost of recomputation is much lower than in conventional models.

As for whether it has the potential to become a unified architecture for multimodal LLMs, all that can be said is that it's currently unclear.

Renaissance

Finally, let's talk about the "renaissance" of deep learning models.

In recent years, there have been many efforts to rethink and improve models that were seemingly outdated, combining current latest insights to produce new results. Besides TARFLOW's attempt to revitalize flow models, there is also "The GAN is dead; long live the GAN! A Modern GAN Baseline," which refines various GAN combinations and achieves similarly competitive results.

Earlier, there were works like "Improved Residual Networks for Image and Video Recognition" and "Revisiting ResNets: Improved Training and Scaling Strategies" that took ResNet to the next level, and even "RepVGG: Making VGG-style ConvNets Great Again" which brought back the VGG classic. Of course, work on SSMs and linear Attention cannot go unmentioned, representing the "renaissance" of RNNs.

I hope this flourishing "renaissance" tide becomes even more intense, as it allows us to gain a more comprehensive and accurate understanding of models.