Flow Models Series: RealNVP and Glow—Inheritance and Sublimation of Flow Models

By 苏剑林 | August 26, 2018

Opening Remarks

In the previous article, "Flow Models Series: NICE—Basic Concepts and Implementation", we introduced the pioneering work of flow models: the NICE model. From NICE, we learned the basic concepts and ideas of flow models, and finally, I provided a Keras implementation of NICE.

In this article, we will focus on the upgrades to NICE: RealNVP and Glow.


Sampling demonstration of the Glow model (captured from the Glow official blog)

The Ingenious Flow

It must be said that the flow model is a very ingeniously designed model. Overall, flow aims to find an encoder that encodes the input $\boldsymbol{x}$ into a latent variable $\boldsymbol{z}$ such that $\boldsymbol{z}$ follows a standard normal distribution. Thanks to the clever design of flow models, this encoder is invertible, allowing us to immediately write down the corresponding decoder (generator). Thus, once the encoder is trained, we simultaneously obtain the decoder, completing the construction of the generative model.

To achieve this, the model must not only be invertible but also have a Jacobian determinant that is easy to calculate. To this end, NICE proposed additive coupling layers. By stacking multiple additive coupling layers, the model possesses both powerful fitting capabilities and a unit Jacobian determinant. Thus, flow models emerged as a type of generative model distinct from VAEs and GANs, allowing us to directly fit the probability distribution itself through clever construction.

Space Left to Explore

NICE provided a new direction for flow models and completed simple experiments, but it also left behind much unknown space. While the concept of flow models is ingenious, NICE’s experiments were somewhat crude: they simply stacked fully connected layers and did not provide uses for convolutional layers. Although the paper conducted multiple experiments, the only truly successful one was on MNIST, which lacked persuasive power.

Therefore, flow models needed further excavation to stand out in the field of generative models. These extensions were completed by its "successors," RealNVP and Glow. These works made flow models shine, turning them into leaders in the generative modeling field.

RealNVP

In this section, we introduce the RealNVP model, which is an improvement upon NICE from the paper "Density estimation using Real NVP". It generalized the coupling layer and successfully introduced convolutional layers into coupling models, making them better suited for image problems. Furthermore, it proposed a multi-scale layer design, which reduces computational costs and provides a strong regularization effect, improving generation quality. At this point, the general framework of flow models began to take shape.

The subsequent Glow model basically followed the RealNVP framework, merely modifying certain parts (such as introducing invertible 1x1 convolutions to replace permutation layers). However, it is worth mentioning that Glow simplified the structure of RealNVP, showing that some of RealNVP's more complex designs were unnecessary. Therefore, in this introduction, I will not strictly distinguish between them but rather highlight their main contributions.

Affine Coupling Layer

In fact, the first author of both NICE and RealNVP is Laurent Dinh, a PhD student of Bengio. I greatly admire his pursuit and refinement of flow models. In the first NICE paper, he proposed additive coupling layers and mentioned multiplicative coupling layers but didn't use them; in RealNVP, additive and multiplicative coupling layers are combined into a general "Affine Coupling Layer."

\begin{aligned}&\boldsymbol{h}_{1} = \boldsymbol{x}_{1}\\ &\boldsymbol{h}_{2} = \boldsymbol{s}(\boldsymbol{x}_{1})\otimes\boldsymbol{x}_{2} + \boldsymbol{t}(\boldsymbol{x}_{1})\end{aligned}\tag{1}

Here $\boldsymbol{s}$ and $\boldsymbol{t}$ are vector functions of $\boldsymbol{x}_1$. Formally, the second equation corresponds to an affine transformation of $\boldsymbol{x}_2$, hence the name "Affine Coupling Layer."

The Jacobian matrix of the affine coupling layer remains a triangular matrix, but the diagonal is not all 1s. Represented as a block matrix:

$$\left[\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}}\right]=\begin{pmatrix}\mathbb{I}_d & \mathbb{O} \\ \left[\frac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}_1}\otimes \boldsymbol{x}_2+\frac{\partial \boldsymbol{t}}{\partial \boldsymbol{x}_1}\right] & \boldsymbol{s}\end{pmatrix}\tag{2}$$

Obviously, its determinant is the product of the elements of $\boldsymbol{s}$. To ensure invertibility, we generally constrain the elements of $\boldsymbol{s}$ to be greater than zero. Thus, in most cases, we use a neural network to output $\log \boldsymbol{s}$ and then take the exponential form $e^{\log \boldsymbol{s}}$.

Note: The name RealNVP comes from the affine layer. Its full name is "real-valued non-volume preserving." Compared to the additive coupling layer where the determinant is 1, RealNVP's Jacobian determinant is no longer identical to 1. We know the geometric meaning of a determinant is volume (refer to "New Understanding of Matrices 5: Volume = Determinant"). A determinant equal to 1 means no change in volume, while an affine coupling layer with a determinant not equal to 1 means the volume changes, hence "non-volume preserving."

Randomly Shuffling Dimensions

In NICE, the author mixed the information flow via interleaving (which is theoretically equivalent to reversing the original vector), as shown below (correspondingly, this has been replaced with the affine coupling layer diagram used in this article):

NICE information mixing

RealNVP found that shuffling the vector randomly makes the information mix more thoroughly, ultimately resulting in a lower loss, as shown here:

RealNVP random shuffle

This random shuffling refers to concatenating the two vectors $\boldsymbol{h}_1, \boldsymbol{h}_2$ output by each flow step into one vector $\boldsymbol{h}$, and then randomly reordering this vector.

Introducing Convolutional Layers

RealNVP provided a scheme for reasonably using convolutional neural networks in flow models, allowing better handling of image problems, reducing parameter counts, and fully utilizing parallel performance.

Note that applying convolutions is not always reasonable. The prerequisite for using convolutions is that the input has local correlation (in spatial dimensions). Images themselves have local correlation because adjacent pixels are related. However, note the two operations in flow: 1. Splitting the input into two parts $\boldsymbol{x}_1, \boldsymbol{x}_2$, then feeding them into the coupling layer, where models $\boldsymbol{s}, \boldsymbol{t}$ essentially only process $\boldsymbol{x}_1$; 2. Randomly shuffling the dimensions of the features before they enter the coupling layer. Both operations can destroy local correlation.

To continue using convolutions, we must find a way to preserve spatial local correlation. We know an image has three axes: height, width, and channel. The first two are spatial dimensions and clearly have local correlation. Therefore, the "channel" axis is the only one we can manipulate. RealNVP stipulated that splitting and shuffling operations are performed only along the "channel" axis. That is, after splitting the input along channels into $\boldsymbol{x}_1, \boldsymbol{x}_2$, $\boldsymbol{x}_1$ still maintains local correlation. Likewise, shuffling the channels keeps the spatial correlation intact, allowing the use of convolutions in $\boldsymbol{s}, \boldsymbol{t}$.

Channel split

Checkerboard split

Note: In RealNVP, the operation of splitting the input into two is called a "mask" because it is equivalent to using 0/1 to distinguish original inputs. Besides the channel-axis split mask mentioned above, RealNVP also introduced an interleaving spatial mask, as shown on the right side of the image above, called a "checkerboard mask." This specific split also preserves local correlation. The original paper alternated between these two masks, but since the checkerboard mask is complex and offers no significant improvement, it was discarded in Glow.

But wait, images usually have only three channels, and grayscale images like MNIST have only one. How do you split them in half or randomly shuffle them? To solve this, RealNVP introduced an operation called "squeeze" to increase the dimensionality of the channel axis. The idea is simple: reshape the image, but do it locally. Specifically, if the original image is $h \times w \times c$, divide the spatial dimensions into $2 \times 2 \times c$ blocks, then reshape each block into $1 \times 1 \times 4c$. This results in a tensor of $h/2 \times w/2 \times 4c$.

Squeeze operation

With squeeze, we can increase channel dimensions while preserving local correlation, making everything we discussed earlier feasible. Squeeze thus became a staple operation for flow models in image applications.

Multi-scale Structure

Multi-scale structure in RealNVP

Besides convolutions, another important advancement in RealNVP was the multi-scale structure. Like convolutions, this is a strategy that reduces model complexity while improving results.

The multi-scale structure is not complicated. As shown in the diagram, the original input goes through a first "flow operation" (a composite of multiple affine coupling layers). The output size matches the input. At this stage, the output is split into two halves along the channel axis, $\boldsymbol{z}_1$ and $\boldsymbol{z}_2$. $\boldsymbol{z}_1$ is output directly, while only $\boldsymbol{z}_2$ is passed to the next flow operation, and so on. For instance, the final output might consist of $[\boldsymbol{z}_1, \boldsymbol{z}_3, \boldsymbol{z}_5]$, where the total size is the same as the input.

This structure has a "fractal" feel, and the paper notes it was inspired by VGG. Each step in the multi-scale operation halves the spatial dimensions, which is significant. However, there is a crucial detail not mentioned in either the RealNVP or Glow papers, which I understood only after reading the source code: how should the prior distribution of the final output $[\boldsymbol{z}_1, \boldsymbol{z}_3, \boldsymbol{z}_5]$ be chosen? Do we simply assume a standard normal distribution?

In fact, as outputs at different levels, $\boldsymbol{z}_1, \boldsymbol{z}_3, \boldsymbol{z}_5$ are not on equal footing. Directly assuming a shared standard normal distribution would forcibly equate them, which is unreasonable. A better approach is to use the conditional probability formula:

$$p(\boldsymbol{z}_1, \boldsymbol{z}_3, \boldsymbol{z}_5)=p(\boldsymbol{z}_1|\boldsymbol{z}_3, \boldsymbol{z}_5)p(\boldsymbol{z}_3|\boldsymbol{z}_5)p(\boldsymbol{z}_5)\tag{3}$$

Since $\boldsymbol{z}_3$ and $\boldsymbol{z}_5$ are entirely determined by $\boldsymbol{z}_2$, and $\boldsymbol{z}_5$ is entirely determined by $\boldsymbol{z}_4$, the conditions can be modified to:

$$p(\boldsymbol{z}_1, \boldsymbol{z}_3, \boldsymbol{z}_5)=p(\boldsymbol{z}_1|\boldsymbol{z}_2)p(\boldsymbol{z}_3|\boldsymbol{z}_4)p(\boldsymbol{z}_5)\tag{4}$$

RealNVP and Glow assume the distributions on the right are all normal distributions, where the mean and variance of $p(\boldsymbol{z}_1|\boldsymbol{z}_2)$ are calculated from $\boldsymbol{z}_2$ (e.g., via convolution, similar to VAEs), the mean and variance of $p(\boldsymbol{z}_3|\boldsymbol{z}_4)$ from $\boldsymbol{z}_4$, and those of $p(\boldsymbol{z}_5)$ are learned directly.

This assumption is far more effective than simply assuming they are all standard normal. Alternatively: this prior assumption is equivalent to the following variable substitution:

$$\hat{\boldsymbol{z}}_1=\frac{\boldsymbol{z}_1 - \boldsymbol{\mu}(\boldsymbol{z}_2)}{\boldsymbol{\sigma}(\boldsymbol{z}_2)},\quad \hat{\boldsymbol{z}}_3=\frac{\boldsymbol{z}_3 - \boldsymbol{\mu}(\boldsymbol{z}_4)}{\boldsymbol{\sigma}(\boldsymbol{z}_4)},\quad \hat{\boldsymbol{z}}_5=\frac{\boldsymbol{z}_5 - \boldsymbol{\mu}}{\boldsymbol{\sigma}}\tag{5}$$

and then assuming $[\hat{\boldsymbol{z}}_1, \hat{\boldsymbol{z}}_3, \hat{\boldsymbol{z}}_5]$ follows a standard normal distribution. Like the scale transformation layer in NICE, these three transformations lead to a non-unit Jacobian determinant, requiring the addition of terms like $\sum_{i=1}^D \log \boldsymbol{\sigma}_i$ to the loss.

At first glance, the multi-scale structure seems to be just for reducing computation, but it's more than that. Due to the invertibility of flow models, the input and output dimensions are the same, leading to a severe dimension-wasting problem. This often requires complex networks. The multi-scale structure abandons the direct assumption that $p(\boldsymbol{z})$ is a standard normal distribution in favor of a composite conditional distribution. While total dimensions remain the same, levels are no longer equal. The model can suppress dimension wasting by controlling variances (in extreme cases, if variance is zero, the Gaussian collapses to a Dirac delta, reducing dimension by 1). Conditional distributions offer more flexibility than independent ones. From a loss perspective, the multi-scale structure provides a powerful regularization term (akin to skip connections in deep classifiers).

Glow

Overall, the Glow model introduces invertible 1x1 convolutions to replace the channel shuffling in RealNVP, while simplifying and standardizing the architecture, making it easier to understand and use.

Glow paper: https://papers.cool/arxiv/1807.03039
Glow blog: https://blog.openai.com/glow/
Glow code: https://github.com/openai/glow

Invertible 1x1 Convolution

This section introduces the core improvement of Glow: the invertible 1x1 convolution.

Permutation Matrix

Invertible 1x1 convolutions stem from generalizing permutation operations. In flow models, reordering dimensions is a key step—NICE used simple reversal, and RealNVP used random shuffling. Both correspond to vector permutations.

Permutation operations can be described by matrix multiplication. For example, if we swap elements 1-2 and 3-4 of the vector $[1, 2, 3, 4]$ to get $[2, 1, 4, 3]$, it can be written as:

$$\begin{pmatrix}2 \\ 1 \\ 4 \\ 3\end{pmatrix} = \begin{pmatrix}0 & 1 & 0 & 0\\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0\end{pmatrix} \begin{pmatrix}1 \\ 2 \\ 3 \\ 4\end{pmatrix}\tag{6}$$

where the first term on the right is a "permutation matrix" obtained by swapping rows or columns of the identity matrix.

Generalized Permutation

Naturally, one might ask: why not replace the permutation matrix with a general trainable parameter matrix? The result is the invertible 1x1 convolution.

Recall that flow transformations must be invertible and have an easily calculated Jacobian determinant. If we directly write:

$$\boldsymbol{h}=\boldsymbol{x}\boldsymbol{W}\tag{7}$$

it is just a standard fully connected layer without bias, which does not guarantee these conditions. First, we set $\boldsymbol{h}$ and $\boldsymbol{x}$ to have the same dimensions, making $\boldsymbol{W}$ a square matrix. Second, since it is a linear transformation, its Jacobian is $\left[\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}} \right]=\boldsymbol{W}$, and its determinant is $\det \boldsymbol{W}$. Thus, $-\log |\det \boldsymbol{W}|$ must be added to the loss. Finally, we initialize $\boldsymbol{W}$ as a random orthogonal matrix to ensure invertibility.

Utilizing LU Decomposition

The above is a basic solution, but calculating a determinant is computationally expensive ($O(n^3)$) and prone to numerical overflow. Glow offers a clever solution: the inverse application of LU decomposition. Specifically, any matrix can be decomposed as:

$$\boldsymbol{W}=\boldsymbol{P}\boldsymbol{L}\boldsymbol{U}\tag{8}$$

where $\boldsymbol{P}$ is a permutation matrix; $\boldsymbol{L}$ is a lower triangular matrix with 1s on the diagonal; and $\boldsymbol{U}$ is an upper triangular matrix. Calculating the determinant of such a matrix is easy:

$$\log |\det \boldsymbol{W}| = \sum \log|\text{diag}(\boldsymbol{U})|\tag{9}$$

the sum of the logs of the absolute values of the diagonal elements of $\boldsymbol{U}$. Instead of calculating the decomposition, why not parameterize $\boldsymbol{W}$ directly in the form of equation (8)? This keeps multiplication costs the same but drastically simplifies the determinant calculation. Glow initializes a random orthogonal matrix, performs LU decomposition to get $\boldsymbol{P}, \boldsymbol{L}, \boldsymbol{U}$, fixes $\boldsymbol{P}$ and the signs of the diagonal of $\boldsymbol{U}$, ensures $\boldsymbol{L}$ is lower triangular with 1s on the diagonal, and $\boldsymbol{U}$ is upper triangular, then optimizes the remaining parameters in $\boldsymbol{L}$ and $\boldsymbol{U}$.

Results Analysis

The description above applies to fully connected layers. In images, this is applied to each channel vector, making it equivalent to a 1x1 convolution. This is where the name comes from. I personally feel the name "1x1 invertible convolution" is slightly misleading—it's essentially a weight-shared, invertible fully connected layer, not limited to images.

Glow performance comparison

Glow's paper shows that compared to reversing, shuffling reached a lower loss, and invertible 1x1 convolutions reached even lower. My own experiments confirmed this.

However: lowering the loss doesn't always mean improving quality immediately. For example, if Model A uses shuffling and reaches loss -50,000 in 200 epochs, while Model B uses invertible convolutions and reaches -55,000 in 150 epochs, Model B might still look worse currently. Invertible convolutions only guarantee that when everyone reaches the optimum, Model B will be better. Second, in my simple experiments, the number of epochs required for 1x1 convolutions to saturate seemed much higher than for simple shuffling.

Actnorm

RealNVP used Batch Normalization (BN), while Glow introduced "Actnorm" to replace it. However, Actnorm is essentially a generalization of the scale transformation in NICE, the affine transformation in Eq (5):

$$\hat{\boldsymbol{z}}=\frac{\boldsymbol{z} - \boldsymbol{\mu}}{\boldsymbol{\sigma}}\tag{10}$$

where $\boldsymbol{\mu}, \boldsymbol{\sigma}$ are trainable parameters. Glow proposed initializing $\boldsymbol{\mu}, \boldsymbol{\sigma}$ using the mean and variance of the first batch, but the provided source code actually used zero initialization (for the log-scale).

In this regard, OpenAI deserves some criticism for rebranding old concepts. However, the layer itself is effective. With Actnorm, the scaling in the affine coupling layer becomes less critical. Adding scaling to additive coupling layers makes them "affine," doubling computation. Since the performance gain of affine over additive is small (especially with Actnorm), additive coupling is often used for large models to save resources. For example, Glow's high-def face generator (256x256) used only additive coupling.

Source Code Analysis

Glow's structure is standard. Let's break it down to provide a reference for building similar models, based on my reading of the source code.

Overall Diagram

Overall, Glow isn't complex. Noise is added to the input, fed to an encoder, and the "mean sum of squares" of the output is used as the loss (with Jacobian terms as regularization). Note: the loss is not Mean Squared Error (MSE), but simply the sum of squares of $\boldsymbol{z}$, without subtracting the input.

Glow overall diagram

encoder

The encoder is composed of $L$ modules, named `revnet` in the code. Each module processes the input, splits the output into two halves, passes one half to the next module, and outputs the other—this is the multi-scale structure. Glow defaults to $L=3$, while 256x256 faces use $L=6$.

Encoder flowchart

revnet

Each `revnet` is a single flow step, applying scale transformation, axis shuffling, splitting, and a coupling layer. This is repeated $K$ times (the "depth"). In Glow, the default $K=32$. Actnorm and the affine coupling layer contribute non-unit Jacobian determinants to the loss.

Revnet structure

split2d

In Glow, `split2d` isn't just a split; it includes a transformation of the split part, which is the conditional prior choice mentioned earlier.

Split2d logic

Comparing Eq (5) and Eq (10), the only difference between conditional prior and Actnorm is the source of the scaling/shifting parameters. Actnorm optimizes them directly, while the prior calculates them from another part via a model. This is essentially "Conditional Actnorm."

f

Finally, the coupling layer model (the $\boldsymbol{s}, \boldsymbol{t}$ functions), named `f` in the code, uses three ReLU convolutional layers:

Coupling layer model f

The last layer is zero-initialized, so the initial state is an identity transformation, facilitating the training of deep networks.

Reproduction

RealNVP laid nearly all the groundwork, while Glow refined it and added small modifications like 1x1 convolutions. Regardless, it is a model worth studying.

Keras Version

The official Glow is in TensorFlow. I've implemented a Keras version here:
https://github.com/bojone/flow/blob/master/glow.py

Currently only supports the TensorFlow backend (tested with Keras 2.1.5 + TF 1.2 and Keras 2.2.0 + TF 1.8 on Python 2.7).

Effect Tests

When I first read about Glow, I was excited. After learning it... Glow is indeed a new world, but not one for common people to conquer easily.

Consider these two issues from Glow's GitHub:

"How many epochs will be take when training celeba?"
The samples we show in the paper are after about 4,000 training epochs...

"anyone reproduced the celeba-HQ results in the paper"
Yes we trained with 40 GPU's for about a week, but samples did start to look good after a couple of days...

Generating high-def 256x256 faces requires 4,000 epochs with 40 GPUs for a week. That implies one year on a single GPU... (Dead).

Alright, I'll give up on that. Let's try 32x32 faces for a demo.

Glow 32x32 face generation

Glow CIFAR10 generation

Not bad. I used $L=3, K=6$, taking ~70s per epoch on a GTX 1070 for 150 epochs. My "epoch" is 32,000 random samples. I also tried CIFAR-10 for 700 epochs; it looks okay from afar, but not up close.

Generating CIFAR-10 is harder than faces regardless of the model. I tried 64x64 faces with $L=3, K=10$ for 200 epochs (each taking 6 minutes). The result...

Glow 64x64 face generation

They are faces, but they look like demons (not deep enough, and not enough epochs). Lowering the "temperature" parameter to 0.8 yielded better results:

Glow 64x64 face temp 0.8

Still a bit distorted, but much better.

Tough Ending

The introduction to RealNVP and Glow concludes here. I've walked through three flow models across two articles. I hope this helps readers.

Overall, flow models like Glow are elegant, but they are computationally expensive and slow to train compared to GANs. Flow models have a long road ahead before they can fully challenge GANs in the generative modeling domain.