NICE in "Steady" Flows: Basic Concepts and Implementation of Flow Models

By 苏剑林 | August 11, 2018

Preface: Ever since I saw the Glow model on Machine Heart (see "The Next GAN? OpenAI Proposes Invertible Generative Model Glow"), I have been thinking about it constantly. Machine learning models emerge one after another nowadays; I often follow new model trends, but few have moved me as much as the Glow model, giving me that "this is it" feeling. What's even more surprising is that this model, which produces such good results, was something I had never heard of before. I've re-read it several times over the past few days, and the more I read, the more interesting it becomes, feeling like it connects many of my previous ideas. Here is a summary of this stage.

Background

This article mainly introduces and implements "NICE: Non-linear Independent Components Estimation". This paper is one of the foundational works for the Glow model—it could be called the cornerstone of Glow.

Difficult Distributions

As is well known, mainstream generative models currently include VAEs and GANs. In fact, besides these two, there are also flow-based models (the concept of "flow" will be introduced later). Flow models have a history as long as VAEs and GANs, but they are far less known. In my opinion, the reason is likely that flow models lack a straightforward explanation like the "counterfeiter-discriminator" analogy of GANs. Flow is entirely mathematical and, combined with the fact that early results weren't particularly good while computational costs were high, it was hard to generate interest. However, now OpenAI’s amazing Glow model, based on flow, will likely encourage more people to invest in flow models.

High-definition faces generated by Glow model — High-definition faces generated by the Glow model

The essence of a generative model is the hope of using a probability model we know to fit given data samples. That is, we must write a distribution $q_{\boldsymbol{\theta}}(\boldsymbol{x})$ with parameters $\boldsymbol{\theta}$. However, our neural networks are "universal function approximators," not "universal distribution approximators." While they can theoretically fit any function, they cannot arbitrarily fit a probability distribution because a distribution must satisfy "non-negativity" and "normalization" requirements. Consequently, the only distributions we can directly write down are discrete distributions or continuous Gaussian distributions.

Of course, from a strict perspective, an image should be a discrete distribution because it consists of a finite number of pixels, and each pixel value is also discrete and finite. Therefore, it can be described by a discrete distribution. This line of thought resulted in models like PixelRNN, which we call "autoregressive flows." Their characteristic is that they cannot be parallelized, so the computational cost is extremely high. Thus, we prefer to use continuous distributions to describe images. Of course, images are just one scenario; in other contexts, we also have a lot of continuous data, making research into continuous distributions very necessary.

Each Showing Their Prowess

So the problem arises: for continuous models, we can really only write down Gaussian distributions. Furthermore, for ease of processing, we often only write down Gaussian distributions where each component is independent. This is clearly only a tiny fraction of the vast array of continuous distributions and is obviously insufficient. To solve this dilemma, we create more distributions through integration:

$$q(\boldsymbol{x})=\int q(\boldsymbol{z})q_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{z}) d\boldsymbol{z} \tag{1}$$

Here $q(\boldsymbol{z})$ is generally a standard Gaussian distribution, and $q_{\boldsymbol{\theta}}(\boldsymbol{x}|\boldsymbol{z})$ can be any conditional Gaussian distribution or a Dirac distribution. Such integral forms can create many complex distributions. Theoretically, it can fit any distribution.

Now that the distribution form is established, we need to find the parameters $\boldsymbol{\theta}$. Usually, this involves maximum likelihood. Assuming the true data distribution is $\tilde{p}(\boldsymbol{x})$, we need to maximize the objective:

$$\mathbb{E}_{\boldsymbol{x}\sim \tilde{p}(\boldsymbol{x})} \big[\log q(\boldsymbol{x})\big] \tag{2}$$

However, since $q_{\boldsymbol{\theta}}(\boldsymbol{x})$ is in integral form, whether it can be computed is difficult to say.

Various experts have "crossed the sea like the Eight Immortals, each showing their own prowess." Among them, VAE and GAN bypassed this difficulty in different ways. VAE does not directly optimize objective (2); instead, it optimizes a stronger upper bound, making it only an approximate model that cannot achieve perfect generative results. GAN bypasses the difficulty through an alternating training method, preserving the model's precision, which is why it achieves such good generation effects. Regardless, GAN is not satisfactory in every respect, so exploring other solutions is meaningful.

Directly Facing Probability Integrals

Flow models choose a "hard road": computing the integral directly.

Specifically, flow models choose $q(\boldsymbol{x}|\boldsymbol{z})$ to be a Dirac distribution $\delta(\boldsymbol{x}-\boldsymbol{g}(\boldsymbol{z}))$, and $\boldsymbol{g}(\boldsymbol{z})$ must be invertible. In other words:

$$\boldsymbol{x}=\boldsymbol{g}(\boldsymbol{z}) \Leftrightarrow \boldsymbol{z} = \boldsymbol{f}(\boldsymbol{x}) \tag{3}$$

To realize invertibility theoretically (mathematically), $\boldsymbol{z}$ and $\boldsymbol{x}$ must have the same dimension. Assuming the forms of $\boldsymbol{f}$ and $\boldsymbol{g}$ are known, calculating $q(\boldsymbol{x})$ via (1) is equivalent to performing an integral transformation $\boldsymbol{z}=\boldsymbol{f}(\boldsymbol{x})$ on $q(\boldsymbol{z})$. That is, originally:

$$q(\boldsymbol{z}) = \frac{1}{(2\pi)^{D/2}}\exp\left(-\frac{1}{2} \Vert \boldsymbol{z}\Vert^2\right) \tag{4}$$

is a standard Gaussian distribution ($D$ is the dimension of $\boldsymbol{z}$). Now we perform the change of variables $\boldsymbol{z}=\boldsymbol{f}(\boldsymbol{x})$. Note that the change of variables for a probability density function is not as simple as replacing $\boldsymbol{z}$ with $\boldsymbol{f}(\boldsymbol{x})$; it also involves the absolute value of the "Jacobian determinant":

$$q(\boldsymbol{x}) = \frac{1}{(2\pi)^{D/2}}\exp\left(-\frac{1}{2}\big\Vert \boldsymbol{f}(\boldsymbol{x})\big\Vert^2\right)\left|\det\left[\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}\right]\right| \tag{5}$$

Thus, we have two requirements for $\boldsymbol{f}$:

1. It must be invertible, and its inverse function must be easy to find (its inverse $\boldsymbol{g}$ is our desired generative model);

2. The corresponding Jacobian determinant must be easy to calculate.

In this way:

$$\log q(\boldsymbol{x}) = -\frac{D}{2}\log (2\pi) -\frac{1}{2}\big\Vert \boldsymbol{f}(\boldsymbol{x})\big\Vert^2 + \log \left|\det\left[\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}\right]\right| \tag{6}$$

This optimization target is solvable. Furthermore, because $\boldsymbol{f}$ is easy to invert, once training is complete, we can randomly sample a $\boldsymbol{z}$ and then generate a sample via the inverse of $\boldsymbol{f}$: $\boldsymbol{f}^{-1}(\boldsymbol{z})=\boldsymbol{g}(\boldsymbol{z})$. This gives us the generative model.

flow

We have previously introduced the characteristics and difficulties of flow models. Below, we will detail how the flow model addresses these difficulties. Since this article mainly introduces the work of the first paper, "NICE: Non-linear Independent Components Estimation", we will specifically refer to this model as NICE.

Coupling Layer

Relatively speaking, calculating the determinant is more difficult than inverting a function, so we start by thinking from "Requirement 2." Friends familiar with linear algebra will know that the determinant of a triangular matrix is the easiest to calculate: it equals the product of the diagonal elements. Therefore, we should try to make the Jacobian matrix of transformation $\boldsymbol{f}$ a triangular matrix. NICE's approach is very clever: it divides the $D$-dimensional $\boldsymbol{x}$ into two parts, $\boldsymbol{x}_1, \boldsymbol{x}_2$, and then takes the following transformation:

$$\begin{aligned}&\boldsymbol{h}_{1} = \boldsymbol{x}_{1}\\ &\boldsymbol{h}_{2} = \boldsymbol{x}_{2} + \boldsymbol{m}(\boldsymbol{x}_{1})\end{aligned} \tag{7}$$

Where $\boldsymbol{x}_1, \boldsymbol{x}_2$ is some partition of $\boldsymbol{x}$, and $\boldsymbol{m}$ is any function of $\boldsymbol{x}_1$. That is, $\boldsymbol{x}$ is split into two parts, and transformed according to the above formula to obtain new variables $\boldsymbol{h}$. We call this an "Additive Coupling Layer." Without loss of generality, the dimensions of $\boldsymbol{x}$ can be rearranged so that $\boldsymbol{x}_1 = \boldsymbol{x}_{1:d}$ represents the first $d$ elements, and $\boldsymbol{x}_2=\boldsymbol{x}_{d+1:D}$ represents elements $d+1 \sim D$.

It is not hard to see that the Jacobian matrix of this transformation $\left[\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}}\right]$ is a triangular matrix, and the diagonal elements are all 1. Represented as a block matrix:

$$\left[\frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}}\right]=\begin{pmatrix}\mathbb{I}_{1:d} & \mathbb{O} \\ \left[\frac{\partial \boldsymbol{m}}{\partial \boldsymbol{x}_1}\right] & \mathbb{I}_{d+1:D}\end{pmatrix} \tag{8}$$

As a result, the Jacobian determinant of this transformation is 1, its logarithm is 0, thus solving the problem of determinant calculation.

At the same time, the transformation in (7) is invertible, and its inverse is:

$$\begin{aligned}&\boldsymbol{x}_{1} = \boldsymbol{h}_{1}\\ &\boldsymbol{x}_{2} = \boldsymbol{h}_{2} - \boldsymbol{m}(\boldsymbol{h}_{1})\end{aligned} \tag{9}$$

Steady Streams of Flow

The above transformation is quite surprising: it is invertible, and the inverse transformation is very simple, adding no extra computational cost. Nevertheless, we can note that the first part of transformation (7) is trivial (an identity transformation). Therefore, a single transformation cannot achieve very strong non-linearity. We need to compose multiple simple transformations to achieve strong non-linearity and enhance fitting capability.

$$\boldsymbol{x} = \boldsymbol{h}^{(0)} \leftrightarrow \boldsymbol{h}^{(1)} \leftrightarrow \boldsymbol{h}^{(2)} \leftrightarrow \dots \leftrightarrow \boldsymbol{h}^{(n-1)} \leftrightarrow \boldsymbol{h}^{(n)} = \boldsymbol{z} \tag{10}$$

where each transformation is an additive coupling layer. This is like water flowing—accumulating a lot from a little, a steady stream flowing far. Thus, such a process is called a "flow." In other words, a flow is the coupling of multiple additive coupling layers.

By the chain rule:

$$\left[\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}}\right]=\left[\frac{\partial \boldsymbol{h}^{(n)}}{\partial \boldsymbol{h}^{(0)}}\right]=\left[\frac{\partial \boldsymbol{h}^{(n)}}{\partial \boldsymbol{h}^{(n-1)}}\right]\left[\frac{\partial \boldsymbol{h}^{(n-1)}}{\partial \boldsymbol{h}^{(n-2)}}\right]\dots \left[\frac{\partial \boldsymbol{h}^{(1)}}{\partial \boldsymbol{h}^{(0)}}\right] \tag{11}$$

Because "the determinant of a product of matrices equals the product of the matrices' determinants," and each layer is an additive coupling layer, the determinant of each layer is 1. Therefore, the result is:

$$\det \left[\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}}\right]=\det\left[\frac{\partial \boldsymbol{h}^{(n)}}{\partial \boldsymbol{h}^{(n-1)}}\right]\det\left[\frac{\partial \boldsymbol{h}^{(n-1)}}{\partial \boldsymbol{h}^{(n-2)}}\right]\dots \det\left[\frac{\partial \boldsymbol{h}^{(1)}}{\partial \boldsymbol{h}^{(0)}}\right]=1$$

(Considering the staggering below, the determinant might become -1, but the absolute value remains 1), so we still don't need to consider the determinant.

Advancing Through Interleaving

Note that if the order of coupling remains unchanged, i.e.,

$$\begin{array}{ll}\begin{aligned}&\boldsymbol{h}^{(1)}_{1} = \boldsymbol{x}_{1}\\ &\boldsymbol{h}^{(1)}_{2} = \boldsymbol{x}_{2} + \boldsymbol{m}_1(\boldsymbol{x}_{1})\end{aligned} & \begin{aligned}&\boldsymbol{h}^{(2)}_{1} = \boldsymbol{h}^{(1)}_{1}\\ &\boldsymbol{h}^{(2)}_{2} = \boldsymbol{h}^{(1)}_{2} + \boldsymbol{m}_2\big(\boldsymbol{h}^{(1)}_{1}\big)\end{aligned} & \\ & \\ \begin{aligned}&\boldsymbol{h}^{(3)}_{1} = \boldsymbol{h}^{(2)}_{1}\\ &\boldsymbol{h}^{(3)}_{2} = \boldsymbol{h}^{(2)}_{2} + \boldsymbol{m}_3\big(\boldsymbol{h}^{(2)}_{1}\big)\end{aligned} & \begin{aligned}&\boldsymbol{h}^{(4)}_{1} = \boldsymbol{h}^{(3)}_{1}\\ &\boldsymbol{h}^{(4)}_{2} = \boldsymbol{h}^{(3)}_{2} + \boldsymbol{m}_4\big(\boldsymbol{h}^{(3)}_{1}\big)\end{aligned} & \quad\dots \end{array} \tag{12}$$

Then ultimately $\boldsymbol{z}_1 = \boldsymbol{x}_1$, and the first part remains trivial, as shown below:

Simple coupling keeps part of the input identical, information is not fully mixed — Simple coupling keeps part of the input identical; information is not fully mixed

To obtain non-trivial transformations, we can consider shuffling or reversing the order of input dimensions before each additive coupling, or simply swapping the positions of these two parts so that information can mix thoroughly, for example:

$$\begin{array}{ll}\begin{aligned}&\boldsymbol{h}^{(1)}_{1} = \boldsymbol{x}_{1}\\ &\boldsymbol{h}^{(1)}_{2} = \boldsymbol{x}_{2} + \boldsymbol{m}_1(\boldsymbol{x}_{1})\end{aligned} & \begin{aligned}&\boldsymbol{h}^{(2)}_{1} = \boldsymbol{h}^{(1)}_{1} + \boldsymbol{m}_2\big(\boldsymbol{h}^{(1)}_{2}\big)\\ &\boldsymbol{h}^{(2)}_{2} = \boldsymbol{h}^{(1)}_{2}\end{aligned} & \\ & \\ \begin{aligned}&\boldsymbol{h}^{(3)}_{1} = \boldsymbol{h}^{(2)}_{1}\\ &\boldsymbol{h}^{(3)}_{2} = \boldsymbol{h}^{(2)}_{2} + \boldsymbol{m}_3\big(\boldsymbol{h}^{(2)}_{1}\big)\end{aligned} & \begin{aligned}&\boldsymbol{h}^{(4)}_{1} = \boldsymbol{h}^{(3)}_{1} + \boldsymbol{m}_4\big(\boldsymbol{h}^{(3)}_{2}\big)\\ &\boldsymbol{h}^{(4)}_{2} = \boldsymbol{h}^{(3)}_{2} \end{aligned} & \quad\dots \end{array} \tag{13}$$

As shown below:

Mixing information through cross-coupling to achieve stronger nonlinearity

Scaling Transformation Layer

In the first half of the article, we pointed out that flow is based on invertible transformations. Therefore, when the model is trained, we simultaneously obtain a generative model and an encoding model. However, because of the invertible transformation, the random variable $\boldsymbol{z}$ and the input sample $\boldsymbol{x}$ have the same size. When we specify $\boldsymbol{z}$ as a Gaussian distribution, it is spread across the entire $D$-dimensional space, where $D$ is the size of the input $\boldsymbol{x}$. But although $\boldsymbol{x}$ has $D$ dimensions, it might not truly fill the entire $D$-dimensional space. For example, an MNIST image has 784 pixels, but some pixels stay at 0 across both training and test sets, indicating the data doesn't truly occupy all 784 dimensions.

In other words, flow, a model based on invertible transformations, inherently faces a serious problem of dimensional wastage: input data clearly doesn't reside on a $D$-dimensional manifold, yet it must be encoded as a $D$-dimensional manifold. Is this feasible?

To address this, NICE introduces a scaling transformation layer. It applies a scale transformation to each dimension of the encoded features, namely $\boldsymbol{z} = \boldsymbol{s}\otimes \boldsymbol{h}^{(n)}$, where $\boldsymbol{s} = (\boldsymbol{s}_1,\boldsymbol{s}_2,\dots,\boldsymbol{s}_D)$ is a vector of parameters to be optimized (each element is non-negative). This vector $\boldsymbol{s}$ can identify the importance of each dimension (the smaller it is, the more important; the larger it is, the less important the dimension, becoming negligible as it increases), serving to compress the manifold. Note that the Jacobian determinant of this scaling layer is no longer 1. The Jacobian matrix is diagonal:

$$\left[\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{h}^{(n)}}\right] = \text{diag}\, (\boldsymbol{s}) \tag{14}$$

So its determinant is $\prod_i \boldsymbol{s}_i$. Thus, according to equation (6), we have the log-likelihood:

$$\log q(\boldsymbol{x}) \sim -\frac{1}{2}\big\Vert \boldsymbol{s}\otimes \boldsymbol{f} (\boldsymbol{x})\big\Vert^2 + \sum_i \log \boldsymbol{s}_i \tag{15}$$

Why can this scaling transformation identify the importance of features? In fact, this scaling layer can be described in a clearer way: initially, we set the prior distribution of $\boldsymbol{z}$ to be a standard normal distribution, meaning all variances are 1. Actually, we can also treat the variance of the prior distribution as a training parameter. This way, after training, variances will vary. A smaller variance indicates smaller "dispersion" for that feature. If the variance is 0, the feature is always 0 (equal to the mean 0), and the distribution for that dimension collapses to a point. This means the manifold's dimensionality is reduced by one.

Different from equation (4), we write the normal distribution with variance:

$$q(\boldsymbol{z}) = \frac{1}{(2\pi)^{D/2}\prod\limits_{i=1}^D \boldsymbol{\sigma}_i}\exp\left(-\frac{1}{2}\sum_{i=1}^D \frac{\boldsymbol{z}_i^2}{\boldsymbol{\sigma}_i^2}\right) \tag{16}$$

Substituting the flow model $\boldsymbol{z}=\boldsymbol{f}(\boldsymbol{x})$ into the equation above and taking the logarithm, similar to equation (6), we get:

$$\log q(\boldsymbol{x}) \sim -\frac{1}{2}\sum_{i=1}^D \frac{\boldsymbol{f}_i^2(\boldsymbol{x})}{\boldsymbol{\sigma}_i^2} - \sum_{i=1}^D \log \boldsymbol{\sigma}_i \tag{17}$$

Comparing with equation (15), we have $\boldsymbol{s}_i=1/\boldsymbol{\sigma}_i$. Thus, the scaling transformation layer is equivalent to making the prior distribution's variance (standard deviation) a training parameter. If the variance is small enough, we can consider the manifold represented by that dimension as collapsed to a point, reducing the overall dimensionality of the manifold and implying the possibility of dimensionality reduction.

Feature Decoupling

When we choose a prior distribution with independent Gaussian components, besides the convenience of sampling, what other benefits are there?

In a flow model, $\boldsymbol{f}^{-1}$ is the generative model for sampling, while $\boldsymbol{f}$ is the encoder. But unlike autoencoders in traditional neural networks that "force low-dimensional reconstruction of high-dimensional data to extract effective information," flow models are completely invertible, so there's no information loss. Then what is the value of this encoder?

This relates to the question: "What are good features?" In real life, we often abstract dimensions to describe things, such as "height," "weight," "beauty," "wealth," etc. The characteristic of these dimensions is: "When we say someone is tall, they aren't necessarily fat or thin, nor are they necessarily rich or poor." In other words, there are few necessary connections between these features. Otherwise, these features would be redundant. Thus, for good features, ideally, each dimension should be independent of the others, achieving feature decoupling so that each dimension has its own independent meaning.

Thus, we can understand the advantage of "the prior distribution being a Gaussian distribution with independent components." Due to the independence of the components, we have reason to say that when we use $\boldsymbol{f}$ to encode original features, the various dimensions of the output encoded features $\boldsymbol{z}=\boldsymbol{f}(\boldsymbol{x})$ are decoupled. The full name of NICE is Non-linear Independent Components Estimation, which carries this meaning. Conversely, due to the independence of each dimension of $\boldsymbol{z}$, theoretically, when we change a single dimension, we can see how the generated image changes with that dimension, thereby discovering the meaning of that dimension.

Similarly, we can perform interpolation (weighted average) on the encodings of two images to obtain a naturally transitioning generated sample, which is fully embodied in the later-developed Glow model. However, since we only performed MNIST experiments, we won't specifically demonstrate this point here.

Experiments

Here we reproduce the MNIST experiment from the NICE paper using Keras.

Model Details

Let's summarize the various parts of the NICE model. The NICE model is a type of flow model composed of multiple additive coupling layers, each as in (7) with its inverse as in (9). Before coupling, the input dimensions need to be reversed to mix the information thoroughly. The final layer needs a scaling transformation layer, and the final loss is the negative of equation (15).

The additive coupling layer needs to divide the input into two parts. NICE uses interleaved partitioning: even indices are the first part, and odd indices are the second part. Each $\boldsymbol{m}(\boldsymbol{x})$ is simply implemented using multi-layer fully connected layers (5 hidden layers, each with 1000 nodes, ReLU activation). In NICE, 4 additive coupling layers are coupled in total.

For input, we compress the original 0-255 image pixels to the 0-1 range (dividing by 255) and then add uniform noise from the range $[-0.01, 0]$. Adding noise effectively prevents overfitting and improves generated image quality. It can also be seen as a measure to mitigate the dimensional wastage problem, because technically MNIST images can't fill 784 dimensions, but adding noise increases the dimensionality.

Readers might wonder why the noise interval is $[-0.01, 0]$ instead of $[0, 0.01]$ or $[-0.005, 0.005]$? In fact, from the loss perspective, various noises are similar (including changing the uniform distribution to a Gaussian distribution). However, after adding noise, theoretically, the generated images will also have noise, which is not what we want. Adding negative noise makes the final generated pixels slightly biased towards the negative range, so I can use a clip operation to remove some of the noise. This is just a small (though not particularly critical) trick specifically for MNIST.

Reference Code

Here is my reference implementation using Keras:
https://github.com/bojone/flow/blob/master/nice.py
In my experiments, it reaches optimal performance within 20 epochs, with one epoch taking 11s (GTX1070 environment). The final loss is approximately -2200.

Compared to the original paper's implementation, some changes were made here. For the additive coupling layer, I used equation (9) for the forward pass and (7) for its inverse. Since $\boldsymbol{m}(\boldsymbol{x})$ uses ReLU activation, and we know ReLU is non-negative, there's a difference between these two choices. Because the forward pass is the encoder and the inverse pass is the generator, choosing (7) as the inverse makes the generative model more likely to produce positive numbers, which aligns with the image we want to generate, as we need pixels in the 0-1 range.

Digit samples generated by NICE (trained without noise) — Digit samples generated by the NICE model (trained without noise)

Digit samples generated by NICE (trained with negative noise) — Digit samples generated by the NICE model (trained with negative noise)

Annealing Parameters

Although we ultimately hope to sample random numbers from a standard normal distribution to generate samples, in reality, for a trained model, the ideal sampling variance isn't necessarily 1, but fluctuates around 1—generally a bit smaller than 1. The standard deviation of the final sampled normal distribution is called the annealing parameter. For example, in the reference implementation above, we chose 0.75 as the annealing parameter, which visually yields the best generation quality.

Summary

The NICE model is quite huge. According to the above model, the number of parameters is approximately $4 \times 5 \times 1000^2 = 2 \times 10^7$. Twenty million parameters just to train an MNIST generative model is quite exaggerated~

NICE as a whole is quite simple and brute-force. First, additive coupling itself is relatively simple. Second, the $\boldsymbol{m}$ part of the model simply uses huge fully connected layers without integrating tricks like convolution. Therefore, there is still much room for exploration. Real NVP and Glow are two improved versions, and we will talk about their stories later.