BiGAN-QP: A Simple and Clear Encoding & Generative Model

By 苏剑林 | December 10, 2018

Not long ago, by directly analyzing in the dual space, I proposed an adversarial model framework called GAN-QP. Its characteristic is that it can be theoretically proven to neither suffer from vanishing gradients nor require Lipschitz (L) constraints, thus simplifying the construction and training of generative models.

GAN-QP is an adversarial framework, so in theory, all original GAN tasks can be attempted within it. In the previous article "A GAN that requires no L-constraint and does not suffer from vanishing gradients, want to know more?", we only tried the standard random generation task. In this article, we will experiment with a case that includes both a generator and an encoder: BiGAN-QP.

BiGAN and BiGAN-QP

Note that this is BiGAN, not the recently popular BigGAN. BiGAN is Bidirectional GAN, proposed in the paper "Adversarial feature learning". At the same time, there was a very similar paper titled "Adversarially Learned Inference", which proposed a model called ALI, which is essentially the same as BiGAN. In general, they both add an encoder to the ordinary GAN model, so that the model has both the random generation function of a normal GAN and the function of an encoder, which can be used to extract effective features. Applying the GAN-QP adversarial mode to BiGAN gives us BiGAN-QP.

Without further ado, let's look at the effect (left is the original image, right is the reconstruction):

BiGAN-QP reconstruction effect chart

This is the result of reducing a 256x256x3 image to 256 dimensions and then reconstructing it. As you can see, the overall reconstruction effect is good, without the blurriness of ordinary autoencoders. There is some loss of detail, which is a bit worse compared to IntroVAE, but that is a matter of model architecture and hyperparameter tuning, which is not my specialty. In any case, this result shows that BiGAN-QP is workable and the effect is decent.

The content of this article has been updated in the original GAN-QP paper: https://papers.cool/arxiv/1811.07296, where readers can download the latest version on arXiv.

Simplified Derivation of BiGAN-QP

In fact, compared to GAN, the derivation of BiGAN is very simple. You only need to replace the original single input $x$ with a dual input $(x, z)$. Similarly, if you have the GAN-QP foundation, what we call BiGAN-QP is also very straightforward. Specifically, the original GAN-QP was like this:

\begin{equation}\begin{aligned}&T= \mathop{\text{argmax}}_T\, \mathbb{E}_{(x_r,x_f)\sim p(x_r)q(x_f)}\left[T(x_r,x_f)-T(x_f,x_r) - \frac{(T(x_r,x_f)-T(x_f,x_r))^2}{2\lambda d(x_r,x_f)}\right] \\ &G = \mathop{\text{argmin}}_G\,\mathbb{E}_{(x_r,x_f)\sim p(x_r)q(x_f)}\left[T(x_r,x_f)-T(x_f,x_r)\right] \end{aligned}\end{equation}

Now it becomes:

\begin{equation}\begin{aligned}T&= \mathop{\text{argmax}}_T\, \mathbb{E}_{x\sim p(x), z\sim q(z)}\left[\Delta T - \frac{\Delta T^2}{2\lambda d\big(x,E(x);G(z),z\big)}\right] \\ G,E &= \mathop{\text{argmin}}_{G,E}\,\mathbb{E}_{x\sim p(x), z\sim q(z)}[\Delta T]\\ \Delta T &= T(x,E(x);G(z),z)-T(G(z),z;x,E(x)) \end{aligned}\end{equation}

Or a simplified version directly taking $\Delta T = T(x,E(x))-T(G(z),z)$. Theoretically, this works; this is BiGAN-QP.

But in practice, it is very difficult to learn a good bidirectional mapping this way, because it is equivalent to automatically searching for a bidirectional mapping from an infinite number of possible mappings, which is quite difficult. Therefore, we also need some "guiding terms." We use two MSE errors as guiding terms:

\begin{equation}\begin{aligned}T&= \mathop{\text{argmax}}_T\, \mathbb{E}_{x\sim p(x), z\sim q(z)}\left[\Delta T - \frac{\Delta T^2}{2\lambda d\big(x,E(x);G(z),z\big)}\right] \\ G,E &= \mathop{\text{argmin}}_{G,E}\,\mathbb{E}_{x\sim p(x), z\sim q(z)}\Big[\Delta T + \beta_1 \Vert z - E(G(z))\Vert^2 + \beta_2 \Vert x - G(E(x))\Vert^2\Big]\\ \Delta T &= T(x,E(x))-T(G(z),z) \end{aligned}\end{equation}

In fact, the three loss terms of the generator are very intuitive: $\Delta T$ makes the generated images more realistic, $\Vert z - E(G(z))\Vert^2$ aims to reconstruct the latent variable space, and $\Vert x - G(E(x))\Vert^2$ aims to reconstruct the observable variable space. The last two terms should not be too large, especially the last one, as being too large will lead to image blurriness.

These two regularization terms can be seen as an upper bound on the mutual information between $G(z)$ and $z$, and the mutual information between $x$ and $E(x)$. Therefore, from an information perspective, these two regularization terms hope that the mutual information between $x$ and $z$ is as large as possible. Related discussions can be found in the InfoGAN paper; these two regularization terms signify that it also belongs to the category of InfoGAN. So strictly speaking, this should be a Bi-Info-GAN-QP.

Mutual information terms can stabilize the GAN training process to a certain extent and reduce the possibility of mode collapse, because once the model collapses, the mutual information will not be large. In other words, if the model collapses, reconstruction becomes unlikely, and the reconstruction loss will be very large.

Experiments show that after making small adjustments, the effect is better. This small adjustment stems from the fact that the coupling of the two MSE terms is still too powerful (the specific value of the loss is not necessarily large, but the gradient is), causing the model to still lean towards generating blurry images. Therefore, half of the gradient needs to be stopped, turning it into:

\begin{equation}\begin{aligned}T&= \mathop{\text{argmax}}_T\, \mathbb{E}_{x\sim p(x), z\sim q(z)}\left[\Delta T - \frac{\Delta T^2}{2\lambda d\big(x,E(x);G(z),z\big)}\right] \\ G,E &= \mathop{\text{argmin}}_{G,E}\,\mathbb{E}_{x\sim p(x), z\sim q(z)}\Big[\Delta T + \beta_1 \Vert z - E(G_{ng}(z))\Vert^2 + \beta_2 \Vert x - G(E_{ng}(x))\Vert^2\Big]\\ \Delta T &= T(x,E(x))-T(G(z),z) \end{aligned}\end{equation}

$G_{ng}$ and $E_{ng}$ refer to forcing the gradient of these parts to be 0. Most frameworks have this operator; you can just call it. This is the final BiGAN-QP model in this article.

Code and Result Images

The code has also been added to Github: https://github.com/bojone/gan-qp/tree/master/bigan-qp

Here are more result images, randomly generated:

BiGAN-QP randomly generated images

Reconstruction results (left is original, right is reconstruction):

BiGAN-QP reconstruction effect chart 2

As can be seen, whether it is random generation or reconstruction, the effects are satisfying and no blurriness occurs. This indicates that we have indeed successfully trained a GAN model that possesses both encoding and generation capabilities simultaneously.

And an important feature is: because it is dimensionality reduction reconstruction, the model did not (and cannot) learn a one-to-one pixel-wise mapping, but rather a clear reconstruction result that looks roughly the same overall. For example, looking at the first image in the first row and the second image in the last row, the model basically reconstructs the person, but what's interesting are the glasses. We find that the model indeed reconstructed the glasses but changed them to another "style" of glasses. We can even think that the model has learned the concept of "glasses," but because it is dimensionality reduction reconstruction and the representation capacity of the latent variable is limited, although the model knows those are glasses, it cannot reconstruct exactly the same glasses, so it simply swaps in a common pair of glasses.

This is something that the "point-by-point one-to-one reconstruction" required by ordinary VAEs cannot achieve, and "point-by-point one-to-one reconstruction" is also the main reason for VAE blurriness. If you want completely reversible reconstruction, only reversible models like Glow are likely to achieve it.

Additionally, having both an encoder and a generator allows us to play with latent variable interpolation for real images:

BiGAN-QP real image interpolation (far left and far right are real images, second from left and second from right are reconstruction images, the rest are interpolated images)

We can also look at similar images in the eyes of BiGAN-QP (calculate the latent variables for all real images, then calculate similarity using Euclidean distance or cosine value to find the most similar ones; the figure below shows the results using Euclidean distance):

Similarity in the eyes of BiGAN-QP (the first row is the input, the next two rows are similar images)

Welcome to Use and Share

As mentioned earlier, GAN-QP is a theoretically complete adversarial framework. In theory, all GAN tasks can be attempted. Therefore, if you currently have a GAN task at hand, why not give it a try? You can then remove the L-constraint, spectral normalization, and even many regularization terms without worrying about vanishing gradients. GAN-QP is the result of my efforts to remove various hyperparameters from GANs.

If you have new application results based on GAN-QP, you are welcome to share them here.