A Brief Introduction to the Non-Adversarial Generative Model GLANN

By 苏剑林 | February 26, 2019

A while ago, I saw that Facebook published a non-adversarial generative model called GLANN (posted on arXiv last December), claiming it could generate 1024 high-definition faces using a non-adversarial approach. I read it with great interest and indeed gained some insights, but I was also somewhat disappointed. As for why I was disappointed, you will understand as you read on.

Original paper: "Non-Adversarial Image Synthesis with Generative Latent Nearest Neighbors"
Introduction by Machine Heart (Ji Qi Zhi Xin): "Why let GAN dominate? Facebook proposes GLANN, a non-adversarial generation method"

Sample results:
GLANN Results
Below is a simple breakdown of the content related to the GLANN model.

Implicit Maximum Likelihood

The foundation of the entire GLANN method is "Implicit Maximum Likelihood Estimation", from the article "Implicit Maximum Likelihood Estimation", referred to as "IMLE". This article was only posted on arXiv last September, which surprised me quite a bit. Because the algorithm is so simple—I had already used it two years ago—I always felt it was an obviously valid method; yet it was published so late... (I feel like I missed out on millions).

Directly Estimating the Probability Distribution

The appendix of the IMLE paper provides a lot of complex mathematical derivations, but I feel they are not really necessary. IMLE is essentially the result of an integral approximation of the Dirac distribution.

In general, consider the following sampling process: \begin{equation}z\sim q(z),\quad x = G(z)\end{equation} This is actually assuming the distribution of $x$ is: \begin{equation}q(x)=\int \delta\big(x-G(z)\big)q(z)dz\end{equation} where $q(z)$ is generally a normal or uniform distribution, and $\delta(\cdot)$ represents the (multivariate) Dirac delta function.

Note that $q(x)$ can also be written as: \begin{equation}q(x)=\mathbb{E}_{z\sim q(z)}\big[\delta\big(x-G(z)\big)\big]\end{equation} And $\delta(\cdot)$ is actually a Gaussian distribution as the variance tends to 0: \begin{equation}\delta(x)=\lim_{\sigma\to 0}\frac{1}{(2\pi\sigma^2)^{d/2}}\exp\left(-\frac{\Vert x\Vert^2}{2\sigma^2}\right)\end{equation} With this in mind, we might as well let $\sigma$ take a finite value, perform calculations, and then let $\sigma\to 0$, i.e., \begin{equation}q(x)=\lim_{\sigma\to 0}\mathbb{E}_{z\sim q(z)}\left[\frac{1}{(2\pi\sigma^2)^{d/2}}\exp\left(-\frac{\Vert x - G(z)\Vert^2}{2\sigma^2}\right)\right]\end{equation} Then, we perform maximum likelihood estimation, using $-\int p(x)\log q(x)dx$ as the loss, where $p(x)$ is the distribution of real samples: \begin{equation}\begin{aligned}loss=&-\int p(x) \log \left\{\mathbb{E}_{z\sim q(z)}\left[\frac{1}{(2\pi\sigma^2)^{d/2}}\exp\left(-\frac{\Vert x - G(z)\Vert^2}{2\sigma^2}\right)\right]\right\}dx\\ =&\mathbb{E}_{x\sim p(x)}\left[-\log \left\{\mathbb{E}_{z\sim q(z)}\left[\frac{1}{(2\pi\sigma^2)^{d/2}}\exp\left(-\frac{\Vert x - G(z)\Vert^2}{2\sigma^2}\right)\right]\right\}\right]\\ \sim &\mathbb{E}_{x\sim p(x)}\left[-\log \left\{\mathbb{E}_{z\sim q(z)}\left[\exp\left(-\frac{\Vert x - G(z)\Vert^2}{2\sigma^2}\right)\right]\right\}\right]\end{aligned}\end{equation} In the final expression, we have omitted constants irrelevant to the optimization.

Now we convert the expectation $\mathbb{E}$ into sampling, by substituting samples $x_1,x_2,\dots,x_M\sim p(x)$ and $z_1,z_2,\dots,z_N\sim q(z)$ into the loss: \begin{equation}\begin{aligned}loss\sim& -\frac{1}{M}\sum_{i=1}^M \log \left\{\frac{1}{N}\sum_{j=1}^N\exp\left(-\frac{\Vert x_i - G(z_j)\Vert^2}{2\sigma^2}\right)\right\}\\ \sim& -\frac{1}{M}\sum_{i=1}^M \log \left\{\sum_{j=1}^N\exp\left(-\frac{\Vert x_i - G(z_j)\Vert^2}{2\sigma^2}\right)\right\}\end{aligned}\end{equation} From the article "Seeking a Smooth Maximum Function", we know that $\text{logsumexp}$ (exponentiation, summation, then taking the logarithm) is actually a smooth approximation of the $\max$. As $\sigma\to 0$, it becomes the $\max$; adding a negative sign makes it the $\min$. Thus, the simplest form as $\sigma\to 0$ is: \begin{equation}loss\sim \frac{1}{M}\sum_{i=1}^M \left(\min_{j=1}^N \Vert x_i - G(z_j)\Vert^2\right)\end{equation} This is the loss used by IMLE. (The derivation is a bit long only because it's written in detail; it's actually not hard!)

Therefore, the specific process of IMLE is:
1. Sample a batch of real samples $x_1,x_2,\dots,x_M$;
2. Sample a batch of noise $z_1,z_2,\dots,z_N$, to obtain a batch of fake samples $\hat{x}_1,\hat{x}_2,\dots,\hat{x}_N$;
3. For each real sample $x_i$, find its nearest fake sample $\hat{x}_{\rho(i)}$;
4. Minimize the average distance $\frac{1}{M}\sum\limits_{i=1}^M \Vert x_i - \hat{x}_{\rho(i)}\Vert^2$.

Analysis and Discussion of Effects

Putting the derivation aside, this algorithm is quite simple: if every real sample can find a sufficiently close fake sample among the generated ones, doesn't that imply the generated samples are also very good?

That is why I said it was unbelievable that this algorithm was published so late. Now let's look at the results. The principle of the algorithm is sound, but the problem lies in the fact that "nearest" in the algorithm above uses the $L_2$ distance. For images, $L_2$ is not a good metric, so it is foreseeable that this method suffers from blurriness, much like VAEs. In fact, looking at the results on CelebA, it doesn't even compare to a VAE:

IMLE results on CelebA

Code: https://github.com/bojone/gan/blob/master/imle.py

Actually, this idea can be generalized to general divergence optimization. For example, we could use \begin{equation}KL(q(x)\Vert p(x))=\int q(x)\log \frac{q(x)}{p(x)}dx=\mathbb{E}_{x\sim q(x)}\big[\log q(x)-\log p(x)\big]\end{equation} as the optimization objective. Treating $\log q(x)$ and $\log p(x)$ with the same method, the result is: \begin{equation}loss\sim -\frac{1}{M}\sum_{i=1}^M \left(\min_{j=1}^N \Vert G(z_i) - x_j\Vert^2 - \min_{j=1}^K \Vert G(z_i) - G(z_j)\Vert^2\right)\end{equation} Alternatively, setting a margin $m$ might make the effect better: \begin{equation}loss\sim -\frac{1}{M}\sum_{i=1}^M \left(\min_{j=1}^N \Vert G(z_i) - x_j\Vert^2 + \text{relu}\left(m - \min_{j=1}^K \Vert G(z_i) - G(z_j)\Vert^2\right)\right)\end{equation} Note that here two batches of fake samples must be sampled, otherwise the second term becomes meaningless (within the same batch, the second term is always 0). The second term is used to prevent mode collapse. The process for this new algorithm is:
1. Sample a batch of real samples $x_1,x_2,\dots,x_M$;
2. Sample a batch of noise $z_1,z_2,\dots,z_N$, to obtain a batch of fake samples $\hat{x}_1,\hat{x}_2,\dots,\hat{x}_N$;
3. Sample another batch of noise $z_{N+1},z_{N+2},\dots,z_{N+K}$, to obtain another batch of fake samples $\hat{x}_{N+1},\hat{x}_{N+2},\dots,\hat{x}_{N+K}$;
4. For each fake sample $\hat{x}_i$ ($1\leq i\leq N$), find its nearest real sample $x_{\rho_1(i)}$;
5. For each fake sample $\hat{x}_i$ ($1\leq i\leq N$), find its nearest fake sample $\hat{x}_{N+\rho_2(i)}$ among $\hat{x}_{N+1},\hat{x}_{N+2},\dots,\hat{x}_{N+K}$;
6. Minimize the "Real-Fake" distance while maximizing the "Fake-Fake" distance, which is the loss above.

Sample results:
Another IMLE result on CelebA
They look about the same...

From IMLE to GLANN

Returning to the discussion of IMLE itself, it can be imagined that the main reason for its poor performance is the use of the $L_2$ distance. So, what if we replace it with another distance? For image realism, is there an existing loss function we can use?

Perceptual Loss

There actually is one: perceptual loss! This perceptual loss originated from style transfer, specifically from the paper "Perceptual Losses for Real-Time Style Transfer and Super-Resolution". Calculating this perceptual loss is somewhat complicated; it requires a pre-trained ImageNet model, generally VGG for simplicity, then calculating the vectors of its last few hidden layers, and then computing the $L_2$ (or $L_1$) distance of these hidden layer vectors and the $L_2$ (or $L_1$) distance of their Gram matrices, and finally summing them up.

This distance works quite well in style transfer tasks, but it is complex to calculate and feels more like an engineering product than a theoretical derivation. Therefore, I'm not very fond of it and have no interest in writing about it in detail. In short, one can combine perceptual loss with IMLE: \begin{equation}loss\sim \frac{1}{M}\sum_{i=1}^M \left(\min_{j=1}^N d_{perceptual}\big(x_i,G(z_j)\big)\right)\label{eq:perceptual-1}\end{equation}

Intermediate Product: GLO

Directly optimizing the objective $\eqref{eq:perceptual-1}$ is theoretically fine, but the computational cost is huge. As mentioned, calculating perceptual loss is complex and requires a pre-trained ImageNet model. If the batch size is 64, we would need to calculate $64^2=4096$ perceptual losses per batch and take the minimum, which would be unacceptably slow or even impossible to run.

Therefore, GLANN also utilizes a technique called GLO, which comes from the paper "Optimizing the latent space of generative networks". GLO is also an extremely simple concept that resulted in yet another full paper... GLO doesn't even intend to be a generative model; GLO just wants to obtain a low-dimensional embedding. Suppose the real sample set is $x_1,x_2,\dots,x_M$; then GLO's optimization objective is: \begin{equation}\mathop{\text{argmin}}_{G,\hat{z}_1,\dots,\hat{z}_M}\frac{1}{M}\sum_{i=1}^M d\big(x_i, G(\hat{z}_i)\big)\quad \text{s.t.}\quad \Vert z_i\Vert=1\end{equation} GLO optimizes $\hat{z}_1,\dots,\hat{z}_M$ directly, which is equivalent to an Embedding layer, training an embedding for each image. The variable parts of the model are the constraints on the Embedding and the metric $d$. For GLANN, perceptual loss is used: \begin{equation}\mathop{\text{argmin}}_{G,\hat{z}_1,\dots,\hat{z}_M}\frac{1}{M}\sum_{i=1}^M d_{perceptual}\big(x_i, G(\hat{z}_i)\big)\quad \text{s.t.}\quad \Vert z_i\Vert=1\end{equation} This way, even with a batch size of 64, we only need to calculate 64 perceptual losses per batch since it doesn't involve pairwise comparisons.

Final Result: GLANN

Now that we have $\hat{z}_1,\dots,\hat{z}_M$, then $G(\hat{z}_i)$ can generate images. Now we just need to treat $\hat{z}_1,\dots,\hat{z}_M$ as raw data and apply IMLE (where we can use $L_2$ because it is just in the latent space): \begin{equation}\mathop{\text{argmin}}_{T}\frac{1}{M}\sum_{i=1}^M \left(\min_{j=1}^N \Vert \hat{z}_i - T(z_j)\Vert^2\right)\end{equation} This yields the generative model. The complete generation process is: \begin{equation}z\sim q(z)\quad\xrightarrow{\quad T\quad }\quad \hat{z}_i \quad \xrightarrow{\quad G\quad }\quad x_i\end{equation}

Personal Evaluation

That concludes the explanation of the GLANN model. Overall, it is a model that integrates several techniques. As for the results, look at the images at the beginning; personally, I don't think the results are particularly great—the backgrounds are always a bit messy. Of course, it is undoubtedly better than pure IMLE or GLO, and it should be much easier to train than a GAN.

The main improvement of GLANN is replacing the $L_2$ distance with perceptual loss. In fact, this substitution should be applicable to many models; I suspect it could even be used in VAEs. On the other hand, the perceptual loss approach is too engineered for my taste, so I didn't feel like investigating it further. Also, the GLANN paper reported FID advantages on certain datasets. This looks good, but it is actually unfair. Since the calculation of FID itself relies on an ImageNet model, and the GLANN loss also uses an ImageNet model, GLANN's generation naturally has an advantage regarding FID.

In fact, one could use FID as the loss directly, train a GLO, and then train an IMLE to get a generative model. The FID of a generative model made this way would definitely be very low, but it wouldn't mean much because the reality of the images would not be guaranteed.