Powerful NVAE: You Can No Longer Say VAE Generated Images are Blurry

By 苏剑林 | July 10, 2020

Yesterday morning, while doing my daily scroll through arXiv, I was shocked by a newly released paper! The paper is titled "NVAE: A Deep Hierarchical Variational Autoencoder". As the name suggests, it is an improvement on VAE, proposing a new model called NVAE. To be honest, when I clicked on it, I didn't have high hopes because I consider myself to have some understanding of VAEs and felt that their generative capabilities were ultimately limited. However, when the paper opened, the visuals looked like this:

NVAE Face Generation Results

NVAE face generation results

And my first reaction was:

W!T!F! Is this really the result of a VAE? Is this still the VAE I know? It seems my understanding of VAE was too shallow; I can never say VAE-generated images are blurry again...

But after looking at the author's affiliation—NVIDIA—it became somewhat understandable. In recent years, you might have noticed that NVIDIA usually releases a breakthrough in generative models towards the end of the year: 2017 was PGGAN, 2018 was StyleGAN, and 2019 was StyleGAN2. This year seems a bit earlier and more prolific, as they just released a method called ADA last month, which pushed CIFAR-10 generation to a new height, and now comes NVAE.

So, what exactly is special about NVAE that allows it to achieve such a leap in VAE generation quality?

VAE Review

A careful reader might observe:

It still looks a bit fake; the faces are too smooth, as if they've been airbrushed, and it doesn't quite match StyleGAN...

Yes, that's a fair assessment. The generative artifacts are still quite noticeable. But if you aren't shocked, it's likely because you haven't seen previous VAE generation results. The style of a typical VAE looks like this:

Typical VAE Random Generation

Typical VAE random generation results

So, do you still think this isn't a breakthrough?

What limited the expressive power of (previous) VAEs? And where exactly did this breakthrough improve? Let's keep reading.

Basic Introduction

VAE, or Variational Auto-Encoder, has been introduced in many articles on this site. You can search for "Variational Autoencoder" in the search bar on the right to find many related blog posts. Here is a brief review and analysis.

In my derivation of the VAE, we start with a set of samples representing a real (but unknown) distribution $\tilde{p}(x)$. Then we construct a parameterized posterior distribution $p(z|x)$. Together, they form a joint distribution $p(x,z)=\tilde{p}(x)p(z|x)$. Next, we define a prior distribution $q(z)$ and a generative distribution $q(x|z)$, forming another joint distribution $q(x,z)=q(z)q(x|z)$. Finally, our goal is to make $p(x,z)$ and $q(x,z)$ close to each other, so we optimize the KL divergence between them:

\begin{equation} \begin{aligned} KL\big(p(x,z)\big\Vert q(x,z)\big)=&\iint p(x,z)\log \frac{p(x,z)}{q(x,z)} dzdx\\ =&\mathbb{E}_{x\sim \tilde{p}(x)} \Big[\mathbb{E}_{z\sim p(z|x)}\big[-\log q(x|z)\big]+KL\big(p(z|x)\big\Vert q(z)\big)\Big] + \text{constant} \end{aligned} \end{equation}

This is the optimization objective of the VAE.

Difficulty Analysis

The requirements for $p(z|x), q(z), q(x|z)$ are: 1. They must have an analytical expression; 2. They must be easy to sample from. However, there aren't many such distributions in the continuous world. The most commonly used is the Gaussian distribution, particularly the "decoupled Gaussian" (independent components). Therefore, in standard VAEs, $p(z|x), q(z), q(x|z)$ are all set as independent Gaussian distributions: $p(z|x)=\mathcal{N}(z;\mu_1(x),\sigma_1^2(x))$, $q(z)=\mathcal{N}(z;0,1)$, and $q(x|z)=\mathcal{N}(x;\mu_2(z),\sigma_2^2(z))$.

The problem is that "independent Gaussian distributions" cannot fit arbitrarily complex distributions. When we choose the form of $p(z|x)$, it's possible that no matter how we tune its parameters, $\int \tilde{p}(x)p(z|x)dx$ and $\frac{\tilde{p}(x)p(z|x)}{\int \tilde{p}(x)p(z|x)dx}$ can never become Gaussian. This means that $KL\big(p(x,z)\big\Vert q(x,z)\big)$ theoretically cannot reach zero. Thus, forcing $p(x,z)$ and $q(x,z)$ to approximate each other only yields a rough, averaged result, which is why conventional VAE images are blurry.

Related Improvements

One classic direction for improving VAE is to combine it with GANs, such as CVAE-GAN and AGE. The state-of-the-art result in this direction is probably IntroVAE. Theoretically, this type of work implicitly abandons the assumption that $q(x|z)$ is a Gaussian distribution, replacing it with a more general distribution, thereby improving generation. However, I feel that introducing GANs into VAE is bit like "playing with fire"—you gain performance but also inherit GAN's weaknesses (training instability, etc.), and even then, VAE generation remains inferior to pure GANs. Another direction is to combine VAE with flow models, such as IAF-VAE or the f-VAE I did previously. This work enhances the expressive power of $p(z|x)$ or $q(x|z)$ via flow models.

Another direction is introducing discrete latent variables, typically represented by VQ-VAE (see my "Brief Introduction to VQ-VAE: Quantized Autoencoder"). VQ-VAE encodes images into a discrete sequence and then uses PixelCNN to model the corresponding prior distribution $q(z)$. As mentioned, when $z$ is continuous, the choices for $p(z|x)$ and $q(z)$ are limited, leading to restricted approximation accuracy. But if $z$ is a discrete sequence, $p(z|x)$ and $q(z)$ correspond to discrete distributions. Using autoregressive models (language models in NLP, PixelRNN/PixelCNN in CV), we can approximate any discrete distribution, thus allowing for a more accurate approximation and improved generation. The subsequent VQ-VAE-2 further confirmed the effectiveness of this path, though VQ-VAE's workflow differs significantly from conventional VAE, making it sometimes hard to view as a pure VAE variant.

NVAE Breakdown

After all that buildup, we finally talk about NVAE. NVAE stands for Nouveau VAE (isn't it Nvidia VAE?). it incorporates many recent advancements in the CV field, including multi-scale architectures, depthwise separable convolutions, the Swish activation function, and flow models. It integrates the best of multiple worlds, becoming the current strongest VAE.

(Note: The notation in this article differs from the original paper and common VAE introductions but is consistent with other related articles on this blog. Readers are encouraged to understand the concepts rather than memorize symbols.)

Autoregressive Distribution

As previously analyzed, the difficulty of VAE stems from $p(z|x), q(z), q(x|z)$ not being strong enough. Therefore, the improvement strategy is to enhance them. First, NVAE does not change $q(x|z)$, mainly to maintain parallel generation. Instead, it enhances the prior distribution $q(z)$ and posterior distribution $p(z|x)$ through an autoregressive model. Specifically, it partitions the latent variables into groups $z=\{z_1,z_2,\dots,z_L\}$, where each $z_l$ is still a vector, and sets:

\begin{equation}q(z)=\prod_{l=1}^L q(z_l|z_{< l}),\quad p(z|x)=\prod_{l=1}^L p(z_l|z_{< l},x)\label{eq:arpq}\end{equation}

The distributions $q(z_l|z_{< l})$ and $p(z_l|z_{< l},x)$ for each group are still modeled as Gaussian distributions. Thus, in total, $q(z)$ and $p(z|x)$ are established as autoregressive Gaussian models. The KL divergence term for the posterior becomes:

\begin{equation}KL\big(p(z|x)\big\Vert q(z)\big)=KL\big(p(z_1|x)\big\Vert q(z_1)\big)+\sum_{l=2}^L \mathbb{E}_{p(z_{< l}|x)}\Big[KL\big(p(z_l|z_{< l}, x)\big\Vert q(z_l|z_{< l})\big)\Big]\end{equation}

Of course, this approach is a simple generalization and not original to NVAE; it can be traced back to 2015 models like DRAW and HVM. NVAE's contribution is proposing a "relative" design for Eq $\eqref{eq:arpq}$:

\begin{equation}\begin{aligned}&q(z_l|z_{< l})=\mathcal{N}\left(z_l;\mu(z_{< l}),\sigma^2(z_{< l})\right)\\ &p(z_l|z_{< l},x)=\mathcal{N}\left(z_l;\mu(z_{< l})+\Delta\mu(z_{< l},x),\sigma^2(z_{< l})\otimes \Delta\sigma^2(z_{< l}, x)\right) \end{aligned}\end{equation}

In other words, instead of directly modeling the mean and variance of the posterior $p(z_l|z_{< l},x)$, it models their values relative to the prior's mean and variance. In this case, we have (omitting variables for simplicity):

\begin{equation}KL\big(p(z_l|z_{< l}, x)\big\Vert q(z_l|z_{< l})\big)=\frac{1}{2} \sum_{i=1}^{|z_l|} \left(\frac{\Delta\mu_{(i)}^2}{\sigma_{(i)}^2} + \Delta\sigma_{(i)}^2 - \log \Delta\sigma_{(i)}^2 - 1\right)\end{equation}

The original paper notes that this makes training more stable.

Multi-scale Design

Now that latent variables are split into $L$ groups $z=\{z_1,z_2,\dots,z_L\}$, the questions arise: 1. How does the encoder generate $z_1,z_2,\dots,z_L$ one by one? 2. How does the decoder use $z_1,z_2,\dots,z_L$ one by one? How should the encoder and decoder be designed?

NVAE Encoder and Decoder Architecture

Encoder and decoder architecture in NVAE. Here 'r' represents residual modules, 'h' represents trainable parameters, and blue parts share parameters.

NVAE cleverly designs a multi-scale encoder and decoder, as shown above. First, the encoder goes through layers to produce the top-level encoding vector $z_1$, and then gradually moves down to obtain lower-level features $z_2,\dots,z_L$. The decoder naturally utilizes $z_1,z_2,\dots,z_L$ in a top-down manner. Since this part of the decoder shares commonalities with the process of generating $z_1, \dots, z_L$, NVAE directly shares parameters between corresponding parts. This saves parameter count and improves generalization through mutual constraints.

This multi-scale design is present in many state-of-the-art generative models like StyleGAN, BigGAN, and VQ-VAE-2, suggesting its effectiveness is well-validated. Additionally, to ensure performance, NVAE carefully screened residual module designs, eventually settling on the one below—demonstrating exhaustive model tuning:

Residual Module in NVAE

Residual module in NVAE

Other Enhancement Tricks

In addition to the two major features above, NVAE includes several other performance-enhancing tricks. Here are a few:

Batch Normalization (BN) Improvements: Many current generative models have abandoned BN in favor of Instance Normalization (IN) or Weight Normalization (WN) because they found BN degrades performance. NVAE discovered through experiments that BN is helpful for training but harmful for prediction because the moving averages used during inference aren't good enough. Therefore, after training, NVAE re-estimates the mean and variance by sampling batches of the same size multiple times, ensuring BN's prediction performance. Furthermore, to stabilize training, NVAE adds a regularization term to the magnitude of BN's $\gamma$.

Spectral Normalization: We know that the KL divergence between any two distributions is unbounded, so the KL term in VAE is also unbounded. Optimizing such an unbounded objective is "dangerous" as it could diverge at any time. To stabilize training, NVAE applies spectral normalization to every convolutional layer (refer to my "Lipschitz Constraint in Deep Learning: Generalization and Generative Models"). Adding spectral normalization ensures the model's Lipschitz constant is small, making the landscape smoother and training more stable.

Distribution Enhancement with Flow Models: Through the autoregressive model, NVAE enhances the model's ability to fit distributions. However, this autoregression is only across groups. Within each group, $p(z_l|z_{< l}, x)$ and $q(z_l|z_{< l})$ are still assumed to be independent Gaussians, meaning there is room for improvement. A more thorough solution would be making each component within a group autoregressive, but that would make sampling extremely slow. NVAE offers an alternative: modeling the intra-group distribution as a flow model to enhance expression while maintaining parallel sampling within groups. Experiments show improvement, though I believe adding flow models significantly increases complexity for relatively minor gains—best avoided if possible.

Memory Saving Tricks: Although NVIDIA has no shortage of GPUs, NVAE makes efforts to save memory. It uses mixed-precision training and promotes NVIDIA's own APEX library. Furthermore, it employs gradient checkpointing (recomputation) in the BN layers, reportedly saving 18% memory with almost no impact on speed. In short, teams with more GPUs also know how to save memory better than you.

More Results

At this point, the primary technical points of NVAE have been introduced. If you're still craving more, here are a few more result images to appreciate the stunning quality of NVAE.

NVAE on CelebA HQ and FFHQ

Generation results of NVAE on CelebA HQ and FFHQ. Notably, NVAE is the first VAE-type model to experiment on the FFHQ dataset, and its first attempt is this stunning.

Image Retrieval Experiment

Image retrieval experiment based on NVAE. On the left are randomly generated samples; on the right are the most similar samples from the training set, primarily to verify if the model is just memorizing the training data.

More CelebA HQ Results

More CelebA HQ generation results.

Personal Takeaways

Looking at the training table below, we can see the training cost is quite high—higher than a StyleGAN of the same resolution. Throughout the paper, numerous small training tricks are mentioned (and likely many unmentioned). StyleGAN and BigGAN also used many similar tricks, so this isn't necessarily a unique drawback of NVAE. But for a "commoner" who is just a generative model hobbyist like myself, what can we take away from NVAE?

NVAE Training Parameters and Costs

NVAE training parameters and costs

For me, NVAE brought two major conceptual impacts:

First, autoregressive Gaussian models can be very powerful at fitting complex continuous distributions. I previously thought only discrete distributions could use autoregressive models effectively, hence my focus on VQ-VAE's discrete latent space. NVAE proves that even with continuous latent variables, autoregressive Gaussians can fit well. Thus, we don't necessarily have to follow VQ-VAE's discretization path; continuous latent variables are much easier to train.

Second, VAE latent variables can be more than one and can be hierarchical. Looking at the table again, for the FFHQ column, the latent variable $z$ has $4+4+4+8+16=36$ groups in total. Each group has different dimensions: $\{8^2, 16^2, 32^2, 64^2, 128^2\} \times 20$. Calculating this, creating a $256 \times 256$ FFHQ image requires a random vector of total size:

\begin{equation}\left(4\times 8^2 + 4\times 16^2 + 4\times 32^2 + 8\times 64^2 + 16\times 128^2 \right)\times 20=600,576\end{equation}

(Note: Correction based on text - 600,576 dimensions). That is to say, one samples a 600,000-dimensional vector to generate an image that is $256 \times 256 \times 3 = 196,608$ (less than 200,000) dimensions. This is very different from traditional VAEs, which usually encode images into a single vector of a few hundred dimensions. Here, the encoding vector is much larger, resembling a fully convolutional autoencoder, so the clarity improvement is within reason.

What is "Nouveau"?

Finally, I curiously searched for the meaning of "Nouveau." Here is the explanation from Wikipedia:

nouveau (/nuːˈvoʊ/) is a free and open-source graphics device driver for Nvidia video cards and the Tegra family of systems-on-chips, written by independent software engineers, with some help from Nvidia employees. The project's goal is to create an open-source driver by reverse-engineering Nvidia's proprietary Linux drivers... The name of the project comes from "nouveau", the French word for "new"... When the original author's IRC client suggested replacing "nv" with "nouveau", the name was adopted.

Does this mean Nouveau VAE and Nvidia VAE are essentially synonyms? It turns out our initial hunch wasn't wrong!

Summary

This article introduced NVAE, an upgraded VAE recently published by NVIDIA, which pushes VAE generation quality to a new height. As seen, NVAE raises the theoretical upper bound through autoregressive latent distributions, designs a clever encoder-decoder structure, and integrates almost all the latest techniques from current generative models to create the strongest VAE to date.