By 苏剑林 | December 26, 2018
This article briefly lists what I consider to be important recent GAN progress papers. This basically serves as the primary research list I used while studying GANs.
The Taste of Generative Models
GAN is a deep pit, especially for an amateur player like me; diving in for a long time makes it difficult to produce any output, particularly as various large companies use massive computational power to create one giant model after another, making it nearly impossible for individuals to compete. However, I always feel that only by touching generative models does one feel they have encountered true machine learning. This is true whether in image or text processing. Therefore, I am still willing to pay attention to generative models.
Of course, GAN is not the only choice for generative models, but it is a very interesting one. In images, there are at least GAN, flow, and pixelrnn/pixelcnn as options, but in terms of potential, I still feel GAN is the most promising—not just because of the results, but mainly because of its adversarial philosophy. In text, the seq2seq mechanism is actually already a probabilistic generative model, and models like pixelrnn essentially imitate seq2seq. Of course, there is also research using GANs for text generation (though most involve reinforcement learning). In other words, in NLP, generative models also have many achievements; even if you primarily study NLP, you will eventually encounter generative models.
Anyway, without further ado, let me quickly list the papers for everyone's reference and as a reminder for myself.
Let the Results Speak
A Word Up Front
Imprecisely speaking, in GANs currently, results basically do the talking. No matter how perfect your theory is, if your experiment cannot generate high-definition images, it is difficult to be accepted; conversely, no matter how "ugly" your mathematical derivation might be, as long as your experimental results are good enough to generate HD images, everyone will flock to you.
A landmark event for GAN models was NVIDIA's "Progressive Growing GANs" released last year, which first achieved $1024 \times 1024$ high-definition face generation. You should know that typical GANs have difficulty generating even $128 \times 128$ faces, so $1024$ resolution generation can be called a breakthrough. The papers listed below have all achieved $1024$ face generation in their own experiments. This experimental result alone makes these papers worth our attention.
Of course, generating $1024$ images requires not only model progress but also significant computational power, making it difficult for average individuals/labs to achieve. Focusing on these papers is not about us replicating such large image generation, but because these models can generate such large images, they must have something worth learning from—we might even understand where the bottlenecks of GANs lie, allowing us to avoid detours in our own research.
Paper List
《Progressive Growing of GANs for Improved Quality, Stability, and Variation》
Paper Address: https://papers.cool/arxiv/1710.10196
Reference Implementation: https://github.com/tkarras/progressive_growing_of_gans
Simple Introduction: This is the Progressive Growing GANs (PGGAN) from NVIDIA mentioned earlier, which first achieved $1024$ face generation. As the name suggests, PGGAN achieves a transition from low resolution to high resolution through a progressive structure, thus enabling smooth training of high-definition models. The paper also proposes its own understanding and techniques for regularization and normalization, which are worth reflecting on. Of course, because it is progressive, it is equivalent to training many models in series, so PGGAN is very slow...
《Which Training Methods for GANs do actually Converge?》
Paper Address: https://papers.cool/arxiv/1801.04406
Reference Implementation: https://github.com/LMescheder/GAN_stability
Simple Introduction: This paper contains a lot of mathematical derivations regarding the stability of GAN training, eventually resulting in a gradient penalty term simpler than WGAN-GP. Students concerned with GAN training stability can refer to this. Beside $1024$ faces, this paper also conducted experiments on many other datasets with good results, and they are all trained end-to-end directly without needing a progressive structure. My only confusion is: isn't this penalty term just a special case in WGAN-div? Why didn't the paper mention this?
《IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis》
Paper Address: https://papers.cool/arxiv/1807.06358
Reference Implementation: (No high-quality open source/reproduction seen yet)
Simple Introduction: This is a VAE that "reflects." It improves VAE through adversarial training, enabling it to generate high-definition images while simultaneously obtaining both an encoder and a generator. Besides generating $1024$ HD images, it is worth mentioning that the conception of this paper is very ingenious. While models that simultaneously obtain encoders and generators are not unique—for example, BiGAN can do it—IntroVAE's uniqueness lies in its ability to directly use the encoder as a discriminator without needing an additional discriminator, effectively saving $1/3$ of the parameter count. The deeper reasons behind this are worth our careful analysis and savoring.
《Large Scale GAN Training for High Fidelity Natural Image Synthesis》
Paper Address: https://papers.cool/arxiv/1809.11096
Reference Implementation: https://github.com/AaronLeong/BigGAN-pytorch
Simple Introduction: This is the famous BigGAN. While this paper does not provide $1024$ face generation results, it provides generation results for natural scene images at $128$, $256$, and $512$ resolutions. You should know that generating natural scene images is many times harder than generating CelebA faces; since it can generate $512$ natural scene images, we naturally don't doubt it can easily generate $1024$ faces. BigGAN already has many popular science introductions online, so I won't repeat them. The paper also proposes some of its own regularization techniques and shares a large amount of hyperparameter tuning experience (which parameters to adjust for good/bad changes), which is very worth referencing.
《Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow》
Paper Address: https://papers.cool/arxiv/1810.00821
Reference Implementation: https://github.com/akanimax/Variational_Discriminator_Bottleneck
Simple Introduction: This paper controls the fitting ability of the discriminator through an information bottleneck, thereby acting as regularization and stabilizing GAN training. For an introduction to the information bottleneck, you can refer to my article. Generally speaking, any means used to prevent overfitting in ordinary supervised training can theoretically be used in the discriminator, and the information bottleneck is one such means. Of course, as can be seen from the title, the paper is not satisfied with only applying it to GANs; besides $1024$ face generation experiments, the paper also conducted imitation learning and reinforcement learning experiments.
《A Style-Based Generator Architecture for Generative Adversarial Networks》
Paper Address: https://papers.cool/arxiv/1812.04948
Reference Implementation: https://github.com/NVlabs/stylegan
Simple Introduction: This is the new GAN generator architecture released a few days ago, dubbed GAN 2.0 by many articles. Still NVIDIA, still the PGGAN authors, still the PGGAN mode, but with a different generator architecture. They already generated $1024$ images a year ago, and this time is no exception. This new generator architecture is said to borrow from style transfer models, hence the name Style-Based Generator. After reading it, it is essentially the architecture of a Conditional GAN (CGAN), but with the condition and noise swapped. Simply put, noise is treated as the condition, and the condition is treated as noise, then substituted into the CGAN. Looking at the effect in the paper, this conceptual shift works very well; I tried to implement it myself, and it works, but with some mode collapse, so everyone should wait for the open source. Incidentally, it was also the PGGAN authors who brought us the CelebA HQ dataset a year ago; now they have brought us the new FFHQ dataset. It is said that the dataset and code will be open-sourced in January next year—let's wait and see.
Stabilize the Training First
A Word Up Front
Unlike supervised learning tasks where you usually just need to design a model, have enough data, and enough computing power to get a good model, GAN is never just about designing the model. It is a matter of theory, model, and optimization all in one. From a framework perspective, after the development of WGAN, the theoretical framework for GANs is basically complete, with subsequent work being minor repairs (including my GAN-QP). From an architecture perspective, DCGAN laid the foundation, and ResNet + Upsampling later became one of the standard frameworks. As for the just-released Style-Based Generator, I won't say more; basically, the model architecture is also mature. So what's left?
It's optimization—the training process. I feel that to truly master GANs, one must carefully study the optimization process, perhaps analyzing its training trajectory from a dynamical perspective. This may involve the existence, uniqueness, and stability of solutions to differential equations, as well as knowledge of stochastic optimization processes. In short, only by incorporating the optimization process into the analysis of GANs can GANs truly become complete.
The following papers analyze GAN training problems from different angles and provide their own solutions, making them well worth reading.
Paper List
《Stabilizing Training of Generative Adversarial Networks through Regularization》
Paper Address: https://papers.cool/arxiv/1705.09367
Simple Introduction: Derives GAN regularization terms by adding noise; the derivation process theoretically applies to all $f$-GANs. Judging from the paper's results, the outcome is quite good.
《GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium》
Paper Address: https://papers.cool/arxiv/1706.08500
Simple Introduction: Proposes the TTUR training strategy. The basic idea is: originally we used the same learning rate to alternate training the discriminator and generator a different number of times per iteration; now we can consider using different learning rates and training each once, which is clearly more time-efficient. However, even if there is a lot of theory, its theoretical basis is another existing paper, "Stochastic approximation with two time scales." One could say the paper just repeatedly applies this existing theoretical foundation, which feels slightly monotonous.
《Which Training Methods for GANs do actually Converge?》
Paper Address: https://papers.cool/arxiv/1801.04406
Simple Introduction: This paper was introduced earlier, but I am placing it here again because it is a classic. It feels like a must-read for anyone researching GAN training stability, understanding GAN training problems from a differential equation perspective. During the stability analysis, this paper mainly cites two others: one is its "prequel" (by the same author) called 《The Numerics of GANs》, and the other is 《Gradient descent GAN optimization is locally stable》. Both are classics.
《Spectral Normalization for Generative Adversarial Networks》
Paper Address: https://papers.cool/arxiv/1802.05957
Simple Introduction: Implements the $L$-constraint (Lipschitz constraint) for the discriminator via spectral normalization. This should be considered the most elegant method for implementing the $L$-constraint currently. Spectral normalization is widely used now, so it's worth mentioning. Related introductions can be found in my previous article.
《Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect》
Paper Address: https://papers.cool/arxiv/1803.01541
Simple Introduction: Adds a new regularization term to WGAN-GP. The idea for this regularization term is very simple: it directly uses the $L$-constraint (in difference form) as a regularization term, similar to the extra quadratic term in the GAN-QP discriminator. Looking at the curves in the paper, training is more stable than pure WGAN-GP.
Welcome to Continue Adding
That's all for this paper list—exactly ten papers. Due to my limited reading volume, there might be omissions. If you have other recommendations, feel free to suggest them in the comments.
PS: Readers who only care about NLP don't need to feel frustrated; some NLP blog posts will be coming out very soon (^_^)