The Evolution of GAN Architectures

By 苏剑林 | April 19, 2019

In fact, the discovery of O-GAN has reached my ideal pursuit for GANs, allowing me to comfortably leap out of the deep pit of GAN research. Therefore, I will now attempt to explore broader and more diverse research directions, such as tasks in NLP that haven't been done yet, Graph Neural Networks, or other interesting things. However, before that, I want to record my previous learning results regarding GANs. In this article, let's comb through the development of GAN architectures—primarily the development of generators, as discriminators haven't changed much over time. Also, this article introduces the architectural development of GANs in the field of images and has nothing to do with SeqGAN in NLP. Furthermore, this article will not repeat basic GAN introductions.

A Word Up Front

Of course, in a broad sense, any progress in classification models in the image domain can be considered progress for the discriminator (since they are both classifiers, related technologies can be applied to the discriminator). Since image classification models essentially haven't undergone qualitative changes since ResNet, this also suggests that the ResNet structure is basically the optimal choice for the discriminator.

However, generators are different. Although relatively standard architectural designs have formed for GAN generators since DCGAN, they are far from being "finalized" or "optimal." Until recently, many works have been proposing new designs for generators. For example, SAGAN introduced Self-Attention into the generator (and discriminator), and the famous StyleGAN introduced a generator in the form of style transfer based on PGGAN. Therefore, many works indicate that there is still room for exploration in GAN generator architectures. A good generator architecture can accelerate GAN convergence or improve GAN performance.

DCGAN

When talking about the history of GAN architecture development, one must mention DCGAN; it qualifies as a landmark event in GAN history.

Background

As is well known, GANs originated in Ian Goodfellow's paper "Generative Adversarial Networks", but early GANs were limited to simple datasets like MNIST. This was because GANs had just emerged; although they attracted a wave of interest, they were still in the trial-and-error stage, with issues like model architecture, stability, and convergence still being explored. The emergence of DCGAN laid a solid foundation for solving these problems.

DCGAN comes from the article "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks". If we were to say what it did, it's actually simple: it proposed an architecture for generators and discriminators that significantly stabilized GAN training, to the point that it became the standard GAN architecture for a long time. It sounds simple, but in reality, achieving this was not easy, because there are many intuitively "reasonable" architectures. Selecting the near-optimal one from various combinations clearly required a significant amount of experimentation. Because DCGAN established the standard GAN architecture, researchers could focus more on diverse tasks thereafter, no longer agonizing over model architecture and stability, which led to the flourishing development of GANs.

Architecture Description

Having said all that, let's return to the discussion of the architecture itself. The model architecture proposed by DCGAN is roughly as follows:

Neither the generator nor the discriminator uses pooling layers; instead, they use convolutional layers with strides. The discriminator uses regular convolution (Conv2D), while the generator uses transposed convolution (DeConv2D);
Batch Normalization is used in both the generator and the discriminator;
ReLU activation is used in all layers of the generator except the output layer, which uses the Tanh activation function;
LeakyReLU activation is used in all layers of the discriminator;
No fully connected layers are used after the convolutional layers;
After the last convolutional layer of the discriminator, Global Pooling is not used; instead, it is directly flattened.

Actually, looking back now, this is still a relatively simple structure, embodying the beauty of "simplicity is the ultimate sophistication," further proving that what is good must be concise.

The DCGAN structure diagram is as follows:

DCGAN Discriminator Architecture (Left) and Generator Architecture (Right)

Personal Summary

Several key points:

The kernel size for convolution and transposed convolution is 4x4 or 5x5;
The stride for convolution and transposed convolution is usually 2;
For the discriminator, BN is generally not used after the first convolution layer, and then the pattern is "Conv2D + BN + LeakyReLU" until the feature map size is 4x4;
For the generator, the first layer is fully connected, then reshaped to 4x4, followed by the "DeConv2D + BN + ReLU" pattern. BN is not used in the last layer, which uses Tanh activation; correspondingly, input images are scaled to -1 to 1 by dividing by 255, multiplying by 2, and subtracting 1.

Although it might look large in terms of parameter count, DCGAN is actually fast and doesn't consume much VRAM, so it is very popular. Thus, despite being old, it is still used in many tasks today. At least for rapid experimentation, it is an excellent architecture.

ResNet

As GAN research deepened, people gradually discovered some shortcomings of the DCGAN architecture.

Problems with DCGAN

The common consensus is that because DCGAN's generator uses transposed convolution, and transposed convolution inherently has "Checkerboard Artifacts," these artifacts limit the upper bound of DCGAN’s generation capability. For more details on checkerboard artifacts, check "Deconvolution and Checkerboard Artifacts" (Highly recommended, with many visual illustrations).

Illustration of checkerboard artifacts, manifested as an interlaced effect like a chessboard when zoomed in. Image from "Deconvolution and Checkerboard Artifacts"

To be precise, checkerboard artifacts are not a problem of "transposed convolution" per se, but an inherent issue with $stride > 1$, which causes the convolution to fail to cover the entire image "isotropically," resulting in an interlaced effect. Since transposed convolutions are usually paired with $stride > 1$, they are typically blamed. In fact, parallel to transposed convolution, dilated convolution also has checkerboard artifacts because it can be proven that under certain transformations, dilated convolution is equivalent to regular convolution with $stride > 1$.

On the other hand, I suspect there is another reason: DCGAN's non-linear capability might be insufficient. Readers who have analyzed DCGAN results will notice that once the input image size is fixed, the entire DCGAN architecture is essentially fixed, including the number of layers. The only thing that seems changeable is the kernel size (channel numbers can be adjusted slightly, but the adjustment space isn't large). Changing the kernel size can change the model's non-linear capability to an extent, but it only changes the width of the model, and for deep learning, depth is often more important than width. The problem is that for DCGAN, there is no natural and direct way to increase depth.

The ResNet Model

Due to these reasons, and with the deepening of ResNet in classification problems, the application of ResNet structures in GANs was naturally considered. In fact, the mainstream generator and discriminator architectures for GANs have indeed become ResNet-based. The basic structure is illustrated below:

ResNet-based Discriminator Architecture (Left) and Generator Architecture (Right), with a single ResBlock structure in the middle

As we can see, ResNet-based GANs don't differ significantly from DCGAN in overall structure (further affirming DCGAN's foundational role). The main features are:

Transposed convolution is removed in both the discriminator and generator, leaving only ordinary convolution layers;
The convolutional kernel size is usually unified to 3x3, and convolutions form residual blocks;
Upsampling and downsampling are achieved through AvgPooling2D and UpSampling2D, whereas DCGAN used convolution/transposed convolution with $stride > 1$; where UpSampling2D effectively scales up the image dimensions;
Since there are already residuals, ReLU can be used as the unified activation function; of course, some models still use LeakyReLU, though the difference is minimal;
By increasing the number of convolution layers in the ResBlocks, both the non-linear capability and depth of the network can be increased, which is the flexibility of ResNet;
Generally, the residual form is $x + f(x)$, where $f$ represents the combination of convolution layers; however, in GANs, model initialization is typically smaller than in standard classification models, so for stability, some models change it to $x + \alpha \times f(x)$, where $\alpha$ is a small number like 0.1, to achieve better stability;
Some authors believe BN is not suitable for GANs and sometimes remove it entirely or replace it with LayerNorm, etc.

Personal Summary

I haven't carefully researched which paper first applied ResNet to GANs, but prominent GANs like PGGAN, SNGAN, and SAGAN have all used ResNet. ResNet strides are all equal to 1, thus it is uniform enough to avoid checkerboard artifacts.

However, ResNet is not without drawbacks. Although ResNet doesn't increase the number of parameters compared to DCGAN (in some cases even fewer), ResNet is much slower and requires significantly more VRAM. This is because ResNet has more layers and more connections between layers, leading to more complex gradients and weaker parallelism (parallelism is possible within a layer, but consecutive layers are serial and cannot be directy parallelized). The result is that it's slower and consumes more VRAM.

Furthermore, checkerboard artifacts are actually very subtle effects; perhaps they only become noticeable during high-definition image generation. In my experiments generating 128x128 or even 256x256 faces or LSUN, I didn't visually perceive a significant difference in results between DCGAN and ResNet, but DCGAN's speed was more than 50% faster than ResNet. In terms of VRAM, DCGAN can directly run 512x512 generation (on a single 1080ti), while for ResNet, running 256x256 is already a bit strained. Therefore, unless I am trying to beat the current state-of-the-art FID scores, I wouldn't choose the ResNet architecture.

SELF-MOD

Normally, after introducing ResNet, I should introduce models like PGGAN and SAGAN, as they are landmark events in terms of resolution or IS/FID metrics. However, I don't plan to introduce them because, strictly speaking, PGGAN is not a new model architecture; it just provides a progressive training strategy that can be applied to DCGAN or ResNet architectures. And SAGAN's changes are not major; standard SAGAN just inserts a Self-Attention layer into common DCGAN or ResNet architectures, which doesn't count as a major change in generator architecture.

Next, I will introduce a relatively new improvement: the Self-Modulated Generator, from the paper "On Self Modulation for Generative Adversarial Networks", which I will simply refer to as "SELF-MOD" here.

Conditional BN

Before introducing SELF-MOD, I need to introduce something else: Conditional Batch Normalization (Conditional BN).

As is well known, BN is a common operation in deep learning, especially in the image field. To be honest, I don't like BN very much, but I must admit it plays an important role in many GAN models. Standard BN is unconditional: for an input tensor $\boldsymbol{x}_{i,j,k,l}$, where $i,j,k,l$ represent the batch, height, width, and channel dimensions of the image, the training phase is as follows:

\begin{equation}\boldsymbol{x}_{i,j,k,l}^{(out)}=\boldsymbol{\gamma}_l \times \frac{\boldsymbol{x}_{i,j,k,l}^{(in)} - \boldsymbol{\mu}_l}{\boldsymbol{\sigma}_l+\epsilon} + \boldsymbol{\beta}_l\end{equation}

where

\begin{equation}\boldsymbol{\mu}_l = \frac{1}{N}\sum_{i,j,k} \boldsymbol{x}_{i,j,k,l}^{(in)},\quad \boldsymbol{\sigma}^2_l = \frac{1}{N}\sum_{i,j,k} \left(\boldsymbol{x}_{i,j,k,l}^{(in)}-\boldsymbol{\mu}_l\right)^2\end{equation}

are the mean and variance of the input batch data, where $N = \text{batch_size} \times \text{height} \times \text{width}$, while $\boldsymbol{\beta}$ and $\boldsymbol{\gamma}$ are trainable parameters, and $\epsilon$ is a small positive constant to prevent division by zero. In addition, a set of moving average variables $\hat{\boldsymbol{\mu}}, \hat{\boldsymbol{\sigma}}^2$ is maintained to use during the testing phase.

The reason this BN is called unconditional is that parameters $\boldsymbol{\beta}, \boldsymbol{\gamma}$ are obtained purely through gradient descent and do not depend on the input. Correspondingly, if $\boldsymbol{\beta}, \boldsymbol{\gamma}$ depend on some input $\boldsymbol{y}$, it is called Conditional BN:

\begin{equation}\boldsymbol{x}_{i,j,k,l}^{(out)}=\boldsymbol{\gamma}_l(\boldsymbol{y}) \times \frac{\boldsymbol{x}_{i,j,k,l}^{(in)} - \boldsymbol{\mu}_l}{\boldsymbol{\sigma}_l+\epsilon} + \boldsymbol{\beta}_l(\boldsymbol{y})\end{equation}

Here $\boldsymbol{\beta}_l(\boldsymbol{y}), \boldsymbol{\gamma}(\boldsymbol{y})$ are outputs of some model.

Let's first talk about how to implement it. It is actually very easy to implement Conditional BN in Keras. Reference code is as follows:


def ConditionalBatchNormalization(x, beta, gamma):
    """
    To implement Conditional BN, we just need to remove the native 
    beta and gamma from Keras's BatchNormalization and pass in 
    external beta and gamma instead. For training stability, 
    it's best to initialize beta with zeros and gamma with ones.
    """
    x = BatchNormalization(center=False, scale=False)(x)
    def cbn(x):
        x, beta, gamma = x
        for i in range(K.ndim(x)-2):
            # Adjust the ndim of beta; modify this based on the specific case
            beta = K.expand_dims(beta, 1)
            gamma = K.expand_dims(gamma, 1)
        return x * gamma + beta
    return Lambda(cbn)([x, beta, gamma])

SELF-MOD GAN

Conditional BN first appeared in the article "Modulating early visual processing by language", and was later used in "cGANs With Projection Discriminator". Currently, it has become the standard solution for Conditional GANs (cGAN), including SAGAN and BigGAN. To put it simply, cGAN uses the label $\boldsymbol{c}$ as a condition for $\boldsymbol{\beta}, \boldsymbol{\gamma}$ to form conditional BN, replacing the unconditional BN in the generator. In other words, the primary input to the generator is still the random noise $\boldsymbol{z}$, while the condition $\boldsymbol{c}$ is passed into every BN layer of the generator.

Why talk so much about conditional BN? What does it have to do with SELF-MOD?

The situation is this: SELF-MOD considers that the stability of cGAN training is better, but usually, GANs don't have labels $\boldsymbol{c}$ available. What to do? Just use the noise $\boldsymbol{z}$ itself as the label! This is the meaning of Self-Modulated—modulating oneself without relying on external labels, but achieving similar effects. Described with a formula:

\begin{equation}\boldsymbol{x}_{i,j,k,l}^{(out)}=\boldsymbol{\gamma}_l(\boldsymbol{z}) \times \frac{\boldsymbol{x}_{i,j,k,l}^{(in)} - \boldsymbol{\mu}_l}{\boldsymbol{\sigma}_l+\epsilon} + \boldsymbol{\beta}_l(\boldsymbol{z})\end{equation}

In the original paper, $\boldsymbol{\beta}(\boldsymbol{z})$ is a two-layer fully connected network:

\begin{equation}\boldsymbol{\beta}(\boldsymbol{z})=\boldsymbol{W}^{(2)}\max\left(0, \boldsymbol{W}^{(1)}\boldsymbol{z}+\boldsymbol{b}^{(2)}\right)\end{equation}

$\boldsymbol{\gamma}(\boldsymbol{z})$ is the same. Looking at the official source code, I found that the dimension of the middle layer can be set smaller, such as 32, so it won't significantly increase the parameter count. This is the generator for an unconditional GAN with a SELF-MOD structure.

Personal Summary

DCGAN generator in SELF-MOD form. ResNet-based versions are similar, just replacing BN with the SELF-MOD version.

I combined the SELF-MOD structure with my O-GAN experiments and found that the convergence speed increased by nearly 50%, and the final FID and reconstruction results were better. The excellence of SELF-MOD is clear. I even have a faint feeling that O-GAN and SELF-MOD are a great match (haha, maybe just a narcissistic illusion).

Keras reference code is here:

https://github.com/bojone/o-gan/blob/master/o_gan_celeba_sm_4x4.py

Additionally, even in cGAN, the SELF-MOD structure can be used. Standard cGAN uses the condition $\boldsymbol{c}$ as the BN input condition; SELF-MOD uses both $\boldsymbol{z}$ and $\boldsymbol{c}$ as BN input conditions. Reference usage:

\begin{equation}\begin{aligned}\boldsymbol{\beta}(\boldsymbol{z},\boldsymbol{c}) =& \boldsymbol{W}^{(2)}\max\left(0, \boldsymbol{W}^{(1)}\boldsymbol{z}'+\boldsymbol{b}^{(2)}\right)\\ \boldsymbol{z}' =& \boldsymbol{z}+\text{E}(\boldsymbol{c})+\text{E}'(\boldsymbol{c})\otimes \boldsymbol{z}\end{aligned}\end{equation}

where $\text{E}, \text{E}'$ are two Embedding layers. In cases where the number of categories is small, they can be directly understood as fully connected layers. $\boldsymbol{\gamma}$ is analogous.

Other Architectures

Readers might find it strange: why hasn't the famous BigGAN and StyleGAN been mentioned?

In fact, BigGAN didn't make particularly unique improvements to the model architecture, and even the authors themselves admit it's just about "brute force achieving miracles." As for StyleGAN, it indeed improved the model architecture. However, once you understand the previously mentioned SELF-MOD, StyleGAN is not hard to understand; one can even view StyleGAN as a variant of SELF-MOD.

AdaIN

The core of StyleGAN is something called AdaIN (Adaptive Instance Normalization), which originates from the style transfer paper "Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization". It is actually very similar to Conditional BN, perhaps even simpler:

\begin{equation}\boldsymbol{x}_{i,j,k,l}^{(out)}=\boldsymbol{\gamma}_l(\boldsymbol{y}) \times \frac{\boldsymbol{x}_{i,j,k,l}^{(in)} - \boldsymbol{\mu}_{i,l}}{\boldsymbol{\sigma}_{i,l}+\epsilon} + \boldsymbol{\beta}_l(\boldsymbol{y})\end{equation}

The difference from Conditional BN is: Conditional BN uses $\boldsymbol{\mu}_{l}$ and $\boldsymbol{\sigma}_{l}$, whereas AdaIN uses $\boldsymbol{\mu}_{i,l}$ and $\boldsymbol{\sigma}_{i,l}$. In other words, AdaIN calculates statistical features within a single sample rather than across a batch. Therefore, AdaIN does not need to maintain moving averages of mean and variance, making it simpler than conditional BN.

StyleGAN

DCGAN generator in StyleGAN form. ResNet-based versions are similar; the main change is replacing Conditional BN with AdaIN.

With SELF-MOD and AdaIN, we can clarify StyleGAN. The main change in StyleGAN is also the generator. Compared to SELF-MOD, its differences are:

Cancel the noise input at the top and replace it with a trainable constant vector;
Replace all conditional BN with AdaIN;
The input condition for AdaIN is created by transforming noise $z$ using a multi-layer MLP, which is then projected into $\boldsymbol{\beta}$ and $\boldsymbol{\gamma}$ for different AdaIN layers using different transformation matrices.

It's that simple~

Personal Summary

I personally experimented with a simplified StyleGAN-style DCGAN and found that it could converge and the results were decent, but with slight Mode Collapse. Since official StyleGAN uses PGGAN training mode and I did not, I wonder if StyleGAN needs to be paired with PGGAN to be trained well? I don't have an answer yet. However, in my experiments, SELF-MOD was much easier to train and gave better results than StyleGAN.

Article Summary

This article briefly combed through the changes in GAN model architectures, mainly from DCGAN and ResNet to SELF-MOD, etc. These are all quite distinct changes; some subtle improvements might have been ignored.

For a long time, there have been few works that drastically changed GAN model architectures, and SELF-MOD and StyleGAN have once again sparked interest in architectural changes. The paper "Deep Image Prior" also showed that the prior knowledge contained within the model architecture itself is a significant reason image generation models can succeed. Proposing better model architectures means proposing better prior knowledge, which naturally benefits image generation.

The architectures mentioned in this article are based on my own experiments, and the evaluations made are based on my personal experience and aesthetic views. If there are any inaccuracies, please feel free to correct them~

Original source: https://kexue.fm/archives/6549

To cite this article, please refer to: