By 苏剑林 | November 15, 2021
Currently, the mainstream methods for implementing WGAN include Weight Clipping, Spectral Normalization, and Gradient Penalty. This article introduces a new implementation scheme: Gradient Normalization (GN). This scheme originates from two interesting papers, namely "Gradient Normalization for Generative Adversarial Networks" and "GraN-GAN: Piecewise Gradient Normalization for Generative Adversarial Networks".
What makes them interesting? As you can see from the titles, these two papers should be highly overlapping, perhaps even by the same authors. However, they are actually papers from two different teams at roughly the same time—one published in ICCV and the other in WACV. They derived nearly identical solutions based on the same assumptions. The degree of overlap in content is so high that I kept thinking they were the same paper. Truly, coincidences are everywhere~
Basic Review
We have introduced WGAN many times before, such as in "The Art of Mutual Confrontation: From Zero to WGAN-GP" and "From Wasserstein Distance and Duality Theory to WGAN", so we will not repeat the details here. Briefly, the iterative form of WGAN is:
\begin{equation}\min_G \max_{\Vert D\Vert_{L}\leq 1} \mathbb{E}_{x\sim p(x)}\left[D(x)\right] - \mathbb{E}_{z\sim q(z)}\left[D(G(z))\right]\end{equation}
The key here is that the discriminator $D$ is a constrained optimization problem that must satisfy the Lipschitz constraint $\Vert D\Vert_{L}\leq 1$ during the optimization process. Therefore, the implementation difficulty of WGAN lies in how to introduce this constraint into $D$.
To recap, if there exists some constant $C$ such that for any $x, y$ in the domain, $\Vert f(x)-f(y)\Vert \leq C\Vert x - y\Vert$, then we say $f(x)$ satisfies a Lipschitz constraint (L-constraint), where the minimum value of $C$ is called the Lipschitz constant (L-constant), denoted as $\Vert f\Vert_{L}$. Thus, for the WGAN discriminator, two steps are required: 1. $D$ must satisfy the L-constraint; 2. the L-constant must not exceed 1.
In fact, current mainstream neural network models are in the form of "linear combinations + non-linear activation functions," and mainstream activation functions are "near-linear," such as ReLU, LeakyReLU, and SoftPlus. Their derivatives' absolute values do not exceed 1, so mainstream models actually satisfy the L-constraint already. The key is how to ensure the L-constant does not exceed 1 (actually, it doesn't strictly have to be 1; it just needs to be bounded by some fixed constant).
Scheme Introduction
The ideas behind Weight Clipping and Spectral Normalization are similar; they both constrain parameters to ensure that the L-constant of each layer in the model is bounded, thereby bounding the total L-constant. Gradient Penalty, on the other hand, notes that a sufficient condition for $\Vert D\Vert_{L}\leq 1$ is $\Vert \nabla_x D(x)\Vert \leq 1$, so it imposes a "soft constraint" through the penalty term $(\Vert \nabla_x D(x)\Vert - 1)^2$.
The Gradient Normalization introduced in this article is also based on the same sufficient condition. It uses the gradient to transform $D(x)$ into $\hat{D}(x)$ such that it automatically satisfies $\Vert\nabla_x \hat{D}(x)\Vert \leq 1$. Specifically, we usually use ReLU or LeakyReLU as the activation function. Under these activation functions, $D(x)$ is actually a "piecewise linear function." This means that except at the boundaries, $D(x)$ is a linear function within local continuous regions; correspondingly, $\nabla_x D(x)$ is a constant vector.
Thus, Gradient Normalization aims to let $\hat{D}(x)=D(x)/\Vert \nabla_x D(x)\Vert$. In this way, we have:
\begin{equation}\Vert\nabla_x \hat{D}(x)\Vert = \left\Vert \nabla_x \left(\frac{D(x)}{\Vert \nabla_x D(x)\Vert}\right)\right\Vert=\left\Vert \frac{\nabla_x D(x)}{\Vert \nabla_x D(x)\Vert}\right\Vert=1\end{equation}
Of course, this might lead to division-by-zero errors, so the two papers proposed different solutions. The first paper (ICCV) directly added $|D(x)|$ to the denominator, which also guarantees the boundedness of the function:
\begin{equation} \hat{D}(x) = \frac{D(x)}{\Vert \nabla_x D(x)\Vert + |D(x)|}\in [-1,1]\end{equation}
The second paper (WACV) more simply added an $\epsilon$:
\begin{equation} \hat{D}(x) = \frac{D(x)\cdot \Vert \nabla_x D(x)\Vert}{\Vert \nabla_x D(x)\Vert^2 + \epsilon}\end{equation}
The second paper also mentioned testing $\hat{D}(x)=D(x)/(\Vert \nabla_x D(x)\Vert+\epsilon)$, noting the results were slightly worse but similar.
Experimental Results
Let's look at the experimental results. Naturally, since both were accepted to top conferences, the experimental results are positive. Some results are shown below:
[Experimental results table from the ICCV paper]
[Experimental results table from the WACV paper]
[Generation effect demonstration from the ICCV paper]
Remaining Doubts
The results look good, the theory seems sound, and it was recognized by two top conferences simultaneously, which undoubtedly makes it appear to be good work. However, my confusion has only just begun.
The most important issue with this work is that, according to the piecewise linear function assumption, although the gradient of $D(x)$ is locally a constant, it is discontinuous overall (if the gradient were globally continuous and constant, it would be a linear function, not piecewise linear). However, $D(x)$ itself is a continuous function, so $\hat{D}(x)=D(x)/\Vert \nabla_x D(x)\Vert$ is a continuous function divided by a discontinuous function, resulting in a discontinuous function!
So the question arises: it seems quite incredible that a discontinuous function can serve as a discriminator. One must realize that this discontinuity isn't just at a few boundary points; it's a discontinuity between two regions, making it an non-negligible presence. On Reddit, other readers expressed similar doubts, but as of yet, the authors have not provided a reasonable explanation (Link).
Another issue is that if the piecewise linear function assumption were truly effective, then using $\hat{D}(x)=\left\langle \frac{\nabla_x D(x)}{\Vert \nabla_x D(x)\Vert}, x\right\rangle$ as the discriminator should theoretically be equivalent. However, my experimental results show that such a $\hat{D}(x)$ performs extremely poorly. Therefore, one possibility is that Gradient Normalization is indeed effective, but the reason for its effectiveness is not as simple as analyzed in the above two papers; perhaps there are more complex mechanisms at play that we haven't discovered yet. Furthermore, it's possible that our understanding of GANs is still far from sufficient; that is, the requirements for the continuity of the discriminator might be far different from what we imagine.
Finally, in my experimental results, the performance of Gradient Normalization was not as good as Gradient Penalty. Moreover, Gradient Penalty only requires second-order gradients when training the discriminator, whereas Gradient Normalization requires second-order gradients for training both the generator and the discriminator. Thus, the speed of Gradient Normalization decreases significantly, and memory consumption increases noticeably. From my personal experience, Gradient Normalization is not a particularly friendly scheme.
Summary
This article introduced a new scheme for implementing WGAN—Gradient Normalization. The scheme is simple in form and the results reported in the papers are quite good, but I personally believe there are still many points worth questioning.