By 苏剑林 | March 06, 2019
In this post, I would like to share a recent work: by simply modifying the original GAN model, we can turn the discriminator into an encoder. This allows the GAN to simultaneously possess generation and encoding capabilities with almost no increase in training cost. This new model is called O-GAN (Orthogonal GAN) because it is based on the orthogonal decomposition of the discriminator's degrees of freedom, representing the most thorough utilization of the discriminator's capacity.
Arxiv Link: https://papers.cool/arxiv/1903.01931
Open Source Code: https://github.com/bojone/o-gan
Background
I have been immersed in the field of generative models for a long time now. Not only have I written several blog posts about generative models, but I have also submitted a few small papers related to generative models to Arxiv. Since diving into this "pit," although my understanding of generative models—especially GANs—has deepened, and sometimes I felt I made some incremental improvements (hence the Arxiv submissions), in reality, those were minor adjustments of little significance.
However, the model I am introducing today is one that I consider more valuable than the sum of all my previous GAN-related work: it provides the simplest solution currently available to train a GAN model with encoding capabilities.
Nowadays, GANs have become increasingly mature and massive. State-of-the-art models like BigGAN and StyleGAN are well-known and widely used. However, these advanced GAN models currently only function as generators; they lack an encoder function. This means they can continuously generate new images but cannot extract features from existing ones.
Of course, there has been research on GANs with encoders, and I have even worked on this blog before (refer to "BiGAN-QP: A Simple and Clear Encoding & Generation Model"). But regardless of whether they have encoding capabilities, most GANs share a common trait: once training is complete, the discriminator becomes useless. Theoretically, as training progresses, the discriminator degrades (e.g., tending towards a constant).
Anyone who has worked with GANs knows that the complexity of the discriminator and generator networks is comparable (and if there is an encoder, its complexity is similar as well). Discarding the discriminator after training is a serious waste of a large network! Generally, the architecture of a discriminator is very similar to an encoder. A natural idea is: can the discriminator and encoder share most of their weights? To my knowledge, among all past GAN-related models, only IntroVAE achieved this. However, IntroVAE's approach is relatively complex, and there is currently no publicly available code that successfully reproduces IntroVAE (I tried to reproduce it myself but failed).
The solution proposed in this article is extremely simple—by slightly modifying the original GAN model, the discriminator can be transformed into an encoder, with almost no increase in complexity or computational volume.
Model
Without further ado, let's introduce the model. First, let's write the general GAN formulation:
\begin{equation}\begin{aligned}D =& \mathop{\text{argmin}}_{D} \mathbb{E}_{x\sim p(x), z\sim q(z)}\Big[f(D(x)) + g(D(G(z)))\Big]\\
G =& \mathop{\text{argmin}}_{G} \mathbb{E}_{z\sim q(z)}\Big[h(D(G(z)))\Big]
\end{aligned}\end{equation}
To avoid confusion, I will describe the symbols here. Here $x\in \mathbb{R}^{n_x}, z\in \mathbb{R}^{n_z}$, $p(x)$ is the "evidence distribution" of the real image set, and $q(z)$ is the distribution of noise (in this article, it is an $n_z$-dimensional standard normal distribution). $G: \mathbb{R}^{n_z} \to \mathbb{R}^{n_x}$ and $D: \mathbb{R}^{n_x} \to \mathbb{R}$ are the generator and discriminator, respectively. $f, g, h$ are certain deterministic functions; different GANs correspond to different $f, g, h$. Sometimes we add normalization or regularization methods, such as spectral normalization or gradient penalties; for simplicity, these are not explicitly written out.
Next, we define several vector operators:
\begin{equation}\text{avg}(z)=\frac{1}{n_z}\sum_{i=1}^{n_z} z_i,\quad \text{std}(z)=\sqrt{\frac{1}{n_z}\sum_{i=1}^{n_z} (z_i-\text{avg}(z))^2}, \quad \mathcal{N}(z)=\frac{z - \text{avg}(z)}{\text{std}(z)}\end{equation}
This may look sophisticated, but it's simply the mean and standard deviation of the vector's elements, and the standardized vector. Specifically, when $n_z \geq 3$ (which is true for all valuable GANs), $[\text{avg}(z), \text{std}(z), \mathcal{N}(z)]$ are functionally independent, meaning this is essentially an "orthogonal decomposition" of the original vector $z$.
As mentioned, the discriminator's structure is similar to an encoder, except that an encoder outputs a vector while a discriminator outputs a scalar. Thus, I can write the discriminator as a composite function:
\begin{equation}D(x)\triangleq T(E(x))\end{equation}
Here $E$ is a mapping from $\mathbb{R}^{n_x} \to \mathbb{R}^{n_z}$, and $T$ is a mapping from $\mathbb{R}^{n_z} \to \mathbb{R}$. It is easy to imagine that the number of parameters in $E$ will far exceed those in $T$. We want $E(x)$ to have encoding capabilities.
How do we achieve this? We just need to add a loss: the Pearson correlation coefficient!
\begin{equation}\begin{aligned}T,E =& \mathop{\text{argmin}}_{T,E} \mathbb{E}_{x\sim p(x), z\sim q(z)}\Big[f(T(E(x))) + g(T(E(G(z)))) - \lambda \rho(z, E(G(z)))\Big]\\
G =& \mathop{\text{argmin}}_{G} \mathbb{E}_{z\sim q(z)}\Big[h(T(E(G(z)))) - \lambda \rho(z, E(G(z)))\Big]
\end{aligned}\end{equation}
Where
\begin{equation}\rho(z, \hat{z})=\frac{\sum\limits_{i=1}^{n_z} (z_i - \text{avg}(z))(\hat{z}_i - \text{avg}(\hat{z}))/n_z}{\text{std}(z)\times \text{std}(\hat{z})}=\cos(\mathcal{N}(z), \mathcal{N}(E(G(z))))\end{equation}
If $\lambda=0$, it is just an ordinary GAN (with the discriminator decomposed into $E$ and $T$). By adding this correlation coefficient, intuitively, we hope that $z$ and $E(G(z))$ are as linearly correlated as possible. Why add it this way? We will discuss this at the end.
Clearly, this correlation coefficient can be embedded into any existing GAN. The modifications are minimal (split the discriminator, add a loss). I have experimented with various GANs and found that they can all be trained successfully.
In this way, the GAN discriminator $D$ is split into $E$ and $T$, and $E$ becomes an encoder. Most of the discriminator's parameters are now utilized. However, $T$ remains. After training, $T$ is still useless. Although the number of parameters in $T$ is relatively small and the waste is minimal, for a "perfectionist" like me, it is still unsettling.
Can we eliminate $T$ as well? After multiple trials, the conclusion is: we can! Because we can directly use $\text{avg}(E(x))$ as the discriminator:
\begin{equation}\begin{aligned}E =& \mathop{\text{argmin}}_{E} \mathbb{E}_{x\sim p(x), z\sim q(z)}\Big[f(\text{avg}(E(x))) + g(\text{avg}(E(G(z)))) - \lambda \rho(z, E(G(z)))\Big]\\
G =& \mathop{\text{argmin}}_{G} \mathbb{E}_{z\sim q(z)}\Big[h(\text{avg}(E(G(z)))) - \lambda \rho(z, E(G(z)))\Big]
\end{aligned}\label{eq:simplest}\end{equation}
By doing this, the entire model no longer has $T$, only the pure generator $G$ and encoder $E$. There is no redundancy in the entire model at all!
Experiments
Why does this work? We will discuss that at the end. First, let's look at the experimental results. After all, if the experiments aren't good, no matter how beautiful the theory is, it has no meaning.
Note that theoretically, the correlation coefficient term introduced here cannot improve the quality of the generative model. Therefore, there are two main goals for the experiment: 1. Will this extra loss damage the quality of the original generative model? 2. Can this extra loss truly turn $E$ into an effective encoder?
I mentioned that this method can be embedded into any GAN. In this experiment, the GAN used is a variant of my previous GAN-QP:
\begin{equation}\begin{aligned}E =& \mathop{\text{argmin}}_{E} \mathbb{E}_{x\sim p(x), z\sim q(z)}\Big[\text{avg}(E(x)) - \text{avg}(E(G(z))) + \lambda_1 R_{x,z} - \lambda_2 \rho(z, E(G(z)))\Big]\\
G =& \mathop{\text{argmin}}_{G} \mathbb{E}_{z\sim q(z)}\Big[\text{avg}(E(G(z))) - \lambda_2 \rho(z, E(G(z)))\Big]
\end{aligned}\label{eq:simplest-2}\end{equation}
Where
\begin{equation}R_{x,z} = \frac{[\text{avg}(E(x)) - \text{avg}(E(G(z)))]^2}{\Vert x - G(z)\Vert^2}\end{equation}
Regarding the datasets, this experiment was quite comprehensive, covering four datasets: CelebA HQ, FFHQ, LSUN-churchoutdoor, and LSUN-bedroom, all at $128\times 128$ resolution (I also did some $256\times 256$ experiments with good results, but didn't include them in the paper). The model architecture used was DCGAN as usual. For other details, please refer directly to the paper or code.
Images below:
CelebA HQ Random Generation
CelebA HQ Reconstruction Results
CelebA HQ Linear Interpolation
FFHQ Random Generation
FFHQ Reconstruction Results
FFHQ Linear Interpolation
LSUN-church Random Generation
LSUN-church Reconstruction Results
LSUN-church Linear Interpolation
LSUN-bedroom Random Generation
LSUN-bedroom Reconstruction Results
LSUN-bedroom Linear Interpolation
Regardless of what you think, I think it's quite good~
1. The Random Generation results are quite good, indicating that the newly introduced correlation coefficient term did not degrade the generation quality;
2. The Reconstruction results are quite good, indicating that $E(x)$ indeed extracts the main features of $x$;
3. The Linear Interpolation results are quite good, indicating that $E(x)$ has indeed learned features that are close to being linearly separable.
Theory
Now that we've checked the results, let's discuss the theory.
Obviously, the role of this extra reconstruction term is to make $z$ as "correlated" as possible with $E(G(z))$. For this, I believe most readers' first thought would be the MSE loss $\Vert z - E(G(z))\Vert^2$ rather than the $\rho(z, E(G(z)))$ used in this article. But in fact, if we add $\Vert z - E(G(z))\Vert^2$, the training will almost always fail. Why does $\rho(z, E(G(z)))$ succeed then?
According to the previous definition, $E(x)$ outputs an $n_z$-dimensional vector, but $T(E(x))$ only outputs a scalar. That is to say, $E(x)$ provides $n_z$ degrees of freedom, and as a discriminator, $T(E(x))$ must occupy at least one degree of freedom (theoretically, it only needs one). If we minimize $\Vert z - E(G(z))\Vert^2$, the training process will force $E(G(z))$ to be exactly equal to $z$, meaning all $n_z$ degrees of freedom are occupied by it. No degrees of freedom remain for the discriminator to distinguish between real and fake. Therefore, adding $\Vert z - E(G(z))\Vert^2$ will likely lead to failure. But $\rho(z, E(G(z)))$ is different; $\rho(z, E(G(z)))$ has nothing to do with $\text{avg}(E(G(z)))$ and $\text{std}(E(G(z)))$ (changing only the $\text{avg}$ and $\text{std}$ of vector $E(G(z))$ won't change the value of $\rho(z, E(G(z)))$ because $\rho$ itself subtracts the mean and divides by the standard deviation first). This means that even if we maximize $\rho(z, E(G(z)))$, we still leave at least two degrees of freedom for the discriminator.
This is why in $\eqref{eq:simplest}$, we can directly use $\text{avg}(E(x))$ as the discriminator, as it is not affected by $\rho(z, E(G(z)))$.
A similar example is InfoGAN. InfoGAN also includes a module that reconstructs input information, and this module shares most weights with the discriminator (encoder). Since InfoGAN actually only reconstructs part of the input information, the reconstruction term does not occupy all the degrees of freedom of the encoder. Thus, what InfoGAN does is reasonable—as long as at least one degree of freedom is left for the discriminator.
There is another fact that helps us understand. During adversarial training, the noise is $z \sim \mathcal{N}(0, I_{n_z})$. When the generator is well-trained, theoretically for all $z \sim \mathcal{N}(0, I_{n_z})$, $G(z)$ will be a realistic image. In fact, the converse is also true: if $G(z)$ is a realistic image, then $z$ should be $z \sim \mathcal{N}(0, I_{n_z})$ (i.e., located in the high-probability region of $\mathcal{N}(0, I_{n_z})$). Further inferring, for $z \sim \mathcal{N}(0, I_{n_z})$, we have $\text{avg}(z) \approx 0$ and $\text{std}(z) \approx 1$. Thus, if $G(z)$ is a realistic image, a necessary condition is $\text{avg}(z) \approx 0$ and $\text{std}(z) \approx 1$.
Applying this conclusion, if we want the reconstruction to be good, meaning we want $G(E(x))$ to be a realistic image, a necessary condition is $\text{avg}(E(x)) \approx 0$ and $\text{std}(E(x)) \approx 1$. This indicates that for a good $E(x)$, we can assume $\text{avg}(E(x))$ and $\text{std}(E(x))$ are known (equal to 0 and 1, respectively). Since they are known, there is no need to fit them. In other words, they can be excluded from the reconstruction term. In fact:
\begin{equation}-\rho(z, E(G(z)))\sim \left\Vert \mathcal{N}(z) - \mathcal{N}(E(G(z)))\right\Vert^2\end{equation}
That is to say, if we exclude $\text{avg}(E(x))$ and $\text{std}(E(x))$ from the MSE loss and then omit the constants, it is actually $-\rho(z, E(G(z)))$. This further explains the rationality of $\rho(z, E(G(z)))$. Furthermore, based on this derivation, the reconstruction process is not $G(E(x))$ but rather:
\begin{equation}\hat{x}=G(\mathcal{N}(E(x)))\end{equation}
Finally, this extra reconstruction term can theoretically prevent the occurrence of mode collapse. It is quite obvious: since the reconstruction quality is good, the generation quality cannot be that bad, so mode collapse won't happen easily. If we must find a mathematical basis, we can view $\rho(z, E(G(z)))$ as a lower bound on the mutual information between $Z$ and $G(Z)$. Thus, minimizing $-\rho(z, E(G(z)))$ is effectively maximizing the mutual information between $Z$ and $G(Z)$, which is equivalent to maximizing the entropy of $G(Z)$. A larger entropy for $G(Z)$ indicates increased diversity, moving away from mode collapse. For similar derivations, refer to "GAN from the Perspective of Energy (Part 2): GAN = 'Analysis' + 'Sampling'".
Conclusion
This article introduced a scheme that requires only simple modifications to the original GAN to transform the original GAN discriminator into an effective encoder. Multiple experiments show that such a scheme is feasible. Deep reflection on the theory reveals that this is essentially an orthogonal decomposition of the original discriminator (encoder) and full utilization of the degrees of freedom after decomposition. Therefore, the model is called "Orthogonal GAN (O-GAN)."
A small change for an encoder—why not give it a try? I welcome everyone to test it out~
Afterword:
In hindsight, the thinking behind this model is essentially the decomposition of "diameter and direction," which is not hard to understand, but achieving it was not so easy.
Initially, I was also stuck in the trap of $\Vert z - E(G(z))\Vert^2$ and couldn't pull myself out. Later, I thought of many tricks and finally stabilized the model under the $\Vert z - E(G(z))\Vert^2$ reconstruction loss (which took several months), but the model became very ugly (introducing a triple-adversarial GAN). So I set about simplifying the model. Later, I tried using the $\cos$ value as the reconstruction loss and found that it could converge simply. I then reflected on the principles behind this, which might involve degrees of freedom.
Then, I tried to decompose $E(x)$ into norm and direction, using the norm $\Vert E(x)\Vert$ as the discriminator and the $\cos$ as the reconstruction loss, with a hinge loss for the discriminator. This actually has clear geometric meaning and sounds more elegant. It worked on some datasets, but the generalizability was poor (CelebA was fine, LSUN was not). Another issue was that $\Vert E(x)\Vert$ is non-negative, making it hard to embed into general GANs, and many techniques for stabilizing GANs couldn't be used.
Then I thought about how to make the norm positive or negative. I initially thought about taking the logarithm of the norm so that norms less than 1 become negative and norms greater than 1 become positive, thus achieving the goal. Unfortunately, the results were still not good. Later, many other schemes failed until I finally realized I could give up using the norm (corresponding to variance) for the discriminator loss and just use the mean. So it later converted to $\text{avg}(E(x))$; this transition took a long time.
Also, reconstruction loss is generally thought to measure the difference between $x$ and $G(E(x))$, but I discovered that just measuring the difference between $z$ and $E(G(z))$ is the lowest-cost solution because reconstruction takes extra time. Finally, I performed many experiments; many ideas succeeded on CelebA but failed on LSUN. So, the finally simple-looking model is actually the result of difficult sedimentation.
The entire model stems from an obsession: since the discriminator has the structure of an encoder, it should not be wasted. Given the previous success of IntroVAE, I believed there must be a simpler solution. I experimented for several months, ran hundreds of models, and finally solved this problem completely recently.
Besides IntroVAE, I was also greatly inspired by the paper Deep Infomax. The appendix of Deep Infomax provided a new way to approach GANs, and I began thinking about the new model from the methods there.