Idempotent Generative Network IGN: A GAN Attempting to Unify Discrimination and Generation

By 苏剑林 | January 31, 2024

A while ago, a generative model named "Idempotent Generative Network (IGN)" caught some attention. It claimed to be a new type of generative model independent of existing VAE, GAN, flow, and Diffusion models, and it features single-step sampling. Perhaps because everyone has grown weary of the multi-step sampling process of current mainstream diffusion models, any "rustle in the grass" claiming single-step sampling easily attracts attention. Furthermore, the word "idempotent" in the name IGN added a sense of mystery, further heightening expectations and successfully piquing my interest. However, I was busy with other things before, so I didn't have time to read the model details carefully.

Recently, having some free time, I remembered IGN and pulled up the paper. But after reading it, I was quite puzzled: how is this a new model? Isn't it just a variant of GAN? The difference from a regular GAN is that it merges the generator and discriminator into one. Does this "merging into one" offer any special benefits, such as better training stability? Personally, I don't feel it does. Below, I will share my process and doubts in understanding IGN from a GAN perspective.

Generative Adversarial

Regarding GANs (Generative Adversarial Networks), I studied them systematically a few years ago (you can check the GAN tag for related articles), but I haven't followed them continuously in recent years. Therefore, I'll first give a brief review of GANs to facilitate a comparison between GANs and IGN in subsequent chapters.

A GAN has two basic components: a Discriminator and a Generator. They can be described as a "judge" and a "counterfeiter." The discriminator is responsible for distinguishing between real samples and fake samples produced by the generator, while the generator is responsible for mapping simple random noise to target samples and uses signals from the discriminator to improve its generation quality. Through constant "attack and defense," the generator's quality improves until the discriminator can no longer distinguish between real and fake samples, achieving a realistic effect.

Taking WGAN as an example, the training goal of the discriminator $D_\theta$ is to widen the score gap between real and fake samples:

\begin{equation}\max_{\theta} D_{\theta}(G_{\varphi}(z)) - D_{\theta}(x)\label{eq:d-loss}\end{equation}

where $x$ is a real sample from the training set, $z$ is random noise, $G_{\varphi}$ is the generator, and $G_{\varphi}(z)$ is the fake sample. The training goal of the generator is to narrow the score gap between real and fake samples, i.e., minimizing the above equation. However, for the generator, $x$ (which contains no parameters $\varphi$) is equivalent to a constant, so it can be simplified to:

\begin{equation}\min_{\varphi} D_{\theta}(G_{\varphi}(z))\label{eq:g-loss}\end{equation}

Besides this, there is also the matter of Lipschitz constraints, but those are details. For interested readers, you can further read articles like "The Art of Mutual Confrontation: WGAN-GP from Scratch" and "From Wasserstein Distance and Duality Theory to WGAN".

Generally, GANs are trained with two alternating losses. Sometimes they can be written as a single loss optimized in two directions—performing gradient descent on some parameters and gradient ascent on others. This training process involving opposing directions (i.e., $\min$-$\max$) is usually unstable and prone to collapsing, or if it does train, it may suffer from "Mode Collapse," where the generated results lack diversity.

Single Loss

Some readers might object: "You said GANs use two alternating losses, but IGN clearly uses a single loss; how can you say IGN is a special case of GAN?"

In fact, IGN's single loss formulation is a bit of a "cheat." Using its logic, a GAN can also be written in a single loss form. How? It's simple. Assume $\theta', \varphi'$ are copies of the weights $\theta, \varphi$, such that $\theta' \equiv \theta$ and $\varphi' \equiv \varphi$, but they do not compute gradients. Then equations \eqref{eq:d-loss} and \eqref{eq:g-loss} can be merged into:

\begin{equation}\min_{\theta,\varphi} D_{\theta}(x) - D_{\theta}(G_{\varphi'}(z)) + D_{\theta'}(G_{\varphi}(z))\label{eq:pure-one-loss}\end{equation}

At this point, the gradients with respect to $\theta, \varphi$ are the same as when the two losses are separate, making it an equivalent implementation. But why is this a "cheat"? Because it involves no real technique; it purely replaces the $\min$-$\max$ notation. If implemented strictly this way, one would need to constantly clone $D_{\theta'}$ and $G_{\varphi'}$ and stop their gradients, which is very inefficient for training.

In fact, to write GAN as a single loss and maintain practicality, one can refer to my previous article "Clever Gradient Cutting: Implementing GAN with a Single Loss". By using the framework's native stop_gradient operator along with some gradient calculation tricks, this can be achieved. Specifically, stop_gradient forces the gradient of a certain part of the model to be 0. For example:

\begin{equation}\nabla_{\theta,\varphi} D_{\theta}(G_{\varphi}(z)) = \left(\frac{\partial D_{\theta}(G_{\varphi}(z))}{\partial\theta},\frac{\partial D_{\theta}(G_{\varphi}(z))}{\partial\varphi}\right)\label{eq:full-grad}\end{equation}

After adding the stop_gradient operator (abbreviated as $\color{skyblue}{\text{sg}}$), we have:

\begin{equation}\nabla_{\theta,\varphi} D_{\theta}(\color{skyblue}{\text{sg}(}G_{\varphi}(z)\color{skyblue}{)}) = \left(\frac{\partial D_{\theta}(G_{\varphi}(z))}{\partial\theta},0\right)\label{eq:stop-grad}\end{equation}

Thus, through the stop_gradient operator, we can easily block the inner gradient of a nested function (i.e., the gradient of $\varphi$). What about blocking the outer gradient (the gradient of $\theta$), as required for the generator? There is no direct way, but we can use a trick: subtract equation \eqref{eq:stop-grad} from equation \eqref{eq:full-grad}:

\begin{equation}\nabla_{\theta,\varphi} D_{\theta}(G_{\varphi}(z)) - \nabla_{\theta,\varphi} D_{\theta}(\color{skyblue}{\text{sg}(}G_{\varphi}(z)\color{skyblue}{)}) = \left(0,\frac{\partial D_{\theta}(G_{\varphi}(z))}{\partial\varphi}\right)\end{equation}

This achieves the blocking of the outer gradient. By combining these, we get one way to train a GAN with a single loss:

\begin{equation}\begin{gathered} \min_{\theta,\varphi} \underbrace{D_{\theta}(x) - D_{\theta}(\color{skyblue}{\text{sg}(}G_{\varphi}(z)\color{skyblue}{)})}_{\text{Gradient of } \varphi \text{ removed}} + \underbrace{D_{\theta}(G_{\varphi}(z)) - D_{\theta}(\color{skyblue}{\text{sg}(}G_{\varphi}(z)\color{skyblue}{)})}_{\text{Gradient of } \theta \text{ removed}} \\[8pt] = \min_{\theta,\varphi} D_{\theta}(x) - 2 D_{\theta}(\color{skyblue}{\text{sg}(}G_{\varphi}(z)\color{skyblue}{)}) + D_{\theta}(G_{\varphi}(z))\end{gathered}\end{equation}

This way, there's no need to repeatedly clone the model, and gradient equivalence is achieved within a single loss.

Idempotent Generation

After all that, it's finally time for our main character—the Idempotent Generative Network (IGN)—to take the stage. But before its formal entrance, please wait a moment while we discuss the motivation behind IGN.

A notable characteristic of GANs is that once training is successful, only the generator is usually kept; the discriminator is mostly "discarded." However, in a reasonable GAN, the generator and discriminator often have the same order of magnitude of parameters. Discarding the discriminator means half of the parameters are wasted, which is unfortunate. To address this, some work has tried incorporating encoders into GANs and sharing some parameters between the discriminator and encoder to improve parameter utilization. Among them, the most minimalist work is the O-GAN I proposed, which only slightly modifies the discriminator's structure and adds an extra loss to turn the discriminator into an encoder without increasing parameters or computation. It is a work I am quite satisfied with.

As the title suggests, IGN is a GAN that attempts to merge the discriminator and generator. The generator acts as "both the player and the referee," so from this perspective, IGN can also be seen as a way to improve parameter utilization. First, IGN assumes that $z$ and $x$ are the same size, so the input and output sizes of the generator $G_{\varphi}$ are the same, which differs from regular GANs where the dimensionality of $z$ is usually smaller than $x$. With input and output designed to be the same size, images themselves can be passed as input into the generator for further calculation. Thus, IGN designs the discriminator as a reconstruction loss:

\begin{equation}\delta_{\varphi}(x) = \Vert G_{\varphi}(x) - x\Vert^2\label{eq:ign-d}\end{equation}

$\delta_{\varphi}$ follows the notation from the original IGN paper and has no special meaning. This design fully reuses the generator's parameters and adds no extra parameters, which seems very elegant. Now, if we substitute this discriminator into equation \eqref{eq:pure-one-loss}, we get:

\begin{equation}\min_{\varphi}\underbrace{\delta_{\varphi}(x) - \delta_{\varphi}(G_{\varphi'}(z))}_{\text{Discriminator loss}} + \underbrace{\delta_{\varphi'}(G_{\varphi}(z))}_{\text{Generator loss}}\end{equation}

Doesn't this look exactly like the **Final optimization objective** in the original IGN paper? Of course, the original paper includes two adjustable coefficients. In fact, every term's coefficient in equation \eqref{eq:pure-one-loss} is adjustable, so that's nothing special. Thus, it's clear that IGN can be entirely derived from a GAN perspective; it is a special case of GAN—even though the authors stated they did not think of IGN from a GAN perspective.

The term "idempotent" comes from the authors' belief that when IGN is successfully trained, the discriminator's score for real samples is 0, meaning $G_{\varphi}(x) = x$. From this, it can be inferred that:

\begin{equation}G_{\varphi}(\cdots G_{\varphi}(x)) = \cdots = G_{\varphi}(G_{\varphi}(x)) = G_{\varphi}(x) = x\end{equation}

In other words, applying $G_{\varphi}$ to a real sample $x$ multiple times leaves the result unchanged, which is the mathematical meaning of "idempotence." However, theoretically, we cannot guarantee that the GAN discriminator's loss (for real samples) will be exactly zero, so it is hard to achieve true idempotence. The experimental results in the original paper also reflect this.

Personal Analysis

A question worth considering is: **Why can the reconstruction loss \eqref{eq:ign-d} successfully serve as a discriminator? Or, given that many expressions can be constructed using $G_{\varphi}(x)$ and $x$, can any of them serve as a discriminator?**

From the perspective of "reconstruction loss as a discriminator," IGN is very similar to EBGAN. However, this doesn't mean EBGAN's success explains IGN's, because EBGAN's generator is independent of the discriminator and lacks the constraint of fully shared parameters. Thus, EBGAN's success is "within expectations" and fits GAN's original design. But IGN is different because its discriminator and generator share parameters completely, and GAN training itself has significant instability, making it easy for both to "sink together."

In my view, IGN has a chance not to fail because it happens to satisfy "self-consistency." First, the fundamental goal of a GAN is for $G_{\varphi}(z)$ to output a real image for input noise $z$. In IGN's design of "reconstruction loss as a discriminator," even if the discriminator's optimal loss isn't zero, it might be close, meaning $G_{\varphi}(x) \approx x$ is approximately satisfied. Thus, it simultaneously satisfies the condition: "for input image $x$, $G_{\varphi}(x)$ can output a real image." That is, regardless of the input, the output space is that of real samples. This self-consistency is crucial; otherwise, the generator might "fall apart" trying to generate in two different directions.

That being said, what tangible improvement does IGN offer over a general GAN? Please forgive my dullness, but I really can't see the benefit of IGN. Regarding parameter utilization, it looks like IGN's parameter sharing increases efficiency, but in fact, to ensure the input and output of $G_{\varphi}$ are the same size, IGN uses an autoencoder structure. Its parameter count and computation amount are equal to the sum of a generator and discriminator in a typical GAN! In other words, IGN doesn't reduce the parameter count; instead, it increases the total computation because it enlarges the generator's volume.

I also conducted a simple experiment with IGN and found that its training also suffers from instability—one could even say it's more unstable because hard constraints like "parameter sharing + Euclidean distance" are more likely to amplify instability, leading to "sinking together" rather than "rising together." Furthermore, because IGN's generator has identical input and output sizes, it loses the advantage of general GAN generators projecting from a low-dimensional manifold to high-dimensional data. IGN is also prone to mode collapse, and due to the use of Euclidean distance, the generated images tend to be blurry like those of a VAE.

Summary

This article introduced the Idempotent Generative Network (IGN), which recently gained attention, from a GAN perspective. It compared the connections and differences between IGN and GAN and shared an analysis of IGN.