By 苏剑林 | January 30, 2019
"Look at those digging pits, what's so different about them~" In this series, we attempt to understand GANs from the perspective of energy. We will find this perspective so beautiful and intuitive that it is truly admirable. This perspective is directly inspired by the Bengio team's new work "Maximum Entropy Generators for Energy-Based Models", which appeared on arXiv a few days ago. Of course, the connection between energy models and GANs has a long history and is not an original creation of this article; however, this paper makes the connection more meticulous and complete. Additionally, this article supplements some of my own understanding and reflections, striving to be more accessible and comprehensive.
As the first article, let's start with a straightforward analogical derivation: GAN is actually a journey of "digging pits" and "jumping into pits" one after another (or digging then jumping?).
In general, the main content of this article is as follows:
In this section, we use as popular a metaphor as possible to explain what GAN is from an energy perspective.
First, we have a batch of samples $x_1,x_2,\dots,x_{n}$. We hope to find a generative model capable of producing a batch of new samples $\hat{x}_1,\hat{x}_2,\dots,\hat{x}_{n}$ that are very similar to the original samples. How do we make them? It's simple, following a two-step process.
We treat the real samples $x_1,x_2,\dots,x_{n}$ as coordinates and dig many pits at these coordinates. The distribution of these pits can be described by an energy function $U(x)$. Consequently, the real samples $x_1,x_2,\dots,x_{n}$ are equivalent to being placed at the bottom of the pits, and then we place the generated fake samples $\hat{x}_1,\hat{x}_2,\dots,\hat{x}_{n}$ at the "waist" of the pits:
GAN Step One: "Digging Pits"
Now we fix $U(x)$, meaning we no longer move the pits. Then we release the fake samples $\hat{x}_1,\hat{x}_2,\dots,\hat{x}_{n}$. Obviously, they will slowly roll down to the bottom of the pits. Since the bottom of the pits represents real samples, $\hat{x}_1,\hat{x}_2,\dots,\hat{x}_{n}$ become very much like real samples:
GAN Step Two: "Jumping into Pits"
This is the workflow of GAN!
Note that the two steps above are not just simple metaphors but a complete description of GAN. Based on these two steps, we can even directly write out the GAN training formulas.
First, let's look at "digging pits." we said we want to place real samples at the bottom of the pit and fake samples at the waist, so that later fake samples can roll to the bottom. This means the "average altitude" of the fake samples should be higher than the "average altitude" of the real samples, i.e., make
\begin{equation}\mathbb{E}_{x\sim p(x)}\big[U(x)\big] - \mathbb{E}_{x\sim q(x)}\big[U(x)\big]\label{eq:eq-e}\end{equation}as small as possible. Here we use $p(x)$ to denote the distribution of real samples and $q(x)$ for the distribution of fake samples. Fake samples are generated via $x=G(z)$, where $z\sim q(z)$ follows a standard normal distribution.
In addition, we said real samples should be at the bottom of the pit. In mathematical terms, the bottom of the pit is a local minimum point where it's best if the derivative equals 0, i.e., ideally satisfying $\nabla_x U(x)=0$. In terms of an optimization target, that means making $\Vert \nabla_x U(x)\Vert^2$ as small as possible. Combining these, we get the optimization target for $U$:
\begin{equation}\begin{aligned}U =& \mathop{\text{argmin}}_{U}\mathbb{E}_{x\sim p(x)}\big[U(x)\big] - \mathbb{E}_{x\sim q(x)}\big[U(x)\big] + \lambda \mathbb{E}_{x\sim p(x)}\big[\Vert \nabla_x U(x)\Vert^2\big]\\ =& \mathop{\text{argmin}}_{U}\mathbb{E}_{x\sim p(x)}\big[U(x)\big] - \mathbb{E}_{z\sim q(z)}\big[U(G(z))\big] + \lambda \mathbb{E}_{x\sim p(x)}\big[\Vert \nabla_x U(x)\Vert^2\big] \end{aligned}\label{eq:eq-ee}\end{equation}Note: Previously, regarding gradient penalty, we always had two confusions: 1. Is it better for the gradient penalty to be centered around 0 or centered around 1; 2. Should gradient penalty be applied to real samples, fake samples, or interpolated samples? Now, based on the energy perspective, we can conclude that "performing a 0-centered gradient penalty on real samples" is better, because it means (overall) placing the real samples at local minima.
Thus, from the energy perspective, we have a very intuitive answer for gradient penalty.
Next, let's look at "jumping into pits." That is, once the pits are dug and $U$ is fixed, we let the fake samples roll to the bottom, which means lowering $U(x)$ to roll into the nearest pit, so:
\begin{equation}G = \mathop{\text{argmin}}_{G}\mathbb{E}_{z\sim q(z)}\big[U(G(z))\big]\label{eq:eq-g}\end{equation}As we can see, the discriminator is actually "creating the terrain," and the generator is minimizing the potential energy. This is the main idea of energy-based GAN.
If the pits in reality were as simple as the diagrams above, it might only take two steps to train a generative model. However, real-world pits can be very complex. For example, in the figure below, as the fake sample $\hat{x}_1$ slides down, it won't necessarily reach the pit of $x_1$, but might reach an intermediate pit. This intermediate pit might not represent a real sample, but just a "pseudo-real" one. Therefore, we need to continuously improve the fake samples and also continuously correct the pits (for example, striving to "shave off" peaks that block progress in the next step). This means we need to repeatedly and alternately execute steps $\eqref{eq:eq-e}$ and $\eqref{eq:eq-g}$.
Pit distributions in reality may be more complex
See? By imagining a few pits in your head, we can derive the complete framework of GAN, and it's even an upgraded version of the advanced WGAN-GP: 0-centered gradient penalty. GAN is nothing more than the art of pits!
For further discussion on this GAN, you can refer to my previous blog posts "WGAN-div: An Obscure WGAN Pit Filler" or the paper "Which Training Methods for GANs do actually Converge?".
The above picture can also help us answer many questions. For example, can the discriminator go without gradient penalty? Why do most GAN training processes, especially for the generator, not use optimizers with momentum, or even if they do, keep the momentum very small? And how does mode collapse happen?
Gradient penalty is theoretically beautiful, but it is indeed too slow. So from a practical standpoint, it's best not to use gradient penalty if possible. However, if no gradient penalty is used and we directly minimize equation $\eqref{eq:eq-e}$, numerical instability easily occurs.
This is not hard to understand, because without constraints, it's easy to have $U(x)\to -\infty$ for real samples and $U(x)\to +\infty$ for fake samples. That is, the discriminator optimizes too aggressively, and the gap gets too wide (infinite). Then a natural idea is to set thresholds for real and fake samples respectively; if the optimization of $U(x)$ exceeds this threshold, stop optimizing it. For example:
\begin{equation}\mathbb{E}_{x\sim p(x)}\big[\max(0, 1+U(x))\big] + \mathbb{E}_{x\sim q(x)}\big[\max(0,1-U(x))\big]\label{eq:eq-e-hinge}\end{equation}In this way, for $x\sim p(x)$, if $U(x) < -1$, then $\max(0, 1+U(x))=0$. For $x\sim q(x)$, if $U(x) > 1$, then $\max(0, 1-U(x))=0$. In both cases, $U(x)$ will no longer be optimized. This means $U(x)$ doesn't need to be too small for real samples and doesn't need to be too large for fake samples, thereby preventing over-optimization of $U(x)$.
This scheme is the hinge loss used by SNGAN, SAGAN, and BigGAN.
Of course, if $U(x)$ itself is non-negative [such as using the MSE of an autoencoder as $U(x)$ in EBGAN], then equation $\eqref{eq:eq-e-hinge}$ can be slightly modified:
\begin{equation}\mathbb{E}_{x\sim p(x)}\big[U(x)\big] + \mathbb{E}_{x\sim q(x)}\big[\max(0,m-U(x))\big]\label{eq:eq-e-hinge2}\end{equation}where $m > 0$.
As for the choice of optimizer, we can see the answer from the "jumping into pits" diagram. Optimizers with momentum help us find better global minima faster, but for GAN, we don't actually need to run to a better minimum point; we only need to run to the nearest minimum point. If we jump out of the nearest minimum and run to a deeper one, we might lose diversity or even suffer mode collapse.
For example, in the figure below, for $\hat{x}_2$, an optimization algorithm without momentum allows $\hat{x}_2$ to stop at $x_2$. If momentum is included, it might skip $x_2$ and even run to $x_1$. Although $x_1$ is also a real sample, if $\hat{x}_1$ and $\hat{x}_2$ both converge toward $x_1$, there might be no fake sample capable of generating $x_2$, thus losing diversity.
Comparison of optimization trajectories with and without momentum: without momentum, fake samples only need to drop into the nearest pit; with momentum, they may cross the nearest pit and reach farther ones, leading fake samples to cluster around certain real samples and losing diversity.
Therefore, in GAN optimizers, momentum cannot be too large; too much might lead to a loss of diversity in generated samples or cause other instabilities. Similarly, the learning rate cannot be too large. In short, all acceleration methods shouldn't be too intense.
What is mode collapse? Why does mode collapse occur? We can still use this imagery to provide an easy explanation. Previously, we drew the fake samples $\hat{x}$ very reasonably, but if initialization is poor or optimization is unreasonable, the $\hat{x}$ samples might cluster around a few specific pits, for example:
Mode collapse illustration
In this case, following the optimization process described above, all fake samples will rush toward $x_n$, so the model can only generate a single (or a few) styles of samples. This is mode collapse. Simply put, mode collapse occurs because the fake samples are too concentrated and not "uniform" enough. Therefore, we can add a term to the generator to encourage a uniform trend in fake samples. This term is the entropy of the fake samples $H(X)=H(G(Z))$. We want the entropy of fake samples to be as large as possible, which means more chaotic and more uniform. So the generator's objective can be changed to:
\begin{equation}G = \mathop{\text{argmin}}_{G} -H(G(Z)) + \mathbb{E}_{z\sim q(z)}\big[U(G(z))\big]\label{eq:eq-gg}\end{equation}In theory, this can solve the mode collapse problem. As for how to calculate $H(X)$, we will discuss it in detail later.
For GAN, the most popular and easy-to-understand perspective is the analogy of the "counterfeiter vs. discriminator" competition, which directly leads to the standard GAN. However, this popular analogy cannot be further extended to understand WGAN or regularization terms like gradient penalty.
In contrast, the energy perspective is quite flexible. It even allows us to intuitively understand WGAN, gradient penalty, and other content, which can be said to be some of the most advanced results in the field of GAN today. Although the energy perspective seems more complex in form than "counterfeiter-discriminator," its physical meaning is quite clear. With a little thought, we feel it is actually more interesting and inspiring—it has a "tastes better the more you chew" feel~